 
  
  
     
  Akshat Verma 
    Kaladhar Voruganti 
    Ramani Routray 
    Rohit Jain 1 
     
  IBM India Research 
    Network Appliance 
    IBM Almaden Research 
    Yahoo India 
     akshatverma@in.ibm.com  
    kaladhar.voruganti@netapp.com  
    routrayr@us.ibm.com  
    rohitj@yahoo-inc.com 
 
In this paper, we present the design and implementation of SWEEPER architecture and backup copy selection algorithms that specifically tackle the problem of quickly and systematically identifying a good recovery point. We monitor various system events and generate checkpoint records that help in quickly identifying a clean backup copy. The SWEEPER methodology dynamically determines the selection algorithm based on user specified recovery time and recovery point objectives, and thus, allows system administrators to perform trade-offs between recovery time and data currentness. We have implemented our solution as part of a popular Storage Resource Manager product and evaluated SWEEPER under many diverse settings. Our study clearly establishes the effectiveness of SWEEPER as a robust strategy to significantly reduce recovery time.
Data Resiliency is a very important concern for most organizations to ensure business continuity in the face of different types of failures and disasters, such as virus attacks, site failures, machine/firmware malfunction, accidental and malicious human/application errors [17,2]. Resiliency is not only about being able to resurrect data after failures, but also about how quickly the data can be resurrected so that the business can be operational again. While in the case of total data loss failures, the recovery time is largely dominated by Restoration Cost, i.e., time to restore the data from backup systems at local or remote locations; in the case of data corruption failures, the time to identify a clean previous copy of data to revert to can be much larger. Often, data corruption is detected after the incidence of corruption itself. As a result, the administrator has a large number of candidate backup copies to select the recovery point from. The key to fast recovery in such cases is reducing the time required in the identification step. Much research and industrial attention have been devoted to protecting data from disasters. However, relatively little work has been done in the area of how to quickly retrieve the latest clean copy of uncorrupted data. The focus of this paper is on how to efficiently identify clean data.
Data resiliency is based on Data Protection: taking either continuous or periodic snapshots of the data as it is being updated. Block level [7] [15], file level [10], logical volume level [27] and database level [5] data replication/recovery mechanisms are the most prominent data protection mechanisms. The mechanisms vary with respect to their support for different data granularities, transactional support, replication site distance, backup latencies, recovery point and recovery time objectives [11]. Continuous data protection (CDP) [24] is a form of continuous data protection that allows one to go back in time after a failure and recover earlier versions of an object at the granularity of a single update.
The recovery process is preceded by Error Detection. Errors are usually found either by application users or by automated data integrity tools. Tools such as disk scrubbers and S.M.A.R.T. tools [23] can detect corruptions caused by hardware errors. Virus scanners, application specific data integrity checkers such as fsck for filesystem integrity, and storage level intrusion detection tools [19,3] can detect logical data corruptions caused by malicious software, improper system shutdown, and erroneous administrator actions. For complex Storage Area Network (SAN) environments, configuration validation tools [1] have been proposed that can be used in both proactive and reactive mode to identify configuration settings that could potentially lead to logical data corruptions.
Once system administrators are notified of a corruption, they need to solve the Recovery Point Identification problem, i.e., determine a recovery point that will provide a clean copy of their data. Typically, system administrators choose a recovery point by trading-off recovery speed (the number of versions that need to be checked to find a clean copy) versus data currentness (one might not want to lose valid data updates for the sake of fast recovery). Recovery point identification is currently a manual, error-prone and frustrating process for system administrators, due to the pressure to quickly bring organization's applications back on-line. Even though CDP technologies provide users with the ability to rollback every data update, they do not address the problem of identifying a clean copy of data. It is only after a good recovery point has been identified, that Data Recovery can begin by replacing the corrupt copy by the clean copy of data. The efficacy of a recovery process is characterized by a Recovery Time Objective (RTO) and a Recovery Point objective (RPO). RTO measures the downtime after detection of corruption, whereas RPO indicates the loss in data currency in terms of seconds of updates that are lost.
This paper describes and evaluates methods for efficient recovery point identification in CDP logs which reduce RTO, while not compromising on RPO. The basic idea behind our approach is to evaluate the events generated by various components such as applications, file systems, databases and other hardware/software resources and generate checkpoint records. Subsequently upon the detection of failure, we efficiently process these checkpoint records to start the recovery process from an appropriate CDP record. Since CDP mechanisms are typically used along with point-in-time snapshot (PIT) technologies, it is possible to create data copies selectively. Further, since the time it takes to test a copy of data for corruption dominates overall recovery time, selective testing of copies can drastically reduce recovery time. This selective identification of copies that one can target for quick recovery is the focus of this work.
Some existing CDP solutions [16,21,30] are based on the similar idea of checkpointing interesting events in CDP logs. While event checkpointing mechanisms help in narrowing down the search space, they do not guarantee that the most appropriate checkpoint record will be identified. Thus, the CDP log evaluation techniques presented in this paper compliment these checkpoint record generation mechanisms in quickly identifying the most suitable CDP record. The key contributions of this paper are:
The rest of this paper is organized as follows. We formally define the Recovery Point Identification problem and model in Sec. 2.1 and 2.2. The SWEEPER architecture is presented in Sec. 2.3. The CDP record scanning algorithms are described in Sec.3, and our implementation in Sec. 4. Sec. 5 describes our experimental evaluation. Sec. 6 discusses the strengths and weaknesses of SWEEPER in the context of related work followed by a conclusion in Sec. 7.
 such that
 such that  while minimizing
 while minimizing 
 ). A constrained version of the problem is to minimize
 ). A constrained version of the problem is to minimize  subject to a bound on
 
subject to a bound on  . The identification of
. The identification of  proceeds by finding 
an ordered set
 proceeds by finding 
an ordered set  of timestamps, which is a subset of the set of all the 
timestamps
 of timestamps, which is a subset of the set of all the 
timestamps  such that
 such that  (
 ( ), the last element of the set
), the last element of the set  , is the same as 
the error point
, is the same as 
the error point  . Further, the total cost of creating and testing the data 
images corresponding to the
. Further, the total cost of creating and testing the data 
images corresponding to the  timestamps in
 timestamps in  , which equals
, which equals  , should be minimized. The cost of checking a data image at timestamp
, should be minimized. The cost of checking a data image at timestamp 
 for corruption is the sum of (a) the cost of making the first PIT 
image preceding the timestamp available for read/write (
 for corruption is the sum of (a) the cost of making the first PIT 
image preceding the timestamp available for read/write ( ) (b) the cost of 
applying the
) (b) the cost of 
applying the  (
 ( modulo
 modulo  ) CDP logs (
) CDP logs (  ) and (c) the cost of testing the copy (
) and (c) the cost of testing the copy ( ). Hence, the total 
time spent in isolating the uncorrupted copy is the sum of the costs of all the 
timestamps checked in the sequence
). Hence, the total 
time spent in isolating the uncorrupted copy is the sum of the costs of all the 
timestamps checked in the sequence  .
. 
| ![\includegraphics[width=7.5cm]{figs/timeline.eps}](SWEEPER_files/img30.png) | 
We now elaborate on the recovery parameters like  ,
,  etc and 
their typical values. For a typical example, consider Fig. 1, 
where a Recovery Point Identification strategy decided to check data 
image at
 etc and 
their typical values. For a typical example, consider Fig. 1, 
where a Recovery Point Identification strategy decided to check data 
image at  for corruption. The data protection (continuous and 
point-in-time) solution employed in the setting takes a total backup of the data 
at regular intervals. In between total backups, incremental backups are taken 
after every
 for corruption. The data protection (continuous and 
point-in-time) solution employed in the setting takes a total backup of the data 
at regular intervals. In between total backups, incremental backups are taken 
after every  writes (CDP logs). The number of incremental backups 
taken between two consecutive total backups is denoted by
 writes (CDP logs). The number of incremental backups 
taken between two consecutive total backups is denoted by  . Hence, 
in order to construct the point-in-time snapshot of data at time
. Hence, 
in order to construct the point-in-time snapshot of data at time  , we make 
the first total backup copy preceding
, we make 
the first total backup copy preceding  (labeled as
 (labeled as  in the example) online. Then, we apply incremental backups over this 
data till we reach the timestamp
 in the example) online. Then, we apply incremental backups over this 
data till we reach the timestamp  of the last 
incremental backup point before
 of the last 
incremental backup point before  . The total time 
taken in getting the PIT copy at
. The total time 
taken in getting the PIT copy at  online is denoted 
by
 online is denoted 
by  . On this PIT copy, we now apply the CDP logs that capture all the data 
changes made between
. On this PIT copy, we now apply the CDP logs that capture all the data 
changes made between  and
 and  , and incur a cost 
of
, and incur a cost 
of  for each log. Finally, an integrity check is applied over this data, 
and the running time of the integrity checker is denoted by
 for each log. Finally, an integrity check is applied over this data, 
and the running time of the integrity checker is denoted by  .
. 
For average metadata and data write sizes of  and
 and  respectively, write coalescing factor
 
respectively, write coalescing factor  , average filesize
, average filesize 
 , read and write bandwidth of
, read and write bandwidth of  and
 and  respectively, a file corpus of
 
respectively, a file corpus of  files, and a unit 
integrity test time
 files, and a unit 
integrity test time  , the expected time taken for each of these three 
activities are given by the following equations. (
, the expected time taken for each of these three 
activities are given by the following equations. ( calculation is 
based on the assumption that constructing the PIT copy only requires metadata 
updates.)
 calculation is 
based on the assumption that constructing the PIT copy only requires metadata 
updates.) 
|  |  |  | |
|  | (1) | ||
|  |  |  | (2) | 
In a typical setup with  files being are 
modified every second, update sizes of
 files being are 
modified every second, update sizes of  , file sizes of
, file sizes of 
 , metadata size of
, metadata size of  , total backups 
every 12 hours, incremental backups every
, total backups 
every 12 hours, incremental backups every  , and disk transfer 
rate of
, and disk transfer 
rate of  , the time taken to get a PIT image online
, the time taken to get a PIT image online  is 
approximately 50 seconds. The time taken to apply the CDP logs is of the order 
of 10 minutes, whereas the I/O time taken (not including any computations) by 
the integrity checker
 is 
approximately 50 seconds. The time taken to apply the CDP logs is of the order 
of 10 minutes, whereas the I/O time taken (not including any computations) by 
the integrity checker  comes to about 4 hours assuming that the 
integrity checker only needs to check the modified files. In a deployment with a 
CDP log window of
 comes to about 4 hours assuming that the 
integrity checker only needs to check the modified files. In a deployment with a 
CDP log window of  hrs, if one sequentially checks every
 hrs, if one sequentially checks every  record, the expected time to find the point of corruption would be of the order 
of
 
record, the expected time to find the point of corruption would be of the order 
of  days. Hence, sequential checking is infeasible and minimizing the time 
taken by the recovery process essentially boils down to minimizing the number of 
data images that are checked by recovery process.
 days. Hence, sequential checking is infeasible and minimizing the time 
taken by the recovery process essentially boils down to minimizing the number of 
data images that are checked by recovery process. 
The central idea of SWEEPER is to automatically checkpoint important system events from various application-independent monitoring sources as indicators of corruption failures. In contrast to earlier work [16] that requires an administrator to manually define the events for each application that are correlated with various corruption types, we automate the process of creating these checkpoints and indexing them with the type and scope of corruption along with a number that indicates the probability of the event being correlated with the specific corruption type. We then use these checkpoints as hints to quickly pinpoint the most recent clean copy of data. The overall architecture of SWEEPER is described in Fig. 2
The key design question while architecting a recovery mechanism like SWEEPER is whether to build it a) as part of a CDP system, or b) as part of a Storage Resource Manager product like EMC Control Center, HP OpenView or IBM TotalStorage Productivity Centre, or c) as a standalone component. We have separated the design of algorithms and the architecture so that the algorithms can work with any architecture and vice versa. Implementing SWEEPER outside the CDP system allows it to be leveraged by many different types of CDP systems. The disadvantage of this implementation is that some CDP systems might not completely expose all of the internal system state changes via an event mechanism. Most of this information is available from Storage Resource Managers. Furthermore, most Storage Resource Managers have a comprehensive monitoring and resource discovery mechanism that can be leveraged by the Recovery module. Hence, we recommend the design choice (b) for SWEEPER (Fig. 2), with the following key components.
 tiers to
 tiers to  tiers. However, we observed in our experimental study that
 tiers. However, we observed in our experimental study that  tiers are usually sufficient for reducing recovery time.
 
  tiers are usually sufficient for reducing recovery time. 
  
Checkpoint Generation Flow:
Fault-Isolation Flow:
The checkpoint log processing algorithms are implemented in the Checkpoint Record Locator and the CDP Record Scanner modules. The Checkpoint Record Locator iteratively queries the CDP Record Scanner for the next checkpoint that it should verify for correctness, whereas the CDP Record Scanner has the core intelligence for identifying the next eligible checkpoint. The iterative flow of Checkpoint Record Locator is presented in Fig. 3.
 denotes the number of CDP logs,
 denotes the number of CDP logs,  is the cost of 
testing a data image for corruption,
 is the cost of 
testing a data image for corruption,  is the cost of 
getting a PIT copy online,
 is the cost of 
getting a PIT copy online,  is the cost of 
applying one CDP log on a data image and
 is the cost of 
applying one CDP log on a data image and  is the number of 
CDP logs after which a PIT image is taken. Further, we use
 is the number of 
CDP logs after which a PIT image is taken. Further, we use  to 
denote the number of checkpoint records in the relevant history.
 to 
denote the number of checkpoint records in the relevant history. 
 algorithm returns the first checkpoint record after the 
given start time. Hence, the number of integrity tests that Checkpoint 
Record Locator needs to perform using the scanRecordsSequential 
method is proportional to the number of checkpoint records
 algorithm returns the first checkpoint record after the 
given start time. Hence, the number of integrity tests that Checkpoint 
Record Locator needs to perform using the scanRecordsSequential 
method is proportional to the number of checkpoint records  and not 
on the number of CDP logs
 and not 
on the number of CDP logs  . The worst-case cost of creating the most 
current clean data image by the Sequential Scanning Strategy is
. The worst-case cost of creating the most 
current clean data image by the Sequential Scanning Strategy is  .
. 
|   | 
 , the data would 
remain corrupt for all timestamp
, the data would 
remain corrupt for all timestamp  . This order 
preservation in corruption testing allows us to partition the search space 
quickly. We use the intuition of binary-search algorithms that partitioning a 
search space into two equal sized partitions leads one to converge to the 
required point in logarithmic time steps instead of linear number of steps. 
Hence, for a search space with
. This order 
preservation in corruption testing allows us to partition the search space 
quickly. We use the intuition of binary-search algorithms that partitioning a 
search space into two equal sized partitions leads one to converge to the 
required point in logarithmic time steps instead of linear number of steps. 
Hence, for a search space with  checkpoints at any 
given time
 checkpoints at any 
given time  ,
,  returns the timestamp corresponding to the
 returns the timestamp corresponding to the  checkpoint record 
for inspection. If the data corresponding to the timestamp is corrupt, we 
recursively search for corruption in the timestamps between the
 checkpoint record 
for inspection. If the data corresponding to the timestamp is corrupt, we 
recursively search for corruption in the timestamps between the  and
 and  checkpoints. On 
the other hand, if the data is clean, our inspection window is now the 
timestamps between the
 checkpoints. On 
the other hand, if the data is clean, our inspection window is now the 
timestamps between the  and the
 and the  checkpoint records. It is easy to see that since 
the inspection window reduces by a factor of
 checkpoint records. It is easy to see that since 
the inspection window reduces by a factor of  after every check, 
we would complete the search in
 after every check, 
we would complete the search in  steps and the 
total time (expected as well as worst case) spent in recovery point 
identification is given by
 steps and the 
total time (expected as well as worst case) spent in recovery point 
identification is given by  .
. 
 that has the 
highest likelihood (
 that has the 
highest likelihood ( ) of being correlated with the corruption
) of being correlated with the corruption  . Hence, for cases where data corruption is associated with a rare 
event, the search may terminate in a constant number of steps (constant number 
of timestamps are examined) or take upto
. Hence, for cases where data corruption is associated with a rare 
event, the search may terminate in a constant number of steps (constant number 
of timestamps are examined) or take upto  steps in the 
adversarial worst case. However, as long as the probability
 steps in the 
adversarial worst case. However, as long as the probability  of a 
particular checkpoint being the cause of corruption is uncorrelated with the 
time of its occurrence, the search would still reduce the space exponentially 
and is expected to terminate in logarithmic steps. Formally, we have the 
following result for the expected running time of Informed Search.
 of a 
particular checkpoint being the cause of corruption is uncorrelated with the 
time of its occurrence, the search would still reduce the space exponentially 
and is expected to terminate in logarithmic steps. Formally, we have the 
following result for the expected running time of Informed Search. 
 .
.  timestamps before 
it finds the most recent uncorrupted entry. It is easy to see that the average 
cost of testing a given timestamp is given by
 timestamps before 
it finds the most recent uncorrupted entry. It is easy to see that the average 
cost of testing a given timestamp is given by  .
. 
Observe that as the highest probability checkpoint is uncorrelated with its 
timestamp, each of the  checkpoint records are equally likely to be 
examined next. Further, if the
 checkpoint records are equally likely to be 
examined next. Further, if the  checkpoint is 
examined, it divides the search space into two partitions, and we need to 
examine only one of them after the check. Hence, the recurrence relation for the 
search algorithm is given by
 checkpoint is 
examined, it divides the search space into two partitions, and we need to 
examine only one of them after the check. Hence, the recurrence relation for the 
search algorithm is given by 
|  | (3) | 
 satisfies the recurrence relation. To verify, note that the right hand 
side of the equation reduces to
 satisfies the recurrence relation. To verify, note that the right hand 
side of the equation reduces to  , if
 , if  is replaced by
 is replaced by 
 . Hence, we only need to show that
. Hence, we only need to show that  for some constant
 for some constant  
  
  
  
  by 
using the fact that
 by 
using the fact that  . This completes the proof.
. This completes the proof. 
 algorithm (Fig. 5). 
The algorithm computes the total search time in an inductive manner and selects 
the checkpoint record that minimizes the expected running time.
 algorithm (Fig. 5). 
The algorithm computes the total search time in an inductive manner and selects 
the checkpoint record that minimizes the expected running time. 
Intuitively speaking, the Balanced strategy picks the checkpoint 
records that (a) are likely to have caused the corruption and (b) partition the 
space into two roughly equal-sized partitions. The algorithm strikes the right 
balanced between partitioning the space quickly and selecting checkpoints that 
are correlated with the current corruption  . We have the 
following optimality result for the balanced strategy.
. We have the 
following optimality result for the balanced strategy. 
 for failure
 
for failure  . The expected running time of the strategy is then given 
in terms of the size and probabilities associated with the two partitions
. The expected running time of the strategy is then given 
in terms of the size and probabilities associated with the two partitions  and
 
and  .
.  and
 and  are the 
accumulated probabilities of the left and right partitions respectively, and
 are the 
accumulated probabilities of the left and right partitions respectively, and 
 is the total cost of creating and testing data for any given 
timestamp. Using the fact that
 is the total cost of creating and testing data for any given 
timestamp. Using the fact that  can be as high as
 can be as high as 
 in the case where all checkpoint records have 
equal probabilities, we modify Eqn. 4 
by
 in the case where all checkpoint records have 
equal probabilities, we modify Eqn. 4 
by  out from the right hand side). This completes the proof.
 out from the right hand side). This completes the proof. 
 that is fed to the CDP Recovery Module, and the
 that is fed to the CDP Recovery Module, and the 
 rolls back all updates till time
 rolls back all updates till time  . Since we can 
leverage many components from the SRM, the only components we needed to 
implement for SWEEPER were SWEEPER Event Analyzer and Checkpoint 
Generator, SWEEPER Knowledge Database, SWEEPER CDP Record 
Scanner, and SWEEPER Checkpoint Record Locator. Details of the 
algorithms implemented by SWEEPER CDP Record Scanner and SWEEPER 
Checkpoint Record Locator have been presented in Sec. 3 and we 
restrict ourselves to describing the implementation of the remaining components 
in this section. We start with the checkpoint record structure and its 
implementation.
. Since we can 
leverage many components from the SRM, the only components we needed to 
implement for SWEEPER were SWEEPER Event Analyzer and Checkpoint 
Generator, SWEEPER Knowledge Database, SWEEPER CDP Record 
Scanner, and SWEEPER Checkpoint Record Locator. Details of the 
algorithms implemented by SWEEPER CDP Record Scanner and SWEEPER 
Checkpoint Record Locator have been presented in Sec. 3 and we 
restrict ourselves to describing the implementation of the remaining components 
in this section. We start with the checkpoint record structure and its 
implementation. 
| ![\begin{table}
\footnotesize {
\begin{minipage}[t]{.47\textwidth}
\centering {...
... High \\ \hline
\end{tabular*} \\
(d)
}
\end{minipage}
}
\end{table}](SWEEPER_files/img103.png) | 
 , and details the 
source of the event, the failure type(s)
, and details the 
source of the event, the failure type(s)  relevant to the 
event and the correlation probability
 relevant to the 
event and the correlation probability  of the event with 
each relevant failure type. A split-view of the Knowledge Base is 
presented in Table 1. We use expert information derived from literature to 
generate the Knowledge Database. We obtained common configuration 
related probabilities from the proprietary level two field support team problem 
database of a leading storage company for storage area networks. We obtained 
hardware failure information from proprietary storage controller failure 
database of the same company. We obtained the virus failure information by 
looking at virus behavior for many common viruses at the virus encyclopedia site 
[28]. 
To elucidate with an example, for the corruption type of virus, we looked at the 
signature of the last
 of the event with 
each relevant failure type. A split-view of the Knowledge Base is 
presented in Table 1. We use expert information derived from literature to 
generate the Knowledge Database. We obtained common configuration 
related probabilities from the proprietary level two field support team problem 
database of a leading storage company for storage area networks. We obtained 
hardware failure information from proprietary storage controller failure 
database of the same company. We obtained the virus failure information by 
looking at virus behavior for many common viruses at the virus encyclopedia site 
[28]. 
To elucidate with an example, for the corruption type of virus, we looked at the 
signature of the last  discovered viruses [28] and 
noted common and rare events associated with them. Based on how commonly an 
event is associated with a virus, we classified that event as having a low, 
medium or high probability to be seen in case of a virus attack. Because of the 
inherent noise in this expert data, we classify
 discovered viruses [28] and 
noted common and rare events associated with them. Based on how commonly an 
event is associated with a virus, we classified that event as having a low, 
medium or high probability to be seen in case of a virus attack. Because of the 
inherent noise in this expert data, we classify  only into low, medium and high probability buckets which correspond to 
probabilities of
 only into low, medium and high probability buckets which correspond to 
probabilities of  ,
,  and
 and  respectively.
 respectively. 
The events listed in the Knowledge Base and monitored by the Event Analyzer can be classified into Configuration Changes: addition, update (upgrade/downgrade firmware, driver or software level) or removal of hardware and software resources and changes in security access control. These events are checkpointed as it is common for data corruption problems to occur when one upgrades a software level, or when one introduces a new piece of software or hardware resource. Background Checking Processes: successful or unsuccessful completion of background checking processes like virus scan, hardware diagnostics, filesystem consistency like fsck or application provided consistency checkers. These checkpoints provide markers for consistency in specific filesystems, databases or volumes. Application Specific Changes:Applications can provide hints about abnormal behaviour that may indicate corruption and SWEEPER allows one to monitor such events. Hardware Failures: checksum/CRC type errors, self diagnostic checks like SMART, warnings generated by SMART etc. Performance Threshold Exceeding: High port actitvity, high CPU utilization etc. Performance Threshold Exceeding events like high port utilization or high CPU utilization is usually a symptom of either a virus attack, or an application that has gone astray. Meta-data Update Changes: Changes in system directory etc. Abnormal Meta-data updates indicate application misbehavior, which, in turn, can potentially lead to data overwrite or corruption problems.
 has client agents 
that subscribe to various types of events from hosts, switches, storage 
controllers, file systems and applications. These events are consolidated and 
presented as part of an event bus. SWEEPER is only concerned about a 
subset of the events in the event bus and gets the list of relevant events from 
the Knowledge Database (Table 2). 
Once the relevant events are filtered, the Event Analyzer generates a 
checkpoint record for the event along with its timestamp. Tight synchronization 
between the SWEEPER event analyzer clock and the CDP system clock is 
not mandatory as the RPO granularity is no finer than
 has client agents 
that subscribe to various types of events from hosts, switches, storage 
controllers, file systems and applications. These events are consolidated and 
presented as part of an event bus. SWEEPER is only concerned about a 
subset of the events in the event bus and gets the list of relevant events from 
the Knowledge Database (Table 2). 
Once the relevant events are filtered, the Event Analyzer generates a 
checkpoint record for the event along with its timestamp. Tight synchronization 
between the SWEEPER event analyzer clock and the CDP system clock is 
not mandatory as the RPO granularity is no finer than  second. Hence, we 
perform hourly synchronization of the event analyzer clock with the CDP clock to 
ensure that the timestamps in the checkpoint records are reasonably accurate.
 second. Hence, we 
perform hourly synchronization of the event analyzer clock with the CDP clock to 
ensure that the timestamps in the checkpoint records are reasonably accurate. 
The key feature of the SWEEPER architecture is associating correlation probabilities with events to help any search strategy in speeding up the recovery flow. For every event
 and 
corruption
 and 
corruption  , the correlation probability
, the correlation probability  denotes the probability that the corruption
 denotes the probability that the corruption  happened in a given 
time interval
 happened in a given 
time interval  given that
 given that  has happened at
 has happened at 
 . The correlation probability is estimated using the Bayesian.
. The correlation probability is estimated using the Bayesian. In order to compute the correlation probability, one needs to estimate  ,
,  and
 and  .
.  is 
estimated by looking at the event stream for the event
 is 
estimated by looking at the event stream for the event  and using an 
exponential-weighted averaging to compute the average frequency of the event
 and using an 
exponential-weighted averaging to compute the average frequency of the event 
 . To elaborate further, the time line is divided into windows of size
. To elaborate further, the time line is divided into windows of size 
 each (Fig. 7), 
and for any window
 each (Fig. 7), 
and for any window  , the number of occurrences
, the number of occurrences  of 
each event
 of 
each event  is noted. The local probability of event
 is noted. The local probability of event  in the 
window
 in the 
window  (denoted by
 (denoted by  ) is calculated as
) is calculated as 
 . In order to take into account the historical frequency 
of the event
. In order to take into account the historical frequency 
of the event  and have a more stable estimate, the probability of an 
event
 and have a more stable estimate, the probability of an 
event  in the time window
 in the time window  (
 ( ) is 
damped using the exponential decay function (Eq. 7). For 
the first window, the probability is same as the local probability.
) is 
damped using the exponential decay function (Eq. 7). For 
the first window, the probability is same as the local probability. 
 estimates are obtained from the Knowledge 
Database. The final parameter in Eqn. 6 is
 estimates are obtained from the Knowledge 
Database. The final parameter in Eqn. 6 is 
 . However, note that whenever checkpoints are being examined to figure 
out the source of a corruption
. However, note that whenever checkpoints are being examined to figure 
out the source of a corruption  , all the 
checkpoints would have the same common term
, all the 
checkpoints would have the same common term  . Hence,
. Hence,  is computed by ignoring the common term
 is computed by ignoring the common term  for 
all the
 for 
all the  events, and then normalizing the correlation 
probabilities so that
 events, and then normalizing the correlation 
probabilities so that  . The correlation 
probabilities thus computed are stored in the checkpoint records for use in the 
recovery flow.
. The correlation 
probabilities thus computed are stored in the checkpoint records for use in the 
recovery flow. 
 and fed 
to the SWEEPER Event Analyzer via the
 and fed 
to the SWEEPER Event Analyzer via the  Event 
Bus.
 Event 
Bus. 
The key contribution of the checkpoint-based architecture is to correlate 
checkpoint records with corruption failures using the correlation probabilities 
 . Hence, we test our search algorithms for various distributions of
. Hence, we test our search algorithms for various distributions of 
 , which are listed below. We use synthetic correlation probability 
distributions because of the lack of authoritative traces of system and 
application events that would be applicable on a wide variety of systems. We 
have carefully inspected the nature of event distributions from many sources and 
use the following distributions as representative distributions of 
event-corruption probability.
, which are listed below. We use synthetic correlation probability 
distributions because of the lack of authoritative traces of system and 
application events that would be applicable on a wide variety of systems. We 
have carefully inspected the nature of event distributions from many sources and 
use the following distributions as representative distributions of 
event-corruption probability. 
 -level) uniform distribution has
-level) uniform distribution has  (or
 (or  ) types of event-types where all the events belonging to any given 
  type have the same probability of having caused the corruption. However, 
  certain event-types are more likely to have caused the error than other 
  event-types and this is captured by having more than
) types of event-types where all the events belonging to any given 
  type have the same probability of having caused the corruption. However, 
  certain event-types are more likely to have caused the error than other 
  event-types and this is captured by having more than  level of 
  probabilities.
 level of 
  probabilities. One may note that the uniform and zipf distributions capture the two extremes 
in terms of skewness, that the correlation distribution  may 
exhibit in practice. Hence, a study with these two extreme distributions not 
only capture many real settings, but also indicate the performance of the 
algorithms with other distributions as well.
 may 
exhibit in practice. Hence, a study with these two extreme distributions not 
only capture many real settings, but also indicate the performance of the 
algorithms with other distributions as well. 
Our first set of experiments studied the scalability and effectiveness of the 
Checkpoint Generation Flow. In the second set of experiments, we 
evaluate the various search strategies in the Recovery flow. We 
conducted experiments for scalability (increase in number of checkpoint records 
 ) and their ability to deal with recovery time constraint (
) and their ability to deal with recovery time constraint ( ). We also study the usefulness of a
). We also study the usefulness of a  -tier checkpoint 
record structre and the robustness of the SWEEPER framework with false 
negatives (probability that the error is not captured in the event stream) and 
noisy data (error in correlation probability estimation). Since the CDP 
system was simulated, we had to manually fix the various parameters of the 
CDP system to realistic values. We kept the time taken to check the 
data corresponding to any given timestamp (
-tier checkpoint 
record structre and the robustness of the SWEEPER framework with false 
negatives (probability that the error is not captured in the event stream) and 
noisy data (error in correlation probability estimation). Since the CDP 
system was simulated, we had to manually fix the various parameters of the 
CDP system to realistic values. We kept the time taken to check the 
data corresponding to any given timestamp ( ) as
) as  seconds, the time taken to get a PIT copy online (
 
seconds, the time taken to get a PIT copy online ( ) as
) as  seconds, 
and the time taken to create the snapshot using the CDP logs (
 seconds, 
and the time taken to create the snapshot using the CDP logs (  ) as 100 seconds.
) as 100 seconds. 
We first investigated the scalability of our Checkpoint Record Generation 
Flow implementation to keep pace with events as the SAN size grows. Hence, 
the Event Generator increases the number of resources managed by the 
Storage Resource Manager and generates more events, that are fed to the 
Event Analyzer. We observed that our Event Analyzer is able to 
efficiently deal with increased number of events (Fig. 8) and 
the lag in checkpoint record is almost independent of the event rate. We also 
observe that the checkpoint records lag by only  mins and hence only 
the last
mins and hence only 
the last  minutes of events may be unavailable for recovery flow, 
which is insignificant compared to the typical CDP windows of weeks or months 
that the recovery flow has to look into. The efficiency of the checkpoint flow 
is because (a) the computations in Event Analyzer (e.g., 
exponential-decay averaging) are fairly light-weight and (b) depend only on 
finite-sized window (Sec. 4.5), 
thus scaling well with number of events.
 minutes of events may be unavailable for recovery flow, 
which is insignificant compared to the typical CDP windows of weeks or months 
that the recovery flow has to look into. The efficiency of the checkpoint flow 
is because (a) the computations in Event Analyzer (e.g., 
exponential-decay averaging) are fairly light-weight and (b) depend only on 
finite-sized window (Sec. 4.5), 
thus scaling well with number of events. 
|  | 
We next focus on the Recovery Flow and investigate the impact of 
using a non-sequential strategy as opposed to sequential checking. Fig 9 
compares the time taken to find the most recent uncorrupted version of data by 
the sequential and the binary search strategies, for a uniform probability 
distribution. Since the probability of all checkpoints are equal, 
Informed as well as Balanced strategy have similar performance 
to Binary Search. For the sake of visual clarity, we only plot the 
performance of Binary Search algorithm. It is clear that even for small 
values of  , sequential algorithm fares poorly because of the
, sequential algorithm fares poorly because of the  running time ratio, and underlines the need for non-sequential 
algorithms to speedup recovery. For the remainder of our experiments, we only 
focus on non-sequential strategies and study their performance.
 running time ratio, and underlines the need for non-sequential 
algorithms to speedup recovery. For the remainder of our experiments, we only 
focus on non-sequential strategies and study their performance. 
| 
 | 
We next studied the relative performance and scalability of the proposed 
non-sequential strategies for various correlation probability distributions. 
Fig 10(a) 
and Fig 10(b) 
studies the performance of the three non-sequential strategies with increase in 
the number of checkpoint records under Zipf and  -level uniform 
probability distribution for checkpoints. We observe that the Balanced 
search strategy always outperforms the Binary search strategy. 
This validates our intuition that since Balanced strategy partitions 
the search space by balancing the likelihood of corruption in the partitions 
rather than the number of checkpoint records it converges much faster than 
binary. The gap in performance is more for the Zipf distribution than 2-level 
Uniform distribution, as the more skewed Zipf distribution increases the 
likelihood of partitions with equal number of checkpoints to have very different 
accumulated correlation probabilities. The Informed search strategy 
performs well under Zipf distribution as it has to examine very few checkpoint 
records, before it identifies the recovery point. For small
-level uniform 
probability distribution for checkpoints. We observe that the Balanced 
search strategy always outperforms the Binary search strategy. 
This validates our intuition that since Balanced strategy partitions 
the search space by balancing the likelihood of corruption in the partitions 
rather than the number of checkpoint records it converges much faster than 
binary. The gap in performance is more for the Zipf distribution than 2-level 
Uniform distribution, as the more skewed Zipf distribution increases the 
likelihood of partitions with equal number of checkpoints to have very different 
accumulated correlation probabilities. The Informed search strategy 
performs well under Zipf distribution as it has to examine very few checkpoint 
records, before it identifies the recovery point. For small  , 
Informed even outperforms the Balanced strategy, which takes
, 
Informed even outperforms the Balanced strategy, which takes 
 time, while Informed runs in small number of 
constant steps, with very high probability. Thus, if the operator has high 
confidence in the events associated with different types of corruption, 
Informed may be the strategy of choice. One may observe for the 2-level 
uniform correlation probability case that, as the number of checkpoints 
increase, the performance of Informed degrades to that of 
Binary search. This is because the individual probability of each 
checkpoint record (even the checkpoints associated with the higher of the
 time, while Informed runs in small number of 
constant steps, with very high probability. Thus, if the operator has high 
confidence in the events associated with different types of corruption, 
Informed may be the strategy of choice. One may observe for the 2-level 
uniform correlation probability case that, as the number of checkpoints 
increase, the performance of Informed degrades to that of 
Binary search. This is because the individual probability of each 
checkpoint record (even the checkpoints associated with the higher of the  levels) falls linearly with the number of points and hence, convergence in 
constant steps is no longer possible and Informed search converges in
 
levels) falls linearly with the number of points and hence, convergence in 
constant steps is no longer possible and Informed search converges in 
 time, as predicted by Theorem 1. 
These experiments bring out the fact that the greedy Informed strategy 
is not good for large deployments or when the the number of backup images are 
large.
 time, as predicted by Theorem 1. 
These experiments bring out the fact that the greedy Informed strategy 
is not good for large deployments or when the the number of backup images are 
large. 
| 
 | 
We next modify the objective of the search algorithms. Instead of finding the most recent clean datapoint, the recovery algorithms now have a constraint on the recovery time and need to output a datapoint, at the end of the time window. Fig 11(a) and Fig 11(b) examine the tradeoff between the recovery execution time and the data currency at the recovery point for 2-level uniform and Zipf distributions respectively. Data currency is measured in terms of lost write updates between the most recent uncorrupted copy and the copy returned by the algorithms.
The largest difference in performance under the two distribution is for the Informed strategy. It performs well under Zipf, since it starts its search by focusing on the few high probability checkpoints to quickly prune the search space. Under 2-level uniform distribution, it performs poorly as compared to the other strategies overall but more so when recovery time is less. This is because it evaluates the uniform probability events in a random order, which on an average leads to more unbalanced partitions, as compared to the other two strategies. Intuitively, both the Binary and Balanced strategies aim to reduce the unexplored space in each iteration. Hence, they minimize the distance of the error snapshot from the set of snapshots checked by them. On the other hand, Informed does not care about leaving a large space unexplored by it, and hence before it finds the actual error times, its best estimate of error time may be way off the actual error time. A similar observation can be made if the algorithms aim to achieve a certain (non-zero) data-currency in the minimum time possible. By reversing the axis of Fig. 11, one can observe that both Binary and Balanced strategies achieve significant reduction in data-currency fairly quickly, even though Informed catches up with them in the end. Hence, when the recovery time window is small, the use of Informed strategy is not advisable.
 -tiers
-tiers  -tiered checkpoint record structure on the performance of the 
algorithms. In a
-tiered checkpoint record structure on the performance of the 
algorithms. In a  -tiered record structure, the algorithms execute on only a 
subset of checkpoint records consisting of high probability checkpoint records. 
The idea is that the recovery point is identified approximately by running the 
algorithms on the smaller set of records and then locate the exact recovery 
point using the checkpoints in the neighborhood. For the plot in Fig 12, 
checkpoints are assigned probabilities according to 2-level uniform distribution 
with
-tiered record structure, the algorithms execute on only a 
subset of checkpoint records consisting of high probability checkpoint records. 
The idea is that the recovery point is identified approximately by running the 
algorithms on the smaller set of records and then locate the exact recovery 
point using the checkpoints in the neighborhood. For the plot in Fig 12, 
checkpoints are assigned probabilities according to 2-level uniform distribution 
with  of checkpoints having
 of checkpoints having  cumulative 
probability. Further, only
 cumulative 
probability. Further, only  of total 
checkpoints are promoted to the high probability tier, and hence the skew in the 
high-tier checkpoints is much lower than a single tier checkpoint record 
structure. We observe in Fig. 12 
that, as compared to a single tier checkpoint records (Fig 10(b)), 
the performance gap between binary and balanced reduces significantly. This is 
because the binary strategy reaps the benefit of operating only on the high 
probability checkpoint subset, making it closer in spirit to the balanced 
strategy. Thus, a
 of total 
checkpoints are promoted to the high probability tier, and hence the skew in the 
high-tier checkpoints is much lower than a single tier checkpoint record 
structure. We observe in Fig. 12 
that, as compared to a single tier checkpoint records (Fig 10(b)), 
the performance gap between binary and balanced reduces significantly. This is 
because the binary strategy reaps the benefit of operating only on the high 
probability checkpoint subset, making it closer in spirit to the balanced 
strategy. Thus, a  -tiered checkpoint architecture, makes the probability 
oblivious Binary Search algorithm also probability-aware, by forcing it 
to operate only on checkpoints that have a high correlation probability with the 
corruption.
-tiered checkpoint architecture, makes the probability 
oblivious Binary Search algorithm also probability-aware, by forcing it 
to operate only on checkpoints that have a high correlation probability with the 
corruption. 
We next investigate the robustness of our proposed algorithms to deal with 
silent errors; i.e. an error that does not generate any events. We model silent 
errors using false negatives (probability that the error is not captured in the 
checkpoint records). Fig. 13 
studies the loss in data currency with increase in false negatives. For this 
experiment, the average number of CDP logs between any two events was kept at 
 , and we find that the non-sequential strategies are able to keep the 
data currency loss below or near this number even for a false negative of
, and we find that the non-sequential strategies are able to keep the 
data currency loss below or near this number even for a false negative of  . Our results show the correlation probability-aware algorithms like 
Balanced and Informed are no worse than correlation 
proabibility-unaware algorithms like Sequential and Binary. 
Hence, even though these strategies factor the correlation probability while 
searching, they do not face significant performance penalty, when the error 
event is not captured in the checkpoints.
. Our results show the correlation probability-aware algorithms like 
Balanced and Informed are no worse than correlation 
proabibility-unaware algorithms like Sequential and Binary. 
Hence, even though these strategies factor the correlation probability while 
searching, they do not face significant performance penalty, when the error 
event is not captured in the checkpoints. 
In practice, the expert-given values for  as well as the computed
 as well as the computed  have estimation 
errors and the algorithms should be robust enough to deal with noise in 
correlation probability estimates. Fig. 14 
shows the recovery time taken by the various algorithms for the
 have estimation 
errors and the algorithms should be robust enough to deal with noise in 
correlation probability estimates. Fig. 14 
shows the recovery time taken by the various algorithms for the  -level 
probability distribution, as the expert data and event stream becomes noisy. The 
noise is generated by adding a zero mean uniform distribution to the correlation 
probabilities, where the range of the noise is varied from
-level 
probability distribution, as the expert data and event stream becomes noisy. The 
noise is generated by adding a zero mean uniform distribution to the correlation 
probabilities, where the range of the noise is varied from  to the 
probability of the highest probability event. Note that the two vertical lines 
indicate the probability of the two types (levels) of events in the 2-level 
distribution and as the noise (error) approaches the right line, the standard 
deviation of the noise becomes greater than the mean of the original signal (or 
correlation probability). In such a situation, the correlation probabilities 
lose their significance as the complete data can be thought of as noise. The key 
observation in this study is the sensitivity of Informed strategy to 
the accuracy of correlation probabilities and the robustness of Balanced 
Strategy to noise. We observed that Balanced Strategy showed a 
graceful degradation with increase in error, degenerating to Binary 
Strategy even when the correlation probability had only noise, whereas 
Informed Strategy showed a steep increase in recovery time. This is not 
surprising, given that Informed Strategy only considers the individual 
correlation probability and suffers with increased estimation error. On the 
other hand, Balanced Strategy takes into account cumulative 
probabilities, which are not effected as much by the zero mean noise until noise 
dominates the original signal. Even in the case where error dominates the 
original probabilities, Balanced performs as well as the 
Binary strategy underlining its usefulness as an effective, versatile 
and robust strategy.
 to the 
probability of the highest probability event. Note that the two vertical lines 
indicate the probability of the two types (levels) of events in the 2-level 
distribution and as the noise (error) approaches the right line, the standard 
deviation of the noise becomes greater than the mean of the original signal (or 
correlation probability). In such a situation, the correlation probabilities 
lose their significance as the complete data can be thought of as noise. The key 
observation in this study is the sensitivity of Informed strategy to 
the accuracy of correlation probabilities and the robustness of Balanced 
Strategy to noise. We observed that Balanced Strategy showed a 
graceful degradation with increase in error, degenerating to Binary 
Strategy even when the correlation probability had only noise, whereas 
Informed Strategy showed a steep increase in recovery time. This is not 
surprising, given that Informed Strategy only considers the individual 
correlation probability and suffers with increased estimation error. On the 
other hand, Balanced Strategy takes into account cumulative 
probabilities, which are not effected as much by the zero mean noise until noise 
dominates the original signal. Even in the case where error dominates the 
original probabilities, Balanced performs as well as the 
Binary strategy underlining its usefulness as an effective, versatile 
and robust strategy. 
We present in this paper a scalable architecture and efficient algorithms that help administrators deal with data corruption, by quickly rolling back to a an earlier clean version of data. The related work in this area can be broadly classified into techniques for (a) Data Resiliency and Recovery, (b) Event Monitoring and Logging, (c) Indexing: Correlating events with corruption and using the correlation values as index and (d) Searching the event logs for quick identification of the failure point.
Continuous Data Protection (CDP) solutions deal with data corruption by allowing an administrator to roll back at a granularity that is much finer than what was possible with traditional continuous copy or backup solutions. CDP solutions have been proposed at file level [26] and at block level in the network layer [16,21,30,15]. Versioning file systems [25,20] that preserve earlier versions of files also provide the CDP functionality. CDP logs allow users to revert back to earlier data versions but leaves the onus of determining a recovery point on the administrator. A brute-force approach examining all CDP logs could lead to unacceptably high recovery time. To alleviate this, some products such as [16,21,30] incorporate Event Monitoring and Logging with the basic Data Resiliency mechanisms and allow applications to record specific checkpoint records in the CDP log, which then serve as landmarks in the log stream to narrow down the recovery point search. However, these solutions present no Indexing and Searching mechanism and the administrator has to devise a search strategy between the checkpoints, where all checkpoints are as equally likely to be associated with the failure. Our work builds on existing solutions by (i) using system events along with application events to generate checkpoints, (b) attaching correlation probabilities with the checkpoints, and (c) providing CDP log processing algorithms that use these probabilities intelligently to automatically identify a good recovery point quickly.
The use of Searching and Mining techniques on an Event Log has been very popular in the area of problem [9] detection and determination in large scale systems. Numerous problem analysis tools [29], [32], [13] have been proposed that aid in the process of automating Searching the logs for problem analysis. A major issue in application of these techniques in large systems is the complexity of the event collection and subsequent analysis. Xu et al. [31] propose a flexible and modular architecture that enables addition of new analysis engines with relative ease. Kiciman et al. [12] use anomalies in component interactions in an Internet service environment for problem detection. Chen et al. [4] use a decision tree approach for failure diagnosis in eBay production environment. File system problem determination tools have been proposed [32] [13] that monitor system calls, file system operations and process operations to dynamically create resource graphs. Once the system administrator detects that there is a problem, these tools help to quickly identify the scope of the problem with respect to the affected files and processes. Similarly, Chronus [29] allows users to provide customized software probes that help in identifying faulty system states by executing the probe on past system states that have been persisted on disk. They select past system states (virtual machine snapshots for a single machine) using binary search. Hence, Chronus addresses two of the problems, namely Event Monitoring and Searching that SWEEPER handles. Even though it solves a completely different problem from SWEEPER, Chronus is most similar to SWEEPER in spirit. However, Chronus does not distinguish between different event checkpoints, whereas we index the checkpoints based on correlation probabilities and design a search strategy that uses this information for fast convergence. Further, we provide the user the flexibility to tradeoff on the data currentness/recovery time trade-off by appropriately setting the RTO and RPO values. Finally, the focus of our paper is on data corruption on disk in a distributed environment whereas Chronus deals with system configuration problems for a single machine.
The novelty of SWEEPER lies in integrating Event monitoring, Checkpoint Indexing, and new search techniques that are able to make use of the index (correlation) information. A possible weakness of SWEEPER lies in its inherent design that is based on checkpoint events: SWEEPER depends on events to provide hints for corruption and silent errors or noise in the event-corruption correlation may degrade SWEEPER's performance. However, the proposed search algorithms do a balancing act between searching events with high correlation probabilities and pruning the search space in logarithmic time. Our study suggests the use of Balanced Search as the most robust strategy for efficient Recovery Point Identification. The Balanced Search strategy finds a good recovery time for silent errors by degenerating to Binary Search. The only scenario where Balanced Search is (marginally) outperformed is in the case of probability skew: a few checkpoints capture most of the correlation probability and Informed Search is the best performer. Our study makes a strong case of using search algorithms that use the correlation between system events and corruption to quickly recover from data corruption failures.
In this paper we presented an architecture and algorithms to quickly identify clean data in CDP record stream. The basic idea of our system is to monitor system events and generate checkpoint records that provide hints about potential data corruption. Upon discovering data corruption, system administrators can pass RTO and RPO requirements as input to our system, and it returns the most relevant checkpoint records. These checkpoint records point to locations in the CDP record stream that potentially will provide clean data. We use expert knowledge, resource dependency analysis, and event co-relation analysis to generate the checkpoint records. Upon the discovery of data corruption, system administrators can use SWEEPER to quickly identify relevant CDP records. We present different CDP record traversal algorithms to help traverse the CDP record stream to identify recovery points. Our studies suggest the use of Balanced Search strategy as the most robust strategy for efficient Recovery Point Identification. However, in an environment where checkpoints with high correlation probabilities are very few, the Informed Search strategy seems to be a good choice. Hence, the study emphasizes the need for either the log processing algorithms or the checkpoint architecture to be aware of correlation probabilities of various events. Furthermore, our study conclusively establishes the need for non-sequential search mechanisms to quickly identify good recovery points in a CDP environment as opposed to a linear sequential scanning of checkpoints.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer 
Based Learning Unit, University of Leeds. 
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, 
Macquarie University, Sydney. 
The command line arguments were: 
latex2html -split 0 
-show_section_numbers -local_icons -no_navigation rice.tex