Analyzing Cases of Resilience Success and Failure—A Research Study

Abstract : Organizations that are using the CERT? Resilience Management Model and organizations that are considering using it want information about the business value of implementing resilience processes and practices, and how to determine which ones to implement. This report describes the SEI research study that begins to address this need. It includes a discussion of the completed phase 1 study and a proposed phase 2 project. Phase 1 included forming a hypothesis and set of research questions and using a variety of techniques to collect data and evaluate whether resilience practices have a discernible (measurable) effect on operational resilience--that is, an organization's ability to continue to carry out its mission (provide critical services) in the presence of operational stress and disruption. The outcomes of phase 1 provide the foundation for the proposed phase 2. The longer term goal includes developing a quantitative, validated business case for prioritizing and implementing specific resilience practices, including decision criteria for selecting and measuring investments in improved resilience.


List of Tables
Extension of Resilience Requirements to Types of Resilience Assets

Introduction
Since the release of the CERT ® Resilience Management Model [Caralli 2010, Caralli 2011], organizations that are considering using the model and organizations that are implementing it have sought information about the business value of using resilience processes and practices and how to determine which ones to implement.This same return-on-investment concern has been raised by the community at large for years when considering any process improvement effort.We, as a model-developing community, have generally not been able to provide a satisfactory response to questions of this type.This is particularly the case for new models that have yet to be broadly adopted and, thus, for which there are very few experience reports and case studies.This research effort was conceived to begin to remedy this situation.
In January 2012, the CERT Program's 1 Cyber Enterprise Workforce Development Directorate decided to allocate internal funds to begin to address organizations' questions through an initial research study titled "Analyze Cases of Resilience Success and Failure."A proposed follow-on research project (see Section 4.0) will make progress in evaluating whether resilience practices have a discernible (measurable) effect on operational resilience-that is, an organization's ability to continue to carry out its mission (provide critical services) in the presence of operational stress and disruption.In the longer term, we also intend to develop a quantitative, validated business case for prioritizing and implementing specific resilience practices, including decision criteria for selecting and measuring investments in improved resilience.
Throughout this report, we use the term "phase 1" to describe the initial research study performed from January through September 2012.We use the term "phase 2" to describe the work we propose as a follow-on to the initial study, subject to available SEI research and customer funding.
During phase 1, the research team explored the hypothesis and research questions described in Section 1.2 and formulated the revised hypotheses and research questions, described in Section 2.1.The team has collected a small sample of case data (disruptive events) from collaboration partners.We have used this data and other insights to develop a research scope and analysis approach that serves as a foundation for phase 2 of the research project in fiscal year (FY) 2013 (October 2012through September 2013).
The remainder of this report is organized as follows: • Section 1.1 describes the scope of phase 1.
• Section 1.2 presents our initial research objectives, hypothesis, and research questions formulated during phase 1.
• Section 2 describes the general research method for exploratory and explanatory research, the phase 1 data collection and coding approach, and several initial analysis methods based on mixed-methods research (qualitative and quantitative). 1 The CERT Program is part of Carnegie Mellon University's Software Engineering Institute (SEI).
• Section 3 presents our research observations and lessons learned from phase 1, including a general description of our collaboration partners, the cases they provided, and insights we gained from working with this case data.
• Section 4 provides an overview of our proposed FY13 phase 2 project.
• Section 5 summarizes the report.
Terms used throughout this report are defined in the glossary.

Scope
For phase 1, the initial scope of candidate resilience practices are those described in the CERT ® Resilience Management Model (CERT-RMM), Version 1.1 [Caralli 2011].CERT-RMM is a capability-focused maturity model for process improvement that reflects best practices from industry and government for managing operational resilience across the domains of security management, business continuity management, and aspects of information technology (IT) operations management. 2 CERT-RMM defines operational resilience as the emergent property of an organization that can continue to carry out its mission in the presence of operational stress and disruption that does not exceed its limit. 3  Stress and disruption arise from the realization of operational risk: failed internal processes; failures of systems or technology; the deliberate or inadvertent actions of people; and external events (such as natural disasters) [Caralli 2011].To expand, operational resilience is the organization's ability to protect and sustain high-value services and their associated assets (information, facilities, people, and technology such as systems and software) to achieve the service mission.An operationally resilient service is one that can meet its mission under times of disruption or stress and can return to normalcy when the disruption or stress is eliminated.A service is not resilient if it cannot return to normalcy after being disrupted, even if it can temporarily withstand adverse circumstances [Allen 2011a].
Practices in the model focus on improving the organization's management of key operational resilience processes.These improvements enable high-value services to meet their mission consistently and with high quality, particularly during times of stress and disruption [Caralli 2011].
The primary resilience requirements that must be met in the presence of and after disruption are confidentiality, integrity, availability, and privacy.The applicability of a specific type of resilience requirement varies depending on the asset type, as shown in Table 1.

2
For more information about CERT-RMM, see the book titled CERT ® Resilience Management Model: A Maturity Model for Managing Operational Resilience and the CERT Resilience Management Model pages on the CERT website [Caralli 2011, CERT 2012]. 3 Disruption in this definition applies to a disturbance that does not exceed the service's operational limit.A catastrophic loss of infrastructure would not be considered a disruption.The research team selected availability of technology assets as the scope for phase 1.Our rationale for this scope was driven by the wide range of operational disruptions (caused, for example, by denial of service, disruption of service, jamming, losing communication nodes) that can occur in an operational theater when technology is not available.In addition, our rationale was based on an initial sample of case data from collaboration partners where availability of technology assets was disrupted.The Department of Defense (DoD) Cyber Science and Technology Priority Steering Council Research Roadmap [King 2011] provided additional impetus by identifying time to restore operational capability and effort to restore operational capability as two key metrics of interest for characterizing a resilient infrastructure. 4We used these metrics as our primary focus and sought to identify recommended practices that measurably reduced time and effort to restore operational capability in the presence of and after an attack.

Objectives, Research Questions, and Hypothesis
The section describes the objectives, research questions, and hypothesis for phase 1. Revised research questions and hypotheses for the phase 2 project are described in Section 4.1.

Objectives
The goal of this overall research effort is to inform the general question "Do resilience practices have a discernible (measurable) effect on operational resilience?"This objective can be expressed as a statement: Measure the contribution of resilience practices to reducing the occurrence and impact of disruptive events.One could argue that there is ample anecdotal evidence and some documented case studies in the resilience domains described above showing that improved practices do have a discernible and positive effect on an organization's ability to be more resilient.That said, the authors of this report are currently unaware of comprehensive research results that evaluate and demonstrate a measurable effect that can also be used as a basis for selecting and prioritizing which resilience practices to implement.We are exploring a wide range of sources to validate this observation and to place this research within the relevant research literature and communities of practice.Several of these sources are cited in Section 4.6.

4
The Roadmap measures time to restore operational capability in minutes, hours, and days, and in effort to restore operational capability in number of cyber specialists required to resolve a significant attack: 100, 10, 1, or automated.
Resilience practices include any actions, methods, or techniques that help satisfy a resilience requirement.One example pertinent to our technology availability scope is a service level agreement that specifies availability requirements for key servers.Resilience practices that address this requirement could include redundant equipment, regular backups, and the ability to restore from backups in a specified time frame.
We study the general question by investigating and analyzing resilience cases-cases in which an organization was successful or unsuccessful in satisfying a resilience requirement in the presence of and after a disruptive event.While the handling of any specific disruptive event may have both successful and unsuccessful aspects, we intend to identify these distinct aspects in different resilience case descriptions.Thus, any resilience case can be identified to have one of the following: • a positive outcome where the impact of the disruptive event is no more than some critical threshold (as defined by a resilience requirement) • a negative outcome where the impact of the disruptive event is more than some critical threshold (as defined by the resilience requirement) Figure 1 depicts concepts and relationships associated with operational resilience that are relevant for this research effort, including the relationship of resilience requirements to impact thresholds in the context of services, assets, and controls.Resilience processes and practices implement the protection and sustainment strategies for high-value services and assets.In addition, this figure describes how operational resilience is driven by enterprise requirements.It is derived from the comparable Figure 2.6 in the CERT Resilience Management Model: A Maturity Model for Managing Operational Resilience [Caralli 2011]; thus it serves as the overall context for this research effort (both phase 1 and phase 2).We initially focused on two impact measures, the time to restore service and the effort to restore service, reflecting the interests of the Office of the Secretary of Defense, Science, and Technology (OSD S&T) [King 2011] and the emphasis on technology availability.

Research Questions
We further elaborated on the broader question "Do resilience practices have a discernible effect on operational resilience?" by addressing two more specific research questions in addition to those noted above: 1. for disruptive events (incidents) with a positive outcome (impact <= threshold): What was the contribution of resilience practices to the successful handling of the incident causing the impact?How much did the use of resilience practices contribute to success?(Note: A successful response also includes cases where the incident was thwarted and no impact occurred.) 2. for resilience cases with a negative outcome (impact > threshold): What was the source of the failure and the root cause of the impact?Would the organization have been successful if it had implemented (selected) resilience practices?
To identify gaps in and improvements to CERT-RMM resilience practices, we intend to explore the following additional research questions: • Did our collaboration partners perform any resilience practices that they found to be useful and that are not included in CERT-RMM?
• Are there practices suggested by CERT-RMM that don't have much positive effect and, therefore, could be reduced or eliminated-specifically those that have a high cost?
• Are the time and effort to restore reduced as a result of implementing sets of related resilience practices?
• Did partners implement any particular practices only because they were suggested by CERT-RMM and were effective?We would eventually like to make a claim on this question, supported by evidence and analysis.

Hypothesis
Phase 1 focused on the following general hypothesis: Disruptive events are more likely to have a positive outcome (i.e., response considered successful) when resilience practices are implemented than when they are not.
The research objectives, questions, and hypotheses, and the types of data we needed to explore them resulted in the research approach described in the next section.Additional and more detailed hypotheses are also discussed.

Overview
This work was largely framed as an exploratory research study applying the multiple (or comparative) case study method described by Robert Yin [Yin 2009].Thus, we did not start with concrete, specific hypotheses other than the general one stated in Section 1.2.One goal of the exploratory phase 1 study was, in fact, to generate more specific hypotheses for the phase 2 explanatory research project.
We defined the following analysis outcomes for phase 1 but accomplished only two of them, as indicated, primarily due to the lack of sufficient resilience case data: • the generation of more specific hypotheses (accomplished) • evidence of the validity of those hypotheses (deferred) • a level of confidence associated with that evidence (deferred) • a statement of the nature and degree of generalization of the research results (deferred) • an understanding of experiments and studies that can be conducted and important data to be collected that will help establish or refute specific hypotheses in phase 2 (accomplished) We did formulate several more specific hypotheses that we intend to explore during phase 2. The percentages in these hypotheses are examples; they might be modified according to the confidence goals of our research partners.

•
Confidence that 80% of incidents in the category of interest (technology availability, information confidentiality) are reported.
• A known list of factors explain at least 80% of the time to handle a specific category of incidents • A known list of factors explain at least 80% of the cost of handling a specific category of incidents.
• A known list of factors cause at least 80% of a specific category of incidents.
• A known list of factors may control the reduction of a specific category of incidents by at least 80%.
• Identified factors are analogous to or inform the selection of resilience practices that confirm these hypotheses.
In addition to the hypotheses that were focused on resilience cases and their underlying state of resilience practices, we formulated questions associated with the practical measurement of factors directly tied to resilience practices, cases, and impacts.These questions articulate the need to better understand the measurement scaling of such factors and the degree of their repeatability and reproducibility.
• What are the potential factors to be measured to study the research questions?
• What is the proper measurement scale for each factor?
• What is the proper analytical or modeling approach related to each factor or combination of factors?
• Can these factors be measured practically, reliably, and with acceptable degrees of repeatability and reproducibility?
• Many of these questions can be answered using small-scale, statistically designed experiments [Montgomery 2012] and studies, as described in Section 4.4, thereby enabling the collection of the correct types of data to be used in analyzing cases.Answers to these questions will help us perform a more targeted and quantitative analysis subsequent to phase 1.
While phase 1 was limited to exploratory research, in phase 2 we will strive to explain relationships between resilience practices and organizations' incidents through hypothesis testing [Sheskin 2011].This overall research effort is, therefore, an example of mixed-methods research [Creswell 2011], which involves both qualitative and quantitative research.As Creswell and Clark describe, mixed-methods research may be appropriate when investigators do not know the exact questions to ask, variables to measure, and theories to guide the study, possibly because of the newness of the research topic.This is the case for phase 1. Creswell and Clark state that "in these situations, it is best to explore qualitatively to learn what questions, variables, theories, and so forth need to be studied and then follow up with a quantitative study to generalize and test what was learned from the exploration."Mixed methods are ideal for this type of research.The qualitative analysis conducted in phase 1 paves the way for quantitative analysis in phase 2.

Data Collection and Coding
The research team reached out to potential collaboration partners and collected case data using the definition of a resilience case as defined by the Resilience Case Request for Data template (see the appendix).For phase 1, we selected cases on a "target of opportunity" basis by engaging with collaboration partners where a strong trust relationship existed between individuals.Many of the individuals we have approached for case data thus far are reluctant to share such sensitive data and, in some cases, unwilling.However, where there is an existing customer project work statement, a CERT-RMM licensing agreement, a CERT-RMM Users Group relationship, or a professional relationship between peers characterized by mutual respect and trust, we find that individuals are more willing to share data and are able to provide sufficient rationale to their internal stakeholders to obtain permission to share the data.To date, using the form in the appendix, we have obtained a description of eight cases from four organizations.We were not able to obtain all the information for all fields for all cases but did obtain sufficient data to explore the analysis approaches described in this section.
The market sectors represented by the four collaboration partners that kindly provided case data for phase 1 included U.S. federal civilian agencies, U.S. defense contractors, commercial service providers, and academia.The eight cases we collected from these four partners described the following general types of incidents:

Business continuity
• power outage requiring migration to a backup site Coding of cases is a critical process whereby information gathered through case review and interviews is entered into a resilience case template and (ultimately) a resilience case database according to a prescribed methodology documented in a codebook.Due to insufficient data, we did not pursue a defined case coding approach (other than the use of the template by research study members) or the development of a resilience case database during phase 1.We did perform in-depth analysis of our research questions and the case data template to ensure that we were collecting sufficient data to address our research questions.In the process of performing this analysis, we identified several gaps,5 which will be remedied in phase 2. All these tasks will be explored in more depth during the follow-on project.
After performing the preliminary analysis on a single event described in Section 2.3, we determined that the analyses necessary to support phase 1 and phase 2 require a range of events of a similar type over time (at least 7 with similar resilience practices; optimally 30-40).We have identified a source for such data where a deep trust relationship exists with the partner providing the data.We plan to pursue this data collection approach during the conduct of phase 2.

Analysis Methods
Given the small data set we had to work with, we explored the question "How do I structure data to inform small controlled experiments to achieve a reasonable understanding of how to analyze resilience cases to inform our research questions?"Of the eight cases we collected, only one provided sufficient data to explore candidate analysis methods.Using that case, we experimented with two types of analysis: attribute agreement analysis and a preliminary conceptual Bayesian Belief Network (BBN).Our analysis experiments are described below.

Attribute Agreement Analysis
Attribute agreement analysis can be used to collect and assess ratings provided by subject matter experts (e.g., a minimum of 4 subject matter experts) and then to examine such ratings (e.g., a minimum of 20 item ratings) for agreement and divergence.We defined and executed the following variation of an attribute agreement analysis process to conduct an experiment on our ability to render reproducible, subjective judgments to code resilience cases.This process was performed by three members of the research team.
1. Develop a detailed chronological time line of each action taken (or actions not taken that should have been) in the case.Identify CERT-RMM processes and practices that are relevant for each action.Prepare a spreadsheet that maps each action to the applicable CERT-RMM processes, goals, practices, and subpractices. 6For example, one action was to notify the organizational computer incident response team when the suspicious event was first detected.Two of the CERT-RMM practices that apply here are IMC:SG3.SP1, Define and maintain incident declaration criteria, and IMC:SG4.SP3, Communicate incidents.For this case, we identified 18 actions and 30 applicable CERT-RMM practices (with specific subpractices).
2. Answer and score the following two questions for each CERT-RMM practice: To what degree was this practice implemented in this case?Use a scale of zero to 10 (zero being not implemented at all; 10 being fully implemented) and do one of the following: a. Provide a brief rationale for each answer to Question 1. b. Specify that additional data is needed to answer this question and the nature of that data.

−
Question 2 (Q2): Given the answer to Q1, what is your subjective assessment of the role that this practice as implemented played in the resilience outcomes for phase 1? (Resilience response <= threshold; resilience response > threshold, expressed as time to restore and effort to restore.)Use a scale of -10 to +10, where -10 is a significant negative role (made the response much worse), zero is neutral (no role), and +10 is a significant positive role (made the response much better).Also, do one of the following: a. Provide a brief rationale for each answer to Q2. b.Specify that additional data is needed to answer this question and the nature of that data.
3. Initially, to ease the case analysis workload, we discussed answering Q2 only for practices that have answers at the extremes for Q1 (i.e., 0-2 or 9-10), as we may only care about analyzing practices that are significantly absent/weak/ineffective or significantly strong.
4. We initially evaluated each practice independently even though some practices may relate to others.We chose to defer analysis of practice interrelationships.
5. Practices for which we could not answer Q1 and Q2 were eliminated from consideration for this first experiment.
An answer to Q1 or Q2 is called a judgment.We needed 100 judgments to conduct our first analytical experiment (attribute agreement analysis).So if we could identify 12-13 practices in a case for which we could answer both questions (24-26 judgments) and four people could do this analysis, we would have approximately 100 judgments.
We observed that this analysis process was best done by two researchers first, independently, followed by their reconciling these first two sets of results.Then, one or two additional 6 We found that we needed to go to the subpractice level to gain sufficient understanding and detail to perform a reasonable mapping.In the future, we may also need to map to finer grained control descriptions as identified in the CERT-RMM Crosswalk [Partridge 2011].
researchers performed the same analysis using the first set of results as their baseline.We determined that it was too difficult to reconcile three to four sets of analysis results simultaneously.
After three team members attempted this analysis, we realized that we lacked sufficient data to form a consensus, which led to subjective and varying results.We also determined that this approach was too resource intensive (in time and effort) to scale to the number of cases we would need to analyze to produce meaningful results.As a next step, we decided to focus on a subset of CERT-RMM practices that we believed provided the strongest contribution to a successful outcome and used the analysis approach described in the next section.

Conceptual Bayesian Belief Network
The purpose of this analysis was to develop a preliminary, conceptual Bayesian Belief Network (BBN).A BBN is a probabilistic graphical model that represents a set of random variables and their conditional dependencies. 7We believed a BBN structure could be useful in helping us identify key practices and their interdependencies toward predicting resilience outcomes of time and cost for a specific case.Historical use of BBNs in the field of quality and reliability enabled a simplification of a model, with direct causal practices connected to performance outcomes and additional "upstream" practices providing a leading indication of the directly causal practices.
A research team member selected 19 CERT-RMM practices (from the 30 identified in the case above) that we believed would most likely predict a successful outcome if performed adequately.These practices were placed into a spreadsheet structured to evaluate cause and effect relationships among practices.All practices were listed as rows and columns in the spreadsheet.
We then scored the intersection of each practice with every other practice as follows: Score Meaning 0 no relationship between these two practices 1 possible relationship between these two practices (weak influence) 2 agree that there is a relationship between these two practices (moderate influence) 3 strongly agree that there is a relationship between these two practices (strong influence) For example, there is a strong relationship (a score of 3) between IMC:SG3.SP1, Define and maintain incident declaration criteria, and IMC:SG2.SP4, Analyze and triage events (assign disposition).There is no relationship between IMC:SG3.SP1, Define and maintain incident declaration criteria, and IMC:SG4.SP3, Communicate incidents (identify relevant stakeholders).
After we established and reviewed these relationships, we added the scores by practice and identified those with the highest score.Practices with the highest row scores indicated those that were most likely to be a precursor or input to other practices, and practices with the highest column scores indicated those that were downstream and more dependent on other practices.For example, VAR:SG3.SP1, Manage exposure to vulnerabilities, had the highest precursor (row) score.MON:SG2.SP3, Collect and record information, had the highest dependency (column) score.From this analysis, we developed a preliminary, conceptual BBN, attempting to identify 7 According to http://en.wikipedia.org/wiki/Bayesian_networkpractices that, if performed adequately, would most likely predict a successful outcome.However, this attempt at using a BBN did not immediately resolve to a simple, hierarchal model in which some practices were more directly causal of resilience outcomes and others were more indirectly causal.Instead, a high degree of bidirectional correlation was noted among the practices, confirming the need for more case data and in-depth analysis.The spreadsheets and diagrams resulting from this analysis are available upon request.

Data Types, Analysis Approaches, and Outcomes
At this point in phase 1, we decided to step back from individual cases and describe the types of data that would inform our research questions and hypotheses.We then described each data type's value, limitations, and challenges; the types of analysis we could do with each type of data; and candidate analysis outcomes.The results of this effort are shown in Table 2.
CMU/SEI-2012-TN-025 | 13 From this analysis, we determined that to successfully conduct this research, we needed a range of incidents (between 7 and 100) of a comparable type (such as violations of information confidentiality) over a period of time (weeks, months).We identified a partner with the potential to provide such data and started to develop a guiding scenario and supporting analysis approaches with this data source in mind.We suspended further analysis of the existing set of eight incidents and formulated the data analysis research approach for the phase 2 project described in Section 4.4.This concluded the analysis task for phase 1.

Collecting and Coding Cases
Early on, we relied heavily on the CERT Program's experience in collecting and coding insider threat cases and derived significant benefit as we developed the Request for Resilience Case Data template included as the appendix.However, we identified a substantive difference in this research effort's objectives.Insider threat case analysis, in large part, is focused on characterizing the threat and identifying practices and controls that may help mitigate it.The derivation of insider threat controls has largely been an informal process relying on expert knowledge of the effectiveness of controls against various security vulnerabilities.Phase 1 focused on rigorously evaluating the efficacy of resilience practices in the presence of a realized threat (i.e., a disruptive event).While this area of focus is also of great interest for the insider threat research, we were unable to obtain relevant experience or learning from that body of work to inform phase 1.
It is important to ensure that the data being collected will adequately inform the research questions and hypotheses.This, too, may be obvious; but this task does require in-depth examination and analysis.
We discussed several approaches for coding cases in a reliable and repeatable manner.We also discussed several approaches for building a repository of resilience cases that would support some form of structured and automated analysis, including the use of tools.Based on the case data we were able to collect, we determined that these discussions and any decisions resulting from them were premature.
Many of the individuals we approached for case data were reluctant to share such sensitive data and, in some cases, unwilling even with existing non-disclosure agreements.However, where there was an existing customer project work statement, a CERT-RMM licensing agreement, a CERT-RMM Users Group relationship, or a professional relationship between peers characterized by mutual respect and trust, we found that such individuals were often willing to share such data.That said, in several instances, they were not able to provide sufficient rationale to their internal stakeholders to obtain permission to share the data beyond a summary level.Though this may not be a major issue when dealing with government organizations with which we have a formal relationship, we need to provide more information and assistance to help our private sector partners develop this rationale.

Analyzing Data
In parallel with working with collaboration partners to obtain case data, we drafted an approach for analyzing the data and then sought the review of two senior SEI researchers.Based on their feedback, we learned that it was premature to develop such an approach in the absence of case data and in the absence of a full description of the background and motivation for phase 1.
CERT-RMM is a management model, which means that the resilience practices are stated at a fairly high level when viewed from an implementation and action perspective.We found that we needed to go to the subpractice level to gain sufficient understanding and detail to perform a reasonable mapping of case actions to resilience practices.In the future, we may need to map to finer grained control descriptions as identified in CERT-RMM Crosswalk [Partridge 2011].
After performing the preliminary analysis on a single event, we determined that the analyses necessary to support phase 1 and phase 2 require a range of events of a similar type over time (at least 7 with similar resilience practices; optimally 30-40).Trends over time provide a much richer data set than individual events.
Here are some of the challenging analysis questions we discussed: • If we know the impact of a given incident, how do we determine the contribution (or not) of a given practice?
• Is there practice correlation or causation with respect to incident impact?
• Do we have cases we can compare where incidents were handled well versus not handled well?
• Which practices tended to be more significant from a statistical viewpoint in determining different outcomes?
• Should we focus on individual practices, or are practice interrelationships and groups of practices more applicable?
• How do we determine the degree of practice implementation (say, on a scale of 1-10)?
• How do we need to record data and what do we need to record to make our analysis outcomes repeatable and defensible?
• Given the often sparse and incomplete case descriptions we had to work with, how reliable and repeatable are scoring and analyses performed by subject matter experts?
• Should we assess at the case level or the individual practice level?For example, how well did the organization detect and respond across the entire event?What allowed them to succeed or fail (greater focus on the root cause)?
• Should we consider opportunities to work with collaborating organizations and establish continuous data collection over a moderate to longer period of time?
• Can assessment data inform this research?
If we use a particular set of data to formulate candidate hypotheses, we need new and comparable data to confirm or refute those hypotheses.One approach is to analyze existing data and create a model, followed by a time period of collecting new data to use for validation of the model.Another approach, assuming sufficient case data, is to collect data and use half of it to analyze and develop a model, and then use the other half to immediately validate the model. 14In this latter approach, there is no need to wait for a set period of time to collect more data for model validation purposes.

14
In general, if you have a relatively small data set for which new cases are not readily available, you will probably want to split the data set into an analysis set and a test set.The analysis set is used to determine the effectiveness of a given set of practices for a specific security concern.The test set is used to test whether the hypothesis generated using the analysis set holds for the test set, thus giving evidence as to its generalizability.
There is no set rule on how you choose the sizes of the analysis and test sets.It depends on your relative concern with generalizability versus hypothesis generation, all other things being equal.So, for example, if you are able to collect data describing 20 events, you should probably analyze 10 and save 10 for testing.

General
Other research teams that depend on incident case data may find the analysis approaches and methods considered during phase 1 and planned for during phase 2 to be of value.
The CERT Program Cyber Enterprise Workforce Management Directorate provided initial funding to support phase 1.This gave our research team the ability to explore our research questions, experiment with a variety of data and analysis approaches, and make key discoveries along the way as to how to more effectively conduct case-based research tied to resilience practice outcomes.Phase 1 was essential for identifying potential collaboration partners, identifying sources of case data, and developing a defensible proposal for phase 2.

Objectives and Scope
For our proposed FY13 phase 2 project, we intend to continue to pursue the objective of measuring the contribution of resilience practices to reducing the occurrence and impact of disruptive events.Phase 2 is the next step toward a larger, long-term objective of developing a method to evaluate the extent to which resilience practices contribute to an organization's ability to continue to carry out its mission in the presence of operational stress and disruption.
Due to the nature of the incident data available to us, as described in Section 2.2, the research initially will focus on information confidentiality-specifically, the unauthorized disclosure of sensitive data.We will examine the effect of resilience practices for preventing and containing data disclosure on reducing the time and cost (impact measures) to handle disclosure incidents.A set of practices will be proposed to and vetted by our research partners.

Research Hypotheses and Questions
The same general hypothesis is posed for phase 2 as the one we began with in phase 1: Disruptive events are more likely to have a positive outcome (i.e., response considered successful) when resilience practices are implemented than when they are not.Data will be collected and experiments conducted to test the more specific hypotheses that were formulated during phase 1 (listed in Section 2.1).
In testing the general hypothesis, we will examine each practice for its relative contributions as follows: • to reducing the occurrence and impact of disclosure incidents

Guiding Scenario
Our guiding scenario, the unauthorized disclosure of sensitive information, is shown in Figure 2, as a time line describing potential events and the resulting impact.A qualitative description of the impact is provided, based on the classification of the disclosed data.In reality we find that the activities within the incident that realize this impact are variable and possibly unknown.In our time line, we assume the realization of the impact.The precipitating activities involve two asset types: people and information (marked as green and yellow borders, respectively).People may not clearly understand the nature of the information or the impact of possible disclosure.The information may not be clearly categorized and may not be clearly marked.As the time line progresses, actions shift from protecting the information (or not) by people to protecting and using information through technology (marked as a black border).
If the information is well understood and clearly marked, the resilience practices for protecting information are considered adequately implemented.The time line progresses only if the people or technology practices are ineffective or missing.Unauthorized disclosure of sensitive information occurs when the user chooses a data distribution method inappropriate for the information's classification.Either the actions of people or an implemented technology practice could detect certain methods of unauthorized disclosure.Often this disclosure is reported by the originating user or the recipient of the data.Practices known as Cross Domain Solutions (CDS) are used to monitor for data marked for the wrong domain.These practices have various limitations.If the data bypasses these practices and their companion controls, the only remaining barrier to realizing the impact that results from disclosing sensitive information is the capability of the adversary and environmental factors.
The adversary's ability to retrieve sensitive information is dependent upon the environment in which it resides-its technological container such as a server, a website, or an accessible email attachment.The adversary must have access to this container.Once in possession of the information, the adversary must be capable of recognizing its utility.Finally, the adversary must have the ability to use the information.This portion of the time line is dependent on the adversary's practices associated with effectively using people, information, and technology assets.
Mitigation is dependent on the disruption of the time line at any of the described points, including those that affect the adversary's ability to use the information.

Data Analysis Approach
Phase 2 will focus on the measured contribution of resilience practices to the reduction of the occurrence and impact of disruptive events.Our research team will use a quasi-experimental research design commonly referred within the social research community as the Solomon Four-Group Quasi-Experimental Design [Frankfort-Nachmias 2008, p. 104].This design incorporates four experimental treatments, two of which involve a pretest baseline measurement and two that don't.Table 3 depicts this with an example of using four different partner experimental groups (organizational units) that will participate in the experiment.This research design predominantly handles the possibility that baseline measures, taken before experimental conditions (called the intervention) are applied, may influence the conduct of the experimental subjects and thus affect the baseline measures taken after the intervention.By incorporating the two treatments without pretest baseline measures, this research design provides greater generalizability by removing any influence of those measurements.If, however, this research determines that there is no danger of influence from the pretest baseline measurements, the design would be reduced to just the first two treatments listed in Table 3 and a straightforward comparison of intervention and lack of intervention.For this research, the intervention will be the purposeful and complete implementation of one or more related resilience practices, while a lack of intervention will be the absence of that implementation.We will implement the Solomon Four-Group Quasi-Experimental Design in an iterative fashion such that each iteration will enable the testing of a different set of resilience practices.Additionally, we will run each iteration for a predefined period of time to observe multiple incidents and the effect of resilience practices on the time and cost required to handle these incidents.Statistical power and sampling rules will be implemented to determine the minimum number of incidents required to conduct a comparison and, consequently, the minimum expected time period required for a given iteration.
Once an iteration of a set of resilience practices is finished, we will conduct a comparative analysis as depicted in Figure 3. Specifically, incident handling times and costs will be recorded for each incident between identified time line events such as • an incident being reported • followed by the spread being contained • followed by system cleansing being completed • followed by the system(s) being returned to normal operation Armed with these measurements, we will conduct hypothesis tests to compare the four treatment groups to discern differences in incident handling times and costs for each of the previously identified event segments.Figure 3 depicts the comparison of the different groups graphically with the non-intervention group shown as the red baseline and the intervention group shown as the green improved baseline.Figure 3 also depicts example conclusions from hypothesis tests in which specific differences in incident handling time and cost may be confirmed with different levels of confidence.Once a series of hypotheses are tested for a given set of related resilience practices, as shown in Figure 3, the same data will be used to conduct time-based predictions of incident handling times and costs as depicted in Figure 4. Essentially, Figure 4 demonstrates that the handling times and costs for any given event segment or consecutive segments may be modeled with distributionfitting approaches, including a modern approach of Weibull modeling [Dodson 1994].This modeling focuses on the probability that handling times and costs will be below or above (currently arbitrarily) selected thresholds.Figure 4 includes examples of statements that may be concluded from Weibull modeling.Statements concluded from Weibull modeling may become the basis for establishing a set of benchmarks of incident handling times and costs that can be compared to benchmarks from other organizations or periods of time in the same organization.These time-based predictions may also prove useful in the real-time management of incidents and allocation of resources.
The analytics discussed so far form the rudimentary base of analytics for the resilience practice research.With the collection of more data for each treatment group of a given set of related resilience practices, more sophisticated predictive analytics can be performed as shown in Figure 5.In this predictive analytic approach, measurements from incidents (handling time and cost, implementation status of resilience practices, and other contextual factors) will be used to support statistical regression analysis.The analysis will produce regression equations that predict incident handling time and cost based on the use of specific resilience practices, along with other contextual factors.This approach will enable both a statistical determination of the significance of resilience practices and the actual degree of influence of different resilience practices on handling time and cost.Regression equations may be developed using data from all organizations and incident types, or just specific segments of them.In either case, the equations will provide predictions with 95% prediction intervals that may serve to support real-time management decision making during an incident.These equations may also be used in a what-if or sensitivity analysis during management planning to help organizations better prepare for handling incidents.As shown in Figure 6, this modeling approach requires a definition of first-pass success for each step in the incident handling, event time line.With such definitions of success, the collected data from the research will enable the computation of conditional success probabilities for each step in the time line.By multiplying these probabilities, a first-pass, rolled yield may be calculated that provides a useful measure for benchmarking and measuring the degree of resilience practice improvement over time for a given organization.Noticeable improvements in resilience practice adoption will produce improved rolled yields for handling incidents.
As this plan indicates, we will use a variety of analytical techniques, beginning with simple analytics that require little measurement data to predictive analytical techniques that are enabled by progressively more research data.This analytical approach provides the research team with more robustness to the various degrees of data to be made available by participating organizations.

Data Analytics and Validity
The analytical toolkit and research design documented in this plan remain driven by four types of validity: • operational scope: Are we working on the right problem?
• field significance: Will our solution have the intended impact in the field?
• technical soundness: Is our technical approach scientifically founded?
• technical significance: Is our technical approach well situated in the discipline?
With regard to operational scope, we will use analytics involving designed surveys, metadata analysis, and structured quantitative decision techniques (e.g., the analytic hierarchy process [AHP], multi-attribute utility theory [MAUT], variants of wideband Delphi, and conjoint analysis) to quantify the utility or value of this work to various stakeholders.
With regard to field significance, analytics will play a key role in maximizing external validity through a number of mechanisms.First, the analytical methodology will provide valuable insight into what operational measures should be collected to meet the needs of the resilience performance outcomes and leading indicators.A primary philosophy guiding this activity will come from the research of Douglas Hubbard [Hubbard 2010], who shares a variety of ideas of how to measure intangibles.Measuring intangibles may become a significant activity within phase 2. Second, field quasi-experiments will be designed and analyzed, thereby providing maximum generalizability.Third, analytical results of surveys and structured interviews will provide quantified feedback from stakeholders on the perceived alignment and impact of this research.
With regard to technical soundness, analytics will play a key role in maximizing internal validity by ensuring a sound scientific approach to the research.This analytical plan outlines the use of popular and accepted social science research designs and scientifically driven hypothesis tests whose sampling plans build on sampling units, sampling frames, and sample size determined by acceptable alpha and beta error.Further enhancement of this analytical plan will include the analysis and mitigation of a list of commonly regarded external and internal threats to validity.Each step of the scientific research and analytics will be subject to technical peer review to ensure timely revision of the scientific methods.
With regard to technical significance, analytics will play a key role in contributing new knowledge and innovation within the field of security and resilience.Analytics will be used to quantify the completeness and quality of the early literature search within the resilience domain.
Additionally, the research team may choose to take advantage of text analytics as one approach to deciphering large amounts of existing literature and artifacts relevant to this research.Lastly, analytics may also help collaboration partners demonstrate the contribution of this research to their existing research and operational practices.
In summary, this analytical plan describes a set of analytical tools, methods, and activities driven by the four types of validity to ensure a successful first pass at this research in FY13.The analytical rigor described above will also well position the results of phase 2 for distribution in leading research publications and conference presentations.

Related Work and Communities of Practice
During phase 2, we will identify and compare other approaches being used to answer similar hypotheses and research questions.The primary basis for comparison is the measurement of the impact of applying resilience practices to validate their effectiveness (rather than endorsing the practices based on subject matter expert consensus alone).We will use other criteria as well, such as a focus on compliance to a standard (checklist approach) versus continuous improvement, and the scope of the practices across resilience disciplines-security, business continuity and disaster recovery, and certain aspects of IT operations, all of which CERT-RMM addresses.
Table 4 describes some of the organizations and projects that may be included in our comparison of related work.

Summary
This report describes a research study that was performed from January through September 2012 (phase 1), and a proposed phase 2 research project to be conducted during FY13 (October 2012 through September 2013), subject to available funding.
The goal of this overall research effort (both phase 1 and phase 2) is to inform the general question "Do resilience practices have a discernible (measurable) effect on operational resilience?"In the FY13 phase 2 proposal, we express the objective as a statement: Measure the contribution of resilience practices to reducing the occurrence and impact of disruptive events.

Our general hypothesis is
Disruptive events are more likely to have a positive outcome (i.e., response considered successful) when resilience practices are implemented than when they are not.
After presenting our objectives, research questions, and hypotheses, we describe the various directions we pursued to define resilience cases, engage with collaboration partners, collect case data, and attempt to analyze such data.We share a wide range of observations and insights gained during phase 1 and present a summary of the proposed phase 2 project.That summary includes an in-depth discussion of promising analysis approaches and a first look at other related research efforts that we intend to build upon.
The authors of this report welcome readers' comments and questions.We are actively seeking collaboration partners who may be willing to share their incident data with us and receive the results of our ongoing research in return.

Glossary consequence
The unwanted effect or undesirable outcome on the organization as the result of exploitation of a condition or threat (CERT-RMM RISK).There is a consequence if an Impact > Threshold for some resilience requirement.An example of a consequence is the lack of availability of a key customer facing website for 48 hours due to malware infection.As a result, customers are unable to complete specific business transactions leading to a measurable reduction in revenue for the outage time period-an expression of impact.

effort to restore
The total number of staff hours that are required to restore operational capability during and following a disruptive event (incident).
event One or more occurrences that affect organizational assets and have the potential to disrupt operations (CERT-RMM IMC).

impact
To have a direct effect upon (www.merriam-webster.com); a level of productive capability of an asset that has been lost as the result of exploitation of a condition or threat, after all incident response actions have been taken; expressed as time to restore, effort to restore, and cost to restore for phase 1.
incident An event (or series of events) of higher magnitude that significantly affects organizational assets and requires the organization to respond in some way to prevent or limit organizational impact (CERT-RMM IMC).

incident response
The actions the organization takes to prevent or contain the impact of an incident to the organization while it is occurring or shortly after it has occurred (CERT-RMM IMC).

resilience case
A case in which an organization was successful or unsuccessful in satisfying a resilience requirement in the presence of and after a disruptive event.A resilience case may involve either a consequence or resilience evidence.
resilience evidence Demonstrated resilience of an organization as the result of exploitation of a condition or threat.
Resilience evidence accrues if an Impact <= Threshold for some resilience requirement.An example of resilience evidence is service and personnel transition to a backup facility with restored, core IT operations and telephony services within 24 hours (threshold) as specified in the service continuity plan.

resilience practice
A method or technique that helps satisfy a resilience requirement.

resilience requirement
A constraint that the organization places on the productive capability of an asset (information, technology, facilities, personnel) to ensure that it remains viable and sustainable when charged into production to support a service (CERT-RMM).Availability service level agreements for key servers are examples of a resilience requirement.

resilience threshold (or just threshold)
The minimal level of productive capability of an asset as determined by resilience requirements.

sustainment practices
Activities and use of related controls necessary to maintain an asset in a desired operational state when it is subjected to harm or disruptive events.

time to restore
The total number of hours, days, weeks, or months that are required to restore operational capability during and following a disruptive event (incident); this may also include total elapsed (calendar) time.
Figure 1: Relevant Concepts and Relationships Associated with Operational Resilience Figure 2: Unauthorized Disclosure of Sensitive Information Figure 3: Comparative Analytics of Incident Handling Time and Cost Figure 4: Time-Based Predictive Analytics of Incident Handling Time and Cost Figure 5: Predictive Analytics of Incident Handling Time and Cost Using Leading Indicators Figure 6: "Rolled Yield" Probability Model of Properly Handled Incidents

Figure 1 :
Figure 1: Relevant Concepts and Relationships Associated with Operational Resilience

•
hurricane requiring migration to a backup site • fire requiring migration to a backup site Technology availability • web server compromise due to malicious software • email phishing resulting in user credentials being provided to a malicious website • prevention of a zero-day exploitation that would have installed malicious software • insufficient monitoring and alerting to determine if requests to download unauthorized files were correctly blocked • filled log volume preventing user access to email

••
to reducing the time and cost of handling disclosure incidents • to the root causes of incidents • the ratio of the cost of implementing the practice to the resulting cost savings of handling disclosure incidents In addition to the research questions listed in Section 1.2, during phase 2 we will seek answers to these: What factors are driving the occurrence of disclosure incidents, and what controllable factors exist to reduce the rate of disclosure incidents?Might resilience practices and measures [Allen 2010, Allen 2011a, Allen 2011b] predict candidate factors?• What factors contribute to the time required to handle disclosure incidents, and what percentage of the total time do they account for?• What factors contribute to the cost required to handle disclosure incidents, and what percentage of the total cost do they account for?Cost calculations will be informed by the Cost of Incidents and Mean Cost of Incidents metrics from the CIS Consensus Security Metrics [CIS 2010].

Figure 2 :
Figure 2: Unauthorized Disclosure of Sensitive Information

Figure 3 :
Figure 3: Comparative Analytics of Incident Handling Time and Cost

Figure 4 :
Figure 4: Time-Based Predictive Analytics of Incident Handling Time and Cost

Figure 5 :
Figure 5: Predictive Analytics of Incident Handling Time and Cost Using Leading Indicators

Figure 6
Figure 6 depicts another type of analytical model to pursue during phase 2, one that builds upon probability models commonly used in manufacturing environments.This probability model represents the concept of what is called a first-pass, rolled yield model.

Figure 6 :
Figure 6: "Rolled Yield" Probability Model of Properly Handled Incidents

Table 1 :
Extension of Resilience Requirements to Types of Resilience Assets

Table 2 :
Data Types, Candidate Analysis Approaches, and Potential Analysis Outcomes

Table 3 :
Solomon 4 Group Research Design

Table 4 :
Prospects for Related Work ComparisonsAn architectural framework for network resilience based on the two-phase strategy D2R2+DR: defend, detect, remediate, recover, diagnose, refine.Ongoing work in simulation, experimentation, and metrics[Sterbenz 2012]