Measures for Managing Operational Resilience

Abstract : How resilient is my organization? Have our processes made us more resilient? Members of the CERT(Registered Trademark) Resilient Enterprise Management (REM) team are conducting research to address these and other related questions. The team's first report, Measuring Operational Resilience Using the CERT Resilience Management Model, defined high-level objectives for managing an operational resilience management (ORM) system, demonstrated how to derive meaningful measures from those objectives, and presented a template for defining resilience measures, along with example measures. In this report, REM team members suggest a set of top 10 strategic measures for managing operational resilience. These measures derive from high-level objectives of the ORM system defined in the CERT Resilience Management Model, Version 1.1 (CERT-RMM). The report also provides measures for each of the 26 process areas of CERT-RMM, as well as a set of global measures that apply to all process areas. This report thus serves as an addendum to CERT-RMM Version 1.1. Since CERT-RMM practices map to bodies of knowledge and codes of practice such as ITIL, COBIT, ISO2700x, BS25999, and PCI DSS, the measures may be useful for measuring security, business continuity, and IT operations management processes, either as part of adoption of CERT-RMM or independent of it.

Our research on resilience measurement and analysis focuses on addressing the following questions, which are often asked by organizational leaders: 1. How resilient is my organization? 2. Have our processes made us more resilient? 3. What should be measured to determine if performance objectives for operational resilience are being achieved?
To establish a basis for measuring operational resilience, we relied on the CERT-RMM as the process-based framework against which to measure. CERT-RMM comprises 26 process areas (such as Incident Management and Control (IMC) and Asset Definition and Management (ADM)) that provide a framework of goals and practices at four increasing levels of capability (Incomplete, Performed, Managed, and Defined.) Our initial work 4 provided organizational leaders with tools to determine and express their desired level of operational resilience. Specifically, we defined high-level objectives for an operational resilience management program, for example, ''in the face of realized risk, the program ensures the continuity of essential operations of high-value services and their associated assets.'' We then demonstrated how to derive meaningful measures from those objectives using a condensed Goal Question (Indicator) Metric method (refer to Park 1998 5 ), for example, determining the probability of delivering service through a disruptive event. We also defined a template for defining resilience measures and presented example measures using the template.
Too often, organizations collect ''type count'' measurements (such as numbers of incidents, systems with patches installed, or people trained) with little meaningful context on how these measures can help inform decisions and affect behavior. Based on the Goal Question (Indicator) Metric method, we identified strategic measures that help organizational leaders determine which process-level measures best address their needs. What follows is a description of five organizational objectives for managing operational resilience and 10 strategic measures for an operational resilience management (ORM) program The ORM program defines an organization's strategic resilience objectives (such as ensuring continuity of critical services in the presence of a disruptive event) and resilience activities (such as the development and testing of service No part of this newsletter may be reproduced in any form -by microfilm, xerography, or otherwise -or incorporated into any information retrieval system without the written permission of the copyright owner. Requests to publish material or to incorporate material into computerized databases or any other electronic form, or for other than individual or internal distribution, should be addressed to Editorial Services, 325 Chestnut Street, Suite 800, Philadelphia, PA 19106. All rights, including translation into other languages, reserved by the publisher in the U.S., Great Britain, Mexico, and all countries participating in the International Copyright Convention and the Pan American Copyright Convention. Authorization to photocopy items for internal or personal use, or the personal or internal use of specific clients may be granted by Taylor & Francis, provided that $20.00 per article photocopied is paid directly to Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA. The fee code for users of the Transactional Reporting Service is ISSN 0736-6981/06/$20.00+$0.00. The fee is subject to change without notice. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Product or corporate names may be trademarks or registered trademarks, and are only used for identification and explanation, without intent to infringe. POSTMASTER: Send address change to EDPACS, Taylor & Francis Group, LLC., 325 Chestnut Street, Suite 800, Philadelphia, PA 19106. continuity plans). We use an example of acquiring managed security services from an external provider to show how each measure could be used. Managed security services may include network boundary protection (such as firewalls and intrusion detection systems), security monitoring, incident management (such as forensic analysis and response), vulnerability assessment, penetration testing, and content monitoring and filtering.
Organizational objective 1: The ORM program derives its authority from-and directly traces it to-organizational drivers (which are strategic business objectives and critical success factors), as indicated by the following measures: Measure 1: Percentage of resilience activities that do not directly (or indirectly) support one or more organizational drivers.
Example use: External security services replace comparable inhouse services with a lower cost (less effort) and more effective (less impact from incidents) solution. After external security services are operational, 75 percent of in-house efforts no longer support organizational drivers. This measure can be used to ensure an effective transition of designated in-house services to externally-provided services and to retrain/reassign staff currently performing such services.
Measure 2: For each resilience activity, the number of organizational drivers that require it to be satisfied (the goal is equal to or greater than 1).
Example use: An example of a resilience activity is formalizing a relationship with a security services provider using a contract or service level agreement (SLA) that includes all resilience specifications. There is at least one organizational driver that calls for having security services in place to achieve the driver. This driver likely maps to a personal objective of the chief information officer or chief security officer. If there is no such traceability, one or more drivers may require updating.
Organizational objective 2: The ORM program satisfies resilience requirements that are assigned to high-value services and their associated assets, as indicated by the following measures: Measure 3: Percentage of high-value services that do not satisfy their assigned resilience requirements.
Example use: Resilience requirements for security services are specified in the SLA. Provider performance is periodically reviewed to ensure that all services are meeting the SLA requirements (for example, high priority alerts from incident detection systems are resolved within xx minutes). Optimally, this percentage should be zero. If it is greater than an SLA-stated threshold (for example, 20 percent for service A), corrective action is taken and confirmed. Measure 4: Percentage of high-value assets that do not satisfy their assigned resilience requirements.
Example use: This example is similar to the one above. The incident database is a high-value asset that is required to provide incident response services. The SLA specifies resilience requirements for DECEMBER 2011 this database, including daily automated backups and quarterly and event-driven (backup server upgrade and high-impact security incident) testing to ensure the provider's ability to successfully restore from backups. Optimally, this percentage should be zero. If it is greater than an SLA-stated threshold (for example, 20 percent for asset B), corrective action is taken and confirmed.
Organizational objective 3: The ORM program-via the internal control system-ensures that controls for protecting and maintaining high-value services and their associated assets operate as intended, as indicated by the following measures: Measure 5: Percentage of high-value services with controls that are ineffective or inadequate.
Example use: The SLA identifies the controls (policies, procedures, standards, guidelines, tools, etc.) that are required by a service. These controls can be tailored versions of the controls that the organization uses or can be negotiated based on the provider's standard suite of controls. Provider implementation of these controls is periodically reviewed (audited, assessed, scans performed, etc.). Optimally, this percentage should be zero. If it is greater than an SLA-stated threshold (for example, 20 percent for service A), corrective action is taken and confirmed.
Measure 6: Percentage of high-value assets with controls that are ineffective or inadequate.
Example use: This measure is as described above, with asset controls stated in the provider SLA.
Organizational objective 4: The ORM program manages operational risks to high-value assets that could adversely affect the operation and delivery of high-value services, as indicated by the following measures: Measure 7: Confidence factor that risks from all sources that require identification have been identified.
Example use: Major sources of risk are initially identified in the provider SLA and as part of an ongoing review based on changes in the operational environment within which services are provided. The elements that contribute to ''confidence factor'' (such as risk thresholds by service) are also identified. Confidence factor is represented as a Kiviat diagram 6 showing plan versus actual for all sources. Analysis of provider gaps is reviewed on a periodic basis and corrective action is taken and confirmed to reduce unacceptable gaps.
Measure 8: Percentage of risks with impact above threshold.
Example use: Assessment of provider risk is performed on a periodic basis as specified in the SLA. Optimally, this percentage should be zero. If it is greater than an SLA-stated threshold (for example, 20 percent for risk type A), corrective action is taken and confirmed.
Organizational objective 5: The ORM program ensures the continuity of essential operations of high-value services and their associated assets in the face of realized risk, as indicated by the following measures: Measure 9: Probability of delivered service through a disruptive event.
Example use: The SLA states service-specific availability and service levels to meet, both steady state and in degraded mode. Provider performance is periodically reviewed, including during and after a disruptive event (power outage, cyber attack, etc.). Probability of delivered service is determined and evaluated as a trend over time. Corrective action is taken and confirmed as required.
Measure 10: For disrupted, high-value services with a service continuity plan, percentage of services that did not deliver service as intended throughout the disruptive event.
Example use: The SLA includes requirements for service-specific continuity (SC) plans. For provider services with SC plans that do not maintain required service availability and service levels, corrective actions are taken and confirmed, including updates to SC plans. In addition, the customer uses this as an opportunity to review and update its own SC plans that depend on provider services, where service was not delivered as intended.
All these strategic measures derive from lower-level measures 7 at the CERT-RMM process area level, including average incident cost by root cause type and number of breaches of confidentially and privacy of customer information assets resulting from violations of provider access control policies.
To help organizational leaders determine what measures work best for their organization, we are collaborating with members of the CERT-RMM Users Group 8 , which includes the United States Postal Inspection Service, Discover Financial Services, Lockheed Martin, and Carnegie Mellon University. Through a series of twoday workshops, members define an improvement objective, assess their current level of operational resilience against that objective, identify areas of improvement, and implement improvement plans using the CERT-RMM processes and candidate measures as the guide. Please contact us if you are interested in joining a CERT-RMM Users Group.

ADDITIONAL RESOURCES
To learn more about CERT work in Resilience Measurement, please visit http://www.cert.org/resilience/rma.html