An Investigation of Techniques for Detecting Data Anomalies in Earned Value Management Data

Abstract : Organizations rely on valid data to make informed decisions. When data integrity is compromised, the veracity of the decision-making process is likewise threatened. Detecting data anomalies and defects is an important step in understanding and improving data quality. The study described in this report investigated statistical anomaly detection techniques for identifying potential errors associated with the accuracy of quantitative earned value management (EVM) data values reported by government contractors to the Department of Defense. This research demonstrated the effectiveness of various statistical techniques for discovering quantitative data anomalies. The following tests were found to be effective when used for EVM variables that represent cumulative values: Grubbs' test, Rosner test, box plot, autoregressive integrated moving average (ARIMA), and the control chart for individuals. For variables related to contract values, the moving range control chart, moving range technique, ARIMA, and Tukey box plot were equally effective for identifying anomalies in the data. One or more of these techniques could be used to evaluate data at the point of entry to prevent data errors from being embedded and then propagated in downstream analyses. A number of recommendations regarding future work in this area are proposed in this report.


List of Figures
Statistical Control Chart Techniques Used 12                                       1 Introduction

The Problem of Poor Quality Data
Organizations rely on valid data. They use the data to manage programs, make decisions, prioritize opportunities, and guide strategy and planning. But how reliable are the data organizations collect and use? The problem with poor data quality is that it leads to poor decisions. In addition, the rework required to correct data errors can be quite costly.
Existing evidence suggests that poor data quality is a pervasive and rampant problem in both industry and government. According to a report released by Gartner in 2009, the average organization loses $8.2 million annually because of poor data quality. The annual cost of poor data to U.S. industry has been estimated to be $600 billion [Gartner 2009]. Research indicates the Pentagon has lost more than $13 billion due to poor data quality [English 2009].

Data Quality Defined
Data are of high quality if "they are fit for their intended uses in operations, decision making, and planning" [Juran 1951]. This definition implies that data quality is both a subjective perception of the individuals involved with the data and the quality associated with the objective measurements based on the data set in question. A number of studies have indeed confirmed that data quality is a multi-dimensional concept [Ballou 1985, Ballou 1998, Huang 1999, Redman 1996, Wand 1998, Wang 1996]. An international standard data quality model identifies 15 data quality characteristics: completeness, consistency, credibility, currentness, accessibility, compliance, confidentiality, efficiency, precision, traceability, understandability, availability, portability, and recoverability [ISO 2008].

Data Defects vs. Data Anomalies
A data defect is defined as a data value that does not conform to its quality requirements. 1 Larry English defines it similarly as an item that does not conform to its quality standard 2 or customer expectation [English 2011].
Data defects come about in a variety of different ways, including human errors and errors created by faulty processing of the data. Examples of data defects include missing data, errors caused by typos, incorrectly formatted data, data that are outside the range of acceptable values for an attribute, and other similar problems. English has developed a classification of data defects that is summarized in Appendix A.
Some data defects are easier to detect than others. For example, a missing data value can be readily identified through simple algorithms that check for null values within a data field. Likewise, 1 A quality requirement is an application requirement that eliminates or prevents data errors, including requirements for domain control, referential integrity constraints, and edit and validation routines. 2 A quality standard is a mandated or required quality goal, reliability level, or quality model to be met and maintained [English 2011].
values that are clearly out of range of acceptable values for a datum can be detected using simple value checking methods (e.g., a living person's birth date that is incorrectly entered so that it appears that the person is 300 years old). However, there is a class of defects that are more difficult to pinpoint. These are the data values that are referred to as anomalies.
A data anomaly is not the same as a data defect. A data anomaly might be a data defect, but it might also be accurate data caused by unusual, but actual, behavior of an attribute in a specific context. Data anomalies have also been referred to as outliers, exceptions, peculiarities, surprises, and novelties [Lazarevic 2008].
Chandola and colleagues refer to data anomalies as patterns in data that do not conform to a welldefined notion of normal behavior [Chandola 2009]. This is similar to how Hawkins defines an outlier as "an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism" [Hawkins 1980]. Johnson defines an outlier as "an observation in a data set which appears to be inconsistent with the remainder of that set of data" [Johnson 1992]. In this report, we use the term "anomaly" to refer to outliers, exceptions, peculiarities, and similarly unusual values.
Anomaly detection techniques have been suggested for numerous applications such as credit card fraud detection, clinical trials, voting irregularity analysis, data cleansing, network intrusion, geographic information systems, athlete performance analysis, and other data-mining tasks [Hawkins 1980, Barnett 1998, Ruts 1996, Fawcett 1997, Johnson 1998, Penny 2001, Acuna 2004, Lu 2003].

Current State of Practice
A bourgeoning industry has developed to address the problem of data quality. Software applications are available that detect and correct a broad spectrum of data defects that exist in enterprise databases. Losses due to data quality issues would be higher than they are if not for the adoption of these data quality tools. According to Gartner, the data quality tools market grew by 26% in 2008, to $425 million [Gartner 2009]. These tools are geared toward customer relationship management (CRM), materials, and to a lesser degree, financial data. Of the companies that use data quality tools, the Gartner survey found that 50% of survey respondents said they are using data quality tools to support master data management (MDM) initiatives, and more than 40% are using data quality technologies to assist in systems and data migration projects.
According to Ted Friedman, an analyst with The Gartner Group, data quality tools have been most often used in an offline, batch mode to cleanse data outside the boundaries of operational applications and processes [Kelly 2009]. Figure 1 provides an example of a typical CRM data identification/correction algorithm. Chen and colleagues state that "there is much prior work on improving the quality of data that already resides in a database. However, relatively little attention has been paid to improved techniques for data entry" [Chen 2009]. Friedman notes, "Gartner advises clients to consider pervasive data quality controls throughout their infrastructure, ensuring conformance of data to quality rules at the point of capture and maintenance, as well as downstream…Companies should invest in technology that applies data quality rules to data at the point of capture or creation, not just downstream" [Kelly 2009].
Much of the current work on data quality in the Department of Defense (DoD) is limited to identifying missing or duplicate data and discrepancies in recorded values from multiple sources. Other work at the DoD focuses on identifying business rules to screen for defects in repository data. Work is also ongoing in the DoD to apply automated data screening techniques to identify defects.

Our Research Focus
In our data quality research, SEMA is focusing on the accuracy characteristic of the International Organization for Standardization (ISO) 25012 quality model. Within the model, accuracy is defined as "the degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use" [ISO 2008].
Specifically, the objective of the study described in this report was to evaluate statistical techniques that could be used proactively to identify more and varied kinds of data anomalies than have thus far been recognized in the DoD.

Collaborators and Data Source for this Research
To accomplish our objectives, we collaborated with the Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics (OUSD (AT&L)), Acquisition Visibility (AV

What is Earned Value Management?
Earned value management is a program or project management method for measuring performance and progress in an objective manner. EVM combines measurements of scope, schedule, and cost in a single integrated system. Figure 2 summarizes some of the key concepts and data items of the EVM system. A detailed discussion of EVM is beyond the scope of this paper. For a detailed description, see the resources available from the Defense Acquisition University [DAU 2011].
For our data anomaly detection research, we selected several EVM variables: BCSW, BCWP, and ACWP are shown in Figure 2 and are used together to measure performance in the EVM system. NCC and CBB are figures associated with government contracts that remain constant unless formally changed and hence are not routinely part of the EVM system of measures. Activities 1 through 4 are discussed in Sections 2.1.1 to 2.1.4. Section 2.2 describes each of the anomaly detection techniques and how they were applied to the EVM data.

Conduct Literature Search
Our literature research focused on the analytical strengths and limitations of existing anomaly detection techniques and their potential appropriateness for use in this research. Our team also reviewed the capabilities of some of the leading commercial data quality software tools to better understand the techniques that they incorporate. A brief summary of the review is presented in Appendix D.
Over 210 journal articles, web sites, and reference books were collected and catalogued for initial scanning by team members. The references were rated based on relevancy and items of high relevance were assigned to team members for in-depth review. Hodge and Austin partition their discussion of outlier detection methodologies under three overall categories: statistical models, neural networks, and machine learning. They also distinguish between clustering, classification, and recognition. There is no single definitive typology of anomaly detection techniques, and the techniques sometimes overlap several of these proposed categories. However, Chandola and colleagues provide a useful starter set to establish a high-level landscape of the techniques. All three papers, particularly the one by Chandola and colleagues, cite many references where these kinds of anomaly detection techniques have been used.
All of the techniques of anomaly detection that we describe in this document rely on the existence of patterns of "normal" behavior, from which the anomalies can be differentiated. Some of the techniques are limited to univariate data distributions while others consider anomalies based on atypical deviations from statistical relationships among two or more variables.

Select Data Source
During the latter part of 2010, our research team conducted two site visits to meet with data analysts from the DoD Acquisition Visibility (AV) organization. AV is responsible for providing accurate, authoritative, and reliable information supporting acquisition oversight, accountability, and decision making throughout the DoD. A key outcome of the meetings was the selection of the EVM-CR as the source of data for evaluating anomaly detection techniques. This repository source was selected based on several criteria, including: the ability to obtain access privilege to the data, the abundance and richness of the data, and existing reports of errors in the data submitted to the repository. This evidence was drawn from analyses conducted by AV analysts as they were preparing reports to support executive decision making.
Program performance information is reported to the EVM-CR on a monthly basis. The massive volume of EVM data reported each month is staggering. Using valuable analysts to do tedious, manual inspections of the data is impractical. For this reason, the development of an automated method for identifying potential data errors would be extremely beneficial since it would relieve the analyst from searching for needles in the proverbial haystack.
The EVM data was provided in MS-Excel workbook format. After receiving the data for this research study, the data set was organized for analysis and the contents characterized. It consisted of 6211 records associated with 359 program tasks. A program task is made up of multiple records in a time series. Each record in the data set contained 167 columns. Most of these columns were text fields containing descriptive and administrative details about the record, such as who submitted it, files that were submitted, when it was submitted, the contract under which it was being submitted, and so on. Given our focus on statistical techniques that apply to quantitative measures, most of the content in a record was not used.

Select Test Cases and Establish Anomalous Data Values
The research team decided it would be most efficient to focus on a sample of the data and chose to examine the time series profiles of the 359 program tasks. From these, the research team selected four program tasks to use as test cases for evaluating the efficacy of the anomaly detection techniques. Criteria considered to select the cases included the number of records available and their completeness in terms of the variables of interest (i.e., BCWP, ACWP, BCWS, NCC, and CBB) As described further in Section 2.1.4, the nature of the data also influenced the techniques that could be used. The objective was to obtain an effective sample for evaluation purposes.
To establish the anomalous data values in the test cases, the team asked an OSD EVM subject matter expert (SME) to review them; this SME had extensive experience reviewing and analyzing data from the EVM-CR. This was necessary because the actual disposition of the data was unknown, and the research focus was on detecting anomalies that had a high probability of being defects.
We presented both the actual data values and graphical representations of the data and asked the SME to identify anomalies that should be investigated as possible data errors. One example of the results of the SME review is illustrated in Figure 3. The arrows indicate the values that the SME identified as data points that should be investigated as possible data errors. All test cases used in this research study are presented in Appendix B.   To evaluate the effectiveness of each anomaly detection technique, the team considered two key measures: The intent was to use the results of the SME review to determine the effectiveness of each anomaly detection technique that was to be investigated.

Select Anomaly Detection Techniques
To determine which anomaly detection technique is appropriate for a given situation, the nature of the data being assessed and the type of anomaly being searched for should be considered.
The team's research focus was to identify techniques for finding specific types of data anomalies associated with accuracy. Data profiling methods and tools are already available for identifying and correcting the following: • missing data • incomplete data • improper formats • violations of business rules • redundancy Therefore, the team purposely ignored these basic types of data anomalies and focused on the accuracy attribute of five variables:

contract budget base (CBB)
The first three variables are cumulative cost values that are reported on a monthly basis. NCC and CBB are not cumulative. Based on our initial profiling of these variables, we believed that statistical analysis approaches would be fruitful as a means of identifying anomalous data that could be caused by error. The assumption was that a normal datum belongs to a grouping or a patterned distribution of other data points. When the grouping or distribution is understood, a model can be developed that will establish the boundaries of what constitutes a region of normalcy outside of which a datum is considered as being anomalous.

Anomaly Detection Techniques Investigated
As part of the literature review, we identified a number of statistical anomaly detection approaches that looked promising. These techniques were specifically developed to identify anomalous data. They included the following: • statistical control chart techniques, including the control chart for individuals, moving range (mR) control chart, exponentially weighted moving average chart, and moving average chart We also investigated the following control-chart-related techniques:

Statistical Control Chart Techniques
A control chart is a statistical device principally used for the study and control of repetitive processes. It is a line graph that displays variation in a time-ordered fashion. A center line and control limits (based on ± 3 standard deviations) are placed on the graph to help analyze the patterns in the data. Common cause variation occurs randomly and behaves like a constant system of chance causes that are predictable. While individual values are all different, as a group, they tend to form a pattern that can be described by a probability distribution. A process that experiences only common cause variation is said to be in statistical control. A process that experiences special cause variation is said to be out of statistical control. Special cause variation refers to any factors causing variation that cannot be adequately explained by any single probability distribution of the output.
Walter Shewhart introduced the first control chart system during the 1930s [Shewhart 1931]. Since then, a large number and wide variety of control chart schemes have been developed for specific applications and objectives. For example, some control chart schemes are effective for detecting anomalies in a data set, while others are effective for detecting a subtle shift in the aver-age value of a key characteristic measure. Some control chart implementations assume continuous-scaled measurement data, while other chart schemes assume the use of discrete data (such as defect counts).
Based on our research, we selected several control charts that held potential for identifying anomalies in the data. These are listed in Table 1. While the appearance of the different control charts is similar, the parameters of the charts themselves are very different. Parameter calculations for each of the control charts are accessible in the references provided in Table 1. We explored the efficacy of each control chart on each of the EVM variables under study. For the EVM variables BCWS, BCWP, and ACWP, the following approach was taken (both for control chart and other techniques described in the following sections): 1. Filter EVM data based on task name of interest. 4. Analyze the results by comparing the generated control chart to the relevant time series test case and compile results.
For the EVM variables NCC and CBB, the above steps were followed, with the exception of step 2, which was eliminated since NCC and CBB are non-cumulative variables.
An example of this type of control chart analysis is illustrated in Figure 5. The time series cumulative profile of ACWP is indicated in the chart at the right of the diagram. The control chart for the data is on the left. Two data anomalies are detected in the control chart as indicated by the values' positions above the upper control limit. 4 BCWS, BCWP, and ACWP are cumulative values. The indicated calculations transform the data into monthly cost values. 5 Minitab is a statistical software package developed at Pennsylvania State University. See the Minitab website for more information (http://www.minitab.com).

Grubbs' Test
Grubbs' test is a statistical test developed by Frank E. Grubbs to detect anomalies in a univariate data set [Grubbs 1969]. Grubbs' test is also known as the maximum normed residual test [Stefansky 1972]. Grubbs' test is defined for the statistical hypothesis: H 0 : The data set does not contain any anomalies.
H a : There is at least one anomaly in the data set.
The test statistic is the largest absolute deviation from the data set mean in units of the data set standard deviation and is defined as is the sample mean of the data set s is the standard deviation of the data set The hypothesis, H 0, is rejected at the significance level, α, if denotes the upper critical value of the t distribution with (n-2) degrees of freedom and a significance level of 2 Grubbs' test detects one anomaly at a time as illustrated in Figure 6. Multiple iterations are executed until no anomalies are discovered. The approach described in Section 2.2.1 was used to implement Grubbs' test for the EVM variables BCWS, BCWP, and ACWP. Grubbs' test is a statistical test based on the assumption that the data are approximated by a normal distribution. Therefore, in our research, when suspected anomalies are removed from the transformed values of BCWS, BCWP, and the resultant data reasonably approximates a normal distribution. However, the NCC and CBB datasets are very different from BCWS, BCWP, and ACWP, making Grubbs' test ineffective for detecting anomalies for those variables.

Rosner Test
Rosner developed a parametric test designed to detect 2 to 10 anomalies in a sample composed of 25 or more cases [Rosner 1975, Rosner 1983]. The test assumes that the data are normally distributed after the suspected anomalies are removed. As described above for the other tests, the Rosner test is performed on the monthly non-cumulative values. The test requires that the suspected anomalies be identified by inspecting the data beforehand. Once the maximum number of possible anomalies is identified, then they are ordered from most extreme to least extreme.
Using ordered data, the following steps are performed for the Rosner test: 2. The sample value with the largest deviation from the mean is used to calculate the test statistic R i as follows: is the value with the largest deviation from the mean but can be either the largest or smallest value in the sample.
3. The sample value X(1) is then removed from the sample, and the mean X(2), S(2), and R(2) are calculated from the n-1 values. The Rosner test is best illustrated with an example. For our example, 37 data entries were ordered by magnitude and used for the Rosner calculations.
Looking at the data, which represents a time series of month-to-month differences, the team hypothesized that there could be four anomalous entries. These are displayed in Table 2 as Ys.
Choosing to test for four anomalies, the first iteration calculated the mean of the entire sample and the largest deviation from the mean to calculate the R value as described in the steps above. As the iterations progressed, the sample mean and the standard deviation were reduced as the entries with the largest deviations were dropped from each successive calculation. When four iterations were performed, the test of the R statistic failed for the fourth entry, but was positive for the third. This means that the Rosner test confirmed that there are three anomalies in this data set. The calculated R(i) is bolded where it exceeds the tabled critical value. For completeness, the three data records (Y) identified as anomalies in this example are shown.

Dixon Test
The Dixon test (sometimes referred to as Dixon's extreme test or Dixon's Q test) was designed for identifying anomalies when the sample size is less than or equal to 30 [Dixon 1951]. Recent research has extended its applicability to samples up to 100 and improved the precision and accuracy of the critical values for judging the test results [Verma 2006]. The test measures the ratio of difference between an anomaly and its nearest neighbor to the range of the sample (see Table 2).
The tests do not rely on the use of the mean or standard deviation of the sample.
Dixon initially posited a series of six calculations to test for anomalies. Which test to use depends on the sample size and whether the test is for a single anomaly or pairs of anomalies. The tests for pairs of anomalies were designed to account for masking effects when there is more than one extreme anomaly present in the data [Barnett 1998]. The data are tested for normality and, if necessary, transformed to fit a normal distribution. For our data, we used either a Box-Cox transformation or a Johnson transformation n the earned value data, the negotiated contract cost (NCC) and the contract budget base (CBB) variables could not be normalized, so the Dixon test was not appropriate and was not used.
The data are then ordered, and the largest or smallest extreme values are tested by calculating the following appropriate statistic.
The value of the calculated r n is then compared to a table of critical values. If the calculated r n is greater than the corresponding critical value, the value can be characterized as an outlier. In this research, the critical values at the 95% confidence level were used. 6 The Dixon test is meant to identify one extreme outlier, although the r 21 and r 22 statistics have been shown to be robust in the presence of more than one anomaly [Ermer 2005]. For our purposes, we were interested in the performance of the Dixon test compared to other anomaly detection techniques using the monthly differences for the earned value variables BCWS, BCWP, and ACWP.
To judge the efficacy of the Dixon test in identifying anomalies, a series of rolling brackets was imposed on the data for each of the three earned value variables. That is, when testing r 10 for a large extreme datum, the statistic was calculated by using three consecutive data records at a time.
For r 21 , we used 8 consecutive cases and for r 22 , we used 14 consecutive cases. Both the largest and smallest values were tested. The anomalous data records identified using this technique are shown in Appendix C. 6 ISO 57255 suggests that if the test result is significant at the 95% level but not at 99%, the datum should be characterized as a straggler and requires further examination [Huah 2005].

Tukey Box Plot
Originally developed by John Tukey for exploratory data analysis, box plots have become widely used in many fields. As described earlier, the test is performed on the monthly non-cumulative values. Figure 7 contains an image from the JMP statistical package's help file on outlier box plots. 7 The box runs from the 1 st through the 3 rd quartile (25 th and 75 th percentiles) of the entire data distribution; the distance between the two ends of the box is called the interquartile range. The whiskers stretch to the outermost data points both above and below the box, in each of which lie another 1.5*(interquartile range) of the data points. Any dots above or below the whiskers are classified as anomalies.
The bracket to the left of the box is the range in which the densest 50% of those data lie. The confidence diamond represents the confidence interval in which a sample's mean most likely lies, which may not be the same as the median as represented in the example.
Box plots as originally envisaged by Tukey make no assumption of statistical normality. They are simply based on distribution of the data by percentiles. The region between the ends of the whiskers contains 99.3% of the observations, which makes box plots equivalent to the 3σ technique for Gaussian data, although it is slightly more generous than Shewhart's rule of approximately 99.7% for identifying anomalies for statistical process control methods [Shewhart 1931]. As a result, a few more data points may be classified as anomalies using box plot techniques than in techniques using more stringent criteria.

Autoregressive Integrated Moving Average (ARIMA) Models
ARIMA models are widely used for both short-and long-term extrapolation of economic trends [Box 1970]. A particular strength of ARIMA is that it encompasses many related statistical time series methods in one general framework. While ARIMA models were originally intended (and continue to be used most widely) for modeling time series behavior and forecasting, they have Many time series patterns can be modeled by ARIMA, but all such patterns amenable to ARIMA would have an autocorrelation or partial autocorrelation element to the model (that is, the value of any particular data record is related to earlier values). Differencing the data (calculating the differences between data values) is the step that simplifies the correlation pattern in the data. 8 Often a cyclic or seasonal pattern must be accounted for in the model. Once the proper order of differencing has been identified, the observations are integrated to characterize the overall trend in the original time series data (which accounts for the "I" in ARIMA). Autoregressive (AR) and/or moving average (MA) terms may be necessary to correct for any over-or under-differencing. An AR term or terms may be necessary if a pattern of positive autocorrelation still exists after the integration. An MA term or terms may be necessary if any negative autocorrelation has been introduced by the integration; this is likely to happen if there are step jumps where the original series mean increases or decreases at some thresholds over time. The goal of ARIMA is to account for all factors that determine the values in the time series so that any residual variation is attributable to "noise." Obviously, the best fit accurately models the values in the series while minimizing noise. Statistical software handles all the needed calculations and produces an array of visual outputs to guide the selection of an appropriate model.
Fortunately, the EVM time series that we analyzed tends to have much simpler best model fits than are sometimes required for more complex time series with seasonal cycles. ARIMA models can be quite varied in their construction; for our data, a nonseasonal ARIMA is appropriate. Such a model is classed as an ARIMA (p,d,q) model where Since ARIMA models often are nonlinear, the best fits are displayed by line and curve segments. An example is shown in Figure 8, which displays one of the 20 time series that we used to compare the anomaly detection methods described in this report. The actual data values are represented as dots, some of which are identified as anomalies using the Tukey box plots that are described below. The most extreme anomalies appear clearly outside of the confidence intervals displayed around the best fit in the figure. Using many of the existing statistical packages, any data point can be identified simply by mousing over it.

8
The non-cumulative series are first-differenced series in mathematical terms. The transformation is done by subtracting the numerical value of its immediately preceding data point from the numerical value of each succeeding data point. The difference between the two will be positive if the prior value was smaller, negative if the succeeding value is smaller, and zero if they are the same. Statistical software packages do the same transformation automatically for as many time lags of integration as are necessary to find the best model fit (e.g., second differences, which are simply the differences between consecutive first differenced values).

Figure 8: An Example ARIMA Best Fit of an EVM Distribution
Simple first difference (e.g., X t -X t-1 ) fits are sufficient in instances that we analyzed in doing our comparisons of anomaly detection methods. In addition, an ARIMA model can almost always be fit for variables that other anomaly detection methods do not handle well (e.g., for EVM management reserve).
The analysis of the model's residuals also plays a key role in determining the suitability of any particular model. We used Tukey box plots as part of the residual analysis to avoid making assumptions about normality of the data distributions as well as for their intuitive interpretation of what constitutes an anomaly (see Section 2.2.5).

3-Sigma Outlier
Many of the techniques discussed thus far operate on the data of a single program task and use all of the data within the program task data set as part of the anomaly detection technique. The 3sigma outlier test is an automated algorithm that we developed as a way to evaluate the entire EVM data set, including all program tasks within the data set. The algorithm was implemented in a Microsoft Excel application. Rather than use the entire task data, the algorithm evaluated accumulated data beginning at month three (i.e., with three data values) and then carried out iterations for months four to n (where n is the total number of values in the program task). When a new program task ID was encountered, the calculations and counters were reset to initiation. A summary of the algorithm is depicted in Figure 9.
This technique simulates the real-world situation of monitoring data as it is being recorded in a database, rather than the retrospective inspection of data once the entire data set is available.

Moving Range Technique
We developed the moving range technique following the control chart analyses listed in Table 1.
Based on the efficacy of the mR control chart for detecting anomalies in NCC and BCC data, the moving range technique is an adaptation of this particular control chart scheme.
As in the 3-sigma outlier test, we used a Microsoft Excel application that evaluates accumulated data beginning at month three (i.e., with three data values) and then carries out iterations for months four to n (where n is the total number of values in the program task). When a new program task ID was encountered, the calculations and counters were reset to initiation. The general flow of the algorithm is the same as that shown in Figure 9 except for the anomaly detection test, which is depicted in the third box from the top of the diagram. The anomaly detection test for the moving range technique is as follows: where is the value of NCC or CBB for record i k is the number of data values in the program task; for k individual values, there are k-1 ranges is the sample-size-specific anti-biasing constant for n=2 observations that are used in calculating mR i [Montgomery 2005] A value is flagged as an anomaly if >

SPI/CPI Outlier
In the earned value management system, the schedule performance index (SPI) and cost performance index (CPI) are defined as = = Our research team explored the use of these variables as a way to normalize the entire data set (i.e., the multiple program data available in the data set) so that anomaly detection analysis was not constrained to a program task by program task evaluation. This approach was explored because there was a possibility that anomalous SPI and CPI values could be detected across the entire data set (that is, across multiple program tasks).
The SPI/CPI outlier technique was implemented as follows for SPI: for i=2 to n where n is the total number of records in the EVM data set.
2. Calculate the average value, , of the values.
3. Calculate the standard deviation (sd) of the values.

If
> then flag the value as an anomaly; if < then flag the value as an anomaly and investigate the corresponding EVM measures.

Comparison of Techniques
In this research study, we evaluated the following anomaly detection techniques: * These techniques were found to be completely ineffective for detecting anomalies in the EVM data. Therefore, they are not discussed further in this report.
We found that some techniques were effective for discovering anomalies in the variables BCWS, BCWP, and ACWP, but proved ineffective for detecting anomalies in the NCC and CBB variables. This is because the variables behave in fundamentally different manners. BCWS, BCWP, and ACWP are cumulative variables whose typical time series profile is curvilinear. NCC and CBB are reported values tied to the contract and do not typically change on a month-to-month basis. When these variables do change over time, the resultant time series appears as a step function. We partitioned the analysis results into two sections to reflect the different character of these two groups of variables and the techniques that were used to detect anomalies within the variables. Figure 10 provides a graphical summary of the performance of the techniques that were found to be effective for BCWS, BCWP, and ACWP when the results of all test cases were combined. Table 4 shows the same results in tabular format, and a further breakdown of the results is presented in Appendix C.

Performance of Techniques Applied to BCWS, BCWP, and ACWP
With respect to detection rate, it may appear that Grubbs' test outperformed all other tests (with the highest detection rate of 85.4%). However, the differences in detection rates among the five top performers (i.e., Grubbs' test, Rosner test, box plot, ARIMA, and control chart for individu-als) are not statistically significant. These techniques as a group did perform better than the four remaining techniques, and this outcome is statistically significant. 9 A probable explanation for this difference is that the top performers benefited from the use of the entire set of data in each test case data set to construct the statistical parameters of the anomaly detection technique. However, the four techniques represented on the right of Figure 10 were implemented in such a way as to simulate the monthly accumulation of EVM data over time. These techniques evaluated the existence of anomalies sequentially, without using all information in the data set to evaluate whether the new incoming data was anomalous. For example, at month six, the variable value was tested using only the available six data values. Then, the next record was read and value seven was evaluated using n=7. But, for the five techniques on the left of the graph, the entire data set (e.g., 73 values in some cases) was used to evaluate the month six value to determine whether it was anomalous. Having the benefit of all the information in the data set likely led to the detection rate effectiveness of the five top performing techniques.  Having established that these five techniques perform similarly with respect to anomaly detection rate (i.e., those that appear in the top row, starting in the left-most column of Table 4), the false alarm rates were compared among them. These appear in the bottom row of Table 4. The differences in false alarm rate among the five top performers were statistically insignificant, based on the outcome of the Chi-square test for equalities of proportions. 10 Therefore, based on our two measures of effectiveness (that is, detection rate and false alarm rate), our analysis suggests that Grubbs' test, Rosner test, box plot, ARIMA, and the control chart for individuals (I-CC) all performed at the same level.
Sections 3.2.2 through 3.3.4 describe some of the qualitative factors associated with each of the five techniques that performed well with respect to detection rate. These qualitative factors are summarized in Table 5.

Efficiency
The extent to which time is well used for the intended task.
The number of times human intervention is required (for the purpose of decision-making) before the technique can execute to completion.
The amount of human intervention time required by a technique to complete the evaluation of the data set.

Flexibility
Susceptible to modification or adaptation. The ability of a technique to respond to potential changes affecting its value delivery in a timely and costeffective manner.
The validity of results when data are from a non-Gaussian distribution.
Effectiveness of technique for small and large sample sizes.
Ease with which the sensitivity of the anomaly detection technique can be adjusted.

Simplicity
Freedom from complexity, intricacy, or division into parts.
Amount of burden put on someone to understand the technique or to try to explain it to a measurement and analysis novice.

Extensibility
The ability of a technique to be operationalized in a production environment with minimal effort or disruption of the existing system.
The level of effort required to extend technique to implementation in production environment.
10 See Appendix C for the details of the significance test results.
In the following sections, each of the high-performing anomaly detection techniques are discussed with respect to the qualitative criteria listed in Table 5.

Control Chart for Individuals
The control chart for individuals was a top performer as determined by the two measures of effectiveness used in this study: it had a detection rate of 75.7% and a false alarm rate of 1.5%. This control chart is a popular tool for understanding variation in a process and system and is particularly well-suited for identifying anomalies in a data set. Anomalies are identified by their appearance above the upper control limit or below the lower control limit of the control chart. The centerline and control limits of the chart are calculated using the available data. Therefore, the control chart operates best when there is sufficient data to generate an accurate portrayal of the average and standard deviation of the data set. For small data sets (n<10), the control limits ( ± 3 ) may generate additional false alarms due to an inflated standard deviation caused by the contribution of even a single deviant data value. As n increases, the calculations of the control limits become more reliable in terms of representing the true standard deviation of the data set. In practice, control limits based on n<10 are typically referred to as trial limits until additional data become available.
With respect to sensitivity, the upper and lower control limits can be adjusted for increased or decreased sensitivity. While the typical application is based on 3 limits, these can be adjusted up or down to change the sensitivity of the detection scheme. For example, under certain conditions, this adjustment is implemented when control charts are used to monitor industrial processes. Here, they are sometimes referred to as warning limits [Montgomery 2005].
Incorporating the control chart for individuals scheme into a data collection stream of activities would not be difficult. The implementation of the tool in a production environment would be relatively straightforward and practical to accomplish.

Grubbs' Test
Grubbs' test was also a top performer with a high detection rate of 86.5% and a relatively low false alarm rate of 2.2%. With regard to efficiency, Grubbs' test is not difficult to apply when using a statistical package such as Minitab for the analysis. A macro has been developed for Minitab that implements Grubbs' test for a specific alpha [Griffith 2007]. When the test is performed manually, the calculations are compared to a look-up While empirical results using Grubbs' test were impressive, the test assumes an approximate normal distribution. In cases where there is a large departure from normality, false alarms may be generated due to non-normality, rather than the presence of anomalies.
The sensitivity of Grubbs' test can be adjusted by changing the value of alpha. The alpha used in this research study was set to = 0.05.

Rosner Test
The Rosner test detected 83.8% of the anomalies with a false positive rate of 2.4%, making it the second best performer among the techniques presented here. Unfortunately, the Rosner test suffers several unique drawbacks that make its implementation problematic. First, the test is not generally available in statistical software packages. Although the algorithm involved is not complex, the iterative nature of the technique complicates the programming requirements such that it may be beyond the skills of a normal user. Any organization seeking to implement the Rosner test would need to devote resources to develop such software. A second major drawback is the maximum limitation of 10 anomalies and a minimum of 25 data records. Analysis of a long-term data series might exceed the limit of 10 anomalies, particularly when investigating programs with life cycles that span decades. The minimum of 25 data records also means that a program task would have to produce more than two years of data before the Rosner test could be used. Third, the Rosner test requires the analyst to identify the suspected anomalies before initiating the test. Although this might be done visually, it means additional time and effort on the part of an analyst in order to implement the test.
Like the Grubbs' and Dixon tests, the Rosner test assumes an approximate normal distribution of the non-anomalous data (those data records remaining after the anomalies are removed). This makes it susceptible to false positives when there is a departure from normality. The Rosner test also produces a test statistic, which is compared to a  Rosner 1983]. The sensitivity of the Rosner test is also adjustable; the alpha used in this research study was set to = 0.05 and critical values are available for = 0.01 and = 0.005.

Tukey Box Plot
The Tukey box plot technique for non-cumulative distributions was also a top performer, with a high detection rate of 83.8% and a relatively low false alarm rate of 2.6%. Box plots can be generated easily and efficiently using many readily available statistical packages. Transformation of the time series into non-cumulative format is easily done in a spreadsheet and can be done with a single mouse click in many statistical packages. Box plots make no assumptions about normality or other statistical properties, and the results are easy to interpret and describe intuitively. The cutoff points for determining what constitutes an anomaly can be easily adjusted based on historical experience and the judgment of domain experts in validating the statistical results. The anomalies can be identified for validation by domain experts with a simple copy and paste from the data tables in any good statistical package. The necessary procedures could be easily automated for use in a production environment.

ARIMA
The ARIMA technique was also a top performer, with a high detection rate of 78.4% and a relatively low false alarm rate of 3.6%. For someone experienced with statistical packages, ARIMA techniques are relatively straightforward to use for anomaly detection in relatively simple univariate time series, such as the EVM data that we analyzed. There is no need to transform the time series data into non-cumulative series, which saves time and may be helpful for EVM analysts who are accustomed to visualizations of cumulative time series. Semi-automated software tools and relatively painless guidance for finding the best ARIMA model fit can be made available to EVM domain experts in a production environment. Anomalies can be easily determined by importing the residuals from an ARIMA model into existing box plot software.
A particular strength of ARIMA is that it subsumes many related statistical time series techniques into one general framework, and it may prove to be more widely applicable for EVM and other time series data that are more complex than those we used to compare statistical anomaly detection techniques thus far. A potential drawback is over-fitting to the data, potentially causing the number of false negatives to increase.

Performance of Techniques Applied to NCC and CBB
Selection of an anomaly detection scheme is dependent on the characteristics of the data. The time series behavior of the EVM variables NCC and CBB is fundamentally different than the behavior of the variables BCWS, BCWP, and ACWP. NCC and CBB are non-cumulative variables whose time series profiles typically (but not always) appear as step functions (see Appendix B). Techniques that performed well for detecting anomalies in BCWS, BCWP, and ACWP did not necessarily work well for NCC and CBB.
The following four techniques effectively identified anomalies in NCC and CBB:  Figure 11 summarizes the ability of these techniques to discover anomalies in the NCC and CBB variables of the four test cases. Table 6 presents the results in tabular format.
All four proved to be 100% effective in discovering data anomalies in the test cases. With respect to false alarm rates, some techniques performed better than others; however the differences were statistically insignificant (see Appendix C). In Sections 3.3.1 through 3.3.4 we discuss some of the qualitative factors associated with each of the four techniques that performed well with respect to detection rate. These qualitative factors are summarized in Table 5 on page 25.

Moving Range Control Chart
The mR control chart performed well for detecting anomalies in NCC and CBB variables, with a detection rate of 100% and a false positive rate of 3%. When used in the industrial domain, the mR control chart is paired with the control chart for individuals to monitor the variation of a process [Montgomery 2005]. However, for our purposes, the mR control chart was used solely for detecting anomalies for these variables.
This technique can be easily automated and does not require human judgments or interaction to execute the sequence of steps required for anomaly identification. The approach is straightfor-ward. As with all control charts, anomalies are indicated by the appearance of a data point above the upper control limit or below the lower control limit.

Moving Range Technique
The moving range technique was essentially a direct implementation of the moving range chart within a Microsoft Excel spreadsheet application. The difference between the two was that the moving range chart relied on the entire data set for analysis of anomalies, while the moving range technique considered only the available subset of data available when the EVM data was reported.
Using only a subset of the data for anomaly evaluation led to additional false alarms as compared to the mR control chart.
Implementing this technique confirmed that it would not be difficult to automate the mR control chart within a production environment.

ARIMA
The ARIMA technique performed well for the NCC and CBB variables, with a high detection rate of 100% and a relatively low false alarm rate of 7.6%. ARIMA is equally applicable to cumulative and non-cumulative series, including series with step jumps such as NCC and CBB. The same semi-automated software tools and relatively painless guidance for finding the best ARIMA model fit could be available to EVM domain experts in a production environment, and the anomalies could be easily determined by importing the residuals from an ARIMA model into existing box plot software.

Tukey Box Plot
The Tukey box plot technique did well, with a detection rate of 100%. The false alarm rate of 12.9% is relatively higher for the NCC and CBB series, although not statistically significantly so. As noted for the comparisons of the BCWS, BCWP, and ACWP time series, box plots are equally easy to use and interpret for any time series, and the cut-off points for determining what constitutes an anomaly can be easily adjusted based on experience. The necessary procedures could be easily automated for use in a production environment.

Summary of Results
In this research study, we investigated the efficacy of anomaly detection techniques on earned value management data submitted on a monthly basis by government contractors to the EVM-CR. Five variables from the data set were analyzed for anomalies. Based on their time series behavior (see Appendix B), the variables fell into two categories as shown in Table 7.

Summary of Results -BCWS, BCWP, ACWP
Of the various techniques we analyzed in this study, we found that five techniques were equally effective for identifying anomalies in BCWS, BCWP, and ACWP. These techniques were: The Grubbs' and Rosner tests assume that the data are from an approximate normal distribution.
In cases of non-normal data, there is a chance that anomalies will escape detection. However, Tukey box plot, ARIMA, and control chart for individuals are more robust in that they are not as sensitive to departures from normality.
In production environments, some techniques will require more human judgments than others. We believe that Grubbs, Rosner, Tukey box plot, and control chart for individuals could all be implemented in an automated environment without significant effort or disruption. However, ARIMA would require significant software programming to address the logic required to implement the technique in a fully automated way.
Therefore, when choosing among the top performers in this group, the conditions and trade-offs must be considered. Given the simplicity and robustness in situations of non-normality, the Tukey box plot appears to be a stand-out performer when sample sizes are greater than 10, while either Grubbs' or Rosner tests should be used when the sample size is small.

Summary of Results -NCC, CBB
Three techniques were found to be effective for discovering anomalies in the NCC and CBB variables. The techniques are These techniques performed at 100% effectiveness for identifying data anomalies in our test cases. The differences in the false alarm rate among the techniques were insignificant.
The moving range technique is an adaptation of the mR control chart. The techniques are essentially the same except that the moving range technique evaluated the data one record at a time (for n>3), while the mR control chart used the entire data set of values.
As stated in Section 4.1.1, ARIMA is somewhat complex because it requires human judgment as part of the method. Implementing a fully automated ARIMA method would be more costly than implementing a method based on the moving range of the data. The calculations and anomaly detection rules associated with the moving range technique are simple and would be easy to implement as an automated stand-alone anomaly detection system. Therefore, moving range is recommended as the technique of choice for the detection of anomalies in variables whose time series behave similarly to NCC or CBB.

Challenges Encountered During This Research
We encountered a number of challenges during the course of this research project. First, we were not able to test our techniques against data that had been previously verified as error free. We dealt with this issue by involving an EVM subject matter expert to identify probable defects that we used as test cases in our analysis.
A second challenge involved distinguishing data errors from accurate data that depicted anomalous program behavior. Data anomalies are detected by measuring the departure of values from what they are expected to be. The expectation in this research was based on statistical and probabilistic models and distributions. When a value is within an expected range, it is treated as valid and accurate. However, when it is a measurable departure from what is expected, it is treated as anomalous. Defining a normal region that minimizes the number of false positive and false negative anomalies can be difficult. A third challenge was the nature of EVM-type data, as it represents actual performance and is not from a stochastic process that can be modeled. Human intervention is at play as program managers make adjustments to the allocation of resources based on the current state of the program. This redistribution of resources throughout the program causes the performance indicator to change in ways that may not be predictable.
Finally, an additional concern associated with this factor is the process for resolving whether a defect is caused by an error or by actual program performance. In all cases, when an anomaly is discovered, the only reliable way to determine its true nature is to trace the data value back to the source to conduct root cause analysis. In this study, we were unable to obtain traceability back to the source (individual or authoritative record) that could resolve the nature of the anomaly. As in the previously identified challenge, we mitigated this issue by consulting with EVM subject matter experts to distinguish anomalies (identified in our test cases; see Appendix B) resulting from probable data defects vs. anomalies attributable to actual program performance.

Implications of This Research
Because the cost of poor data quality is a significant problem in government and commercial industry, the National Resource Council (NRC) report, Critical Code: Software Producibility for Defense, Recommendation 2-2 states: "The DoD should take steps to accumulate high-quality data regarding project management experience and technology choices" [NRC 2010]. But committing errors is part of the human condition. We all do it, no matter how careful we are. We rely on quality checks, peer reviews, and inspections to weed out errors in the final product. Without these safeguards defects are injected into the product and processes and remain there.
Information is the product of the data life cycle. As noted in Figure 12, the potential for errors is significant because errors can be injected whenever human beings touch the data through processing and analysis activities as the data are transformed into information that supports decision making. Correcting the data errors represents costly rework to determine the source of the error and fix it. When errors go undetected, flawed analysis leads to potentially flawed decisions that are based on the derived information. Also, since many information systems involve multiple shared repositories, data errors are replicated and propagate uncontrollably. This is why it is important to focus on correcting data errors at the time of entry rather than downstream in the data life cycle where the errors become embedded. Many organizations are flooded with data, and error detection methods are ad hoc or non-existent. While some errors are detected through manual "sanity" checks of the data, many types of errors escape detection due to the volume of data and the difficulty and tediousness of manual inspection.
The purpose of this research study was to investigate the efficacy of methods that detect potential data errors through automated algorithms. The development of automated support would improve data quality by reducing data defects and release the analysts from the tedious and repetitive task of manual inspection so they can focus their efforts more productively.

Recommendations
This research demonstrates that statistical techniques can be implemented to discover potential data anomalies that would have otherwise gone undetected. We believe that it would be technically feasible and potentially very practical to codify the high performing statistical techniques into automated procedures that would scan and screen data anomalies when data are being entered into a repository. Such a capability could be coupled to and preceded by more basic types of error checking that would initially screen basic types of errors from the data based on business rules. There also may be significant potential for improving anomaly detection based on multivariate approaches.
Future research should focus on the cost/benefit analysis to determine the economic advantages of automating a data anomaly detection capability that could serve as the front end of a data collection system. While it appears there will be a need for back-end checks that use all of the available records for a program, it may be that highly effective front-end checking would eventually eliminate the need for such a process.

Appendix A Data Defect Taxonomy
This table was adapted from the work of Larry English [English 2009].

Existence
Each process has all the information it requires.
Record existence A record exists for every real-world object or event the enterprise needs to know about.
Value existence A given data element has a full value stored for all records that should have a value.

Completeness
Each process or decision has all the information it requires.
Value completeness A given data element (fact) has a full value stored for all records that should have a value.

Validity
Data values conform to the information product specifications.
Value validity A data value is a valid value or is within a specified range of valid values for this data element.

Business rule validity
Data values conform to the specified business rules.

Derivation validity
A derived or calculated data value is produced correctly according to a specified calculation formula or set of derivation rules.

Accuracy
The data value correctly represents the characteristic of the real-world object or event it describes.

Accuracy to reality
The data correctly reflects the characteristics of a real-world object or event being described. Accuracy and precision represent the highest degree of inherent information quality possible.
Accuracy to surrogate source The data agree with an original, corroborative source record of data, such as a notarized birth certificate, document, or unaltered electronic data received from a party outside the control of the organization that is demonstrated to be a reliable source.
Precision Data values are correct to the right level of detail or granularity, such as price to the penny or weight to the nearest tenth of a gram.

Non-duplication
There is only one record in a given data store that represents a single realworld object or event.
Source quality and security warranties or certifications The source of information (1) guarantees the quality of information it provides with remedies for non-compliance; (2) documents its certification in its Information Quality Management capabilities to capture, maintain, and deliver Quality Information; (3) provides objective and verifiable measures of the quality of information it provides in agree-upon quality characteristics; and (4) guarantees that the information has been protected from unauthorized access or modification.

Equivalence of redundant or distributed data
Data about an object or event in one data store is semantically equivalent to data about the same object or event in another data store.

Concurrency of redundant of distributed data
The information float or lag time is acceptable between (a) when data are knowable (created or changed) in one data store to (b) when it is knowable in a redundant or distributed data store, and concurrent queries to each data store produce the same result.

Currency
The "age" of the data are correct for the knowledge workers' purpose or purposes.
This appendix presents the test cases we used to evaluate the effectiveness of the anomaly detection methods investigated as part of this research study. The arrows on each graph indicate values that were identified as possible data errors by and OSD subject matter expert.                                           Detection rate n/a n/a n/a n/a n/a n/a n/a n/a n/a    Detection rate n/a n/a n/a n/a n/a n/a n/a n/a n/a    Detection rate n/a n/a n/a n/a   Detection rate n/a n/a n/a n/a  Detection rate n/a n/a n/a n/a   Detection rate n/a n/a n/a n/a

Appendix D Analysis Results -Significance Tests
This section presents the significance tests that are referred to in the Results and Discussion Section of the document.   A p value of 0.013 demonstrates that there is a significant difference in the effectiveness of the listed techniques in Table 35.  A p-value of 0.000 (listed in Table 37) demonstrates a significant difference in false positives generated by the techniques. The test summarized in Table 38 demonstrates that the 3-sigma method generates fewer false positives than all other techniques, and this difference is statistically significant. Other test of proportions for false positives did not show significant differences.   A p value of 0.000 demonstrates a significant difference in the generation of false positives by the techniques listed in Table 40. The test of two proportions demonstrates a significant difference between the mR CC method and the moving range technique. The mR CC method generates fewer false positives and the difference is significant (p value = 0.009).
Other tests to two proportions failed to show significant differences in performance.

Appendix E Summary of Leading Enterprise Data Quality Platforms
This appendix provides summaries of six enterprise data quality platforms that were highlighted in The Forrester Wave: Enterprise Data Quality Platforms, Q4 2010 [Karel 2010]. The focus of the summary for each tool is on data profiling capabilities associated with the platform. Provides a window to a unified data integration platform, with insight into data source analysis, ETL (extract, transformation, load) processes, data quality rules, business terminology, data models, and business intelligence reports.

Discovery
To understand data relationships Identifies and documents existing data, where it is located, and how it is linked across systems by intelligently capturing relationships and determining applied transformations and business rules Data Profiling IBM InfoSphere Information Analyzer is intended to help analysts understand data by offering data quality assessments, flexible data rules design and analysis, and quality monitoring capabilities. Capabilities include the following: • deep profiling capabilities-provide a comprehensive understanding of data at the column, key, source, and cross domain levels • multi-level rules analysis (by rule, record, or pattern) unique to the data quality spaceprovides the ability to evaluate, analyze, and address multiple data issues by record rather than in isolation • shared metadata foundation-integrates the modules across IBM InfoSphere Information Server and IBM InfoSphere Information Server in support of the enterprise • native parallel execution for enterprise scalability-enables high performance against massive volumes of data • supports data governance initiatives through auditing, tracking, and monitoring of data quality conditions over time • enhanced data classification capabilities help to focus attention on common personal identification information to build a foundation for data governance, • used to proactively identify data quality issues, find patterns, and set up baselines for implementing quality monitoring efforts and tracking data quality improvements Informatica http://www.informatica.com/products_services/Pages/index.aspx#page=page-8 Informatica offers a number of products that share various capabilities related to data profiling and data quality. The software product Data Explorer is its primary product for data profiling. However, additional products are available that provide data profiling capabilities but are geared for specific roles within the organization, as shown in the table below.

Software Tool Purpose
Data Explorer Business analysts, data stewards, IT developers Informatica Analyst Line-of-business managers, data stewards, and business analysts Informatica Developer IT developers Informatica Administrator IT administrator

Software Tool: Data Explorer
Data profiling capabilities include the following: • analyze data to automatically profile the content, structure, and quality of highly complex data structures • discover hidden inconsistencies and incompatibilities between data sources and target applications • easily customize new rules to automatically profile new data entries Data mapping capabilities include the following: • generate accurate source-to-target mapping between different data structures and define the necessary transformation specifications • compare actual data and metadata sources to target application requirements • find data gaps, redundancies, and inaccuracies to resolve before moving data • identify data anomalies and create a normalized schema capable of supporting data Connectivity capabilities include the following: • profile all data types in a wide variety of applications, systems, and databases, including • extend support beyond basic customer data, such as names, addresses, and telephone numbers, to include product, financial, asset, pricing, and other data

Software Tool: Informatica Analyst
This easy-to-use, browser-based tool is designed to empower the business to proactively participate in data quality processes without IT intervention. It enables line-of-business managers, data stewards, and business analysts to • profile, analyze, and create data quality scorecards • drill down to specific records with poor data quality to determine their impact on the business and how to fix them • monitor and share data quality metrics and reports by emailing a URL to colleagues • define data quality targets and valid reference data sets • specify, validate, configure, and test data quality rules • collaborate efficiently with IT developers to share profiles and implement data quality rules • identify anomalies and manage data quality exception records • track data quality targets on an ongoing basis

Software Tool: Informatica Developer
This Eclipse-based data quality development environment is designed to enhance IT productivity. It enables IT developers to • discover and access all data sources-whether they are on premise, with partners, or in the cloud This easy-to-use, browser-based tool with centralized configuration and deployment capabilities for managing the data integration environment enables IT administrators to • manage services and nodes, including configurations that support grid and high availability • oversee security and user management, including users, groups, roles, privileges, and permissions Profiler Plus is intended to help the user discover and understand the quality of data. Its data profiling features and integrated analysis management framework enhances, accelerates, and reduces risks in data analysis activities. Key benefits described by Pitney Bowes include the ability to Monitor Plus allows you to create rules to provide a proven way of checking and validating the data used in your business systems and applications, including the ability to • run regular data checks using an external scheduler • integrate monitoring into your existing operational data environment • receive automatic alerts after every execution, with data reports sent directly to your inbox Pitney Bowes also provides specific products related to other and more specific aspects of data quality. They include • Address New Module: Capture, validate, and correct addresses for the U.S., Canada, and over 220 countries worldwide with the Address Now Module for the Spectrum Technology Platform • Advanced Matching Module: Marketing and business processes rely on accurate data to identify and understand the relationships between records. The Advanced Matching Module recognizes customers, products, duplicates, and households across data sources.
• Data Normalization Module: Creates a uniform customer experience by standardizing terms in your data base with the Data Normalization Module for the Spectrum Technology Platform • Universal Addressing Module: Address data exists throughout your enterprise in customer databases, call centers, web sites, and marketing systems. Reliable address information is required to communicate effectively with your customers, develop an accurate single customer view, and leverage your customer-facing technology investments. The Universal Addressing Module provides address validation, correction, and standardization technologies for more than 220 countries • Universal Name Module: Provides flexible and global name knowledge to better segment and target your customer base while matching, standardizing, analyzing, and consolidating complex records with confidence

SAP BusinessObjects
Software Tool: Data Quality Management http://www.sap.com/solutions/sapbusinessobjects/large/eim/data-quality-management/index.epx SAP BusinessObjects Data Quality Management-which is also available in versions for Informatica, SAP solutions, and Siebel-delivers a solution to help analyze, cleanse, and match customer, supplier, product, or material data (structured or unstructured) to ensure highly accurate and complete information anywhere in the enterprise.
SAP BusinessObjects Data Quality Management includes the following features and functionality: • data quality dashboards that show the impact of data quality problems on all downstream systems or applications • the ability to apply data quality transformations to all types of data, regardless of industry or data domain, such as structured to unstructured data as well as customer, product, supplier, and material information • intuitive business user interfaces and data quality blueprints to guide you through the process of standardizing, correcting, and matching data to reduce duplicates and identify relationships • comprehensive global data quality coverage with support for over 230 countries • comprehensive reference data • broad, heterogeneous application and system support for both SAP and non-SAP sources and targets • prepackaged native integration of data quality best practices for SAP, Siebel, and Informatica PowerCenter environments • optimized developer productivity and application maintenance through intuitive transformations, a centralized business rule repository, and object reuse • high performance and scalability with software that can meet high volume needs through parallel processing, grid computing, and bulk data loading support • flexible technology deployment options, from an enterprise platform to intuitive APIs that allow developers quick data quality deployment and functionality