Deviation in Software Maintenance

Use of any trademarks in this report is not intended in any way to infringe on the rights of the trademark holder. Internal use. Permission to reproduce this document and to prepare derivative works from this document for internal use is granted, provided the copyright and " No Warranty " statements are included with all reproductions and derivative works. External use. Requests for permission to reproduce this document or prepare derivative works of this document for external and commercial use should be addressed to the SEI Licensing Agent. Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. The Government of the United States has a royalty-free government-purpose license to use, duplicate, or disclose the work, in whole or in part and in any manner, and to have or permit others to do so, for government purposes pursuant to the copyright license under the clause at 252. 227-7013. For information about purchasing paper copies of SEI reports, please visit the publications portion of our Web site

CMU/SEI-2003-TN-015 iii   The extent to which schedule estimates differ from reality is one important measure of project performance.But is higher maturity in fact correlated with a reduction in schedule deviation?Data from 752 maintenance projects drawn from 441 SW-CMM assessments are analyzed using a zero inflated Poisson (ZIP) regression model, and the results are validated using a bootstrap estimation method.Projects from higher maturity organizations typically report less schedule deviation than those from organizations assessed at lower maturity levels.

List of Figures
1 Introduction

The Importance of Software Maintenance
The Capability Maturity Model ® (CMM ® ) for Software (SW-CMM) Paulk et al. 96] cites the definition of maintenance from IEEE Std 610-1990 [IEEE 90] as "the process of modifying a software system or component after delivery to correct faults, improve performance or other attributes, or adapt to a changed environment."This definition includes at least three types of software maintenance: 1. corrective maintenance: To correct processing, performance, or implementation faults of the software.
2. adaptive maintenance: To adapt the software to changes in environment such as new hardware of the next release of an operating system.Adaptive maintenance does not lead to changes in the system's functionality.
3. perfective maintenance: To perfect the software for its performance, processing efficiency, maintainability, or accommodation of new or changed user requirements.
The IEEE has estimated the annual cost of software maintenance in the United States to exceed $70 billon [Edelstein 93, Lerner 94].Schrank has estimated it to be more than $30 billion annually [Schrank et al. 95].Others have estimated the magnitude of software maintenance costs to range from 40 to 80 percent of overall software life-cycle costs [Alkhatib 92,Kemerer 95,Schrank et al. 95].A widely used rule of thumb for the distribution of maintenance activities has been 60 percent for enhancements, 20 percent for adaptation, and 20 percent for error correction [Lientz & Swanson 80,Glass & Noiseux 81].
While the SW-CMM is intended to be suited for both development and maintenance processes, difficulties in implementing the model in maintenance-only organizations have been reported [Drew 92].Others have criticized the SW-CMM for not directly addressing maintenance [Kuilboer & Ashrafi 00].One survey study conducted in the United Kingdom failed to find evidence that higher maturity companies manage maintenance more effectively than lower maturity companies; however, the survey does not explicitly state how it defines maturity [Hall et al. 01]

This Study
A basic premise of the SW-CMM is that higher process maturity is associated with better project performance and product quality.Furthermore, improving maturity is expected to subsequently improve both performance and quality.Testing this premise can be considered an evaluation of the predictive validity of the assessment measurement procedure .Given both the high cost of software maintenance and enduring questions about the applicability of the SW-CMM, it is important to provide objective evidence about the predictive validity of the SW-CMM in a maintenance context.
This study provides evidence that higher process maturity is in fact associated with "reduced mean and variance" of schedule deviation in software maintenance. 1The analysis is based on 752 maintenance projects from 441 CMM-Based Appraisals for Internal Process Improvement (CBA IPI) assessments.A zero inflated Poisson (ZIP) regression model is used to account for nonnegative integer values and the existence of multiple reports of no deviations in schedule.The results are validated using a bootstrap estimation method.
Section 2 reviews previous studies on predictive validity and presents the study's hypotheses.Section 3 addresses data collection and the characteristics of our sample.Section 4 presents a brief introduction of a ZIP regression model and a bootstrap method for examining the stability of our results.Section 5 presents the results of the analysis.Section 6 contains our conclusions and final remarks.

Theoretical Basis
The SW-CMM provides a framework for organizing software processes into five evolutionary steps, or maturity levels, which lay successive foundations for continuous process improvement (Table 1).The SW-CMM covers practices for planning, engineering, and managing software development and maintenance.More mature software organizations, when following these key practices, are expected to be better able to meet their cost, schedule, functionality, product quality, and other performance objectives [Paulk et al. 96].Testing the above basic premise of the SW-CMM requires an empirical evaluation of the predictive validity of the process maturity concept.Is there a characteristic relationship between process maturity and independently measured performance criteria?Clearly, such relationships may depend on other contextual factors; that is, the relationships may differ from one context to another or may exist in only a few contexts.This theoretical basis for evaluating predictive validity is depicted in Figure 1.

Variable Definition and Empirical Hypotheses
In the context of software maintenance in Figure 1, schedule deviation is the performance measure we use as our dependent variable.Schedule deviation is defined as the absolute value of the difference between actual schedule and planned schedule, i.e., y =|Actual-Planned|.Schedule deviation y is expressed in months ahead or behind schedule, with a value of zero indicating that the project is on schedule. 2 Our explanatory variable, process maturity, is coded from maturity level 5 down to maturity level 1. Maturity level is an ordinal scale, not an interval scale; however, we do employ parametric statistics in this analysis.
Previous studies show that the distribution of maturity levels differs between the United States and elsewhere in the world [SEI 02,Jung et al. 02].Hence, we examine how the region where the assessment was conducted (U.S. versus non-U.S.) acts as a contextual factor in mediating the effects of our research hypotheses. 3  The theoretical basis shown in Figure 1 implies that schedule deviation is negatively associated with maturity level: the higher the maturity level, the less schedule deviation.In addition, the association may differ across regions of the world.Two types of benefits are expected to follow: 2 One might argue that being ahead of schedule is less serious than being behind schedule; however, too few projects reported being ahead of schedule to allow a separate analysis here.Other weaknesses of the schedule deviation measure are described in Section 3.3.

Contextual factors
Performance (objectives)

Contextual factors
• HYPOTHESIS 1: Increasing maturity level reduces the mean of schedule deviance in maintenance projects.• HYPOTHESIS 2: Increasing maturity level reduces the variance of schedule deviance in maintenance projects.
Testing these two hypotheses in software maintenance projects allows us to evaluate the predictive validity of the process maturity concept.The same two hypotheses have been depicted elsewhere in graphical form as seen in Figure 2 [ Paulk et al. 96].
Figure 2: Process Capability as Indicated by Maturity Level [Paulk et al. 93a]

Previous Empirical Studies
All previous studies of predictive validity in process improvement are based either implicitly or explicitly on the theoretical model depicted in Figure 1.While some empirical studies examine variation across large numbers of organizations, most of them are case studies that describe the experiences and benefits from increasing process maturity in a single organization or a small number of organizations.
Case studies are quite useful for demonstrating proof of concept.Case studies, however, have a serious methodological disadvantage.It is difficult at best to generalize their results to a wider population.A case study can monitor projects in depth, but it is difficult to replicate the results later in a comparable context.Case studies also tend to suffer from a selection bias [adapted from El-Emam & Birk 00b]: • Organizations that have not shown any process improvement or have even regressed will be highly unlikely to publicize their results, so case studies tend to show mainly success stories.
• The majority of organizations do not collect objective process and product data (e.g., on defect levels, or even keep accurate effort records).Only organizations that have made improvements and reached a reasonable level of maturity will have the actual objective data to demonstrate improvements (in productivity, quality, or return on investment).Therefore, failures and non-movers are less likely to be considered as viable case studies due to the lack of data.
By now, several predictive validity studies have collected data from larger numbers of organizations or projects, and they have statistically investigated relationships between capability maturity 4 and independent measures of performance.A survey study of individuals from SW-CMM-assessed organizations shows that higher maturity organizations tend to perform better on the subjective measures of performance (including ability to meet schedule), product quality, staff productivity, customer satisfaction, and staff morale organizations [Hwang & Jung 03].However, much weaker relationships were found between project-management process capability and any of the performance measures in small organizations.

Data Source
Authorized lead assessors are required to provide reports to the Software Engineering Institute (SEI SM ) for their completed assessments.Assessment data on the reports are kept in an SEI repository called the Process Appraisal Information System (PAIS).The PAIS includes information for each assessment on the company and appraised entity, key process area (KPA) profiles, organization and project context, functional area representatives groups, findings, and related data. 5 This report considers only CBA IPI assessments.Not all CBA IPI assessments include KPA rating profiles, since the determination of a maturity level or KPA ratings is optional and is provided at the discretion of the assessment sponsor.The dataset that we analyzed for this study was extracted from appraisal reports in the PAIS for the period of January 1998 through December 2001.

Dataset Analyzed
A statistical rule of thumb states that there should be at least six observations (sometimes five) to have confidence in analysis results.A similar criterion was used in an earlier analysis of software process assessment [Jung et al. 01].Briand and colleagues [Briand et al. 00] and El-Emam and colleagues ] also have used a "greater-than-fiveobservations" criterion for the validation of software product metrics.
We follow the same rule of thumb here.Fewer than five maintenance projects at maturity levels 4 and 5 reported any schedule deviation whatsoever.Hence, we exclude maturity levels 4 and 5 from our statistical analysis.Note, however, that the lower incidence of reported schedule deviation at maturity levels 4 and 5 is of course entirely consistent with our empirical hypotheses.
SM SEI is a service mark of Carnegie Mellon University. 5 Submitting an assessment report does not imply that the SEI certifies any assessment findings or maturity levels.All assessment data are kept confidential and are available only to SEI personnel on a need-to-know basis for research and development.Information in the PAIS is used to produce industry profiles or as aggregated data for research publications, and the SEI publishes a Process Maturity Profile twice a year (http://www.sei.cmu.edu/sema/profile.html).
Data exist for 752 maintenance projects from 441 organizations assessed at maturity levels 1 through 3 inclusive.Figure 3 shows the number of organizations and maintenance projects assessed by region.Since more than one maintenance project exists in some organizations, the number of organizations is fewer than the number of projects.

Figure 3: Organizations and Maintenance Projects in Regions
Table 2 shows the number of assessed maintenance projects at each maturity level.Schedule delays were reported by a total of 47 projects, while 8 projects reported being ahead of schedule.The numbers in parentheses denote the number of projects that reported deviations in schedule.

Unit of Analysis
The units of analysis in this study are projects in the maintenance phase of their life cycles, and our performance measure is schedule deviation expressed in months.Since the organization typically is the unit of analysis in CBA IPI assessments, our measure of maturity is organization-wide.If several maintenance projects are assessed in a single organization, all of the projects have the same level of maturity but have their own individual values of schedule deviation. CMU/SEI-2003-TN-015

Data Quality
Our analysis mostly relies on two variables.The independent variable (covariate) is organizational maturity level as determined by CBA IPI assessment teams.A previous study provides ample confidence in the quality of that measure. 6The dependent variable, schedule deviation, is a self-reported nonnegative integer measured by month, in which a project may be ahead, behind, or on schedule.Our reliance on such a measure raises significant accuracy issues.
In particular, a very large proportion, approximately 95 percent, of the projects in the maintenance phase of their life cycles reported being on schedule.That, of course, is contrary to both the results of previous studies and practical experience in the field.
Several reasons may account for this divergence.The question that is used to measure schedule deviation only asks whether or not the project is on time, but the criteria for being on time are not specified.One likely conjecture is that many projects periodically modify their baseline schedule estimates, which results in less-reported delay.Another is that assessments often include exemplary projects.
Time ahead or behind schedule is measured in months, so there also is most probably rounding error in the projects' replies.If a maintenance project is delayed for six weeks, should it be recorded as a delay of one month or two?Similarly, should a two-week delay be reported as a one-month delay or as essentially on time?Moreover, the measure does not account for variations in project size and duration.For example, a two-month delay in a onemonth project is treated the same as a two-month delay in a nine-month project.
That said, as one might expect, reported schedule deviation is in fact higher for projects that are in other phases of their life cycles than maintenance.For example, more than 25 percent of the projects in test and integration do report being a month or more behind schedule.
Self-reports and direct observation often differ.For example, one study shows that software engineers over-report the amount of time that they work by an average of almost three percent; the proportion of times that self-reports and observer reports agreed on what the software engineer actually was doing varied substantially, from 95 to 58 percent [Perry et  In it we performed an internal-consistency reliability study using the same 676 CBA IPI assessments on which the present work is based [Jung and Goldenson 02].The results identified three underlying dimensions of the capability maturity construct."Project implementation" includes the key process areas (KPAs) at maturity level 2, "organization implementation" covers the KPAs at maturity level 3, and the KPAs at both maturity levels 4 and 5 are subsumed under "quantitative process implementation."Cronbach's alpha coefficient of internal consistency for each of the three dimensions exceeds the recommended value of 0.9, which indicates a sufficiently high level of internal consistency for use in practice.
Every measure has its strengths and weaknesses.For example, some studies recommend using relative measures. 7But if the denominator has a small value, the measure may be exaggerated and take on an unreasonably large value.
Other candidate measures of schedule deviation include arithmetic means and standard deviations.However, they too are subject to a lack of robustness; one very small or very large value causes them to take on an arbitrarily large value.A trimmed method such as a Winsorized standard deviation or median absolute deviation might be used [Lunneborg 00]; however, such methods cannot be applied to a dataset characterized by excess zeros such as ours.
We have very little independent basis for judging the criterion validity of the schedule deviation question per se.Moreover, maturity level is an organizational construct while schedule deviation can vary by project.This too may introduce measurement error into our analysis.As will be seen later, however, our results are robust in spite of these limitations.
The relationships with maturity level provide compelling evidence of the predictive validity of the SW-CMM.

Sampling Characteristics of the Dataset
Statistical analysis and its interpretations depend on the criteria by which a sample (subset) is selected from a population.Classical population inference requires random sampling.Hence, we examine here the sampling characteristics of our dataset.
The simplest form of sampling is a random sample.A simple random sample is defined as "a set of cases selected from a well-defined population of cases by a process that ensures that every sample containing the same number of cases has the same chance of being the one selected" [Lunneborg 00].In the context of SW-CMM assessments, this definition explicitly requires two things: (1) a well-defined population of assessment cases from which to sample, and (2) a well-defined random process for selecting the sample.
The assessments reported to the PAIS database do not satisfy these two requirements.The population and the size of its assessments cannot be clearly defined, and the assessed organizations are not selected on a random basis.Rather, the assessments in PAIS are a selfselected sample (i.e., assessed organizations that have voluntarily participated in CBA IPI assessments to improve their software processes or were required to do so by contractors.)Our analyses here clearly must be based on nonrandom sampling methods.

A Zero Inflated Poisson (ZIP) Regression Model
ZIP regression has been used elsewhere for predicting count outcomes in software engineering [Khoshgoftaar et al. 02].The ZIP regression model accounts for the characteristics of an excess number of zero values on the dependent variables, which meets our current needs with schedule deviation.Commonly used Pearson or Spearman correlations are not sufficient to examine such an association.
Our ZIP regression model assumes that the software maintenance processes in an assessed organization are in either a "perfect" or an "imperfect" state.In the perfect state, no schedule deviation will occur, whereas in the imperfect state, there may or may not be schedule deviation.Several factors affect the distribution of schedule deviation in software maintenance and the probability of there being an imperfect state.Process maturity is assumed to be a single factor for the purposes of this study.
Let i ψ be the probability that the i th maintenance project is performed by a maintenance process that is in a perfect state.Then, ( 1) becomes the probability that a process of the i th maintenance project is in an imperfect state.Maintenance projects whose processes are in a perfect state are always assumed to be on schedule.Projects whose processes are in an imperfect state may be on schedule following a Poisson distribution with the parameter i µ , i.e. , exp( ) . For maintenance processes that are in an imperfect state, the probability that schedule deviation is greater than one month is a product of the probability of being in an imperfect state and the probability of schedule deviation i y in a Poisson distribution of i y .
Therefore, the probability density function of the ZIP regression model is as follows [Lambert 92,Long 97]: (1 ) exp( ) for 0, (1) The conditional mean and variance of the ZIP probability function ( 1) are ( 1) and ( 1)( 1) , respectively.If ψ is 0, then the ZIP regression model ( 1) becomes a Poisson regression model.The term "conditional" is used to denote that the mean and variance depend on covariates.The only covariate in this study is maturity level.
The ZIP regression model is obtained by the following two link functions: A negative value of 1 β implies that a high-maturity maintenance process has less schedule deviation than that of a low one.The probability that the maintenance process of project i is in a perfect state is estimated by: ) ) and the Poisson parameter is estimated by

Stability Examination
Since our dataset is not a simple random sample, we also need to examine the stability of the analysis results.For this purpose, we use a bootstrap 8 resampling technique that samples B times from the original observation with replacement, where B is a large number such as 1,000.For each sample, the ZIP regression gives interesting descriptions 1 β (coefficients of maturity level in the ZIP regression model), ( 1) (mean), and (1 )( 1) variance).Then, the lower and upper limits of the confidence interval of each description are determined at the 2.5 and 97.5 percentiles respectively from the empirical reference distribution (i.e., a histogram of B replications).The confidence interval of the empirical reference distribution is called the empirical confidence interval (ECI).The bootstrap method is free from unrealistic assumptions such as normality and homogeneity and is suitable to conduct local inferences.
As noted earlier, we use the region of assessed organizations as a mediating contextual factor.The proportions of assessments in the two regions are not fixed in advance; rather, a bootstrap sample is drawn with permutation from the original dataset and then is divided into the U.S. cases or non-U.S.cases before computing our descriptions.Each bootstrap sample is likely to have different proportions of U.S. and non-U.S.cases.This is called "not by design" from the original dataset [Lunneborg 00].
The description from the original dataset should be solidly in the middle of the empirical reference distribution to be considered stable.It should not be at or near the limits of the description.A measure for evaluating stability bias is defined as follows: The degree of bias is evaluated against the standard error (SE) of the description distribution of B replicates.The SE is computed as follows: ( ) If the bias is large relative to the SE, there is an instability problem.A criterion for judgment is that if the absolute value of the bias is less than one-quarter of the size of the SE, the bias 8 This bootstrap method should not be confused with the Bootstrap model for process assessment [Kuvaja 99].

Descriptive Statistics
This study is based on 752 maintenance projects from 441 SW-CMM CBA IPI assessments.
Figure 4 shows the distribution of organizations and maintenance projects by region.A single maintenance project was reported in each of 56 percent of the assessed organizations (171+76=247).Approximately 26 percent of the assessed organizations in the United States included one maintenance project, while about 34 percent of the non-U.S.organizations included a single maintenance project.Two organizations assessed six maintenance projects each.The mean, median, and standard deviation of the number of maintenance projects in these assessed organizations in the United States are 1.67, 1, and 1.01, respectively.In the non-U.S.organizations, the mean, median, and standard deviation of maintenance projects are 1.75, 1.5, and 0.95, respectively.

Figure 4: Number of Maintenance Projects in Each Assessed Organization
Figure 5 shows the distribution of maturity level by region.If two or more maintenance projects exist in an assessed organization, the maturity level is counted two or more times.
The most frequent maturity level is 2 (Repeatable) in both regions, followed by level 3 (Defined), and level 1 respectively.Means and standard deviations are presented in Table 4. Maturity levels 4 and 5 are not considered in this study because of the very small number of maintenance projects that report delayed schedules at those levels of process maturity.
CMU/SEI-2003-TN-015 As shown in Table 3, the arithmetic mean maturity level in the U.S. dataset is nearly equal to that in the non-U.S.dataset.But, the arithmetic mean of schedule deviations in the U.S. dataset, 0.17, is less than half the value of 0.38 in the non-U.S.dataset.

Parameter Estimation and Stability Test
The results of our ZIP regression analyses are given in Table 5.As expected, the estimated coefficient of maturity level, 1 β , is both negative and statistically significant for both the U.S. and non-U.S.datasets.The negative association indicates that schedule deviance decreases across the maintenance projects as their respective organizations' maturity levels progressively increase.This is consistent with the hypothesis in our theoretical model.
, has a positive association ( 1 γ >0) with maturity level.The ratio for the non-U.S.dataset is significant at 8.9%, which indicates only a weak association; however, the results for both regions indicate that the probability of being in a perfect state is increased as maturity progressively increases.
The Chi-square goodness-of-fit values 9 in the last row in Table 5 show the aptness of our ZIP regression model [Cameron & Trivedi 98].Each of the two fitted models conforms to the assumptions of the ZIP regression model at an alpha value of 1 percent.
Figure 6 shows a graph comparing the fitted and actual probabilities for the non-U.S.case.The better the fit is, the smaller the difference of probabilities. Figure 6 shows that the number of one-month deviation projects is slightly underestimated.On the other hand, the number of projects with three-and four-month deviations is slightly overestimated.However, all of the differences are negligible.For the U.S. dataset, the plot is omitted because the difference of fitted and actual probabilities is quite small.The estimated coefficients 1 β 's are most directly related to our hypotheses, and they are examined for their stability here.Figure 7 shows a bootstrap distribution of the estimated coefficients 1 β 's of schedule deviation with 1,000 replicates.For the U.S. dataset (on the left in Figure 7), the dotted and solid vertical lines denote a bootstrap coefficient of -0.463 and an observed coefficient10 of -0.428 respectively.The difference between them, -0.035, is defined as a bias in bootstrap sampling.It is ignorable in comparison with the SE value of 0.299.Therefore, we conclude that the estimated coefficient 1 β of maturity level is stable.In the bootstrap distribution, 97 percent of the estimated coefficients have negative values.
For the non-U.S.dataset, the bias of the maturity level coefficient, -0.363 -(-0.364)= 0.001, is also ignorable in comparison with the SE value of 0.314.However, 89 percent of the estimates of the maturity level coefficient 1 β are negative.This is a relatively high value compared to the p-value of 0.009 in Table 5.

Mean and Its Stability
As seen in Table 5, the negative coefficients 1 β of process maturity support the hypothesis that increases in maturity level result in decreases in schedule deviation.Figure 8 shows the evaluation of HYPOTHESIS 1 in fuller detail.The (expected) mean ( 1) of probability density function (1) is decreasing, and the decrease is distinct for both the U.S. and non-U.S.datasets.

Figure 8: Mean of Schedule Deviation at Maturity Levels 1-3
Table 6 shows results of the bootstrap resampling that examines the stability of the expected mean at each capability level.We conclude that mean at each level is stable because the bias is smaller than one quarter of the SE.The observed means are a result of the sample in our dataset.Different samples would produce different mean values.Hence, a confidence interval is employed to delimit the true (unknown) mean value of schedule deviation at each maturity level.The 95% ECI in Table 6 is computed from Figure 9, which is a bootstrap empirical reference distribution of mean schedule deviation with 1,000 replicates.
As an example of the ECI interpretation, we can say with a confidence of 95 percent that mean schedule deviation at maturity level 1 in the U.S. dataset is somewhere in the interval between 0.134 and 0.689.But this interpretation is limited to the current dataset; it cannot be extended to all industries in the United States.Since the bootstrap empirical reference distribution in Figure 9 does not satisfy a normality assumption, using a bootstrap ECI is justified.
Note that there are long tails on the right-hand side of the non-U.S.distributions.They are truncated for reasons of space.In Figure 9, however, the same basic results hold for both the U.S. and non-U.S.data.
The 95% ECIs among the maturity levels in Table 6 partially overlap each other.The empirical reference distributions in Figure 9 also show that overlap.Hence, we must test whether there is a significant difference in the mean schedule deviation between maturity levels.The empirical reference distributions in Figure 9 clearly indicate that we cannot employ a parametric test to examine the mean differences; however, the bootstrap method shows that there are statistically significant difference of mean schedule deviation between maturity levels 1 and 2 and levels 2 and 3 with a p-vale of 0.005 for the both cases; corresponding p-values of 0.04 and 0.039 show that there also are significant differences in mean schedule deviation for the same two pairs of maturity levels in the non-U.S.dataset.

Variance and Its Stability
Our second hypothesis requires us to evaluate the reduction of variance in schedule deviation with respect to maturity level.Figure 10 shows how the conditional variance of the ZIP probability density function (1), (1 )(1 ) , is reduced with respect to maturity level.Again, the reduction in variance is significant.The results of our bootstrap resampling shown in Table 7 show that the bias is less than a quarter of the SE.The estimated value of variance in schedule deviation also is stable at each maturity level.Finally, the 95% ECIs of conditional variance in Table 7 also are partially overlapped.The empirical reference distributions in Figure 11 lead to the same conclusion.Therefore, we can use the bootstrap empirical reference distributions in Figure 11 to evaluate the variance difference in the schedule deviation between maturity levels.In the U.S. dataset, 95 percent of the 1,000 replicates show that the variance in schedule deviance at maturity level 2 is less than that at level 1.At maturity levels 2 and 3, the value also is 95 percent.The corresponding values for the non-U.S.dataset are 95 percent and 95 percent respectively.Increases in process maturity are in fact regularly accompanied by reduced variation in schedule deviation by software maintenance projects.

Conclusion
This study presents compelling evidence about the predictive validity of the SW-CMM as applied to software maintenance.A basic premise of the SW-CMM is that higher maturity should result in better project performance.We find that assessed maturity level is in fact related as expected to schedule deviation in software maintenance projects, and our results are quite robust, in spite of the limitations of the data.While important distinctions remain to be addressed, the results are similar across the software development life cycle; they do not appear to be limited to maintenance projects.
A univariate ZIP regression model is employed to test the premise.Since the results are based on non-random sampling, they are validated using a bootstrap estimation method.
The results show that maintenance projects in higher maturity organizations typically have lower mean and variance in schedule deviation than do comparable projects from organizations assessed at lower levels of maturity.The schedule estimates of projects from higher maturity organizations are markedly more predictably accurate.
Clearly, organizational maturity is not the only factor that affects schedule deviation in software maintenance projects.Neither is schedule deviation the only performance measure worth considering.Other measures of performance such as cost, productivity, quality, and customer satisfaction should be evaluated in future analyses of the predictive validity of Capability Maturity Modeling  .Moreover, such analyses should be extended to CMM Integration SM and the full life cycle of the development, maintenance, and acquisition of software-intensive systems.
 Capability Maturity Modeling is registered in the U.S. Patent and Trademark Office by Carnegie Mellon University.SM CMM Integration is a service mark of Carnegie Mellon University.

Figure 1 :
Figure 1: Theoretical Basis in a Predictive Validity Study value of the description at the b th subsample, where b =1, , B ; and θ is a description value from an original dataset.
/SEI-2003-TN-015 can be ignored [Efron & Tibshirani 93].Hence, a description from the original dataset can be considered to be stable.Bootstrap methods have been used previously in empirical software engineering.El-Emam and Garro estimated the number of ISO/IEC 15504 assessments by utilizing a capturerecapture method [El-Emam & Garro 00].Jung and Hunter utilized a bootstrap method in computing confidence levels for the capability levels for each ISO/IEC 15504 process [Jung and Hunter 01].Jung and Goldenson used a bootstrap resampling method to evaluate the stability of internal consistency in the SW-CMM [Jung & Goldenson 02].

Figure 5 :
Figure 5: Distribution of Maturity Level Among Assessed OrganizationsThe proportion of organizations at maturity level 2 clearly is not larger than that at maturity level 1 in software industries throughout the world [Fayad & Laitnen 97].More likely, as early adopters of a new technology and specifically as organizations interested in software process improvement, the organizations in our sample are drawn from the "higher end" of the maturity spectrum.This phenomenon has been detected in the ISO/IEC PDTR 15504 as well[Rout et al. 98].

9AE
null hypothesis for Chi-square goodness-of-fit is that no difference exists between actual counts and estimated counts, i. e. , ≥ ).Thus, a large statistic and small p-value implies a poor model fit.The p-value is a right-tail probability[Cameron & Trivedi 98].CMU/SEI-2003-TN-015

Figure 6 :
Figure 6: A Plot of Actual and Estimated Probabilities

Figure 9 :
Figure 9: Bootstrap Distribution for Mean Schedule Deviation

Figure 11 :
Figure 11: Bootstrap Distribution of Schedule Deviation Variance

Table 7 :
Bootstrap Results of Variance of Schedule Deviation .................................24 vi CMU/SEI-2003-TN-015 CMU/SEI-2003-TN-015 vii . Swanson and Beath claimed that software maintenance is fundamentally different from development of new systems since the maintainer must interact with an existing system [Swanson & Beath 89].Niessink and van Vliet investigated the difference between software maintenance and software development from a service point of view [Niessink & van Vliet 00].They argued that software maintenance can be seen as providing a service, while software development is concerned with the development of products.Hence, they developed a separate information technology (IT) service Capability Maturity Model meant for software maintenance organizations and other IT service providers.Similarly, Kajko-Mattsson developed a problem management maturity model for corrective maintenance [Kajko-Mattsson 02]. CMU/SEI-2003-TN-015

Table 1 :
Maturity Levels and their Key Process Areas[Paulk 99]

Table 2 :
Number of Maintenance Projects al. 96].Errors in self-reports have been noted in various other studies, including voting [Abelson et al. 92], receiving of health care [Loftus et al. 92], and doctor's visits [McCallum et al. 95].
our nonrandom design, the PAIS dataset itself is a population of assessment cases, where the population is called a local population or a set of available cases [Lunneborg 00].Although the PAIS database retains the largest number of assessment cases available anywhere, the dataset is not a random sample, and our results cannot be generalized to all SW-CMM assessments conducted around the world.Hence, interpretation of our results should rightly be limited to assessments reported to PAIS by the current base of CMM users.Still, it is sensible to make inferences about the descriptions to the local population.The descriptions are not inferences to a wider population; rather, they are descriptive statistics which can neither be generalized to others nor have causal implications.Typical descriptions include measures of central tendency (e.g., means or medians), dispersion (e.g., variance or control limits), or relationship (e.g., correlation coefficients or internal consistency).Descriptions based on a nonrandom sample need assurance that they truly characterize the available cases and that they are stable[Lunneborg 00, Montgomery et al. 98].An available set of cases such as our assessment dataset cannot be assumed to have the same degree of homogeneity as a random sample.A fair description is a stable one that is relatively uninfluenced by the presence of specific cases.Thus, results such as those in this report should be tested for their stability (homogeneity).
7 Conte and colleagues [Conte et al. 86] suggest using a magnitude of relative error (MRE) measure of schedule deviation, or i y =|(Actual-Planned)/Actual|.Stensrud and colleagues [Stensrud et al. 02] prefer a measure of the magnitude of error relative (MER), or i y =|(Actual-Planned)/Planned|.In Schedule deviation, as we have defined it, is limited to nonnegative integer values, which can be called count outcomes.More than one regression model exists for count outcomes[Long  97, King 88], so it is necessary to select an appropriate one.The selection should consider the strengths and weaknesses of each model in a specific application field, as well as perceptions in the research community about what are appropriate models of count outcomes.Schedule deviation is a relatively rare occurrence in our dataset.Many projects reported being less than one month behind schedule, and there are many zero values.Hence, this study uses a zero inflated Poisson (ZIP) regression model.

Table 3 :
Descriptive Statistics of Maturity Level and Schedule Deviation

Table 4
shows the arithmetic mean value of schedule deviance at each maturity level.Though arithmetic means are subject to a lack of robustness, the performance of schedule deviation is improved as maturity level increases in both the U.S. and non-U.S.datasets.

Table 4 :
Arithmetic Mean of Schedule Deviation at Each Maturity Level

Table 5 :
ZIP Regression Results of Schedule Deviation

Table 6 :
Bootstrap Results of Mean Schedule Deviation

Table 7 :
Bootstrap Results of Variance of Schedule Deviation