|
competitor2-4

Vol. 2 No. 4
March 1999
Critical Defect, Fault, and Failure Prediction
Critical Questions
- Are critical defects, faults, and failures predictable?
- Can trusted software systems be fielded predictably?
- How can web-based dissemination of best software practice be achieved?
Global Software Competitiveness Studies
Sponsored by the
Center for National Software Studies (CNSS)
http://www.CNsoftware.org
Conducted by Don ONeill
ONeillDon@aol.com
http://members.aol.com/ONeillDon/index.html
(301) 990-0377
@Copyright Don ONeill, 1998, used by the NSC
with permission
Critical Defect, Fault, and Failure Prediction
Prologue
If critical software defects, faults, and failures can be predicted, perhaps they can be detected, controlled, and prevented. If this could become standard software practice, the software industry could replace chaos and unpredictability with trustworthy software systems that earn public confidence.
Background
The prosperity of the nation is increasingly dependent on software. A growing number of embedded value points in critical industries are based on software. Consequently, it is essential that we obtain the deepest possible understanding of software engineering in all its dimensions and that we arrive at a realistic expectation for its application.
Experienced software practitioners and managers understand that software development is a process of experimentation involving the continuous discovery of technical information associated with the hypotheses of function, form, and fit of the software product as it moves through the requirements, specification, design, code, test, and maintenance activities of the life cycle. The defects, faults, and failures experienced at different stages are the leading indicators of experiment completion and success. Reasoning about, understanding, and predicting their behavior is the basis for managing software risk and uncertainty.
Need
The trustworthiness of software systems threatens the harmonious operation of critical industries and is impacting public confidence in the orderliness of society and its institutions both public and private.
Concerns about trustworthiness in software systems abound. With the increasing use of software, there is increasing focus on software product quality concerns. Windows 95 was released with thousands of known defects.The Year 2000 Problem looms large... and uncertain. One third of all software projects started are abandoned, and another third do not achieve the functionality intended. Critical multi-billion dollar modernization initiatives have visibly failed even under the glare of public oversight and congressional scrutiny: FAA Advanced Automation System and IRS Modernization Program. The 1992 Defense Advanced Research Project Agency (DARPA) goal to reduce software problems by a factor of ten by the year 2000 is not being met.
Problem
The organization that is to lead the industry in the production of trustworthy software systems is the organization that is capable of predicting and controlling the critical defects, faults, and failures in the software systems it builds. Bridging the gap of prediction practice among defects, faults, and failures remains an unsolved problem.
- Defects are detected early using software inspections as exit criteria for activities in the software life cycle. A defect is an instance where the software artifact does not meet the standard of excellence set as the exit criteria for the activity.
- Faults are detected later through exercise during integration and system test. A fault is an instance where the exercise of a software component yields an incorrect result.
- Failures are detected in operational test and operational deployment. A failure is a user visible instance where the operation of a software system does not meet expectation.
Yet there is no accepted method for using defect data available early to predict faults and failures that occur later.
Trustworthy software systems are the end game.
- Trustworthy software systems are dependable in operation from a user perspective.
- Trustworthy software systems are convincingly reliable from an engineering perspective.
- Engineers assess and predict the reliability of a system in terms of fault analysis, mean time to failure, availability, and mean time to repair.
- Managers report on the emerging quality of the software system in terms of completeness, correctness, conformance to requirements, compliance with standards, adherence to rules of construction for the application domain, and various viewpoints.
Specifically:
- There is insufficient defect, fault, and failure data available from the nations factory floor.
- There is insufficient process, method, and tooling to combine defect data obtained through software inspections practice, software fault data obtained through software product test and use, and software failure data obtained through software system operation into predictions of trustworthy software system operation.
Approach
As software dependency increases in critical industries, defense systems, and government operations, it becomes essential to understand how to field trustworthy software systems. The practice of trustworthy software systems can be greatly improved with the confident prediction of faults and failures from defect data. The insights gained by analysis of early defect data assist in obtaining a deeper understanding of function, form, and fit and permit the adoption of those alternate architectures, design structures, and disciplined data constructs intended to increase reliability and availability.

The organization capable of producing trustworthy software systems understands defect detection metrics and calibrates defect leakage production and defect leakage type distributions. This organization approaches the capabilities in specified in levels 4 and 5 of the Software Engineering Institutes Capability Maturity Model.
Critical components are pinpointed through a survivability assessment of the concept of operations, software architecture, and the rules construction for its components. The National Software Quality Experiment (NSQE) with its Software Inspection Lab and its Repository of core samples uses defect detection to derive metrics capable of calibrating defect leakage prediction and defect leakage type distribution. The question then is, To what extent are Software Engineering Error Prediction Models capable of utilizing defect leakage prediction and defect leakage type distribution to predict faults and failures? The answer lies in the integration of models.

Survivability Assessment
A trustworthy software system must possess the attribute of survivability. The Software Engineering Institute in its Survivable Network Analysis Method defines survivability as the capability of a system to fulfill its mission in the presence of attacks, failures, or accidents. Focusing on the delivery of essential services and the preservation of essential assets, survivability includes security, fault tolerance, and reliability. An assessment of the function, form, fit and the features that ward off penetration, detect anomalous conditions, ensure continuous operation, and restore normal operation reveals critical software components [Linger 98].
NSQE and Repository
The National Software Quality Experiment (NSQE), underway since 1992 is a mechanism for obtaining core samples of software product quality. A micro-level national database of product quality is being populated by a continuous stream of samples from industry, government, and military services. The centerpiece of the experiment is the Software Inspection Lab where data collection procedures, product checklists, and participant behaviors are packaged for operational project use. Thousands of participants from dozens of organizations are populating the experiment database with tens of thousands of defects of all types. Defect detection rates and defect type distributions suggest defect leakage rates and enable the prediction of faults and failures [ONeill 98.2].
Software Engineering Error Prediction
Software Engineering Error Prediction Models use defect data from the software development process and software inspections to estimate latent defects. Predictions are based on fitting defect discovery data to software life cycle activities and provide input to reliability and availability models [Gaffney 97] .
Goal, Question, Metric
The goal, question, metric (GQM) template introduced by Dr. Vic Basili can be used to focus the approach:
- The goal is to utilize defect data from the NSQE and the Software Inspection Lab to predict critical faults and failures and to calibrate Software Engineering Error Prediction Models.
- Several questions are asked. What are the critical software components? What is the defect type distribution of faults and failures? What is the defect leakage from design and code into test operations and from test to field operations?
- The metrics generated by the NSQE include both Software Inspection Lab Operations control panels and defect type distributions.
The NSQE Repository contains thousands of core samples from which control panels of upper and lower limits are derived. The control panel metrics are based on personnel effort, software component size, and defects detected and include:
- Minutes of preparation effort per major defect
- Minutes of preparation effort per defect
- Minutes of preparation effort per major defect
- Major defects per thousand line of code
- Minor defects per thousand line of code
- New development lines per conduct hour
- Legacy lines per conduct hour
- Defects per session
- Preparation effort / conduct effort
- Return on investment

Software inspections practice detects 60-90% of defects inserted. Occasionally this drops below 50%; infrequently it exceeds 95%. A defect detection range of 60-90% is a defect leakage range of 40-10%. If the nominal defect detection is 75%; the nominal defect leakage would be 25%. These control panel metrics reveal the effectiveness of the software inspections practice and the expectation for detection and leakage.
In addition to control panels, the NSQE Repository contains defect type distributions containing the frequency of occurrence of each defect type. These defect types can be grouped:
- Requirements- documentation
- External- interface, human factors, I/O
- Internal- functionality, logic, data, performance
- Software practice- syntax, standards, maintainability, other

Study Method
The Critical Defect, Fault, Failure Prediction project builds on existing research and conducts original research on these problems using the following methodology:
- Identify high value target sites.
- Utilize the SEI Survivability Assessment mechanism to select at risk software components from among the most critical embedded value points at each site.
- Conduct the National Software Quality Experiment (NSQE) at these twenty-five sites and obtain no less than forty core samples of software product quality from each site.
- Calibrate core samples with appropriate core samples in the NSQE Repository in terms of software inspection lab operations and defect type distribution.
- Generate defect, fault, failure data projections using available industrial strength software engineering error prediction models, such as, SWEEP, STEER, SMERFS, and CASRE
- Utilize the NSQE defect type distributions from core samples to characterize the mapping to predictive fault distributions.
- Utilize actual fault distributions to calibrate the software engineering error prediction models.
- Utilize the NSQE defect type distributions from core samples to characterize the mapping to actual fault distributions.
- Assess and reconcile the mapping characterizations from the predictive and actual fault distributions.
- Utilize the calibrated software engineering error prediction models to generate predictive failure data distributions.
- Utilize the NSQE defect type distributions from the 1000 core samples to characterize the mapping to predictive failure distributions.
- Utilize actual failure distributions to calibrate the software engineering error predictive models.
- Utilize the NSQE defect type distributions from core samples to characterize the mapping to actual failure distributions.
- Assess and reconcile the mapping characterizations from the predictive and actual failure distribution.
- Package the defect data, fault data, failure data, mapping characterizations, and models for continuing use in the NIST defects, faults, and failures studies.
- Disseminate the methods, findings, and insights to industry, government, and academia through conference presentations, public seminars, and private on-site workshops.
- Design and produce an interactive web site to support critical defect, fault, and failure prediction

Diffusion Strategy
Once the enterprise has distinguished itself from the competition and obtained a strategic advantage, this industry position can be locked in by opening up the three technologies and their processes, methods, and tools to the industry at large as standard practice for trustworthy software systems. Those enterprises unable to master this practice for trustworthy software systems will suffer a disadvantage in the marketplace.
Disseminating the knowledge, skills, and behaviors for predicting defects, faults,and failures among the industry practitioners on the factory floor is accomplished by push and pull :
- Several initiatives generate pull including SEI CMM, 6-sigma,and ISO 9001, and the growing number of lawsuits stemming from software failure.
- Push is generated through distribution of an error prediction kit composed of data base structures, data repository, spreadsheet templates, and user handbook; licensing the use of certain data; reporting to user groups and industry conferences.
The diffusion strategy for the critical defect prediction project follows the new rules for the new economy [Kelly 98]. The role of leadership is assumed by those who attract attention, promote community, set standards, and share intelligence. Once a community of interest has been assembled, the diffusion strategy can best be operated from an open web site on the internet.
- A critical defect prediction kit of the processes, methods, and tools for the three technologies are disseminated and supported interactively.
- A continuous flow of core samples of defect data can be collected from users.
- As the NSQE Repository grows in size linearly one user at a time, its value increases exponentially as all users want up to update NSQE results for their analysis bins.
- As users and usage swells so does control of the standard for trusted software systems and the processes, methods, and tools that enable its practice.
- It is the combination of factors that form analysis bins with non-deterministic results that fuel the interactive web usage by users.
In addition to the web site and prediction kit, the project sponsors and supports public seminars, private seminars, user group conferences, and industry conferences to service and support various audiences. These sessions are aimed at gathering up new innovations from users, obtaining additional NSQE core samples, educating new users, and exploring new modes of collaboration.
High Risk/High Payoff
A threat to the nations information technology is the industrys inability to produce trustworthy software systems.
- Software operates at 3-sigma while other disciplines seek and achieve 6-sigma.
- Less than 1% of the organizations assessed have achieved SEI CMM level 5.
- Software project failures exceed successes.
Yet software as a competence is a high payoff enabling technology for industries of all kinds and is becoming the basis for global competitiveness.
Innovation
The prediction of critical defects, faults, and failures demands the integration of three models:
- Survivability assessment governs the selection of critical components.
- The NSQE supplies the repository of industry defect data and the consistent mechanism to obtain core samples from selected critical components.
- The Software Engineering Error Prediction models project faults and failures.
Each of these is a promising and innovative state of the art technology. In combination, these three technologies and their integration promise greatly improved results. Once these are obtaining results, they can be disseminated through an innovative web-based diffusion strategy.
The integration of the three technologies can provide the adopting enterprise with an advantage as a trustworthy software systems provider, an advantage that can be enlarged upon and locked in.
Bibliography
[Gaffney 97] Gaffney, John E., Software Defect Estimation, Prediction, and the CMM, Metrics 97 Conference, 1997
[Humphrey 89] Humphrey, Watts S., "Managing the Software Process", Addison-Wesley Publishing Company, Inc., 1989
[Kelly 98] Kelly, Kevin, New Rules for the New Economy, Penguin Group, 1998
[Linger 98] Linger, Richard C., Survivability Assessment, SEI Symposium, Pittsburgh, 1998
[O'Neill 92] O'Neill, Don, "Software Inspections: More Than a Hunt for Errors", Cross Talk, Issue 30, January 1992
[O'Neill 94] O'Neill, Don, "National Software Quality Experiment", International Conference on Software Quality, Washington DC, 1994
[O'Neill 95,96] O'Neill, Don, "National Software Quality Experiment: Results 1992-1995", Software Technology Conference, Salt lake City, 1995 and 1996
[O'Neill 97.1] O'Neill, Don, "Issues in Software Inspection", IEEE Software,
Vol .14 No 1., January 1997
[O'Neill 97.2] O'Neill, Don, Setting Up a Software Inspection Program, CrossTalk, The Journal of Defense Software Engineering, Vol. 10 No. 2, February 1997
[O'Neill 97.3] O'Neill, Don, "National Software Quality Experiment: A Lesson in Measurement 1992-1996", Quality Week Conference, San Francisco, May 1997 and Quality Week Europe Conference, Brussels, November 1997
[O'Neill 98.1] ONeill, Don, National Software Quality Experiment: A Lesson in Measurement 1992-1997,CrossTalk, The Journal of Defense Software Engineering, Vol. 11 No. 12, Web Addition, December 1998
[O'Neill 98.2] ONeill,Don, National Software Quality Experiment: A Lesson in Measurement 1992-1997, NASA Goddard Software Engineering Workshop, 2-3 December 1998
[Paulk 95] Paulk, Mark C., The Capability Maturity Model: Guidelines for Improving the Software Process, Addison-Wesley Publishing Company, 1995
[SEI 97] "Practical Software Measurement: Measuring for Process Management and Improvement", Software Engineering Institute, CMU/SEI-97-H8-003, 1997
[Tichy 98] Tichy, Walter F., Should Computer Scientists Experiment More?, Computer, 31(5), 32-40, May 1998
[Voas 98] Voas, Jeffrey M. and Gary McGraw, Software Fault Injection, John Wiley & Sons, Inc., New York, 1998
[Zelkowitz 98] Zelkowitz, Marvin V. and Dolores. Wallace, Experimental Models for Validating Technology, Computer, 31(5), 23-31, May 1998
|