figshare
Browse
ICST_2019_paper_64.pdf (1.03 MB)

How Detrimental is Coincidental Correctness to Defect Detection?

Download (1.03 MB)
Version 2 2018-10-19, 21:43
Version 1 2018-09-12, 12:34
journal contribution
posted on 2018-10-19, 21:43 authored by Wes MasriWes Masri


According to the PIE and RIP models, three conditions must be satisfied for program failure to occur: 1) the defect’s location must execute or be reached; 2) the program’s state must become infected; and 3) the infection must propagate to the output. Weak coincidental correctness (or weak CC) occurs when the program produces the correct output, while condition 1) is satisfied but 2) and 3) are not satisfied. Strong coincidental correctness (or strong CC) occurs when the output is correct, while both conditions 1) and 2) are satisfied, but not 3). In the literature, typically coincidental correctness (CC) refers to strong CC.


Researchers have recognized the presence of CC and analytically demonstrated that it is a safety-reducing factor for spectrum-based fault localization (SBFL). However, they did not empirically validate that fact, which we do in this paper. Specifically, using the Defects4J benchmark, we comparatively evaluated the performance of SBFL using 52 different suspiciousness metrics when: a) both weak and strong CC tests are present (TwsCC); b) no weak nor strong CC tests are present (TnoCC); c) weak CC tests are present (TwCC); and d) strong CC tests are present (TsCC). Similarly, using five multi-fault Java programs, we evaluated the performance of greedy Test Suite Reduction (TSR) in the presence and absence of CC. That is, we empirically studied the impact of CC on defect detection using two commonly used techniques.


Using 49 out of the 52 metrics, our results showed with statistical significance that SBFL performs better when using TwCC, TsCC, and TnoCC than when using TwsCC. They also showed that TnoCC yields the best performance followed by TsCC, and then TwCC. However, the effect sizes were mostly trivial, except for three metrics. Compared to TwsCC, our TSR results showed that TnoCC, TsCC, and TwCC resulted in respectively I, G, and %6 more detected defects. Therefore, our empirical study suggests that CC is detrimental to defect detection, and that weak CC is more detrimental than strong CC.


History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC