How Detrimental is Coincidental Correctness to Defect Detection?

Masri, Wes

doi:10.6084/m9.figshare.7077203.v2

ICST_2019_paper_64.pdf (1.03 MB)

How Detrimental is Coincidental Correctness to Defect Detection?

Version 2 2018-10-19, 21:43

Version 1 2018-09-12, 12:34

journal contribution

posted on 2018-10-19, 21:43 authored by Wes MasriWes Masri

According to the PIE and RIP models, three conditions must be satisfied for program failure to occur: 1) the defect’s location must execute or be reached; 2) the program’s state must become infected; and 3) the infection must propagate to the output. Weak coincidental correctness (or weak CC) occurs when the program produces the correct output, while condition 1) is satisfied but 2) and 3) are not satisfied. Strong coincidental correctness (or strong CC) occurs when the output is correct, while both conditions 1) and 2) are satisfied, but not 3). In the literature, typically coincidental correctness (CC) refers to strong CC.

Researchers have recognized the presence of CC and analytically demonstrated that it is a safety-reducing factor for spectrum-based fault localization (SBFL). However, they did not empirically validate that fact, which we do in this paper. Specifically, using the Defects4J benchmark, we comparatively evaluated the performance of SBFL using 52 different suspiciousness metrics when: a) both weak and strong CC tests are present (T_wsCC); b) no weak nor strong CC tests are present (T_noCC); c) weak CC tests are present (T_wCC); and d) strong CC tests are present (T_sCC). Similarly, using five multi-fault Java programs, we evaluated the performance of greedy Test Suite Reduction (TSR) in the presence and absence of CC. That is, we empirically studied the impact of CC on defect detection using two commonly used techniques.

Using 49 out of the 52 metrics, our results showed with statistical significance that SBFL performs better when using T_wCC, T_sCC, and T_noCC than when using T_wsCC. They also showed that T_noCC yields the best performance followed by T_sCC, and then T_wCC. However, the effect sizes were mostly trivial, except for three metrics. Compared to T_wsCC, our TSR results showed that T_noCC, T_sCC, and T_wCC resulted in respectively I, G, and %6 more detected defects. Therefore, our empirical study suggests that CC is detrimental to defect detection, and that weak CC is more detrimental than strong CC.