Lecture 2: “Are mutants a valid substitute for real faults in software testing?”
Agenda:
- Review: Mutation Testing
- Discussion: Just et al, FSE 2014
Code example:
public int min(int a, int b){
if (a < b)
return a;
return b;
}
public void testMin(){
// What values of a, b?
int ret = min(1, 2);
assertEquals(1, ret);
}
Mutant: (Mutate relational operators)
public int min(int a, int b){
if (a > b)
return a;
return b;
}
DETECTED by input A=1, B=2
Mutant: (Delete statements)
public int min(int a, int b){
return b;
}
NOT Detected by input B=1, A=2 DETECTED by input A=1, B=2
Mutant: (Mutate relational operators)
public int min(int a, int b){
if (a <= b)
return a;
return b;
}
NOT DETECTED by any input
Mutation Testing
- How should I build a test suite for min? What values to use?
- A = 1, B = 2, output = 1
- A = 5, B = 5, output = 5
- A = 6, B = 6, output = 6
- A = 5, B = 10, output = 5
- A = Integer.MAX_INT, B = Integer.MIN_INT, output = INTEGER.MIN_INT
- How do we evaluate whether this is an effective test suite?
- Make some argument based on considering boundary conditions + all paths (which is possible in this case)
- Or, look at some kind of coverage metric
- Gold standard: “My test suite found all of the bugs!”
- Mutation testing -> Try to make some standin for having real bugs, find the mutants instead of find the bugs
- What do we use mutation testing for?
- As a developer, I can use this to detect missing test cases (missing sets of inputs AND missing or weak assertions)
- Useful in software engineering research, too:
- For evaluating new test generation tools - is one tool/test suite better than another?
- Test suite selection, minimization, reduction - Measure the relative value of each test (or assertion) in detecting mutants
“Are Mutants a Valid Substitute for Real Faults in Software Testing?”
- What is the motivation for this paper?
- Mutation testing was already being used by researchers and developers - but, does it actually work?
- What is hard about this problem?
- Gathering the dataset of real faults is “challenging”
- Controlling for code coverage and biases in selecting the dataset
- Requires non-trivial computing resources to execute the experiment
- What is the proposed solution? How do we answer this key motivating question about mutation testing?
- What is a “real fault” in this context?
- A real fault is a fault that has an issue number, is fixed in a single commit, and survives the following process:
- Create the dataset (“Defects4J”)
- Select 5 open source programs (“Have issue trackers”, “Have comprehensive test suites”)
- Find a single commit “T_fix” that fixes a failure in the program code:
- Compare source code versions: V_bug, V_fix (V_fix has issue number in commit message, V_bug was the prior version).
- Review each fault to isolate it (Remove additional features that are added in that commit, creating a “commit” that contains just the bug fix)
- Discard faults where couldn’t reproduce on prior revision, OR where couldn’t be isolated
- Discard commits where the only change was to tests (maybe the problem was in the test, not an actual bug)
- Where do developer tests come from (2.3)?
- Create T_pass, T_fail: Two test suites, differ only by the triggering test - the test that reveal the defect. Might be multiple pairs of (T_pass, T_fail) - where each test suite differs by just a single test
- Where do the generated tests come from? (What is a test?)
- Run test generation tools (EvoSuite, Randoop, JCrasher). Create as many test suites as possible within some time bound (?) to get some that pass on V_bug and some that fail on V_bug
- Q: is EvoSuite using mutation testing to generate and evaluate its test suite? If so, is that problematic?
- Run test generation tools (EvoSuite, Randoop, JCrasher). Create as many test suites as possible within some time bound (?) to get some that pass on V_bug and some that fail on V_bug
- What is a mutant?
- Generated by “Major” mutation testing tool
- Consider four mutation operators:
- Replace constants
- Replace operators
- Modify branch conditions
- Delete statements
- Only apply mutations in classes that were modified by the bug fix
- Only consider mutants that might be coupled to a real fault. “The differences in mutation score would be washed out otherwise” - Only looking at one bug at a time, and comparing two suites on that bug (and mutants)
- Example: real fault in rest API endpoint, which uses some HTTP processing class. If you go mutate the HTTP processing class, it will also break the API endpoint
- Is “230,000” a small number for how many mutants? 644 mutants per-fault on average
- The fault selection criteria might be biased to select straightforward faults, and hence, we might have fewer mutants than expected because the faults are isolated to small files
- Only consider mutants that might be coupled to a real fault. “The differences in mutation score would be washed out otherwise” - Only looking at one bug at a time, and comparing two suites on that bug (and mutants)
- Run the experiments
- How do you control for code coverage?
- Consider only intersection of mutants that are covered by all test suites.
- RQ: Are real faults coupled to mutants generated by commonly used mutation operators?
- If mutants are a valid substitute, then any test suite that has a higher fault detection rate should also have a higher mutation score - each real fault should be “coupled” to some mutant
- RQ: What types of faults are not coupled to mutants?
- Manual investigation of results from previous RQ
- RQ: is mutant detection correlated with real fault detection?
- Use the automated test suites for the 194 faults that they could generate tests for; compare mutation scores of the suites that detect the fault to those that don’t
- How do you control for code coverage?
- What is our analysis of this methodology?
- Selection criteria for faults might bias us to select and study only “simple” faults
- Those that could be isolated
- Had to be fixed in a single commit
- They had to be buggy in the prior commit
- A test had to be added when they were fixed
- That test had to run on the prior version of the code (modulo simple changes - add libraries to class path)
- Throw out non-deterministic tests
- Selection criteria for faults only includes those that go into an issue tracker, and the fixing commit is linked to the issue
- Maybe this is correlated strongly with developer experience - some developers will do this, others won’t. Maybe this is a problem, maybe not
- Are generated test suites a suitable standin for human-written test suites?
- Selection criteria for faults might bias us to select and study only “simple” faults
- What are the results?
- Are real faults coupled to mutants generated by commonly used operators? (RQ1)
- 73% of faults were coupled to mutation
- Replace constant not coupled, others were
- RQ2 - What types of real faults are not coupled to mutation?
- RQ3 - Is mutant detection correlated with real fault detection?
- Are real faults coupled to mutants generated by commonly used operators? (RQ1)
- Reactionary/follow up questions:
- How syntactically similar are the mutants that are coupled to the faults?
- Why do they use statement coverage and not branch coverage?
- Maybe worthwhile to consider designing and implementing new mutation operators
- What is a “real fault” in this context?