Lecture 2: “Are mutants a valid substitute for real faults in software testing?”

Agenda:

Review: Mutation Testing
Discussion: Just et al, FSE 2014

Code example:

public int min(int a, int b){
	if (a < b)
		return a;
	return b;
}

public void testMin(){
	// What values of a, b?
	int ret = min(1, 2);
	assertEquals(1, ret);
}

Mutant: (Mutate relational operators)

public int min(int a, int b){
	if (a > b)
		return a;
	return b;
}

DETECTED by input A=1, B=2

Mutant: (Delete statements)

public int min(int a, int b){
	return b;
}

NOT Detected by input B=1, A=2 DETECTED by input A=1, B=2

Mutant: (Mutate relational operators)

public int min(int a, int b){
	if (a <= b)
		return a;
	return b;
}

NOT DETECTED by any input

Mutation Testing

How should I build a test suite for min? What values to use?
- A = 1, B = 2, output = 1
- A = 5, B = 5, output = 5
- A = 6, B = 6, output = 6
- A = 5, B = 10, output = 5
- A = Integer.MAX_INT, B = Integer.MIN_INT, output = INTEGER.MIN_INT
How do we evaluate whether this is an effective test suite?
- Make some argument based on considering boundary conditions + all paths (which is possible in this case)
- Or, look at some kind of coverage metric
- Gold standard: “My test suite found all of the bugs!”
- Mutation testing -> Try to make some standin for having real bugs, find the mutants instead of find the bugs
What do we use mutation testing for?
- As a developer, I can use this to detect missing test cases (missing sets of inputs AND missing or weak assertions)
- Useful in software engineering research, too:
  - For evaluating new test generation tools - is one tool/test suite better than another?
  - Test suite selection, minimization, reduction - Measure the relative value of each test (or assertion) in detecting mutants

“Are Mutants a Valid Substitute for Real Faults in Software Testing?”

What is the motivation for this paper?
- Mutation testing was already being used by researchers and developers - but, does it actually work?
What is hard about this problem?
- Gathering the dataset of real faults is “challenging”
- Controlling for code coverage and biases in selecting the dataset
- Requires non-trivial computing resources to execute the experiment
What is the proposed solution? How do we answer this key motivating question about mutation testing?
- What is a “real fault” in this context?
  - A real fault is a fault that has an issue number, is fixed in a single commit, and survives the following process:
  - Create the dataset (“Defects4J”)
    - Select 5 open source programs (“Have issue trackers”, “Have comprehensive test suites”)
    - Find a single commit “T_fix” that fixes a failure in the program code:
      - Compare source code versions: V_bug, V_fix (V_fix has issue number in commit message, V_bug was the prior version).
      - Review each fault to isolate it (Remove additional features that are added in that commit, creating a “commit” that contains just the bug fix)
      - Discard faults where couldn’t reproduce on prior revision, OR where couldn’t be isolated
      - Discard commits where the only change was to tests (maybe the problem was in the test, not an actual bug)
    - Where do developer tests come from (2.3)?
      - Create T_pass, T_fail: Two test suites, differ only by the triggering test - the test that reveal the defect. Might be multiple pairs of (T_pass, T_fail) - where each test suite differs by just a single test
    - Where do the generated tests come from? (What is a test?)
      - Run test generation tools (EvoSuite, Randoop, JCrasher). Create as many test suites as possible within some time bound (?) to get some that pass on V_bug and some that fail on V_bug
        Q: is EvoSuite using mutation testing to generate and evaluate its test suite? If so, is that problematic?
- What is a mutant?
  - Generated by “Major” mutation testing tool
  - Consider four mutation operators:
    - Replace constants
    - Replace operators
    - Modify branch conditions
    - Delete statements
  - Only apply mutations in classes that were modified by the bug fix
    - Only consider mutants that might be coupled to a real fault. “The differences in mutation score would be washed out otherwise” - Only looking at one bug at a time, and comparing two suites on that bug (and mutants)
      - Example: real fault in rest API endpoint, which uses some HTTP processing class. If you go mutate the HTTP processing class, it will also break the API endpoint
    - Is “230,000” a small number for how many mutants? 644 mutants per-fault on average
      - The fault selection criteria might be biased to select straightforward faults, and hence, we might have fewer mutants than expected because the faults are isolated to small files
- Run the experiments
  - How do you control for code coverage?
    - Consider only intersection of mutants that are covered by all test suites.
  - RQ: Are real faults coupled to mutants generated by commonly used mutation operators?
    - If mutants are a valid substitute, then any test suite that has a higher fault detection rate should also have a higher mutation score - each real fault should be “coupled” to some mutant
  - RQ: What types of faults are not coupled to mutants?
    - Manual investigation of results from previous RQ
  - RQ: is mutant detection correlated with real fault detection?
    - Use the automated test suites for the 194 faults that they could generate tests for; compare mutation scores of the suites that detect the fault to those that don’t
- What is our analysis of this methodology?
  - Selection criteria for faults might bias us to select and study only “simple” faults
    - Those that could be isolated
    - Had to be fixed in a single commit
    - They had to be buggy in the prior commit
    - A test had to be added when they were fixed
    - That test had to run on the prior version of the code (modulo simple changes - add libraries to class path)
    - Throw out non-deterministic tests
  - Selection criteria for faults only includes those that go into an issue tracker, and the fixing commit is linked to the issue
    - Maybe this is correlated strongly with developer experience - some developers will do this, others won’t. Maybe this is a problem, maybe not
  - Are generated test suites a suitable standin for human-written test suites?
- What are the results?
  - Are real faults coupled to mutants generated by commonly used operators? (RQ1)
    - 73% of faults were coupled to mutation
    - Replace constant not coupled, others were
  - RQ2 - What types of real faults are not coupled to mutation?
  - RQ3 - Is mutant detection correlated with real fault detection?
- Reactionary/follow up questions:
  - How syntactically similar are the mutants that are coupled to the faults?
  - Why do they use statement coverage and not branch coverage?
  - Maybe worthwhile to consider designing and implementing new mutation operators