Link Search Menu Expand Document

Lecture 2: “Are mutants a valid substitute for real faults in software testing?”

Agenda:

  1. Review: Mutation Testing
  2. Discussion: Just et al, FSE 2014

Code example:

public int min(int a, int b){
	if (a < b)
		return a;
	return b;
}
public void testMin(){
	// What values of a, b?
	int ret = min(1, 2);
	assertEquals(1, ret);
}

Mutant: (Mutate relational operators)

public int min(int a, int b){
	if (a > b)
		return a;
	return b;
}

DETECTED by input A=1, B=2

Mutant: (Delete statements)

public int min(int a, int b){
	return b;
}

NOT Detected by input B=1, A=2 DETECTED by input A=1, B=2

Mutant: (Mutate relational operators)

public int min(int a, int b){
	if (a <= b)
		return a;
	return b;
}

NOT DETECTED by any input

Mutation Testing

  • How should I build a test suite for min? What values to use?
    • A = 1, B = 2, output = 1
    • A = 5, B = 5, output = 5
    • A = 6, B = 6, output = 6
    • A = 5, B = 10, output = 5
    • A = Integer.MAX_INT, B = Integer.MIN_INT, output = INTEGER.MIN_INT
  • How do we evaluate whether this is an effective test suite?
    • Make some argument based on considering boundary conditions + all paths (which is possible in this case)
    • Or, look at some kind of coverage metric
    • Gold standard: “My test suite found all of the bugs!”
    • Mutation testing -> Try to make some standin for having real bugs, find the mutants instead of find the bugs
  • What do we use mutation testing for?
    • As a developer, I can use this to detect missing test cases (missing sets of inputs AND missing or weak assertions)
    • Useful in software engineering research, too:
      • For evaluating new test generation tools - is one tool/test suite better than another?
      • Test suite selection, minimization, reduction - Measure the relative value of each test (or assertion) in detecting mutants

“Are Mutants a Valid Substitute for Real Faults in Software Testing?”

  • What is the motivation for this paper?
    • Mutation testing was already being used by researchers and developers - but, does it actually work?
  • What is hard about this problem?
    • Gathering the dataset of real faults is “challenging”
    • Controlling for code coverage and biases in selecting the dataset
    • Requires non-trivial computing resources to execute the experiment
  • What is the proposed solution? How do we answer this key motivating question about mutation testing?
    • What is a “real fault” in this context?
      • A real fault is a fault that has an issue number, is fixed in a single commit, and survives the following process:
      • Create the dataset (“Defects4J”)
        • Select 5 open source programs (“Have issue trackers”, “Have comprehensive test suites”)
        • Find a single commit “T_fix” that fixes a failure in the program code:
          • Compare source code versions: V_bug, V_fix (V_fix has issue number in commit message, V_bug was the prior version).
          • Review each fault to isolate it (Remove additional features that are added in that commit, creating a “commit” that contains just the bug fix)
          • Discard faults where couldn’t reproduce on prior revision, OR where couldn’t be isolated
          • Discard commits where the only change was to tests (maybe the problem was in the test, not an actual bug)
        • Where do developer tests come from (2.3)?
          • Create T_pass, T_fail: Two test suites, differ only by the triggering test - the test that reveal the defect. Might be multiple pairs of (T_pass, T_fail) - where each test suite differs by just a single test
        • Where do the generated tests come from? (What is a test?)
          • Run test generation tools (EvoSuite, Randoop, JCrasher). Create as many test suites as possible within some time bound (?) to get some that pass on V_bug and some that fail on V_bug
            • Q: is EvoSuite using mutation testing to generate and evaluate its test suite? If so, is that problematic?
    • What is a mutant?
      • Generated by “Major” mutation testing tool
      • Consider four mutation operators:
        • Replace constants
        • Replace operators
        • Modify branch conditions
        • Delete statements
      • Only apply mutations in classes that were modified by the bug fix
        • Only consider mutants that might be coupled to a real fault. “The differences in mutation score would be washed out otherwise” - Only looking at one bug at a time, and comparing two suites on that bug (and mutants)
          • Example: real fault in rest API endpoint, which uses some HTTP processing class. If you go mutate the HTTP processing class, it will also break the API endpoint
        • Is “230,000” a small number for how many mutants? 644 mutants per-fault on average
          • The fault selection criteria might be biased to select straightforward faults, and hence, we might have fewer mutants than expected because the faults are isolated to small files
    • Run the experiments
      • How do you control for code coverage?
        • Consider only intersection of mutants that are covered by all test suites.
      • RQ: Are real faults coupled to mutants generated by commonly used mutation operators?
        • If mutants are a valid substitute, then any test suite that has a higher fault detection rate should also have a higher mutation score - each real fault should be “coupled” to some mutant
      • RQ: What types of faults are not coupled to mutants?
        • Manual investigation of results from previous RQ
      • RQ: is mutant detection correlated with real fault detection?
        • Use the automated test suites for the 194 faults that they could generate tests for; compare mutation scores of the suites that detect the fault to those that don’t
    • What is our analysis of this methodology?
      • Selection criteria for faults might bias us to select and study only “simple” faults
        • Those that could be isolated
        • Had to be fixed in a single commit
        • They had to be buggy in the prior commit
        • A test had to be added when they were fixed
        • That test had to run on the prior version of the code (modulo simple changes - add libraries to class path)
        • Throw out non-deterministic tests
      • Selection criteria for faults only includes those that go into an issue tracker, and the fixing commit is linked to the issue
        • Maybe this is correlated strongly with developer experience - some developers will do this, others won’t. Maybe this is a problem, maybe not
      • Are generated test suites a suitable standin for human-written test suites?
    • What are the results?
      • Are real faults coupled to mutants generated by commonly used operators? (RQ1)
        • 73% of faults were coupled to mutation
        • Replace constant not coupled, others were
      • RQ2 - What types of real faults are not coupled to mutation?
      • RQ3 - Is mutant detection correlated with real fault detection?
    • Reactionary/follow up questions:
      • How syntactically similar are the mutants that are coupled to the faults?
      • Why do they use statement coverage and not branch coverage?
      • Maybe worthwhile to consider designing and implementing new mutation operators

© 2021 Jonathan Bell. Released under the CC BY-SA license