Last updated: | Permalink
Discussion: Test Oracles and Adequacy
Agenda:
- Administrivia
- Paper Proposal Feedback
- Updated readings for next Tuesday
- Flaky test and TotT follow-up
- Are mutants a valid substitute for real faults?
- Dynamically Discovering Likely Program Invariants to Support Program Evolution
These notes roughly capture the (mostly unedited) comments from the class:
Are mutants a valid substitute for real faults?
- Why do we need to know the answer to this question?
- We want to use mutation to assess the fault-finding ability of our tests - can we?
- Should it be: “are some mutants a valid substitute”?
- What is hard about answering this question?
- How to gather the dataset?
- What is a “bug fix”? - Approach is to manually isolate the part of a commit that fixes a bug
- How to control for code coverage?
- There could be a strong bias between code coverage and mutation
- How to run this experiment? - lots of CPU resources required, probably produces lots of output
- How to get a broad enough dataset so that this might generalize?
- How to gather the dataset?
- Overall experimental design?
- Collect a bunch of isolated bugs
- Collect tests that reveal those bugs
- Generate more tests
- Do mutation analysis on all of the test suites
- Does a test suite that detects more mutants detect more faults?
- What is their definition of a “real fault?”
- Start with 5 open source projects
- Have issue tracker, have good version control
- Look for bug-fix commits
- Manually inspect them
- Discard faults that were not reproducible on the buggy commit
- What was the procedure to determine reproducible? Required that a test existed already
- Are mutants a valid substitute for faults that developers don’t write test cases for?
- … are mutants a valid substitute for faults in distributed systems caused by exception handling bugs (OSDI 2014)?
- Ensured that there was only a single change in the commit (tried to manually do this)
- Discard commits where the only changes were to tests
- Start with 5 open source projects
- Where do developer tests come from?
- (T_pass), (T_fail) For a specific commit, T_pass is the test suite that passes, T_fail differs by exactly one test case that reveals the test case (if there are multiple failing tests, it’s multiple pairs)
- What is a mutant?
- What operators?
- Replace constants
- Replace operators
- Modify branch conditions
- Delete statements
- Where to apply mutations?
- Applied to the classes that had been changed by the bug fix
- Didn’t want to create a lot of “useless” mutants - OK to save time to do fewer
- Assumption: a mutant in the same location as the bug will represent that bug
- Null hypothesis is that there is no correlation. Making more mutants could only IMPROVE the strength of the finding that there is a correlation
- Applied to the classes that had been changed by the bug fix
- What operators?
- Run the experiments
- How to control for code coverage?
- Only consider mutants that are covered by all of the tests
- RQ: Are real faults coupled to mutants generated by commonly used mutation operators?
- If mutants are valid substitute, then any test suite with a higher fault detection rate should also have higher mutation score - “coupled” mutants and real faults
- RQ: What types of real faults are not coupled?
- Manual analysis
- RQ: Is mutant detection correlated with real fault detection
- Generate tests
- Need more tests to make a conclusion if mutation score of a tests suite is actually correlated
- Used three different tools: EvoSuite, Randoop, JCrasher.
- Compare mutation scores between test suites
- Generate tests
- How to control for code coverage?
- What are results?
- Are real fault scowled to mutants generated by commonly used mutation operators?
- 73% were coupled
- What was not coupled?
- Some, they believe, can not be coupled:
- “Incorrect algorithms” - cases where an algorithm had to be changed
- “Similar library method called”
- Some faults required stronger mutation operators:
- Statement deletion
- Argument swapping
- Argument omission
- Similar library method called
- Here’s a less strong mutation operator:
- x = x + 0
- Some, they believe, can not be coupled:
- Is mutant detection correlated with real fault detection?
- Yes
- Also, correlation between mutation detection and real fault detection stronger than coverage vs real fault detection
- Are real fault scowled to mutants generated by commonly used mutation operators?
- Reactions
- “are some mutants a valid substitute?”
- Extremely careful and time-consuming methodology, come to valid and sound conclusions for the dataset, but does this dataset generalize?
- More recent work shows that this still holds on bigger datasets built with same methodology, but what about changing the methodology? (What we consider a bug… also…. Languages? Just Java)
- New context: bugs = student implementation of an assignment that failed an instructor-provided test
- C, TypeScript
Dynamically Discovering Likely Program Invariants to Support Program Evolution
- What is the motivation?
- There are implicit invariants in a programmer’s mind while coding, the program could be more robust if they were encoded
- Example:
i
should monotonically increase in a for loop, or the bound of a loop should be set at som point
- Example:
- There are implicit invariants in a programmer’s mind while coding, the program could be more robust if they were encoded
- What makes it a likely program invariant?
- Based on the tests, it was always true, but the tests might not be comprehensive
- Are they “likely” or “possible”?
- They define “likely” based on a probabilistic assumption that values are uniformly randomly distributed
- Example: if
x==3
on every execution, butx
is a 32-bit integer, it’s very unlikely that it would have happened every time - Case for possible:
Input a value in the range [3,4]
->x
- Case for likely:
- “Likely” based on developer acceptance
- What is the use-case for program evolution?
- If the maintainer becomes aware of the invariants, they will accept them, add them to code, and then this will help future developers/same developer to maintain the code
- Can use this to generate invariants on-demand for specific regions, too
- Maybe useful for fuzz testing: use as a fitness function to evaluate closeness to finding bugs. Or, in the 1999 terminology, good for finding edge cases :)
- Assertions are helpful for evolution
- There is a narrative explaining how this was helpful for augmenting a regex compiler
- Would it be helpful to prune search space for model checking
- What are the challenges in detecting invariants?
- Need to find the ones that are useful
- What’s not a useful invariant?
const x = 3; //assert(x==3) assert(x!=4)
- “This list is not sorted”
- Probabilities, too…
- What’s not a useful invariant?
- Where do the invariants come from?
- Record a lot of things
- Establish patterns for invariants
- They are established in the paper
- Only detect invariants for which there is a template
- Establish a threshold for “negative” invariants
- Only consider possible invariants that include at most 3 variables?
- It will take a long time to run
- Implementation details
- Frontend/backend distinction
- Frontend - records trace
- Backend - infers likely invariants
- Benefits?
- Can add new patterns and re-use the existing trace and frontend
- Can support multiple languages, just reuse the backend
- Could it have been more efficient to put frontend + backend together? What do we lose?
- Can’t infer while tracing - no feedback loop possible
- Stuck with variables that can be serialized into the trace format
- Frontend/backend distinction
- Evaluation
- What was evaluated?
- On the Gries program - do the inferred invariants match the textbook invariants? Yes
- What were these “Gries programs”?
- Examples for teaching programming and invariants, demonstrate that it can find those “easy” invariants
- Scalability
- How long it takes to run the tracer
- How long it takes to find the invariants
- 450MhZ Pentium II
- “The engine has not been optimized for time and space”
- “We used interpreted language Python”
- Relationship between number of tests and number of invariants
- Future studies?
- How many invariants are useful vs useless? (User study)
- Of the invariants detected, who accepts them as useful/valid
- How many invariants are useful vs useless? (User study)
- Inspirational?
- Brilliant! Especially for large programs
- Having examples can be a useful tool for developers - showing what values might be in different places in a program is useful
- Still used/maintained, 30 years later!
- What was evaluated?
- Need to find the ones that are useful