Last updated: | Permalink

Discussion: Test Oracles and Adequacy

Agenda:

Administrivia
1. Paper Proposal Feedback
2. Updated readings for next Tuesday
3. Flaky test and TotT follow-up
Are mutants a valid substitute for real faults?
Dynamically Discovering Likely Program Invariants to Support Program Evolution

These notes roughly capture the (mostly unedited) comments from the class:

Are mutants a valid substitute for real faults?

Why do we need to know the answer to this question?
- We want to use mutation to assess the fault-finding ability of our tests - can we?
Should it be: “are some mutants a valid substitute”?
What is hard about answering this question?
- How to gather the dataset?
  - What is a “bug fix”? - Approach is to manually isolate the part of a commit that fixes a bug
- How to control for code coverage?
  - There could be a strong bias between code coverage and mutation
- How to run this experiment? - lots of CPU resources required, probably produces lots of output
- How to get a broad enough dataset so that this might generalize?
Overall experimental design?
- Collect a bunch of isolated bugs
- Collect tests that reveal those bugs
- Generate more tests
- Do mutation analysis on all of the test suites
- Does a test suite that detects more mutants detect more faults?
What is their definition of a “real fault?”
- Start with 5 open source projects
  - Have issue tracker, have good version control
- Look for bug-fix commits
  - Manually inspect them
  - Discard faults that were not reproducible on the buggy commit
    - What was the procedure to determine reproducible? Required that a test existed already
    - Are mutants a valid substitute for faults that developers don’t write test cases for?
    - … are mutants a valid substitute for faults in distributed systems caused by exception handling bugs (OSDI 2014)?
  - Ensured that there was only a single change in the commit (tried to manually do this)
  - Discard commits where the only changes were to tests
Where do developer tests come from?
- (T_pass), (T_fail) For a specific commit, T_pass is the test suite that passes, T_fail differs by exactly one test case that reveals the test case (if there are multiple failing tests, it’s multiple pairs)
What is a mutant?
- What operators?
  - Replace constants
  - Replace operators
  - Modify branch conditions
  - Delete statements
- Where to apply mutations?
  - Applied to the classes that had been changed by the bug fix
    - Didn’t want to create a lot of “useless” mutants - OK to save time to do fewer
    - Assumption: a mutant in the same location as the bug will represent that bug
      - Null hypothesis is that there is no correlation. Making more mutants could only IMPROVE the strength of the finding that there is a correlation
Run the experiments
- How to control for code coverage?
  - Only consider mutants that are covered by all of the tests
- RQ: Are real faults coupled to mutants generated by commonly used mutation operators?
  - If mutants are valid substitute, then any test suite with a higher fault detection rate should also have higher mutation score - “coupled” mutants and real faults
- RQ: What types of real faults are not coupled?
  - Manual analysis
- RQ: Is mutant detection correlated with real fault detection
  - Generate tests
    - Need more tests to make a conclusion if mutation score of a tests suite is actually correlated
    - Used three different tools: EvoSuite, Randoop, JCrasher.
  - Compare mutation scores between test suites
What are results?
- Are real fault scowled to mutants generated by commonly used mutation operators?
  - 73% were coupled
- What was not coupled?
  - Some, they believe, can not be coupled:
    - “Incorrect algorithms” - cases where an algorithm had to be changed
  - “Similar library method called”
  - Some faults required stronger mutation operators:
    - Statement deletion
    - Argument swapping
    - Argument omission
    - Similar library method called
  - Here’s a less strong mutation operator:
    - x = x + 0
- Is mutant detection correlated with real fault detection?
  - Yes
  - Also, correlation between mutation detection and real fault detection stronger than coverage vs real fault detection
Reactions
- “are some mutants a valid substitute?”
- Extremely careful and time-consuming methodology, come to valid and sound conclusions for the dataset, but does this dataset generalize?
  - More recent work shows that this still holds on bigger datasets built with same methodology, but what about changing the methodology? (What we consider a bug… also…. Languages? Just Java)
  - New context: bugs = student implementation of an assignment that failed an instructor-provided test
    - C, TypeScript

Dynamically Discovering Likely Program Invariants to Support Program Evolution

What is the motivation?
- There are implicit invariants in a programmer’s mind while coding, the program could be more robust if they were encoded
  - Example: i should monotonically increase in a for loop, or the bound of a loop should be set at som point
What makes it a likely program invariant?
- Based on the tests, it was always true, but the tests might not be comprehensive
- Are they “likely” or “possible”?
  - They define “likely” based on a probabilistic assumption that values are uniformly randomly distributed
  - Example: if x==3 on every execution, but x is a 32-bit integer, it’s very unlikely that it would have happened every time
  - Case for possible:
    - Input a value in the range [3,4] -> x
  - Case for likely:
    - “Likely” based on developer acceptance
What is the use-case for program evolution?
- If the maintainer becomes aware of the invariants, they will accept them, add them to code, and then this will help future developers/same developer to maintain the code
- Can use this to generate invariants on-demand for specific regions, too
- Maybe useful for fuzz testing: use as a fitness function to evaluate closeness to finding bugs. Or, in the 1999 terminology, good for finding edge cases :)
- Assertions are helpful for evolution
- There is a narrative explaining how this was helpful for augmenting a regex compiler
- Would it be helpful to prune search space for model checking
What are the challenges in detecting invariants?
- Need to find the ones that are useful
  - What’s not a useful invariant?
    - const x = 3; //assert(x==3) assert(x!=4)
    - “This list is not sorted”
    - Probabilities, too…
- Where do the invariants come from?
  - Record a lot of things
  - Establish patterns for invariants
    - They are established in the paper
    - Only detect invariants for which there is a template
  - Establish a threshold for “negative” invariants
  - Only consider possible invariants that include at most 3 variables?
    - It will take a long time to run
- Implementation details
  - Frontend/backend distinction
    - Frontend - records trace
    - Backend - infers likely invariants
    - Benefits?
      - Can add new patterns and re-use the existing trace and frontend
      - Can support multiple languages, just reuse the backend
    - Could it have been more efficient to put frontend + backend together? What do we lose?
      - Can’t infer while tracing - no feedback loop possible
      - Stuck with variables that can be serialized into the trace format
- Evaluation
  - What was evaluated?
    - On the Gries program - do the inferred invariants match the textbook invariants? Yes
    - What were these “Gries programs”?
      - Examples for teaching programming and invariants, demonstrate that it can find those “easy” invariants
    - Scalability
      - How long it takes to run the tracer
      - How long it takes to find the invariants
        
        450MhZ Pentium II
        
        “The engine has not been optimized for time and space”
        
        “We used interpreted language Python”
      - Relationship between number of tests and number of invariants
  - Future studies?
    - How many invariants are useful vs useless? (User study)
      - Of the invariants detected, who accepts them as useful/valid
  - Inspirational?
    - Brilliant! Especially for large programs
    - Having examples can be a useful tool for developers - showing what values might be in different places in a program is useful
    - Still used/maintained, 30 years later!