Skip to main content Link Menu Expand (external link) Document Search Copy Copied
Last updated: | Permalink

Discussion: Test Oracles and Adequacy

Agenda:

  1. Administrivia
    1. Paper Proposal Feedback
    2. Updated readings for next Tuesday
    3. Flaky test and TotT follow-up
  2. Are mutants a valid substitute for real faults?
  3. Dynamically Discovering Likely Program Invariants to Support Program Evolution

These notes roughly capture the (mostly unedited) comments from the class:

Are mutants a valid substitute for real faults?

  • Why do we need to know the answer to this question?
    • We want to use mutation to assess the fault-finding ability of our tests - can we?
  • Should it be: “are some mutants a valid substitute”?
  • What is hard about answering this question?
    • How to gather the dataset?
      • What is a “bug fix”? - Approach is to manually isolate the part of a commit that fixes a bug
    • How to control for code coverage?
      • There could be a strong bias between code coverage and mutation
    • How to run this experiment? - lots of CPU resources required, probably produces lots of output
    • How to get a broad enough dataset so that this might generalize?
  • Overall experimental design?
    • Collect a bunch of isolated bugs
    • Collect tests that reveal those bugs
    • Generate more tests
    • Do mutation analysis on all of the test suites
    • Does a test suite that detects more mutants detect more faults?
  • What is their definition of a “real fault?”
    • Start with 5 open source projects
      • Have issue tracker, have good version control
    • Look for bug-fix commits
      • Manually inspect them
      • Discard faults that were not reproducible on the buggy commit
        • What was the procedure to determine reproducible? Required that a test existed already
        • Are mutants a valid substitute for faults that developers don’t write test cases for?
        • … are mutants a valid substitute for faults in distributed systems caused by exception handling bugs (OSDI 2014)?
      • Ensured that there was only a single change in the commit (tried to manually do this)
      • Discard commits where the only changes were to tests
  • Where do developer tests come from?
    • (T_pass), (T_fail) For a specific commit, T_pass is the test suite that passes, T_fail differs by exactly one test case that reveals the test case (if there are multiple failing tests, it’s multiple pairs)
  • What is a mutant?
    • What operators?
      • Replace constants
      • Replace operators
      • Modify branch conditions
      • Delete statements
    • Where to apply mutations?
      • Applied to the classes that had been changed by the bug fix
        • Didn’t want to create a lot of “useless” mutants - OK to save time to do fewer
        • Assumption: a mutant in the same location as the bug will represent that bug
          • Null hypothesis is that there is no correlation. Making more mutants could only IMPROVE the strength of the finding that there is a correlation
  • Run the experiments
    • How to control for code coverage?
      • Only consider mutants that are covered by all of the tests
    • RQ: Are real faults coupled to mutants generated by commonly used mutation operators?
      • If mutants are valid substitute, then any test suite with a higher fault detection rate should also have higher mutation score - “coupled” mutants and real faults
    • RQ: What types of real faults are not coupled?
      • Manual analysis
    • RQ: Is mutant detection correlated with real fault detection
      • Generate tests
        • Need more tests to make a conclusion if mutation score of a tests suite is actually correlated
        • Used three different tools: EvoSuite, Randoop, JCrasher.
      • Compare mutation scores between test suites
  • What are results?
    • Are real fault scowled to mutants generated by commonly used mutation operators?
      • 73% were coupled
    • What was not coupled?
      • Some, they believe, can not be coupled:
        • “Incorrect algorithms” - cases where an algorithm had to be changed
      • “Similar library method called”
      • Some faults required stronger mutation operators:
        • Statement deletion
        • Argument swapping
        • Argument omission
        • Similar library method called
      • Here’s a less strong mutation operator:
        • x = x + 0
    • Is mutant detection correlated with real fault detection?
      • Yes
      • Also, correlation between mutation detection and real fault detection stronger than coverage vs real fault detection
  • Reactions
    • “are some mutants a valid substitute?”
    • Extremely careful and time-consuming methodology, come to valid and sound conclusions for the dataset, but does this dataset generalize?
      • More recent work shows that this still holds on bigger datasets built with same methodology, but what about changing the methodology? (What we consider a bug… also…. Languages? Just Java)
      • New context: bugs = student implementation of an assignment that failed an instructor-provided test
        • C, TypeScript

Dynamically Discovering Likely Program Invariants to Support Program Evolution

  • What is the motivation?
    • There are implicit invariants in a programmer’s mind while coding, the program could be more robust if they were encoded
      • Example: i should monotonically increase in a for loop, or the bound of a loop should be set at som point
  • What makes it a likely program invariant?
    • Based on the tests, it was always true, but the tests might not be comprehensive
    • Are they “likely” or “possible”?
      • They define “likely” based on a probabilistic assumption that values are uniformly randomly distributed
      • Example: if x==3 on every execution, but x is a 32-bit integer, it’s very unlikely that it would have happened every time
      • Case for possible:
        • Input a value in the range [3,4] -> x
      • Case for likely:
        • “Likely” based on developer acceptance
  • What is the use-case for program evolution?
    • If the maintainer becomes aware of the invariants, they will accept them, add them to code, and then this will help future developers/same developer to maintain the code
    • Can use this to generate invariants on-demand for specific regions, too
    • Maybe useful for fuzz testing: use as a fitness function to evaluate closeness to finding bugs. Or, in the 1999 terminology, good for finding edge cases :)
    • Assertions are helpful for evolution
    • There is a narrative explaining how this was helpful for augmenting a regex compiler
    • Would it be helpful to prune search space for model checking
  • What are the challenges in detecting invariants?
    • Need to find the ones that are useful
      • What’s not a useful invariant?
        • const x = 3; //assert(x==3) assert(x!=4)
        • “This list is not sorted”
        • Probabilities, too…
    • Where do the invariants come from?
      • Record a lot of things
      • Establish patterns for invariants
        • They are established in the paper
        • Only detect invariants for which there is a template
      • Establish a threshold for “negative” invariants
      • Only consider possible invariants that include at most 3 variables?
        • It will take a long time to run
    • Implementation details
      • Frontend/backend distinction
        • Frontend - records trace
        • Backend - infers likely invariants
        • Benefits?
          • Can add new patterns and re-use the existing trace and frontend
          • Can support multiple languages, just reuse the backend
        • Could it have been more efficient to put frontend + backend together? What do we lose?
          • Can’t infer while tracing - no feedback loop possible
          • Stuck with variables that can be serialized into the trace format
    • Evaluation
      • What was evaluated?
        • On the Gries program - do the inferred invariants match the textbook invariants? Yes
        • What were these “Gries programs”?
          • Examples for teaching programming and invariants, demonstrate that it can find those “easy” invariants
        • Scalability
          • How long it takes to run the tracer
          • How long it takes to find the invariants
            • 450MhZ Pentium II
            • “The engine has not been optimized for time and space”
            • “We used interpreted language Python”
          • Relationship between number of tests and number of invariants
      • Future studies?
        • How many invariants are useful vs useless? (User study)
          • Of the invariants detected, who accepts them as useful/valid
      • Inspirational?
        • Brilliant! Especially for large programs
        • Having examples can be a useful tool for developers - showing what values might be in different places in a program is useful
        • Still used/maintained, 30 years later!

© 2022 Jonathan Bell. Released under the CC BY-SA license