Link Search Menu Expand Document

Lecture 17 - UniFuzz: A Holistic and Pragmatic Metrics-Driven Platform for Evaluating Fuzzers

XKCD's hot take on futility of standards

  • Compare to FuzzBench: maybe what we are most interested in learning is not “which fuzzer is better” but “where is one fuzzer better and why?”
  • Per-project data is useful to see.. but wow this is a lot of data in the tables + graphs
    • 1-plot-per-target vs 1-plot-per-fuzzer
    • Maybe an interactive online companion in Tableau or Flourish would be a nice way to examine closer
    • Maybe this is a defense of FuzzBench :)
    • Relative ranks might be a more useful compression of the various factors into saying which is “best”
      • Most of the statistical tests are using rank anyway…
  • Where is the artifact with the results?
  • In general, how much of our page-limit goes to the evaluation vs presenting what we actually did and why
  • It is, for sure, hard to draw conclusions about fuzzers in general from this

Q: “Usability” evaluation?

  • What is it that we typically evaluate in a paper? “Does this power schedule result in more bugs detected than the default schedule”? - What is the standard-of-quality for the performance and ease-of-use and bug-proneness of research tools?
    • As we saw with the KLEE reflection paper - sometimes these performance/implementation issues can dramatically impact the overall conclusions/results from a paper
  • What are the aspects of usability that might be more relevant for fuzzing applications?
    • AFL - works out of box on an application that takes a file as an input
    • JQF/Zest - need to write a generator and driver
    • Many other fuzzers - need to write some model of the system, grammar for input, driver for program, etc.
    • Is there additional cost during program evolution?

Q: What are the problems that these authors see with existing fuzzing evaluation methodologies, and they are trying to address with this work?

  • Different fuzzers are evaluated on different targets, and might be biased in the selection of those targets. Those targets might also be too small
  • Metrics are not suitable (e.g. not just “bugs found” or worse, “unique crashes”)
  • Re-use of results directly without rerunning them (different hardware for different fuzzers)
  • No typical reporting of resource consumption

Q: Are these problems with fuzzing evaluation methodologies better solved with UniFuzz than by FuzzBench? Are they solved now? Are they still open problems?

  • Can only solve the problem of “different benchmarks” for different tools when everyone actually agrees on it, not just another paper is published saying it’s the best
  • Maintenance burden of updating the targets, and the fuzzers
    • FuzzBench pushes this to the maintainers of the projects who benefit from OSSFuzz
  • Ground truth on bug triggering is still very weak - stack hashing + ASan
    • FuzzBench pushes this to maintainers of the projects who respond to bug reports
    • Compare to: CSmith (‘we reported X bugs and Y were patched, including Z patched quite urgently’) or GLFuzz (‘we found X reports, of which we found Y false positives, reported Z bugs…’)

Q: FuzzBench or UniFuzz? Can we bring something from UniFuzz to FuzzBench? Or from FuzzBench to UniFuzz?

  • UniFuzz’s metrics are nice and thorough (but also overwhelming)
  • FuzzBench is, maybe “more usable”
  • Both works have a lot of strong opinions about what is right in an evaluation, without necessarily having a clear, reasoned argument to support those conclusions
    • FuzzBench showed some more experimental results to show why their number of trials made sense, or why their campaign length made sense…
    • Why is it possible to say “30 trials, 24 hours for all targets”, but not “AFL++ is best on every target”?
  • Seed selection:
    • UNIFuzz: “We found random seeds on the internet”
    • FuzzBench: “We tried with no seed, and then we tried with the seeds that were provided by the developers”

Q: What do we think about all of these metrics

  • “Rare” bugs and “dangerous” bugs
    • Useful to know, but unsure what conclusion we can really draw
    • We define dangerous as “GDB says it’s dangerous” ;)
    • Some bugs are certainly likely to be more impactful than others
      • Given multiple fuzzers with the same oracle, is even possible to design a fuzzer that will find more “important” bugs than the other?
      • Same argument as “are defects evenly distributed in the codebase”?
  • Speed of finding bugs
    • Point: “We ran our enhanced AFL and AFL for 1 hour each, and we found a lot more bugs!”
      • More realistically: Fuzzing-as-a-service products like OSSFuzz run in a time-bound, so what you can find in that time-bound is probably useful
    • Counterpoint: Maybe the best thing to do is “run as many fuzzers as you can for as long as you can, and we don’t understand why what happens happens”
    • Speed of finding bugs might vary significantly across programs
    • Alternative: “It doesn’t matter how fast you find the bugs, as long as you find them all within the specified time bound”
    • Hard to report metrics for this, particularly with repeated trials
    • Also hard to measure this when you have a relatively small number of bugs, because there is limited information (bugs are sparse - same issues as reporting bug). Speed of FINDING BRANCH COVERAGE might be more interesting to look at
  • Coverage
    • UniFuzz does not look at the “differential coverage” aspect, but this would be a nice thing to examine to understand if there is much diversity between the different fuzzers
    • Number of lines covered vs percentage of lines covered?
      • Figure 6 shows total lines
      • Figure 6 is mega misleading because every graph has a different axis - but using percentages would be impossible to read also
      • How else would we present this information in a way that is readable? (Table?)

Q: Why start with 35 fuzzers, then evaluate with 8?

  • AND no AFL++!?!?
    • Maybe because they started in 2018, found CVE’s in 2019, then the paper got bounced around?
    • For the scale of this research, fair that it takes 2-3 years
    • But because no AFL++: No LAF-INTEL, red queen, etc.
  • What was the goal of the “usability” evaluation and what is its contribution?
    • Could maybe have done with just the 8?
    • Maybe they had a lot of problems using the tools and were frustrated, and here we are :)
      • “Fuzzing the fuzzers” paper?
      • Interesting to look through the issues, authors of fuzzing tools often respond
    • Now that they have the 35 fuzzers scripted in Docker, could you make a meta-fuzzer?
      • How many fuzzers are in FuzzBench?
        • ONLY 22 ;)
    • Some of the fuzzers require models (peach) - what to do about that for the targets?

Q: Is there some large number of projects at which we can say that we have a generalizable conclusion? Or: how do we control for bias in the selection of these projects, where my incentive as the benchmark developer is: “There should be a clear rank-order of fuzzer performance across all benchmarks?”

  • Is there a Friedman test here that says “There is no clear winner?”
  • Maybe they could have done this for any/all of the metrics that were included
  • We still like to see the different performance across different targets, and to highlight this instead of ignore it
  • Could maybe come up with something like “We have N benchmarks and we were in the top-M ranks for all of the benchmarks”

Comparing FuzzBench + UniFuzz:

  • So much compression looking at the aggregated results

Figure 5: Correlation coefficient between number of unique bugs and coverage

  • Why are there so many omissions? (No bugs found?)
  • Overall conclusion?
  • Bug sparsity is a significant threat to validity of any conclusion/correlation here

What do we do to evaluate fuzzers now?

  • How do we adequately compress this information into a paper?
  • OSSFuzz has “an ideal integration award” for fuzz targets - they reward developers for the first integration of their tools into their fuzz harness
  • If only we had a technique for adding faults to a program that are a valid substitute for real faults :)

© 2021 Jonathan Bell. Released under the CC BY-SA license