Lecture 17 - UniFuzz: A Holistic and Pragmatic Metrics-Driven Platform for Evaluating Fuzzers
- Compare to FuzzBench: maybe what we are most interested in learning is not “which fuzzer is better” but “where is one fuzzer better and why?”
- Per-project data is useful to see.. but wow this is a lot of data in the tables + graphs
- 1-plot-per-target vs 1-plot-per-fuzzer
- Maybe an interactive online companion in Tableau or Flourish would be a nice way to examine closer
- Maybe this is a defense of FuzzBench :)
- Relative ranks might be a more useful compression of the various factors into saying which is “best”
- Most of the statistical tests are using rank anyway…
- Where is the artifact with the results?
- In general, how much of our page-limit goes to the evaluation vs presenting what we actually did and why
- It is, for sure, hard to draw conclusions about fuzzers in general from this
Q: “Usability” evaluation?
- What is it that we typically evaluate in a paper? “Does this power schedule result in more bugs detected than the default schedule”? - What is the standard-of-quality for the performance and ease-of-use and bug-proneness of research tools?
- As we saw with the KLEE reflection paper - sometimes these performance/implementation issues can dramatically impact the overall conclusions/results from a paper
- What are the aspects of usability that might be more relevant for fuzzing applications?
- AFL - works out of box on an application that takes a file as an input
- JQF/Zest - need to write a generator and driver
- Many other fuzzers - need to write some model of the system, grammar for input, driver for program, etc.
- Is there additional cost during program evolution?
Q: What are the problems that these authors see with existing fuzzing evaluation methodologies, and they are trying to address with this work?
- Different fuzzers are evaluated on different targets, and might be biased in the selection of those targets. Those targets might also be too small
- Metrics are not suitable (e.g. not just “bugs found” or worse, “unique crashes”)
- Re-use of results directly without rerunning them (different hardware for different fuzzers)
- No typical reporting of resource consumption
Q: Are these problems with fuzzing evaluation methodologies better solved with UniFuzz than by FuzzBench? Are they solved now? Are they still open problems?
- Can only solve the problem of “different benchmarks” for different tools when everyone actually agrees on it, not just another paper is published saying it’s the best
- Maintenance burden of updating the targets, and the fuzzers
- FuzzBench pushes this to the maintainers of the projects who benefit from OSSFuzz
- Ground truth on bug triggering is still very weak - stack hashing + ASan
- FuzzBench pushes this to maintainers of the projects who respond to bug reports
- Compare to: CSmith (‘we reported X bugs and Y were patched, including Z patched quite urgently’) or GLFuzz (‘we found X reports, of which we found Y false positives, reported Z bugs…’)
Q: FuzzBench or UniFuzz? Can we bring something from UniFuzz to FuzzBench? Or from FuzzBench to UniFuzz?
- UniFuzz’s metrics are nice and thorough (but also overwhelming)
- FuzzBench is, maybe “more usable”
- Both works have a lot of strong opinions about what is right in an evaluation, without necessarily having a clear, reasoned argument to support those conclusions
- FuzzBench showed some more experimental results to show why their number of trials made sense, or why their campaign length made sense…
- Why is it possible to say “30 trials, 24 hours for all targets”, but not “AFL++ is best on every target”?
- Seed selection:
- UNIFuzz: “We found random seeds on the internet”
- FuzzBench: “We tried with no seed, and then we tried with the seeds that were provided by the developers”
Q: What do we think about all of these metrics
- “Rare” bugs and “dangerous” bugs
- Useful to know, but unsure what conclusion we can really draw
- We define dangerous as “GDB says it’s dangerous” ;)
- Some bugs are certainly likely to be more impactful than others
- Given multiple fuzzers with the same oracle, is even possible to design a fuzzer that will find more “important” bugs than the other?
- Same argument as “are defects evenly distributed in the codebase”?
- Speed of finding bugs
- Point: “We ran our enhanced AFL and AFL for 1 hour each, and we found a lot more bugs!”
- More realistically: Fuzzing-as-a-service products like OSSFuzz run in a time-bound, so what you can find in that time-bound is probably useful
- Counterpoint: Maybe the best thing to do is “run as many fuzzers as you can for as long as you can, and we don’t understand why what happens happens”
- Speed of finding bugs might vary significantly across programs
- Alternative: “It doesn’t matter how fast you find the bugs, as long as you find them all within the specified time bound”
- Hard to report metrics for this, particularly with repeated trials
- Also hard to measure this when you have a relatively small number of bugs, because there is limited information (bugs are sparse - same issues as reporting bug). Speed of FINDING BRANCH COVERAGE might be more interesting to look at
- Point: “We ran our enhanced AFL and AFL for 1 hour each, and we found a lot more bugs!”
- Coverage
- UniFuzz does not look at the “differential coverage” aspect, but this would be a nice thing to examine to understand if there is much diversity between the different fuzzers
- Number of lines covered vs percentage of lines covered?
- Figure 6 shows total lines
- Figure 6 is mega misleading because every graph has a different axis - but using percentages would be impossible to read also
- How else would we present this information in a way that is readable? (Table?)
Q: Why start with 35 fuzzers, then evaluate with 8?
- AND no AFL++!?!?
- Maybe because they started in 2018, found CVE’s in 2019, then the paper got bounced around?
- For the scale of this research, fair that it takes 2-3 years
- But because no AFL++: No LAF-INTEL, red queen, etc.
- What was the goal of the “usability” evaluation and what is its contribution?
- Could maybe have done with just the 8?
- Maybe they had a lot of problems using the tools and were frustrated, and here we are :)
- “Fuzzing the fuzzers” paper?
- Interesting to look through the issues, authors of fuzzing tools often respond
- Now that they have the 35 fuzzers scripted in Docker, could you make a meta-fuzzer?
- How many fuzzers are in FuzzBench?
- ONLY 22 ;)
- How many fuzzers are in FuzzBench?
- Some of the fuzzers require models (peach) - what to do about that for the targets?
Q: Is there some large number of projects at which we can say that we have a generalizable conclusion? Or: how do we control for bias in the selection of these projects, where my incentive as the benchmark developer is: “There should be a clear rank-order of fuzzer performance across all benchmarks?”
- Is there a Friedman test here that says “There is no clear winner?”
- Maybe they could have done this for any/all of the metrics that were included
- We still like to see the different performance across different targets, and to highlight this instead of ignore it
- Could maybe come up with something like “We have N benchmarks and we were in the top-M ranks for all of the benchmarks”
Comparing FuzzBench + UniFuzz:
- So much compression looking at the aggregated results
Figure 5: Correlation coefficient between number of unique bugs and coverage
- Why are there so many omissions? (No bugs found?)
- Overall conclusion?
- Bug sparsity is a significant threat to validity of any conclusion/correlation here
What do we do to evaluate fuzzers now?
- How do we adequately compress this information into a paper?
- OSSFuzz has “an ideal integration award” for fuzz targets - they reward developers for the first integration of their tools into their fuzz harness
- If only we had a technique for adding faults to a program that are a valid substitute for real faults :)