Link Search Menu Expand Document

Lesson 10 - Coverage-based Greybox Fuzzing as Markov Chain - “AFLFast”

What is the problem?

Many inputs generated by AFL explore the same path

Q: What is a path?

  • “Normal” definition of path (e.g. from symbolic execution): a sequence of branch decisions
  • This paper: AFL’s definition of path is a unique hash of all of the coverage probe hit counts

How do we avoid generating lots of inputs that execute the same path?

  • Symbolic execution will (at least theoretically) guarantee exploration of all feasible paths
  • Fuzzing may not

Coverage guided fuzzers - review

Two key aspects for the search process:

  1. Schedule of how long to fuzz each input
  2. Process to determine which input to fuzz next

Solution - Markov Chains

Q: What is a Markov Chain?

  • It’s a model with multiple states, where the probability of advancing from one state to the next state is always independent of any prior history before you got to that state
  • AKA the process is “memory less”

Q: Is Coverage Guided Fuzzing a Markov Process?

  • Yes: This is very related to the “bandit” problem in AI, but consider each slot machine as a “seed” in fuzzing.
  • Yes: Each time that we fuzz some seed input, the probability of advancing to a new program state is independent of other inputs that we generated

Q: Even if we use memory to determine which input to select to fuzz, is it still a Markov process?

  • Yes - this basically just means “the probability of getting into the current state is dependent ons something else” - not “the probability of getting into the next one depends on where I was before”

Discuss: Figure 2 probabilities - maybe over simplified

Q: ‘in other words, WLOG, we make the following two assumptions… We assume that generating an input that exercises path j from undiscovered seed t_i is as likely as generating from seed t_j an input that exercises undiscovered path i. What does this actually mean?

  • Same probability to go from path J -> I as from path I -> J
  • How effective would a symbolic execution + BFS alternative be? [James discussed DARPA CGC winners - if AFLFast came in second, what came in first?]

Key idea: After we discover a new input that reveals a new path, we want to get some estimate of whether there are low-hanging fruit of new paths reachable from this input

Q: Page 4 - what does the statement “It is not difficult to see that a greybox fuzzer is more likely to exercise paths in a high-density region of the space than in a low-density region”

  • You’ll exercise more paths that have fewer constraints
  • This is something that is measured empirically - by assigning some power to the input, fuzzing, and seeing how many new paths you get

“If their goal is to find more bugs, and not just some particular bugs, they will do a good job with that using AFLFast”

  • Assumption: bugs are randomly distributed in code. Hence: a fuzzer that explores more diverse paths is exploring more code, and more likely to find more bugs
  • USENIX Security 2020 sanitizer-guided greybox fuzzing - it doesn’t matter if you expose a bug that you can’t actually discover with your oracle. If you are looking for certain types of bugs, then some parts will be more bug-dense, because you are limited not only by what you can EXPOSE but also what you can DETECT

Create a variety of “power schedules” (Why is this called a “power schedule”?)

  • “State of the practice schedule”. - AFL does out-of-the-box. An input gets given some specific “energy” based on coverage, always gets similar number of executions based on that energy
  • FAST - exponential power schedule, increase the fuzzing time of each seed exponentially each time that the seed is selected, inversely proportional to the number of inputs that exercise that path. So if you have an input that exercises a very common path, you get less power.
  • Always have some minimum power that you assign - never say “I won’t fuzz it at all”

Q: How sensitive will this approach be to the starting inputs that you use (the seeds)?

  • Very biased towards seed inputs - where you start will greatly influence where you go next.
  • Seeds that we start with will expose the initial “common” paths
  • Startup might be similar for the AFLFast and AFL?

Q: can we determine some amount of time to fuzz for before calculating what is “common”

  • Yes, there are approaches in AI where you do some calibration runs at first, where you gather information but do not use it to bias your power scheduling. How long you defer the process for will be a hyper-parameter and should be tuned e.g. with grid search

General fuzzing evaluation discussion: “Crash” vs “bug”

Evaluation

Headline number: how many vulnerabilities do we get? Q: Does # vulnerabilities tell the whole picture?

  • Hard to generalize results to other applications
  • Only applicable to the configuration that they used to run both tools
    • What seed to use?
      • Empty seed
    • How long to run for?
      • Ran each for “6 or 24 hours”
    • How many trials?
      • Ran each “at least 8 times”

For a thorough treatment of these parameters on AFLFast performance, see: Evaluating Fuzz Testing, Klees et al, CCS 2018

What are the takeaways from AFLFast evaluation?

  • Would it make sense to plot crashes or coverage vs inputs (instead of time)
    • Paths vs inputs would be a great graph to support the claim that they explore more paths faster
    • Still need a graph with time because some inputs take longer to run than others, and AFL is attempting to prioritize inputs that run faster (which might let you explore more faster)

What are our takeaways from this paper?

  • “AFLFast is better than AFL” depending on the input seed, fuzzing campaign duration, and metric to use to declare “better”
  • Is it the case that this will always perform at-least-as-well as AFL?
    • Depends on your program, your seed, your campaign duration, and again, your definition of “good”
  • AFL touted its “lack of configurability” but this work shows that there are definitely some knobs (like the scheduler) that are worthwhile to tweak and customize
    • Basically - try changing all of these things, and see what happens
  • What is the relationship between the Markov process model and the final power schedule?

© 2021 Jonathan Bell. Released under the CC BY-SA license