Last updated: | Permalink

Discussion: Fuzzing - “Automated Testing of Graphics Shader Compilers.” (Donaldson et al 2017)

Poll:

To what extent do you think that this work could impact SE? Not very: 0 Unsure/so-so: 5 Very: 15

Coming up with those semantics-preserving changes for new langauges/domains can be tricky (also finding ones that are defect-revealing)
Having multiple implementations of shader compilers gives “more bang for buck” - find bugs in multiple implementations concurrently
An impressive effort

To what extent do you think that the article is convincing that this is the best approach for testing graphics shader compilers? Not very: 0 Unsure/so-so: 8 Convinced: 11

Are there potential inconsistencies/biases based on the shaders that they used as inputs - would a qualitative analysis of the shaders help provide more confidence?
“High-value” bugs are a good target

Context and background

Why are graphics shader compilers difficult to test?
- No “correct” answer
Why not do “differential” testing - compare “fuzzily” the output of two compilers?
- Requires multiple GPUs w/ benchmarks - need a precise definition of “equivalent”
- Still need to determine which one has the bug
- Vendor-specific extensions prevent this to be used fully
What is the relationship between this work and testing approximate computing?
- What is the rigorous definition of “close”? For shader compiler fuzzing, we might need to have a human evaluator (“the fog didn’t render but that was OK”)

Metamorphic testing

Example: sin(x) we want to test, we don’t know what the outputs should be

Metamorphic relation: Specify some property that describes a transformation of the input, and ac corresponding transformation on the output.

sin(-x)=-sin(x)
sin(2pi+x) = sin(x)
integral from 0 to 2pi of sin(x) = 0

What does this buy us in terms of being able to test things, example, sin function?

Provides a relationship from function inputs to outputs
- Create “follow-on” inputs
What is metamorphic relation in this context of GLFuzz?
- Image generated by shader should be equivalent to the one generated by a shader with a semantics-preserving transformation applied
What other domains might we be able to specify metamorphic relationships for, and maybe use to find bugs?
- Machine learning - There are a lot of matrix multiplications, do relations over the multiplications.
- SQL queries - semantics preserving transformations, or even some that are not preserving, but we understand (e.g. LIMIT 10 should restrict number of rows)
- Testing for data races - add no-ops into code might expose a data race (aka the semantics-preserving transformation)
- Numerical analysis - know the different transformations, what is acceptable in terms of approximation
- (Generally: computer algebra systems)
- Image/video processing - semantics preserving transformations, but also identity transformations (crop it, should still be similar, etc)
- Compiler testing - same as this paper. Property: semantics-preserving change to program should yield same compiled program/output

What is motivation?

Why do we care about shader compilers?
- It produces nice graphics, looks good in a powerpoint
- It’s also a nice research problem, because there is no easy/obvious way to do it
All of the poor gamers need games to work right; safety.
Safety can be both from security perspective (exploit-freedom) and general correctness (crash-freedom)
Tesla almost definitely is using OpenGL :)

What is the solution?

Apply semantics-preserving transformation to some inputs, run them, see if output is the same: compileAndRun(P, I) = compileAndRun(f(P(I), I), where f(x) produces a semantically-equivalent program

Given an input set of shaders…

How do they generate the variants?
1. Add dead-code of various kinds: x = 1; if(x == 0){}
  1. Don’t just put empty blocks in there - copy/paste code from other shaders, break or a continue, just return
2. “Live code injection”
  1. Copies some code that computes some local variables from one shader into another
  2. Why copy code from another shader instead of generating from scratch?
    1. There is some risk that what you generate could have some unexpected side-effect
    2. It might be hard
    3. Maybe it would be more likely to create something that would be recognized as a “real” bug
3. Mutate code with mathematical identities, x=1*x, x=0 + x
4. Pack variables into a vector
  1. x,y,z -> q = vec(x,y,z)
5. Q: Need to ensure that the control flow of the program is preserved so that it doesn’t unexpectedly crash (aka they make non-semantics-preserving changes )
  1. Is there a guarantee that the changes are semantics-preserving? (No)
  2. What is the risk if we make a non-semantics-preserving change?
    1. False positives (report a bug when there isn’t)
  3. Every transformation needs to be reversible for minification
6. Detecting “deviant” variations
  1. Some heuristic for determining “similarity”
  2. Build a histogram of the RGB distributions of each image
  3. Compare distributions to determine if they are statistically different (chi-squared distance)
7. Reduce/minifify the deviant variations
  1. Reverse a random set of transformations, if not equivalent to original image, repeat until converges to local maxima

Evaluation

What scale of bugs per-dollar should we be accepting of?
- If we are Nvidia/arm…. :/
- If we are a platform vendor, maybe we care a lot
- If we are an application developer in, say, the medical domain, we might want/need to have some higher degree of certainty of whether our software generates correct visualizations on different hardware
17 devices, 3 evaluation sets
1,000 shaders from GSLSSandbox.com, took 100 largest shaders that could be compiled
How do they choose to apply which transformation?
- Randomly - “swarm testing”
Identify false positives
- Three fellow researchers, not involved in the project, indicate via a key-press whether each pair is visually distinguishable (how long?)
Table 3 - the true positives and false positives by technique
- Mutation is very productive, but it also is going to be most likely to introduce floating point errors that result in false positives
- Could false positives be reduced? Does it matter?

Overall discussion

How to focus on detecting security vulnerabilities?
- Identify shaders that are executing memory-risky parts of code?
- Have no interest in checking the output images
  - But, having a frame-by-frame analysis is still important/valuable for finding those security issues
Faster/easier way to get the same results?
- How to make this a generic approach/architecture for testing compilers, with “semantics preserving changes?”
  - Need some engine for rewriting the code
  - Need some checker to determine whether the output of the programs is the same
    - Maybe easy, maybe hard?
  - Need to be able to clearly specify the “semantics-preserving” part
Is it valuable to find bugs in older software that may have been fixed already?
- Yes: it shows that approach works. Maybe one could argue that this approach is “cheaper”?
- And it means that developers really care about the bugs!