Lecture 18 - Automatic Creation of SQL Injection and Cross-Site Scripting Attacks

Meta-Q: What do we think so far about the differences (or not) between work in automated testing/security in SE vs Security vs PL?

Background

PHP + CGI execution model

User makes a request to a web server -> web server invokes a program to service that request So - challenge for doing dynamic analysis - need to correlate the data from multiple user requests

XSS + SQLi vulnerabilities

OWASP top 10

SQL injection example: SELECT * from users where userName='$THE_USER' Malicious user sends username: 1' OR '1'='1 Evaluates to: SELECT * FROM users where userName='1' OR '1'='1'

How could we prevent this from being exploited in this way?

Input validation - do not allow certain characters as input, or certain patterns. Need to do this kind of validation on both the client side and the server side
Automatically escape every quote in every input that we receive from users!
- 1' OR '1'='1 -> 1\' OR \'1\'=\'1
- SELECT * FROM users where userName='1\' OR \'1\'=\'1'
- “Magic Quotes”
- Problem with this: Defensive programmers end up double-slashing things because they escape it, too
Avoid writing code this way - use “PreparedStatements”: SELECT * from users where userName=? and then something like set_parameter(1, $THE_USER)
- The actual implementation of how the prepared statement works depends on your database implementation, and also the implementation of your database API
Equifax hack - vulnerability in input validation for Apache Struts

What is the problem being solved here?

Generating inputs?
- Try to solve for more paths through the inputs
- This “throw everything at the wall and see what sticks’ approach might be effective for small programs, but unsure about big ones
Detecting vulnerabilities?
- (Especially non-crashing ones)…

Why would you not solve this problem with static analysis?

False positives - particularly with the database boundary to get the “2nd order XSS” vulnerabilities

High level approach

Taint tracking

	int x = Tainted(5);
	int y = 10;
	int z = x + y;
	if(isTainted(z)){
		
	}

$THE_USER = $_GET['user']; //taint source

$THE_USER = addslashes($THE_USER); //Sanitizer?

$result = mysql_query("SELECT * from users where userName='$THE_USER'"); //taint sink

Generate some inputs with the goal of reaching vulnerable sinks
Taint that input, use dynamic taint tracking to see if that input flows to a vulnerable sink
If yes, try to inject a vulnerability
If injected vulnerability, see if it “succeeded” at an exploit

What do they do about sanitizers?

A sanitizer function will clear the taint set
Unclear what their sanitizers are
Different sanitizers might work better for different sensitive sinks
Might miss some true positives that were induced by using the wrong sanitizer
- Would have been an interesting evaluation to compare: how many of the reports get filtered out by sanitizers, which may or may not represent actual vulnerabilities?
- Interesting though that false positive cost is primarily machine time: you can always validate a report as a “vulnerability” or not based on whether or not you can induce one
- Better models of sanitizers can help mitigate this problem, but comes with a maintenance burden as the language evolves, and might be prone to developer errors still, particularly when there are custom sanitizers
- XSS sanitization is tricky and depends on where the value flows into the browser:
```
<script>
alert($THE_EVIL_INPUT);
Eval($THE_EVIL_INPUT);
// This page was generated by $THE_EVIL_INPUT
</script>
<a href="$THE_EVIL_INPUT">Click here!</a>
<$THE_EVIL_INPUT></$THE_EVIL_INPUT>
```

Example: <a href="$THE_EVIL_INPUT">Click here!</a>

Depending on sanitizer…

No sanitizer at all
- $THE_EVIL_INPUT= '"><script language="JavaScript">alert("hahaha");</script>'
Sanitizer: prevent me from adding double quotes
- $THE_EVIL_INPUT="javascript:alert('haha');"

<img src="$THE_EVIL_INPUT" /> <- could provide a URL to something malicious

How does Ardilla detect that an attack is successful? (4.3.2)

SQL injection
- Compare the database statements (and any queries in them) - parse them, see if you get different parse tree
- This is pretty good - state-of-the-art at the time was maybe to do regex parsing on SQL
  - Might miss vulnerabilities in something like PhpMyAdmin which lets users enter arbitrary mySQL queries and execute them, but duh
XSS attack checking
- Signal attack if the output contains “additional script-inducing constructs”
- Might be a weak oracle, as evidenced by the variety of injection routes + sanitizers
  - Compare the “strict” and “lenient” evaluations - might be an overly simplified model of XSS

Concrete + symbolic database

Need to do taint tracking through the database - not just in PHP, since data is persisted in DB

It’s something that you need, but it’s engineering - lots of work to do it, but it’s something that you need to have this system actually work

To make general: also have to worry about triggers, complex queries like INSERT into X select Y from Z, etc.

Alternative approaches?

Treat everything from the DB as tainted?
- Don’t trust anything that comes out of the DB. Any false positive by your tool should be solved by defensive programming - add sanitizers all over.
Have some heuristic?
- Try to create the most diverse set of inputs across different requests as possible
- Before storing in to DB, record on the side that value ABCDEFG”
- Any time you read anything from the DB, look for ABCDEFG and if I see that exact value, apply taint mark T

Idealist view: defensive programming! Don’t even try to solve this problem. Use a static checker that will find candidates for this bug, force remediation. Pragmatic view: We already have a lot of code, and it was written in PHP, so we don’t really believe that many good choices were made in its design or implementation :)

Where have these things gone in the past 12 years?

SQL injection
- We don’t use SQL anymore, or
- We use prepared statements
XSS
- Some browsers have implemented some filters
- Frontend frameworks have helped a lot also
LGTM + CodeQL
Have not solved input injection broadly - cognizant of the risks (better education) and developed linguistic/API approaches to make it easier to do it right
- Constant trade-off in expressiveness vs security