Lecture 3: “Assessing Oracle Quality with Checked Coverage”

Received the N-10 most influential paper award (in 2021)

Agenda:

Friday check-in
Continued discussion: test adequacy
Discuss “Assessing Oracle Quality with Checked Coverage”

Some examples for discussion

int sum(int[] array){
	if(array == null)
		throw new IllegalArgumentException("Input must not be null");
	int ret = 0;
	for(int i = 0; i < array.length; i++){
		ret += array[i];
	}
	return ret;
}

void testSum(){
	assertEquals(10, sum(new int[]{0, 1, 2, 3, 4}));
	// also should add a test for array is null, in order to get coverage
}

Do mutation testing:

int sum(int[] array){
	int ret = 0;
	for(int i = 1; i < array.length; i++){ //Mutation: I = 1 instead of I=0
		ret += array[i];
	}
	return ret;
}

void testSum(){
	assertEquals(10, sum(new int[]{0, 1, 2, 3, 4}));
	assertEquals(10, sum(new int[]{1, 0, 2, 3, 4}));
}

void testSum(){
	assert(sum(new int[]{1, 0, 2, 3, 4}) > 0);
}

What were the problems in using mutation testing to evaluate test suite quality?

Time-consuming (machine time): Need to run lots of mutants and lots of tests
Time-consuming (human time): Need to detect equivalent mutants

What is the motivation for this paper?

Something better than branch coverage to judge how well tested our program is
What is the design space of solutions that are considered?
- Want a scale from 0-100, where 100% means “you are done” and 90% means “keep going”
- Stronger criteria than just statement/branch coverage
- Want something FASTER than mutation testing

What is hard about this problem?

We need something that measures the quality of the oracle, not just the code that is executed
Same problems that we always see evaluating test suites: the test has the oracle that says what the correct behavior is, but how do you know what the REAL correct behavior that you should be checking is
- What does it mean to “have the results checked”

What is the proposed solution?

“Checked coverage” -> what statement AND are included on a backwards slice from any of the assertions?

Sidebar: What is backwards slicing?

static int sum(int N){
  int I = 0;
  int sum = 0;
  while(I < N){
      sum = sum + I;
      I = I + 1;
  }
  System.out.println(I);
  System.out.println(sum);
}

Step 1: Take a trace of the program execution that you want to slice on, for instance sum(2)

Int I = 0;
Int sum = 0;
I < n (true)
Sum = sum + I; //sum = 0;
I = I +1; // I=1;
I < n (true)
Sum = sum+I; //sum = 1;
I = I + 1; // I = 2;
I < n (false)
System.out.println(I); //2
System.out.println(sum); // 1

Example: Trace on println(I)

Int I = 0;
I < n (true)
I = I +1; // I=1;
I < n (true)
I = I + 1; // I = 2;
I < n (false)
System.out.println(I); //2

Note: this is not necessarily following formal definitions of control dependent, definition of control dependent is not in the paper.

What is the normal definition of control dependence that we should use?
- “You are control dependent on a branch if you exist between the position where a branch splits control flow, and it comes back together” - There is a post-dominance relationship between this node and the control flow node/branch

Int sum(int n){
	int I = 0;
	while(I<N){
		I = I + 1;
	}
	System.out.println(I);
}

Why use slicing for checked coverage?
- Intuition: If some code is covered, but not included in a backwards slice from the assertions, there are no data or control dependencies between the assertion and that code - “it is not checked”
- Question: If some code IS included in that slice (it is “checked”) do we have more positive sentiment towards the test?
  - We can “loosely” say that this statement influences the outcome
  - This is a totally Boolean “it is checked” or “it is not”

What are going to be the problems of using slicing for this?

Example: figure 1…?

This is a particularly opinionated definition of what a “good” test is, and that definition is heavily rooted in how the system was implemented (with slicing)

Public void testValidExecution(){
  try{
  Object ret = doSomethingVeryRiskyThatMightThroughException(someValidInput);
  assert(ret != null); //Just because I added this, perfect checked coverage :/
  } catch(Exception ex){
  //uh oh!
  fail("Expected no exception!");
  }
}

Public void testInvalidExecution(){
	try{
		doSomethingVeryRiskyThatMightThroughException(invalidInput);
		fail("Expected exception!");
	} catch(Exception ex){
		//Correct behavior
	}
}

Problem with branch not taken:

Static HashMap myHashMap = new HashMap();
Static{
  myHashMap = null;
  myHashMap.add(...);
}
Public void testValidExecution(){
  Boolean OK = determineIfSystemIsOK();
	
  if(OK){
      //don't fail test
  }
  else{
      //do more complex checks to understand the system state
      callSomeOtherMethodThatWillProbablyFailTheTestButTheyDontKnowThatItWill()
  }
}

“All test classes traced separately”
- Class initialization code will get run each time that a new test runs, that initialization code will never be checked

Evaluation

“Qualitative analysis”
Discarding assertions - lots of noise
What additional evaluation do you think that could have been done, or could be done now?
- More statistical analysis - what is the significance of the results? (Especially for performance…)
- Is checked coverage correlated with fault detection?