Sunday, March 8, 2015

Less Is More - Why I Favor Short Tests

We're not going to be looking at any code in this post. We are, however, going to examine the impact the length of the tests we write has on various aspects of the verification process. The "long vs. short tests" debate is something I often have with colleagues and every time I have to re-iterate the same points. Usually I can't touch on all of them because the discussion is cut short or because I just forget to bring some of them up. This post is my attempt at formalizing my point of view, for myself first of all, but also for others.

When we talk about simulation duration we have two things to consider:

  1. simulation time - how much time has passed within the simulation; it's usually measured at the nano-/micro-/millisecond scale
  2. processor time - how long it takes for the simulation to execute on the machine; values are typically measured in seconds, minutes or hours

The two are connected by the complexity of our DUT. For a bigger design it will take more processor time to simulate 50 microseconds. A short test is a test that can be simulated in a short amount of time. Assuming we can't do anything to influence our simulation speed (which is usually a function of the simulation tool and the machines we simulate on), to get shorter tests they need to simulate less.

 

Short tests are easier to develop

By testing less in one simulation run we don't have to write complex stimuli, which means that we'll finish building our test faster. We'll need more tests though, but this is what we have modern verification techniques for. Constrained random verification's most touted feature might be that it makes it easier to hit states in the design we wouldn't have dreamed of trying to hit, but to me its main use is as a test automation tool.

Even though I promised we won't look at code, let's just take a little sneak peak. Here's how a short test would look like. It just checks that the DUT can perform a certain operation of a specific kind and that's it:

<'
extend MAIN my_virtual_sequence {
  body() @driver.clock is only {
    do init_seq;
    do operation;
  };
};
'>

A longer test would do more than just test that a certain operation works. This test checks that all operations of that kind work and also tries to perform some other tasks afterwards:

<'
extend MAIN my_virtual_sequence {
  body() @driver.clock is only {
    do init_seq;
    
    for each (op) in [ OP1, OP2, OP3, OP4 ] {
      do operation keeping { .op == op };
      
      if oo == OP2 {
        do something_else;
        do even_more;
      };
      
      if op == OP4 {
        wait delay (100 ns);  // TODO ask designer what the appropriate delay is
        do some_other_operation;
      };
      
      // ...
    };
  };
};
'>

The example is pretty basic, but the second test would take longer to develop. There are more things to consider now: what values to loop over, what can we perform after a certain operation, what is an appropriate delay, etc. If there's a bug in the design when doing an OP2 operation, for example, it's going to make it all the more difficult for us. We'll get information overload when trying to do too much at once. Humans are constructed to break problems down: we split tasks into sub-task, which we then split into sub-tasks and so forth. Why not write our tests in such a way to mirror this?

More tests doesn't have to mean more code. Our first test has the possibility to check that any operation works and it does this using randomization. We will get a lot of redundant test runs, but what's more expensive, engineering time or compute time? I'd rather optimize the former first and then, only if required, the latter.

 

Short tests are easier to parallelize

Test length also has an impact on the duration of the regression suite. You're probably thinking "Duh! If I have more to simulate my regression will be longer." That's not what I mean. What I mean is that long tests impact our ability to efficiently run tests in parallel. Let's start with the obvious: if we're able to run 100 jobs in parallel, but we only have 50 tests we're making poor use of our compute resources. Assuming our tests are all the same length, our regression will run 50% slower than it potentially could.

half_wasted

The same thing happens if we have tests that are much longer than the others. At some point, it's only going to be these tests that are running, while the others are already done. We're now in the same situation as before, where we use less jobs that we potentially could.

long_test1

I'll give you an example of how a long test ruined it for me while I was running the final regressions on my last project. I had a problem with the compute farm and had to restart the regression. Almost all of my tests were done in less than 8 hours, but someone wrote one that was taking about 12 hours. When I started the regression again this test probably got submitted to a slower machine, because it took 20 hours or more to complete. This caused me to miss a whole day. Now, what was that test doing? It was checking that all sectors in our non-volatile memory were writable. Immediately from the description you can ask "why not write a test per sector then?". This would have cut down the simulation time by a factor of the number of sectors and it would have sped up the regression significantly.

 

Short tests are easier to debug

Regressions are there to weed out failures in the DUT. Should a fail occur, short, focused tests are easier to debug. Imagine running a test that takes one hour and we find an issue at the 50 minute mark. Before we can even begin to analyze it, we'll need to first run up to that point in time with all debug knobs on maximum, which is going to make the simulation take even longer. After reaching that point we have to analyze it, possibly together with the designer, and make a change in either the DUT or in the testbench if it turns out the issue wasn't a bug. Regardless of what we need to change, there's a pretty good chance that the patch won't work on the first try, which means we'll need to repeat the cycle all over again. We can lie to ourselves all we want and say that we'll handle something else while this test is re-running, but modern science tells us that humans aren't as good at multitasking as we think, so we'll be wasting effort by not being fully focused on the task at hand. It's either that or we're going to start surfing the Internet.

Even if we do decide to go down this route, good luck running your simulation for a long time with full waves. We're going to say hello to our friend, the simulator crash. Then again, we can run up to a certain point without full debug, turn it on and run to the point of the failure. At least this way it won't crash, but what will we do if we have to trace the problem backwards and we run out of waves? We'll have to adjust the time when we turn on debugging over and over again until we reach an appropriate tradeoff. That sounds like wasted time to me...

If we're clever, we're going to try isolate the issue in a short test and do our debugging on that. But, if a short test could have found this issue, why didn't we write it like this in the first place?

 

Short tests are easier to maintain

I deliberately used the word "short" so I can misuses it in this section. We can also refer to length in terms of lines of code. Granted, there are multiple factors that affect maintainability, but less code is easier to maintain because there are less opportunities to go wrong. What's more likely to be buggy, a "Hello world" program or a big testbench containing hundreds of classes and methods?

 

Long tests aren't inherently evil, but...

This doesn't mean that long tests aren't useful. Some issues might still lurk in the deep dark crevices of the design and it might only be possible to hit these by doing a very complicated sequence of operations. Other issues can also only be found when the planets are in some special alignment. What I'm advocating is to write tests that are as long as necessary and not longer. Other times it might be very useful to have a test that simply stresses the design over a long period, but is this really the kind of test that you need to run in a nightly regression? Most probably not, since it's highly unlikely that it's find anything new.

I for one will stick to my short tests that are easy to develop, debug, parallelize and maintain. What about you? Don't hesitate to share your thoughts on the topic in the comments section.

2 comments:

  1. Let me add an addendum. If you are going to write a lot of short tests, don't write a host of tests that are identical with only a few minor changes. That's a maintenance nightmare. Instead, write one powerful (short) test that randomizes a lot. Then run that same test with many different seeds and allow parameter settings to be made from the command line.

    Too many times I've seen a bunch of short tests that all needed to change because the original test they all copied was garbage. I had to make the same fix over and over again.

    ReplyDelete
    Replies
    1. Agreed! I actually wanted to touch on this topic in a future post: constrained random as a test automation tool.

      Delete