Sunday, March 29, 2020

Favor Composition Over Inheritance - Even for Constraints

Simulation is currently the dominant functional verification technique, with constrained random verification the most widely used methodology. While producing random data is a big part of it, letting the solver blindly generate stimulus isn't going to be very efficient. Constraints are needed to guide the stimulus toward interesting scenarios.

A good constrained random test suite contains a mixture of tests with varying degrees of randomness. This is achieved by progressively adding constraints to tests to reduce their randomness. This is best explained with an example.

Let's assume we're verifying a device that can handle read and write accesses to locations in its address map. These accesses can either be done in secure mode or in non-secure mode. We model an access using a UVM sequence item:

class sequence_item extends uvm_sequence_item;

  typedef enum {
    READ,
    WRITE
  } direction_e;

  typedef enum {
    SECURE,
    NONSECURE
  } sec_mode_e;

  rand direction_e direction;
  rand bit [31:0] address;
  rand sec_mode_e sec_mode;

  // ...
endclass

Not all accesses are legal and illegal accesses would be rejected by the device.

Only certain address ranges are mapped, while accesses to unmapped addresses are illegal. If we were to write a test that only accesses mapped addresses, we would have to add the following constraints to generated items:

constraint only_mapped_addresses {
  address inside {
      [CODE_START_ADDR:CODE_END_ADDR],
      [SRAM_START_ADDR:SRAM_END_ADDR],
      [PERIPHERAL_START_ADDR:PERIPHERAL_END_ADDR] };
}

Our device also only allows writes to aligned addresses. For a 32-bit bus, this would mean that the lowest two address bits have to be 0:

constraint only_writes_to_aligned_addresses {
  direction == WRITE;
  address[1:0] == 0;
}

Lastly, certain ranges of our device's address map are restricted to secure code. Let's assume that the address map is split into 16 regions of 256 MB each. Within each of these regions, the lower half is reserved for secure accesses. This means that bit 27 of the address is always 0 for a secure access:

constraint only_secure_accesses_to_lower_half_of_range {
  sec_mode == SECURE;
  address[27] == 0;
}

The test suite for this device would contain a random test where unconstrained items are generated. One test would be directed toward generating accesses to mapped addresses, another test would only perform writes to aligned addresses, while another test would perform only secure accesses. At the same time, we would also need tests that lie at the intersection of the three features, so we would want tests that do pairwise combinations: aligned writes to mapped addresses, aligned writes in secure mode and secure access to mapped addresses. Finally, we also need a test that combines all three and only does secure writes to mapped addresses.

(While a real test suite would definitely need a lot more classes of tests, this post isn't focused on verification planning, but on the mechanical aspects of implementing a robust constraint management strategy, so please ignore the simplicity of the example.)

It might be the case that these behaviors will get tweaked over time as the project moves forward or as a new derivative of the device is developed. The address map might change as regions are moved, removed or resized, or new regions are added. The bus width might change, which would change which addresses are aligned, or we could get a feature request to implement writes of other granularities (e.g. half-word). The definition of secure regions could also change or they could become configurable via special function registers. Any of these changes should be easy to handle and shouldn't require massive changes to the verification code.

Let's skip the obvious idea of putting all constraints into the sequence item class and activating/deactivating them selectively based on the test. This won't scale for real projects, where we would have many more constraints, which would make the code unreadable.

Using mixins

Quite a while back I wrote about how to use mixins to manage constraints.

The mixin approach is flexible, because it allows us to handle each aspect individually. Instead of having all constraints in a single class, we can have one mixin for each constrained feature.

We need one for mapped addresses:

class only_mapped_addresses_mixin #(type T = sequence_item) extends T;

  constraint only_mapped_addresses {
    address inside {
        [CODE_START_ADDR:CODE_END_ADDR],
        [SRAM_START_ADDR:SRAM_END_ADDR],
        [PERIPHERAL_START_ADDR:PERIPHERAL_END_ADDR] };
  }

  // ...
endclass

We also need one for writes:

class only_legal_writes_mixin #(type T = sequence_item) extends T;

  constraint only_writes_to_aligned_addresses {
    direction == WRITE;
    address[1:0] == 0;
  }

  // ...
endclass

Finally, we need a mixin for secure accesses:

class only_legal_secure_accesses_mixin #(type T = sequence_item) extends T;

  constraint only_secure_accesses_to_lower_half_of_range {
    sec_mode == SECURE;
    address[27] == 0;
  }

  // ...
endclass

Assuming that we have a random test that starts regular sequence items, we would use these mixins to write our more directed tests by replacing the original sequence item type with one with constraints mixed in.

The test that only accesses mapped addresses would do the following factory override:

class test_mapped_addresses extends test_all_random;

  protected virtual function void set_factory_overrides();
    sequence_item::type_id::set_type_override(
        only_mapped_addresses_mixin #(sequence_item)::get_type());
  endfunction

  // ...
endclass

The other two feature tests would look similar, but would use their respective mixins.

To only perform writes to mapped addresses we would need to chain the two mixins:

class test_legal_writes_to_mapped_addresses extends test_all_random;

  protected virtual function void set_factory_overrides();
    sequence_item::type_id::set_type_override(
        only_legal_writes_mixin #(only_mapped_addresses_mixin #(sequence_item))::get_type());
  endfunction

  // ...
endclass

Of course, we would do the same to handle the other two pairs.

Similarly, we could use the same principle to combine all three features:

class test_legal_writes_to_mapped_addresses_in_secure_mode extends test_all_random;

  protected virtual function void set_factory_overrides();
    sequence_item::type_id::set_type_override(
        only_legal_writes_mixin #(
            only_mapped_addresses_mixin #(
                only_legal_secure_accesses_mixin #(sequence_item)))::get_type());
  endfunction

  // ...
endclass

The mixin approach comes with some issues, though.

Constraints are always polymorphic, so we have to be very careful to use unique constraint names across all mixins. Applying two different mixins that use the same constraint name would result in only the outer mixin's constraints being applied, because it would override the constraint defined in the inner mixin. It's very easy to run into this issue when using copy/paste to define a new mixin and forgetting to change the name of the constraint. Frustration will follow, as the code looks right, but leads to unexpected results. Moreover, the more mixins are used in the code base, the easier it is for constraint name collisions to happen.

Chaining of mixins is not particularly readable. It is bearable for one or two levels, but the more levels there are, the worse it's going to get.

Finally, using mixins will cause the number of types in our code to explode. Each mixin on top of a class will create a new type. From a coding standpoint this isn't such a big deal, as we won't be referencing those types directly. The more types we have, though, the longer our compile times are going to get. Also, note that for the compiler mixin1 #(mixin2 #(some_class)) is a distinct type from mixin2 #(mixin1 #(some_class)), regardless if it results in the "same" class. It's very easy to use mixin1 #(mixin2 #(some_class)) in one test, but use mixin3 #(mixin2 #(mixin1 #(some_class))) in another, which would make the compiler "see" an extra type.

The mixin pattern uses inheritance, which doesn't match the call to action in the post title, so obviously we're not going to stop here.

Using aspect oriented programming

It's much easier to write our test family using aspect oriented programming (AOP). AOP allows us to alter the definition of a class from a different file. Even though SystemSystemverilog doesn't support AOP, I'd still like to show an example in e, as it can provide us with some hints into how we could improve the mixin-base solution.

(Please note that the following code may not be idiomatic, so don't take it as a reference on how to handle constraints in e.)

Our sequence item definition would look similar:

<'
struct sequence_item like any_sequence_item {

  direction: direction_e;
  address: uint(bits: 32);
  sec_mode: sec_mode_e;

};
'>

In our test that only does mapped accesses, we would tell the compiler to add the constraint to the sequence item:

<'
import test_all_random;

extend sequence_item {
  keep address in [CODE_START_ADDR..CODE_END_ADDR] or
      address in [SRAM_START_ADDR..SRAM_END_ADDR] or
      address in [PERIPHERAL_START_ADDR..PERIPHERAL_END_ADDR];
};
'>

This does not result in a new type. It tweaks the existing sequence_item type for the duration of that test.

If we would like to reuse the constraint in the test that only writes to mapped addresses, we could put the extension into its own file. We could do the same for the other extensions that handle the other features. This would allow each test to load the relevant extension files. For example, for legal writes to mapped addresses we would have:

<'
import test_all_random;
import constraints/only_legal_writes;
import constraints/only_mapped_addresses;
'>

The file structure is similar to what we had when we used mixins, but the code is much cleaner.

Pay special attention to the natural language description of what we are doing: in test_mapped_addresses we are adding the constraint to the sequence_item type.

Using constraint objects

Regular object oriented programming doesn't allow us to change type definitions. What we can do, however, is build our code in such a way as to allow it to be extended when it is being used.

Back in 2015, there was an exciting DVCon paper that presented how to add constraints using composition. It showed how to add additional constraints to an instance of an object without changing the type of that object. This is done by encapsulating the constraints into their own objects which extend the behavior of the original object's randomize() function. Have a quick look at the paper before proceeding, to understand the exact way this is done.

While the paper shows how to add constraints to object instances, we can extend the approach to add constraints globally, to all instances of a type. If we look back at the AOP case from before, this would be conceptually similar to what we were doing there. We would be emulating the addition of constraints to the sequence_item type.

The paper makes an attempt at global constraints in its final section, by using the UVM configuration DB. While that approach works, I feel that it is not expressive enough. A better API, consisting of a static function to add constraints globally, would make the code much more readable than a very verbose config DB set(...) call.

To get the extensibility we want, we have to set up the necessary infrastructure for it. If the sequence item class is under our control, we can modify it directly. Alternatively, if the sequence item is part of an external UVC package, we can define a sub-class which contains the necessary code.

We'll assume that sequence_item can't be changed and we'll create a new constrained_sequence_item class. We would either use this sub-class in our sequences directly or use a factory override.

To execute code that affects all instances, the sequence item class needs a static function through which constraints are added:

class constrained_sequence_item extends sequence_item;

  static function void add_global_constraint(abstract_constraint c);
    // ...
  endfunction

  // ...
endclass

The abstract_constraint class would be the base class for our constraints and would provide us with a reference to the object that is being randomized:

virtual class abstract_constraint;

  protected sequence_item object;

  function void set_object(sequence_item object);
    this.object = object;
  endfunction

endclass

The code to handle global constraints is similar to the one presented in the paper. We store all global constraints in a static array:

class constrained_sequence_item extends sequence_item;

  local static rand abstract_constraint global_constraints[$];

  static function void add_global_constraint(abstract_constraint c);
     global_constraints.push_back(c);
  endfunction

  // ...
endclass

Before randomizing a sequence item instance, we have to set up the constraint objects to point to it:

class constrained_sequence_item extends sequence_item;

  function void pre_randomize();
    foreach (global_constraints[i])
      global_constraints[i].set_object(this);
  endfunction

  // ...
endclass

With the infrastructure set up, we can move on. We encapsulate the constraints for our features into their own constraint classes:

class only_mapped_addresses_constraint extends abstract_constraint #(sequence_item);

  constraint c {
    object.address inside {
        [CODE_START_ADDR:CODE_END_ADDR],
        [SRAM_START_ADDR:SRAM_END_ADDR],
        [PERIPHERAL_START_ADDR:PERIPHERAL_END_ADDR] };
  }

endclass
class only_legal_writes_constraint extends abstract_constraint #(sequence_item);

  constraint c {
    object.direction == sequence_item::WRITE;
    object.address[1:0] == 0;
  }

endclass
class only_legal_secure_accesses_constraint extends abstract_constraint #(sequence_item);

  constraint c {
    object.sec_mode == sequence_item::SECURE;
    object.address[27] == 0;
  }

endclass

In the test that only accesses mapped addresses we would make sure to add the required constraints:

class test_mapped_addresses extends test_all_random;

  protected virtual function void add_constraints();
    only_mapped_addresses_constraint c = new();
    constrained_sequence_item::add_global_constraint(c);
  endfunction

  // ...
endclass

The add_constraints() function should be called before any sequence items are started. A good place to call it from is the end_of_elaboration_phase(...) function.

In the other feature oriented tests we would simply add their respective constraints.

For the test that does writes to mapped addresses we just need to make sure that both constraints are added. We could do this by extending the random test and making two add_global_constraint(...) calls, one for each constraint object:

class test_legal_writes_to_mapped_addresses extends test_random;

  protected virtual function void add_constraints();
    only_legal_writes_constraint c0 = new();
    only_mapped_addresses_constraint c1 = new();
    constrained_sequence_item::add_global_constraint(c0);
    constrained_sequence_item::add_global_constraint(c1);
  endfunction

  // ...
endclass

We could also extend the test that only does legal writes and add the constraints for mapped addresses:

class test_legal_writes_to_mapped_addresses extends test_legal_writes;

  protected virtual function void add_constraints();
    only_mapped_addresses_constraint c = new();
    super.add_constraints();
    constrained_sequence_item::add_global_constraint(c);
  endfunction

  // ...
endclass

Of course, this approach can be used to handle all combinations of constraints.

Adding constraints dynamically has the same advantages as the mixin approach we looked at earlier.

It doesn't suffer from the same readability issue, because we don't rely on long parameterization chains. It suffers from a bit too much verboseness due to the multiple add_global_constraint(...) calls, though this could be improved by adding a variant of the function that accepts a list of constraint objects.

This approach also avoids the type explosion issue that mixins have and is potentially faster to compile.

There is a bit of boilerplate code required for the infrastructure. This can be extracted into a reusable library.

The first thing we need to do is to make the abstract constraint class parameterizable:

virtual class abstract_constraint #(type T = int);

  protected T object;

  function void set_object(T object);
    this.object = object;
  endfunction

endclass

The package should expose a macro to handle the constraint infrastructure:

`define constraints_utils(TYPE) \
  static function void add_global_constraint(constraints::abstract_constraint #(TYPE) c); \
  // ...

There was a subtle issue with the simplistic infrastructure code we looked at before. It wasn't able to handle randomization of multiple instances at the same time (for example, when randomizing an array of sequence items). As this is a more exotic use case, the problem won't show up immediately. It's a simple fix to make, but it would be very annoying to have to make it in mutiple projects. Even when the code might look deceptively simple and have us think it's not worth the hassle to put into an own library, doing so makes it easier to implement and propagate fixes for such issues.

The macro makes the definition of constrained_sequence_item much cleaner:

class constrained_sequence_item extends sequence_item;

  `constraints_utils(sequence_item)

  // ...
endclass

You can find the constraints library on GitHub.

It already supports instance and global constraints. What I would like to add is the possibility to add constraints to all items under a UVM scope, similar to what the paper presents at the end, but using a nicer API that doesn't require the user to do any UVM config DB calls.

I also want to look at the possibility of magically generating all test combinations, given a list of constraint objects. Currently we had to enumerate all combinations of constraints by writing a test class for each, which is very repetitive. It would be great if we could get this automatically and save ourselves the typing. This is something I'll definitely look at in a future post.

In the meantime you can have a look at the full example code on GitHub and try it out yourselves. I hope it inspires you to write flexible test suites that help you reach you verification goals faster.

Sunday, February 9, 2020

Bigger Is Not Always Better: Builds Are Faster with Smaller Packages

One trend over the past few years is that the projects I've been working on tend to get bigger and more complicated. Bigger projects come with new challenges. Among these are the fact that it's much more difficult to keep the entire project in one's head, the need to synchronize with more developers because team sizes grow, a higher risk of having to re-write code because of poorly understood requirements or because some requirements change, and many more.

There's one thing, though, that crept up on me: compile times get much bigger. While this doesn't sound like a big deal, I've found that long build times are an absolute drain on productivity. I use SVUnit a lot, so I'm used to having a very short path between making a change and seeing the effects of that change. Ideally, there should be no delay between starting the test script and getting the result. A delay of a couple of seconds is tolerable. More than 10 seconds becomes noticeable. After exceeding the one minute mark, the temptation to switch to something else (like the Internet browser) becomes very high. This slowdown happens gradually, with each new class that is added, decreasing development speed.

In this post I'd like to talk about compilation. This topic has a tendency to be trivialized and underestimated, even more so in the hardware industry, where it's common to have design flows already set up to deal with this process.

Full vs. incremental builds

A build is the process of taking the source code and producing an executable object. When talking about builds, there are two terms we need to be familiar with: full builds and incremental builds. A full build is performed when there isn't any build output, which requires the entire source code to be built. This is either the case when starting in a new workspace (for example, after cloning the source repository) or after deleting the previous build output. An incremental build only builds the parts of the source code that have changed since the previous build. Because only parts of the project are rebuilt in this case, this means that, generally, the process is faster.

We'll limit our discussion about builds to SystemVerilog packages, though the same concepts also apply to modules and interfaces.

Let's say we have two packages, package0 and package1, which we use in our verification environment:

// File: package0.sv

package package0;

  class some_base_class;
  endclass

endpackage
// File: package1.sv

package package1;

  import package0::*;

  class some_derived_class extends some_base_class;
  endclass

endpackage

Compiling these two packages using an EDA tool is pretty straightforward:

compiler package0.sv package1.sv

The very first time we run this command, the tool will parse the two source files and generate the data structures it uses to represent the compiler output. Since we didn't have any build output when we ran the command, we were performing a full build.

If we add a new class to package1 and run the compiler again, we will be performing an incremental build:

package package1;

  import package0::*;

  class some_derived_class extends some_base_class;
  endclass

  class some_other_class;
  endclass

endpackage

The tool will only recompile package1. It won't touch package0, since it didn't change. If compilation for package0 takes a lot of time, this will saves us that time.

A deeper dive into SystemVerilog compilation

Before we continue with our main discussion, it makes sense to look a bit deeper into how SystemVerilog compilation works. Before I investigated this topic I had some misplaced ideas, which I would like to dispel.

I have only ever really looked at the behavior of one particular EDA tool, but I assume that other simulators behave similarly, as they all have a common heritage. Some SystemVerilog tools differentiate between compilation and elaboration. These defintions depend on the tool you're using. I've seen compilation used to mean parsing the code and generating syntax trees. Elaboration takes these syntax trees and generates executable code that is run in the simulator. I'll use the term compile to mean both of these steps.

Let's start small, with a single package that contains only one class:

package package0;

  class some_base_class;
  endclass

endpackage

After we compile the package, we will have performed a full build. Now, let's add another class to the package:

package package0;

  class some_base_class;
  endclass

  class some_base_class2;
  endclass

endpackage

In this case, you'll notice that the tool compiles both classes. I'm a bit cautious about posting the log files and how I can tell that it's compiling both classes. Some tools make this easier to see than others. One clear sign is that compile takes longer. You can try it out by adding more and more classes and recompiling. I've created some scripts that can help out with such experiments: https://github.com/verification-gentleman-blog/smaller-packages/.

In this case, an incremental compile takes about as much time as a full build, which suggests that nothing is being reused from previous build attempts. Even if we only add classes, the build output for previously built classes is discarded.

What did we learn from this? That tools only organize build output using packages as their most granular unit. Changes within packages are "lost", from an incremental build point of view.

You could argue that from the previous experiment we could infer that tools organize build output based on files. If we were to put each file in its own class and include them in the package, then the tool would be able to somehow behave differently. This isn't, the case, though. `include directives are handled by the pre-processor. It interleaves all of the files together and gives the compiler a big file with all the class definitions inline (the situation we had previously).

We can do another experiment to convince ourselves that builds aren't organized by files. Let's put two packages inside the same file:

package package0;
endpackage

package package1;
endpackage

Let's modify package1 by adding a new variable:

package package0;
endpackage

package package1;
  bit some_var;
endpackage

When rebuilding, we'll notice that only package1 gets rebuilt, but package0 is left alone. (This is also the behavior we would have liked to have for classes inside a package.)

Now let's also modify package0 by adding a variable to it:

package package0;
  bit some_var;
endpackage

package package1;
  bit some_var;
endpackage

When rebuilding, we'll see that package0 is being rebuilt, as we expected, but, surprisingly, so is package1. This is very confusing initially, but obvious once you know the explanation. Because we shifted the lines where package1 and its items are defined in the file, the tool has to update debug information regarding line numbers. This is important for debuggers and for messages that contain line numbers (like assertion erros, $info(...) calls, etc.). This, by the way, is a very good reason to only define one element (package, interface, module) per file.

Let's look at one more thing. Let's take two packages that have a dependency relationship:

package package0;

  class some_base_class;
  endclass

endpackage
package package1;

  import package0::*;

  class some_derived_class extends some_base_class;
  endclass

endpackage

It's clear that changes to package1 shouldn't (and indeed won't) cause rebuilds of package0. It's also clear that changing some_base_class will have to trigger a rebuild of package1. Now, let's add a new class to package0:

package package0;

  class some_base_class;
  endclass

  class some_base_class2;
  endclass

endpackage

At this point, we shouldn't be surprised anymore that both packages are rebuilt in this case. This is because the tool only understands changes at the package level. package1 depends on package0, so any change to package0 will lead to a rebuild of package1, regardless if this is really needed. Unfortunately, this isn't the behavior we would like to have.

Contrast the way SystemVerilog builds work to C++, where files are compiled individually and are linked together in a separate step (a gross over-simplifaction). Changes to one class don't cause recompiles of other classes in the same namespace, if the two classes are unrelated. This is because C++ classes are split between the header (which declares which functions a class provides) and the implementation (which contain the function bodies). A class that depends on another class includes its header, to let the compiler know that it relies on the other class. Only changes in a class's header cause recompiles of dependent classes, while changes to its implementation don't. Because of this setup, C++ builds are much more powerful when it comes to build avoidance, by only rebuilding the parts that they absolutely have to build. This allows for guidelines that incremental builds should take between 5-10 seconds and that full builds (including tests) should take between 1-2 minutes, according to http://www.bitsnbites.eu/faster-c-builds/, numbers which are incredibly low by SystemVerilog standards, where merely starting the simulator takes double digit numbers of seconds.

Case study

The classic testbench structure for an IP block consists of one or more interface verification components (IVCs), that contain code related to the signal protocols used by the design, and one module verification component (MVC), that contains code for aspects related to the design functionality.

IVCs typically consist of a package and one or more interfaces. We don't usually make changes to the IVCs, so once we've built them via a full build, they won't have any impact on subsequent incremental builds.

Most of our work is focused on the MVC. As we've seen above, if we place our MVC code into one package, then any change we make to it will trigger a new build, because of the way SystemVerilog tools handle incremental builds. This isn't going to be very efficient, as an incremental build of the package after each change will take about as long as a full build.

What would happen if we could split our big MVC package into multiple smaller packages?

It's experiment time again! We'll assume that we can split the code such that building each package takes the same amount of time. We'll also ignore any extra costs from building multiple packages instead of one single package. This means that if an incremental build of the entire mvc package would have taken N seconds, then by splitting it into P packages each of the smaller packages would take N/P seconds to build. We'll also assume that we are just as likely to make changes to any of the smaller packages. This means that the probablity to change any package is 1/P.

Let's assume that we can create two independent packages, p0 and p1. We can misuse UML to visualize the package topology:

open the post in a browser to display images

Any change we make to p0 won't cause rebuilds of p1 and vice-versa. We can compute the average incremental build time in this case. Building any of the packages takes N/2 seconds, but we do it only half of the time (since in the other half we change the other package). The average incremental build time is the mean: N/2 * 1/2 + N/2 * 1/2 = N/2. By splitting the code into two independent packages, we've managed to half our incremental build time. It's not very realistic, though, that we could manage to do such a split on a real project.

Let's have a look at something closer to reality. Let's assume that we can split our MVC into two packages, p0 and p1, but p1 depends on p0:

open the post in a browser to display images

An incremental build of p1 would still take only N/2 seconds, because changing anything in p1 doesn't have any effect on p0. A change in p0 would mean that we also have to rebuild p1, which means that it would take N/2 + N/2 = N seconds. On average, we would need N/2 * 1/2 + N * 1/2 = 3/4 * N seconds.

We should try to structure our code in such a way as to increase the number of packages without any dependencies to each other. Let's say we can split p1 from the previous example into two independent packages, p1_0 and p1_1:

open the post in a browser to display images

In this case, changing anything in either p1_0 or p1_1 would take N/3 seconds. A change in p0 would require all three packages to be rebuilt and would take the full N seconds. On average, a change would take N/3 * 1/3 + N/3 * 1/3 + N * 1/3 = 7/9 * N seconds.

We could go on further with deeper package hierarchies, but I think you get the idea.

MVC code lends itself nicely to such a structure. We typically have some "common" code that models the basics of our DUT, from which we can model different higher level aspects, relating to the features of the DUT. We would use our models inside checks or coverage, which could be built independently from each other:

open the post in a browser to display images

Conclusions

Splitting code across multiple packages will generally be better for compilation speed. There are also other advantages. It could make the code base easier to understand, by grouping code by theme (code for X goes in package p_x, code for Y goes in package p_y). It could also make development easier, by allowing developers to specialize in only a handful of the packages, instead of having to deal with the entire code base.

Having to manage multiple packages brings its own set of challenges, though. It could make the code base more difficult to understand if the boundaries between packages are arbitrary (where does code for X go, in p0 or p1?). More packages, especially when they have intricate dependency relationships, also make compilation more difficult to set up.

I'm not going to recommend making one package per class, just to improve build times. Ideally, SystemVerilog compilers should evolve to better handle incremental compilation, by working at levels lower than just packages. At the same time, you should care about turnaround time, so dumping all code into one package shouldn't be your default mode of operation.