Testing and Code Coverage

2022-08-31

Software should be tested regularly throughout the entire development cycle, from the first written lines of code through production releases, to ensure correct operation. Thorough testing is typically an afterthought; however, it can be essential for ensuring changes in a given part of the code do not negatively affect other parts of the code. This is true for all sizes of projects, from big to small, and for all team sizes from one to hundreds.

With all of these test types, you should understand the concept of “expected behavior” instead of “accurate returns.” With scientific code, we expect functions like 2+2 to return 4, which we can think of as an “accurate return” (it is also “expected behavior” in this case), but if you pass in something like “Waffle” + 2, the “expected behavior” that should happen is either that an error is thrown, the 2 is cast to a string and appended, or something you want to have happen. In such a case, the code would still be behaving expectedly, even if it were not returning scientifically accurate information. Unexpected behavior may be “Waffle” + 2 not throwing an error, or calling some routine you did not expect.

Two main types of testing are strongly encouraged:

Unit tests – Small tests designed to check that individual functions or operations behave correctly. This means that not only do the operations run without unexpected errors, but also that they behave as expected with the given inputs, even if that behavior is to throw or raise an error. Unit tests should be added whenever new features/code are added, ensuring that the code is covered by tests.
Regression tests – given a known input, does the software correctly and consistently return the correct values, even after changes to the code? These tests can occasionally take longer to run than Unit Tests, as they often require running the whole code base, or large parts of it. Regression tests can also include tests of previously fixed bugs. This second use case is the more common definition outside of scientific code bases. Ensuring that old bugs are not reintroduced during development is often extremely helpful, as this helps you and others to avoid making the same mistakes.

A third type of testing recommended based on how many dependents your code has, i.e. “How many other places is your code used?”

Integration Tests – Wide-spanning tests which check that subsystem operations behave correctly, also within other environments like on specific hardware or in specific packages. You likely won’t have to think about these until you try to include your package within other packages.

Code coverage measures how many code paths your tests touch, out of all the possible paths taken through your code. It is often reported in “Percent of lines executed relative to total lines of code,” but it should be read as “what percent of decision branches are executed”, as multiple choices/conditions can lead to the same code being executed. There is no hard and fast rule for what percent you should aim for. 90-95% is a good target, especially early on. 100% is admirable, but often not realistically attainable, especially when working with sprawling code, since touching every line of code with tests can require such esoteric test inputs that the calculations would become unreasonable. However, every code is different, and you should still strive for 100% coverage until you hit the limit of sane programming.

It’s also important to understand the limits of code coverage. Full coverage does not in any way ensure that the code was run correctly, it just states just that all the code was touched by the interpreter. You still have to write tests that assess the “expected behavior” of your code, not just that it ran.

Lastly, there is the concept of “number of tests” as a metric for how well tested a code base is. This metric is meaningless by itself. e.g. a test could be written that simply does “assert x == x”, and then repeats that test for every 232-1 32-bit unsigned integers. Tests should never be written with “how many tests” in mind, they should always be written within the context of three things: the type of the test (list above), whether it captures all reasonable expected behaviors, and how much coverage (code coverage) does the collection of tests have; in that order.

Recommended:

Python: PyTest
C/C++/Fortran: CTest
Rust: Test Attribute
Julia: Test Macros

Tutorials: