Note: Testing with Cabal, HSpec, HUnit, and QuickCheck

How do we know that the code we write is correct? How do we know, when we've added functionality to our code, or fixed one bug, that we haven't introduced others? Unless we have to tools to actually prove that our code is correct, we're probably going to want to test our code. Haskell, has a number of tools that facilitate testing, including significant support for testing in cabal and stack, the most commonly used Haskell packaging/build systems.

There is, at this point, a fairly well developed understanding of the nature of testing. We won't review all aspects of this (there are other things we want to do this quarter), but we do want to hit the most important ideas, and show you how they apply in practice.

Testing, except in the unusual case of a program or function with only finitely many possible inputs, can never be complete. It can reveal the presence of bugs, but failing to reveal a bug isn't the same thing as proving that bugs don't exist. Maybe we just haven't looked at enough cases! Still, there are things we can do to improve the reliability of our tests.

A first consideration is coverage. We want to test all of the code. Bugs in code that's not tested won't be revealed.

A second consideration is that there is often a tension between bugs -- sometimes fixing one bug tends to introduces another, and fixing the new bug tends to re-introduces the original bug. If you're not careful, you can go back and forth many times before you recognize what's going on. This problem is exacerbated by long-lived code, which may be maintained by different people, who may traverse different parts of the cycle but not the whole. The standard approach to this is regression testing, i.e., you use tests to detect old bugs, and re-run them whenever the code changes, to make sure you haven't re-introduced an old bug.

Test-Driven Development

Text-driven development is a methodology for building tests along with code, in such a way as to ensure coverage, and creation of a demanding set of tests. The basic idea involves a very tight development loop, in which you write a failing test, and then write just enough code to get that test to pass. This is iterated, until the program is completed.

As a practical matter, pure code is the easiest to test, and this helps explain Haskell's emphasis on pure code, and why we want to minimize impure (mostly IO) code.

Test-driven development is often abbreviated TDD. Unfortunately, another development discipline — Type-driven development — shares the same abbreviation. And Haskell is a language in which types matter a lot, and following a type-driven development approach can be helpful. Fortunately, type-driven development and test-driven development are largely compatible disciplines, and we will illustrate both today.

A Working Example: mergeSort.

We're going to demonstration test-driven development, along with Haskell and cabal support for testing, but writing a module for merge sort. This is one of the standard, $O(n \; \log n)$, efficient sorting algorithms. As a practical matter, quick sort is more commonly used, but merge sort remains important when the data to be sorted is so large that it doesn't fit in memory all at once, as might be the case in dealing with a very large database.

The idea is to write two functions: split, which will break a list containing two or more elements into two smaller non-trivial lists, and merge which will combine two sorted lists into a single sorted list. Merge sort works by using split to break a long list into two non-trivial pieces (each of which is necessarily shorter than the original), calling itself recursively on each of the two resulting lists, and then merging the sorted lists together.

Exercise 1 The specification of split in this description is very much an underspecification. For mergeSort to be efficient, the two resulting lists should be of nearly equal size. This can be accomplished in a number of ways, and today's development considers just one such.

Come up with a fundamentally different approach to split, i.e., one that meets the specification, but typically produces a different output. Implement your algorithm, and show that mergeSort still passes its tests.

Stubbing Out the Code.

We start by creating a merge directory, and using cabal init to set up the directory. We'll be building a library, and creating a test suite, and make the appropriate choices in response to cabal's questions.

I have things set up (in ~/.cabal/config) so that my library source files are in ./lib, and the testing files are in ./test. A general principle is that the ./test directory will essentially mirror the ./lib directory, in that there's a xTest.hs (short for specification) file in the test directory for every file x in the source directory. Cabal sets us up with MyLib.hs and MyLibTest.hs, these need to be renamed MergeSort.hs and MergeSortTest.hs respectively, and the contents of these files (and the .cabal file) need to be updated accordingly.

One final tweak (for recent versions of cabal is to create a file cabal.project, and add the lines

packages : ./merge.cabal tests: True

Note that the Haskell tool chain is evolving very quickly at present, and this is no where more obvious than in cabal. I'm using cabal 3.2.0.0, but ghcup already has a release candidate for cabal 3.4.0.0, and there are alpha versions of 3.5.0.0 out there. The cabal.project, together with cabal.project.local files are new. We won't dwell much on their content now.

Somewhat more interesting is the file merge.cabal, where we edit the test suite stanza to be (note that you should specify the most recent versions of these packages)

test-suite merge-test default-language : Haskell2010 type : exitcode-stdio-1.0 hs-source-dirs : test main-is : MergeSortTest.hs build-depends : base ^>=4.14.1.0 , hspec ^>= 2.7.4 , merge

The declaration cabal-version: 3.0, refers to the format of the .cabal file, not the version of cabal actually used, although obviously it should be at least as large as the file format version. As this code develops, we'll be adding new modules to the build-depends verse. Note here that we don't provide a version dependency for the merge package here -- this file defines the current version of the merge package, and the assumption is that that's the version we want to use. Note also that I format my verses and stanzas a bit differently than cabal init does -- this style facilitates adding new packages to the build-depends verse, and that's a very common thing to do.

Next, we'll stub out MergeSort.hs:

-- | MergeSort provides a polymorphic mergeSort function module MergeSort where -- | sort a list using the merge sort algorithm mergeSort :: Ord a => [a] -> [a] mergeSort = undefined -- | split a list into two lists of almost equal length split :: Ord a => [a] -> ([a],[a]) split = undefined -- | merge two sorted lists into a single sorted list merge :: Ord a => [a] -> [a] -> [a] merge = undefined

I've declared top-level versions of the functions I'm going to need, with type constraints and Haddock-style comments. The latter isn't essential, but it's good to document in the source file what our functions are intended to do. Also note that software tools (e.g., Visual Studio Code) can use Haddock documentation to provide contextual help on various symbols, which becomes increasingly important as code complexity grows. Note, to, the use of undefined, which, if evaluated, will cause a run-time exception. This is practically guaranteed to make most ordinary tests fail.

Finally, we have the MergeSortSpec.hs, which is stubbed out as follows:

module Main where import Test.Hspec pendingTests :: Spec pendingTests = do it "needs some tests" $ do pending mergeSortSpec :: Spec mergeSortSpec = describe "mergeSort" $ do pendingTests -- | split a list into two lists of almost equal length splitSpec :: Spec splitSpec = describe "split" $ do pendingTests -- | merge two sorted lists into a single sorted list mergeSpec :: Spec mergeSpec = describe "merge" $ do pendingTests main :: IO () main = hspec $ do mergeSortSpec splitSpec mergeSpec

Our tests are values of type Spec, which is a monadical value (it is a type alias for SpecWith ()). The functions describe and it are used to name particular functions and tests respectively. Finally, pending is a test which neither passes nor fails, but is separately accounted.

We're now set up to run our tests!

$ cabal test Resolving dependencies... Build profile: -w ghc-8.10.2 -O1 In order, the following will be built (use -v for more details): - merge-0.1.0.0 (lib) (first run) - merge-0.1.0.0 (test:merge-test) (first run) Configuring library for merge-0.1.0.0.. Preprocessing library for merge-0.1.0.0.. Building library for merge-0.1.0.0.. ... [build noise] ... Running 1 test suites... Test suite merge-test: RUNNING... Test suite merge-test: PASS Test suite logged to: /Users/stuart/Sites/Courses/cmsc-16100/2020/Lectures/13a/merge/dist-newstyle/build/x86_64-osx/ghc-8.10.2/merge-0.1.0.0/t/merge-test/test/merge-0.1.0.0-merge-test.log 1 of 1 test suites (1 of 1 test cases) passed.

There's actually quite a lot of output here. At first glance, it looks as if all is well. We've passed our tests! It's not quite so simple, as a look at the log file reveals:

Test suite merge-test: RUNNING... mergeSort needs some tests # PENDING: No reason given split needs some tests # PENDING: No reason given merge needs some tests # PENDING: No reason given Finished in 0.0002 seconds 3 examples, 0 failures, 3 pending Test suite merge-test: PASS

We passed because we didn't fail any tests, but we're still pending tests. We've done a lot of setup, but no real work yet. It's time to do a test.

A First Iteration

We'll start by implementing the split function. But in test-driven development, we don't start with implementing anything. We start by writing a test. It is easy in this to get too focussed on the coding part of the problem, and to write tests that only our (current) implementation will pass. E.g., we probably don't want to write a test like:

it "correctly splits the [1,2,3,4]" $ split ([1,2,3,4] :: Int) == ([1,2],[3,4])

The it function takes as arguments a description of the test (which will ultimately be printed), together with the test itself, which should be a Bool.

as this over-specifies the behavior of split. Instead, we want that the result of splitting a list is two lists of nearly equal length, containing all of the same values (and multiplicities) as the original list. We can express these ideas via the follow two test functions (which in practice will be local to splitSpec:

preservesContent :: [Int] -> Bool preservesContent xs = sort xs == (sort $ as ++ bs) where (as,bs) = split xs dividesEvenly :: [Int] -> Bool dividesEvenly xs = abs (length as - length bs) <= 1 where (as,bs) = split xs

These can be combined into a Spec, again local to splitSpec:

splitsCorrectly :: [Int] -> Spec splitsCorrectly xs = do it "preserves content" $ preservesContent xs it "divides evenly" $ dividesEvenly xs

The form here anticipates changes to our testing routines which will come soon.

We can then run this test on several different cases, including the edge cases (empty and nearly empty lists), and lists with duplication:

splitSpec :: Spec splitSpec = describe "split" $ do splitsCorrectly [] splitsCorrectly [1] splitsCorrectly [1,2] splitsCorrectly [1..100] splitsCorrectly $ [1..100] ++ [1..100]

The describe function essentially adds a section to our testing log.

The resulting tests all fail, because our current definition of split is undefined:

12 examples, 10 failures, 2 pending Test suite merge-test: FAIL

This is what we want! Now, the game is to try to write a definition of split that passes these tests. With a keen knowledge of what the Prelude has to offer, we try:

split xs = splitAt (length xs `div` 2) xs

So if the argument has 7 elements, we'll return a pair of lists with 3 elements and 4 elements respectively. The tests now pass!.

Exercise 2 Give an alternative definition of split, and show that it passes these tests. Does a test-driven development approach to the alternative definition suggest additional tests?

A Second Iteration

Having dispensed with split, we now turn our attention to merge. Our idea here is that if we merge two sorted lists, we'll end up with a list that contains the elements of the arguments in sorted order, i.e.,

correctlyMerges :: [Int] -> [Int] -> Bool correctlyMerges xs ys = sort (xs ++ ys) == merge (sort xs) (sort ys)

It is handy to have a reference sort function lying around! You won't always have this.

Now, we could approach this as did in the case of the split function, but we're actually set up for something more interesting. The Hspec testing framework knows about two other testing frameworks, Test.HUnit, which handles unit tests (which are what we've been doing so far), and Test.QuickCheck, which does randomized testing. QuickCheck is quite easy to use, and requires only minimal changes to the code we've written.

mergeSpec :: Spec mergeSpec = describe "merge" $ do it "properly merges sorted lists" $ property $ correctlyMerges

The new thing here is Test.QuickCheck.property, which turns a Testable value into a (monadic) unit test. The remarkable thing here is that boolean valued functions with arguments from the Arbitrary type class are Testable, and the standard types and type constructors have Arbitrary instances. So correctlyMerges is Testable, and can be converted into a Property, suitable for use by Hspec. Using QuickCheck will necessitate adding QuickCheck to the build-depends verse in our test-suite stanza.

Again, this test fails (because merge is just undefined), so we now add an implementation:

merge :: Ord a => [a] -> [a] -> [a] merge (a:as) (b:bs) | a <= b = a : merge as (b:bs) | otherwise = b : merge (a:as) bs merge [] bs = bs merge as [] = as

This code passes, as we expect.

Of course, while we expect that the random instances produced by QuickCheck will include important edge cases, we don't want to rely too heavily on this. Fortunately, we can just add unit tests to the do block of describe:

mergeSpec :: Spec mergeSpec = describe "merge" $ do it "properly merges two empty lists" $ correctlyMerges [] [] it "properly merges an empty list with a non-empty list" $ correctlyMerges [] [1] it "properly merges a non-empty list with an empty list" $ correctlyMerges [0] [] it "properly merges sorted lists" $ property $ correctlyMerges where correctlyMerges :: [Int] -> [Int] -> Bool correctlyMerges xs ys = sort (xs ++ ys) == merge (sort xs) (sort ys)

A Code Review

In a code review at this point in the development of the MergeSort module, there are two criticisms that require action:

Exercise 3 The implementation of merge generates more garbage than necessary, because we recreate cons-cells like (a:as). Eliminate this inefficiency (hint: @-patterns), and show that the improved code passes the tests.

*Exercise 4 The tests for split should include QuickCheck based tests, as well as unit tests. Add QuickCheck versions of preservesContent and dividesEvenly.

Finishing up...

We can take advantage of a reference implementation of sort to write a very simple, but effective, test for mergeSort:

mergeSortSpec :: Spec mergeSortSpec = describe "mergeSort" $ do it "sorts its input" $ do property $ sortsCorrectly where sortsCorrectly :: [Int] -> Bool sortsCorrectly xs = mergeSort xs == sort xs

And follow up with the implementation:

mergeSort :: Ord a => [a] -> [a] mergeSort [] = [] mergeSort [a] = [a] mergeSort xs = merge (mergeSort as) (mergeSort bs) where (as,bs) = split xs

The code has to deal with cases where split xs might return a pair with an element the same length as xs, resulting in a divergent computation.

We do a final run, and the tests all pass!

Resources

Note that there are a number of other tutorials on the web on Hspec, HUnit, and QuickCheck.