Designing software tests that actually help — Part 2

Ifeora Okechukwu
20 min readDec 19, 2022

Welcome to Part 2 of the series! If you haven’t read Part 1, Please do so before moving on to reading this. Firstly, i would like to apologise for taking too long a time to put this together. Secondly, this will be another long blog post so strap in and let’s get to it!

In this part, i would be talking about a bunch of things:

  • What makes for a quality software test ?
  • The testing pyramid and other concepts
  • Dealing with functional regressions in tests
  • Accidental complexity and how it can affect software testing
  • Conclusion

I am going to start by saying that design is not just about the final substance (superficial or not) of what is being designed, it’s also about the process involved in the thing being designed. The process is as important as the final substance. When it comes to testing, this is no different. There are a lot of tools in the grand toolbox of testing and each by itself serves a purpose (sometimes a plethora of purposes) but the process by which you employ them matters. Also, you’d mostly have to combine tools and concepts in good measure to get the desired result of writing quality tests that actually help.

In this article, i will talk about the things you must do to have quality tests, i also talk about the testing pyramid, test doubles, some test coverage pitfalls as well as test case design anti-patterns, accidental complexity, dealing with functional regression and brittle tests.

Before you continue reading, there are a couple of terms you’d need to be familiar with: Coder Bias, Test Doubles, Testing Pyramid, Test Coverage, System Under Test (SUT), Coupling, Requirements Specification, Accidental Complexity and finally watch these 2 talks on YouTube (if you can): The Structure And Interpretation of Test Cases & Decoupling from Rails

What makes for a quality software test ?

Let’s begin by talking about the essence of testing. When building software, our major aim shouldn’t be to just write tests for the sake of it. The aim should be to write quality tests. How do you come up with quality tests ? I have a 3 ideas on how to do it so let’s get into it.

A coin has two sides, a head and a tail. Not two heads or two tails. The same applies to software and software tests too. When the requirements specification document(s) for software projects is being drawn up during the discovery meetings and system analysis conversations, only a list of functional (and perhaps non-functional) qualities are listed and detailed in such documents.

This is a good thing but it simultaneously sets off the coder bias when building such software from the requirements. What happens is that the development team tends to focus mostly on implementing and testing based on the list of things the software is allowed/supposed to do without also implementing and testing also on the list of things the software isn’t allowed/supposed to do. This is why the implementation and then the resulting tests are brittle and flaky hence of low quality. Creating stronger requirements for software development entails tracking functionality from both sides (of the “coin”) in both implementation and testing. This is the way to improve the quality of tests overall.

Another aspect to deliberate on is that when you write tests, you are led mostly by your desire to write tests that pass with the least set of assertions or in the shortest number of steps.

It is counter intuitive really to consider doing it any other way, but what you should focus on is how do you write tests that fail in the shortest number of steps using the most number of test assertions (more on this in another part of this series). We all need to begin to treat writing software tests kind of like writing software stress tests. In a stress test, the goal is failure and nothing else. What this causes you to do is focus on the many ways your software can go into a state that isn’t valid/desirable or allowed (e.g. error/bug states) before considering the ways that the software should work correctly. In summary, test for edge cases first before any other thing.

Let’s take a concrete example below:

Let’s say you were building a financial technology (fintech) software that allows people to save money in wallets and convert to any (set of allowed) currencies of their choosing (at great rates i might add 😉), they would also be able to spend the money saved in these wallets too on any online/offline purchase.

Surely, they (fintech app end-users) would need to login to the software application (passwords and all) for security and authentication purposes as we don’t want to allow unauthorised access to someone’s wallet. Now, we have already built a feature that allows users to add funding sources. A funding source is an entity that provides a means by which a user put’s money into their wallet (e.g. a credit/debit card or bank account).

Whenever a funding source is added to the database (to a funding sources table via a HTTP endpoint of course), it gets stored with the unique database id of the user who added it (owner_ida foreign key to the users tables’ id column) so it (a funding source) can be grouped by its’ owners. Furthermore, the wallets are also modelled into a database table with unique database ids and an owner_id too.

At this time, we want to build the most important software feature of the sprint: Select a funding source and using it to top-up a wallet. The feature is quickly implemented in the next couple of hours and the coder bias now kicks in and you want to write a test that ensures that the feature works as expected.

After a while, you are done and the green check mark is visible on the screen when you run your test. You smile ear to ear, pat yourself on the back and your job is done right ? Well, not quite (confirmation bias). Where’s the test that ensures the things that a user isn’t supposed to be allowed to do with a funding source and a wallet ? Now, that scenario should have come first but you didn’t think of it did you ? You see, without this balance (remember both sides of the “coin” ?), you cannot have quality tests!

For every test case (or a set of test cases — a test suite) that ensures that a software feature works as expected (the “coin” head), there must a test case (or a set of test cases) that ensures that a software feature cannot be abused (the “coin” tail). In this case, a user who doesn't own a funding source and the wallet should not be allowed to top it up. Also, a user who has reached the maximum limit for the balance of a wallet (if any), should not be allowed to top it up as well. Together, these scenarios when tested make for quality tests. However, when each is done alone, they make for incomplete and non-comprehensive tests that can mostly result in false positives. So, in summary, create tests for degenerate cases. It’s also in your best interest to try to break your own tests (i.e. Ensure that a positive result for any test case is a true positive not a false one)

The main reason why QA Engineers are always fighting with coders on a software project is because QA Engineers mostly employ black-box testing techniques (over white-box testing techniques) which is great for the project overall and everyone involved both in the short-term and long-term. You see, to thoroughly test your software is to treat it like you didn’t write the code at all. Hence, you eliminate coder bias. This is why i believe that developers who write the code have no business writing end-to-end tests (a black-box testing technique). It’s the only QA engineer who should be involved in writing the end-to-end tests.

Also, another thing that affects the quality of a software test is the kind of test doubles used at different stages of testing and the pitfall of over-using the test doubles in our test case.

Finally, the last thing that affects the quality of tests is how extensive your assertions are. If your assertions are not extensive enough, your tests might not represent the true state of thing in your application code. As Kent C. Dodds said — The more your tests resemble the way your software is used and experienced, the more confidence they can give you.

We will discuss more about test assertions in 3rd and final part of this series.

The testing pyramid and other concepts

What’s the testing pyramid and why do we need to study it ? Well because we are about to talk about test doubles in more detail . Then, we are going to round up with a little discuss about test coverage. A testing pyramid is a pyramid-like arrangement of levels for each kind of tests namely: unit, integration and system/e2e (end-to-end) tests that we have in software development. The testing pyramid term was coined by Mike Cohn in his book: Succeeding with Agile.

Now, in the first paragraph of Part 1, I talked about the huge “effort to benefit ratio” (trade-off) needed to practise truly helpful testing (not just testing for the sake of testing). The problem people have with testing is the amount of upfront work that has to be done before testing proper can begin.

There are 4 test doubles that need to be prepared upfront before testing can begin (This is written in descending order of the amount of work and time it takes to get each ready):

1) Fixtures (most amount of work to get ready)
2) Fakes
3) Mocks
4) Spies/Stubs (least amount of work to get ready)

As you go higher on the testing pyramid, from unit tests to integration tests and then to e2e tests, the need for fixtures, mocks, fakes and spies/stubs should be on the decline at each level up the pyramid. However, using the wrong test double in the wrong testing (pyramid) level can also cause problems. For instance, unit tests mostly require mocks and/or occasionally fakes whereas integration tests mostly require fakes and never mocks or stubs. Incidentally, only unit tests require spies/stubs/mocks and relatively large amounts of fixtures. e2e tests, on the other hand, do require a smaller amount of fixtures but don’t require spies/stubs or mocks at all.

For instance, when you utilise mocks/stubs in integration tests, it significantly reduces the quality of your tests because you aren’t testing the actual integration of the software logic but a poor and shallow approximation of it. Also, your test doubles need to be of high quality too because a bug in a fake (test double) can impact the quality of your tests as well. Finally, know when to use a mock and when to use a stub.

The truth is that there’s no way to totally eliminate all of upfront work concerned with making test doubles and fixtures so as to set the stage for testing because it’s all necessary. However, you can reduce this amount of upfront work (the amount of fixtures, mocks, fakes you need) by mocking strictly only what you need to mock and i’ll discuss more about this in the next section.

Another concept to discuss is test coverage. Test coverage is only good at 2 things: Identifying gaps in your requirement specification thereby helping to eliminate defects early and letting you know the amount of untested application code you have. However, it’s not good at letting you know how high the quality of the tests you have written are. This realisation needs to be factored in when writing your tests. Some technical articles flying around posit that test coverage helps to improve the quality of your tests. That’s just not true! You can have high test coverage and still have bugs in your codebase and/or brittle tests. In fact, at that point, test coverage becomes a vanity metric if your tests are not comprehensive (testing both sides of the coin as i said earlier) and you don’t use the right test doubles and/or use test doubles of low quality.

Furthermore, each levels of testing has its’ limit. For instance, you cannot use unit tests or integration tests alone to prove that your software system works without a doubt. No one level of testing can provide you with a 100% certainty. However, when used together properly: unit tests, integration tests and system tests, the results can be very remarkable. I think people ascribe a very high expectations for what they can get from say unit tests alone. Most developers want (or wish) unit tests to serve as the only thing that proves that they have a working software system free of bugs but that’s not possible because unit tests alone don’t have that kind of power.

Finally, as Dijkstra said: “Software tests do not confirm the absence of bugs, but the presence of bugs”. Unit tests are super important but integrations tests are even more important but that doesn’t mean that integration tests should completely replace unit tests. It is imperative that both kinds of tests are written with the highest quality in mind. The picture in the link below is an analogy of what can happen if you prioritise unit tests over integration tests.

Dealing with functional regressions in tests

We all hate regressions of any kind in our code and more so brittle tests (tests that break and fail easily when we modify application code). It’s a joy killer! It’s also the second area of difficulty that software engineers have to deal with (see the bar chart below 👇🏾). But why does it happen at all ? Do we have any control over it ?

Survey results from 30 random software engineers who write tests

To answer these questions we must understand that tests are and should be written with the aim of communicating the feature set of the software that has been or is being built in a simple, succinct and comprehensive manner. When tests are written for just testing sake, then supplying proper answers to these questions become difficult. Also, from the bar chart above, you can see that knowing what to mock and how to mock it is the number 1 area of difficulty for most software engineers. Most engineers say that they find themselves mocking a lot of dependencies ahead of writing actual tests and this discourages them in general from writing tests sometimes. For these engineers, writing tests are super stressful. Why is this ?

Survey results from 30 random software engineers who write tests

To answer that, we need to understand that when building software, trying to test what has already been tested and has been proven to work without errors or bugs is never a good idea. Although, this is done unknowingly by most software engineers, it is common practice to try to test a piece of code that isn’t actually a unit (as if it were) and call it unit testing (which happens to be the most written kind of test: see above 👆🏾👆🏾👆🏾). This usually occurs when the so called unit (of code) cannot be isolated easily. Therefore, the test written for them are bloated and shallow. This is an example of a test-case design anti-pattern.

These sort of shallow tests don’t provide feedback that the software engineer can trust. Ultimately, we lose confidence in these tests and also control over how we detect and respond to regressions in the codebase. Once we can write tests that bring certainty and confidence, taming regression become possible. This can only be achieved if we look very closely at how we write code. The real problem is that we are often doing something wrong which isn’t obvious to us (early on) and in that frustration we go ahead to blame our tools and say things like “I hate testing! It adds/has no real value” or “TDD is dead, long live testing” (😂- very funny DHH).

All tests that you write are also subject to all the objective design principles and coding best practices that your application (custom) code is subject to. Therefore, code design applies to test code too and your test code should reflect that. The tests you write shouldn’t break when the implementation details of the business rules (features) changes, it should only break when the business rules (features) themselves change.

Furthermore, by designing your application code properly, reducing and eliminating couplers (code smells) and adapting your test cases to be make coverage of code paths in your application code overlap without redundancy, you make defect leakages become more evident. The higher the degree of coupling, the lower the ability to change and evolve the software. In other words, the higher the cost of change (i.e. it will cost you and your team of software developers more time and energy to make a change or set of changes to the codebase).

One of those processes to setup here is the proper decoupling of disparate but inter-related units of the whole software. Coupling is the primary reason for tests becoming brittle. It turns out that code that is tightly coupled is usually not very testable or easily testable. The result is that after a little change in the application logic, a plethora of changes have to be made to multiple test cases else they all fail. Another thing is that coupling (in codebases) is hardly obvious too.

By decoupling on 3 levels (as shown below), you can design your tests in such a way that you limit the amount and frequency of breakage in your tests (or eliminate the breakage completely is some cases).

  1. Decouple your application framework (service) code (e.g. Phoenix, Laravel, Masonite, Ruby on rails, Spring, AdonisJS, NestJS) from your business rules application (custom) code which depend on the framework.
  2. Decouple your testing framework (service) code (e.g. Jest, PHPUnit, Japa, Pyunit, JUnit) from the test code.
  3. Finally, decouple your application code from your test code.

To get your tests to a place where you can continue to refactor/rewrite without the fear of breaking your tests in the process, it has to be decoupled on these 3 levels.

One negative effect of not decoupling your code as much as should be is that you have areas of your codebase that should be unit-testable but aren’t and those areas can increase in scope as code smells (technical debt) increase. Another negative effect is that you spend time mocking a lot of dependencies (that you shouldn’t be mocking) out because those dependencies are tightly coupled to business logic you are testing. The final negative effect is that your tests spend a longer time running as is evident in this video.

Now, decoupling code has a formula where you need to start with some simple checks:

  1. Your test cases should be independent of one another (i.e. one test case in your spec or test suite should not share instance/fixture data or a reference in memory or a shared mutex/semaphore). In other words, each test should start on a clean slate (as clean as can be).
  2. Your tests will have to interact with external artefacts/dependencies which such tests (and by extension your application code) merely influence but don’t control it (e.g. TCP/UDP network, file system, day timezones settings). You must isolate your tests from knowledge of these external dependencies by mocking/faking the interactions that utilise the external artefacts/dependencies not the artefacts/dependencies themselves (i.e. by creating forwarders that delegate to the dependency — a.k.a proxies). In other words, you need to mock only what you control and own. However, there are times when you must mock/fake what you do not control and own. At such times, do so if and only if the interaction with such external artefact/dependency is not integral to the function which you are trying to test within the system under test (SUT) or the side effects of the interaction with such external artefact/dependency aren’t desirable under a testing situation.
  3. Your application code needs to utilise several compact and self-sufficient layers of indirection to separate 3 distinct application concerns: 1) business logic 2) UI logic 3) data access logic. Well, you already do it with paradigms like MVC or MVVM but you need to take it a step further by specialising it properly using things like the hexagonal architecture or onion architecture (or a hybrid of both) to separate the custom application code (business rules) you have written from the other layers of concern like the presentation layer, persistence layer and monitoring layer. However, one must never follow this architecture blindly or abuse it in order not to violate the reasonable flow, direction and degree of coupling that should be maintained along related ports or layers. It must be said that coupling in codebases is not necessarily a bad thing. It is the degree to which coupling exists that can be bad especially if it exists to a very high degree.

Accidental Complexity and how it can affect software testing

As long as humans are involved in software creation, accidental complexity will always exist as an aspect of the many things that plague commercially available or even open-source software. When it comes to testing, you require a clear head. Have you ever noticed that when you write code without a clear head and proper thought processes, you end up having to rewrite the code even though it may be working well. This is because the way you think about code affects the way you write it. Plus, if your thinking is wrong then the code you write can introduce complexity into it.

So, what is accidental complexity and why do we as software engineers need to be aware and cautious of it ? Accidental complexity is the thing that sneaks up on you when you aren’t

It’s a baffling paradox how open-source software is one of the biggest sources of complexity (accidental complexity) distributed among codebases all over the world and simultaneously the biggest source of the cost-effective, productive and mostly reliable software for codebases around the world. It’s pertinent to note that this paradox ensures that the software industry progresses in two opposing directions by distributing solutions that focus too much on increasing productivity and little on offering flexibility and managing nuance. We try make up for this by over-planning for expected future change and over-utilising many more needless developer tools that add even more complexity to the software engineering outcome. At last, we reach our wits end and begin to course correct (usually after a decade or two) back to solid ancient wisdom. Indeed, we are all guilty of exchanging short-term convenience for long-term headaches from time to time.

There are two principles of mocking/stubbing we established earlier (above):

  1. Mock only what you own and control and when you do, mock only the proxy wrapper around dependencies you do not own and control.
  2. Only directly mock objects (without proxies) that lie at the logical boundary of functionality which your custom code engages/interacts with whose side effects aren’t desirable or are not integral to the outcome of the unit tests (e.g. mocking a filesystem adapter or mocking a HTTP redirect or mocking session data storage - you can’t really mock a HTTP redirect anyways just the idea of it).

However, we hardly adhere to these 2 principles because we are running away from immediate inconvenience. In my article about decisions that matter broadly, i spoke about why it’s important to be an objectively lazy programmer instead of just being a lazy programmer.

The first big source of complexity is the error in our own formalisation of logic packaged as the software solution to a problem. Quite often, the many opportunities by which we introduce complexity elude us in the manner we think about testing in relation to what is being tested. This complexity leads us to sometimes to erroneously blur the lines between what a unit test is and should be and what an integration test is and should be for example. A software developer starts out intending to write a unit test, but he ends up writing an integration test because of the complexity that the code he is trying to test contains (a.k.a the test is mocking objects it shouldn’t be mocking).

When a developer tries to test a Laravel middleware, what i think they should do is: “Write a test only for the logic inside the middleware not for the middleware itself and the logic inside it”. However, what they actually end up doing is: “First, mock framework-specific objects like the HTTP request object and the next closure and then write a test for the middleware and the logic inside the middleware” and this is a huge waste of time and energy. There are several issues here. The logic inside the middleware is obviously tightly coupled with the middleware and so the only REAL way to test the logic there is to test the entire middleware by booting the entire Laravel application and voila! an integration test (often disguised as a unit test) is born. Firstly, the test doesn’t need the HTTP request object, it just needs the data provided by the HTTP request object (i.e. extracting the Content-Type header value or Authorization bearer token). Finally, the HTTP request object is not owned or controlled by the developer. It is owned and controlled by the framework — Laravel. Here is an example of how to write an easily unit-testable middleware (you can run composer test in the terminal as seen in photo below) in Laravel (also see below — click the Online PHP Sandbox embed).

See ? The test passed !

Now, it’s possible for you reading this article (after looking at the preceding paragraph talking about writing a Laravel middleware test) to watch this video (from the guys at SPATIE) and try to put forward a counter argument saying: “Well, it’s possible to indeed mock the request object using the createRequest() function defined in this video and also the next closure too. Therefore, there’s no problem unit testing the middleware itself!”. However, what you are forgetting is that the createRequest() function depends on SymfonyRequest::create() which could have a breaking change introduced on it in the future or could be deprecated. When that happens, the createRequest() function becomes a source of instability to any code that depend on it. Consequently, the tests you have written stand the risk of becoming brittle. Also, you are doing too much work mocking the request and everything on it (fake data and interfaces) that your test will interact with which could be a lot.

Another big source of complexity can be from the testing framework. A Framework like React testing library is a very good example. Sometimes testing frameworks get in the way of just testing what needs to be tested: the business logic only and nothing else (even though when doing an integration test you shouldn’t be testing only the business logic).

For any component-based UI library (e.g. ReactJS, VueJS e.t.c.)for the frontend out there today, there are 2 broad categories:

  1. Presentation Components — components that don’t have or control state and don’t render any other react components. They are also referred to as leaf components
  2. Container Components — components that do have and control state and do render other react components (other presentation components or other container components).

The idea of using React testing library to unit test a presentation component or a container component that renders other presentation components is very straightforward. No problems there at all! However, it is completely impossible to actually unit test a container component that renders other container components. Thing is that will be an integration test not a unit test in that case. Now, because you do not realise this, you are stuck using a mocking library like jest to try to mock things you shouldn’t be mocking. This can make things messy real quick.

Conclusion

All in all, creating quality tests is serious hard work! Yet, in the long term, the juice is very much worth the squeeze. In the next article of this series, i will delve into the nitty gritty of test assertions and fixtures as well as a deep dive into Equivalence partitioning testing and Boundary Value Analysis. Finally, i will give my 2 cents about TDD and why i think every programmer needs it.

--

--

Ifeora Okechukwu

I like analysis, mel-phleg, software engineer. Very involved in building useful web applications of now and the future.