Property-Based Testing with `hypothesis`, and associated use cases
NOTE: This blog post complements a PyDistrict presentation on the same topic posted on this date.
Thanks to Rami Chowdhury for inviting me to speak, and the PyDistrict organizers for hosting me.
Code samples from this talk are available at this GitHub repository.
The full YouTube talk can be seen here.
Overview
Writing tests today sucks. Tests are expensive to write. They're even more expensive to maintain. They're somewhat opaque to anybody who's not intimately familiar with the test harness. You can't always be sure whether you're writing the "right" tests for your situation. Worst of all, it's sometimes difficult to persuade a non-technical stakeholder that software testing actually does anything underneath the hood, except maybe inflate operational expenses for your software budget. Every software development horror story I've heard comes with some discussion around tests, and that's because not only does software fail in more ways than anybody without job security may care to admit, it fails in novel and unpredictable manners.
What sucks even more is just how important effective software testing is for
software engineers to quickly and confidently contribute value. The absolute
first thing I do after cloning most git
repositories these days, for work or
otherwise, is run the test suite. Test suites are self-documenting and always
more accurate than a wiki or other forms of documentation, and since they're
programmatic it's much easier to tie into the rest of your developer workflow
(say when setting up automated deployments to production).
Honestly, this probably isn't news to you if you've been working in industry for a while, and it's easy to fall into a trap of powerlessness and believing any endemic lack of alignment is eternal. And while it's not always (and usually not) the fault of technical limitations, it may be useful to think of technical ways to grow beyond the immediate problem. So I'd like to say there are two great aspects about professional software that could be applied to software testing.
The first is we're usually using paradigms that are decades old. Professional software is usually incremental in action and incrementalist in thinking. This implies there's often new concepts in academia or research that might solve current problems. One professor I talked to around graduation 4 years ago brought up how the A* path-finding algorithm was invented 20 years before its adoption in the video game industry (and apparently that's at least one reason why NPCs don't collide with in-world walls anymore).
The second is test suites usually under the full control of engineering. There might be a QA department within engineering, but I have yet to see a sales/marketing department care deeply about how you structure a test harness, vs. the layout of a landing page. While serving the company is the goal of any business-oriented engineering department, this means there's more agency than usual in how engineering approaches testing, than say product implementation or devops.
What we're looking for as software engineers is to write as many good tests as possible within some limited time frame.
Different software testing strategies I've used
The most important thing about writing good tests is to start writing good
source code and develop good habits. This might involve separating concerns in a
codebase by using functions instead of classes, gradually applying type
signatures using mypy
or pyre-check
and adding a typechecking step to your
CI pipeline, and paying attention to software properties like immutability
(updating an object creates a copy of the object with the update instead of
mutating the original object) and idempotency (running a code block generates
the same output for the same input even for multiple runs). You can bring a lot
of value to the table by using tools + infrastructure like black
for
autoformatting and Docker / docker-compose
for running local infrastructure,
and a lot of this undergirds higher-order testing strategies like property-based
testing later on.
Some of these strategies I've applied in the field revolve around these simpler
ideas. When I'm building a webapp frontend or backend, I adopted the idea of
"functional core, imperative shell", where you have a soup of decoupled methods
within your app that are easily unit testable, and a set of well-defined I/O
interfaces and stubs to imperative resources like your network or database to
create the cheapest possible integration tests. This is nice for stateless,
pass-through entities, but works less well when you're primarily looking for
effects (e.g. it'd be very difficult to unit test most of PostgreSQL
effectively). So when I became a data engineer, and I spent most of my day
interacting with effectful systems, I attempted with some success to write a set
of data-driven integration tests using
pytest
and leverage its ability to
parameterize over fixtures using data to create essentially a mongodump of test
descriptions. This solution was nice since it allowed me to scale the amount of
data I tested far beyond the amount of code I needed to write. However, I still
needed to come up with my own oracle values to test with. It's easy to run an
expensive integration test with very little added value.
This is where my interest in property-based testing stems from. I'd also mention that even with property-based testing, all these testing strategies (and others I haven't covered or discovered yet) have their place and their tradeoffs. You should decide for yourself when to apply or to not apply each.
What is property-based testing?
At its simplest, property-based testing is the inverse of normal unit testing. Instead of providing some amount of data and a transformation so the computer can assert a property, you provide a property and a transformation, so the computer can provide some data.
Let's go back to the basics. Say you defined a method addOne()
to increment an
integer by one:
def addOne(x: int) -> int:
return x + 1
In order to test method addOne()
, you can define a set of unit tests:
def test_addOne():
assert addOne(5) == 6
You can even parameterize a list of oracle inputs and expected values:
import pytest
@pytest.mark.parametrize('_input, _expected', [(5,6), (6,7), (7,8)])
def test_addOne(_input, _expected):
assert addOne(_input) == _expected
The final leap is realizing that instead of coming up with your own oracle values, you just care that the inputs are integers and the outputs are a functional, deterministic transform resulting in an integer output. This is where property-based testing:
from hypothesis import given, strategies as st
@given(st.integers())
def test_addOne(_input):
assert addOne(_input) == _input + 1
For some properties, you can define and apply the opposite transform, which when combined with the original transform, generates the identity function, which should then leave the input untouched:
def subOne(x: int) -> int:
return x - 1
from hypothesis import given, strategies as st
@given(st.integers())
def test_addsub_returns_value(_input):
assert subOne(addOne(_input)) == _input
So what else is it similar to / different from? One's generative testing,
which are tests that create tests. Since we're creating a spec here,
hypothesis
is generating tests underneath the hood for us and running them.
Otherwise, property-based testing is a bit different as it supports intelligent
search strategies that allow us to shrink the error space.
Another testing paradigm similar to property-based testing is fuzz testing, where random inputs are inserted into a program to see how it behaves and whether it crashes. To me, fuzz testing is more an extension of black-box integration testing, and extends more into devops or security related testing, rather than the unit-focused nature of property-based testing.
What is hypothesis
?
hypothesis
is one of the more popular implementations of property-based
testing in Python. I had thought it originated from a PBT library called
QuickCheck
in Haskell,
but after reading through a Haskell book earlier this year and talking to some
folks online,
there's a more "intelligent" PBT library in Haskell it came from called
Hedgehog
, which applies
built-in shrinking as part of the test generation suite.
You can find the hypothesis
for Python documentation
here, and additional
documentation around the team behind Hypothesis
here.
hypothesis
strategies
The key idea behind hypothesis
is strategy generation. You can see examples of
it by running this code segment in your python
terminal:
from hypothesis import strategies as st
st.text().example()
hypothesis
supports basic strategies like those targeting Python's "primitive"
types. You can create st.integers()
or st.text()
. You can also do more
advanced things, like create strategies from structs like dictionaries or lists,
combine strategies together, infer strategies from a regular expression, and
build strategies from well-recognized third-party classes from pandas
or
Django.
hypothesis
tactics
I took a lot of these tactics from a talk by Zac Hatfield-Dodds introducing PBT at PyCon US. Check out his website here.
The types of properties you can apply with hypothesis
allow for certain
tactics you can play with in hypothesis
.
One of these tactics might be manually defining the property, then letting
hypothesis
generate the test cases for you. One common way this tactic is
applied is to look for "round-trip" properties, where you look for the inverse
method and check that applying both would result in the identity method.
Another tactic might be to use an oracle function. This can be really helpful if you're refactoring a piece of code and you want to carry forward the original behavior of that method, regardless of whether it's correct or incorrect. In this case, that method becomes an oracle and you can apply the equal-by-value constraint between the original and refactored methods as a property to test against.
There's other tactics, like metamorphic testing or hyperproperties, that I won't
get into here (because I honestly don't understand them yet), but one great
resource besides the hypothesis
documentation and the blogs / talks given by
the hypothesis
core devs is Hillel Wayne's
blog.
Working through some real-world problems
So this is all well and good, but for me personally, working through some examples really helps solidify some of these learnings. So I took the opportunity to create three different real-world code samples, so we can discuss how to implement property-based testing for each one.
Example: password validator
This first example describes a sample password validator that you might implement as part of your codebase. The rules for this password validator is that the password must have 8 characters, contain one special character, one uppercase character, one lowercase character, and at least one number. These rules might change by codebase or by context, and so you can say it's some method that takes a string input and applies some set of gates in order to generate a boolean value. When you describe it like that, this method doesn't appear all that different from other methods you might encounter in your codebase.
Some regular unit tests
So this method is fully unit-testable, no need for complex integration tests. What do conventional unit tests look like for this method?
@pytest.mark.parametrize(
"_input, _expected",
[
("", False),
("ohautoeauna", False),
("732814084212", False),
("OAHUETNASU", False),
("&*&*()&", False),
("789&(*&(ht", False),
("aBcD1234@", True),
],
)
def test_password_matches_validation_for_inputs(_input, _expected):
assert password_is_valid(_input) == _expected
This is applying @pytest.mark.parametrize
over some amount of data in order to
generate test cases for our unit test.
These are tests that you can come up with just by looking at the defined rules within the source code. But you might forget that in Python 3, strings are Unicode and they're not always UTF-8, and the notion of a character and a character length may change by language. And once you remember that, you're left wondering what else you've forgotten about.
PBT approach
So this is a great opportunity to apply hypothesis
. We have the rules
available in the source code. If we're able to create a regular expression from
these rules, we might be able to generate data that passes our password
validation method. If it doesn't, then we know there's an error.
After fiddling around on Pythex to create a proper
regular expression, I came up with this:
r'(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]){8,}'
. To break this regex
down:
(?=.*[0-9])
matches all numeric digits.(?=.*[a-z])
matches all lowercase alphabetical letters.(?=.*[A-Z])
matches all uppercase alphabetical letters.(?=.*[!@#$%^&*])
matches all special characters.{8,}
matches if and only if the length of the sample text is greater than or equal to 8 characters.
If you run
st.from_regex(r"(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]){8,}").example()
you should be able to get a good amount of arbitrary output. Note, that
printable characters go beyond string.printable
or just ASCII.
So, this means our unit test suite can be re-written as:
from hypothesis import strategies as st
# `hypothesis` test suite
@given(st.from_regex(r"(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]){8,}"))
def test_password_matches_validation_for_arbitrary_inputs(_input):
assert password_is_valid(_input)
Example: basic webapp (REST to SQL)
This example describes a sample web-based API backend that you might use when
starting a new SaaS project or internal microservice. It takes in some URL with
URL parameters, then spits out a JSON response either affirming the creation of
a record in the database, or erroring out with an accurate status code and a
helpful error message. It should not respond with an HTML error page, or a
stack trace of any sort. Various CHECK
clauses are added to the SQL table
definition preventing insertion of invalid records at the database layer.
Regular approach: functional soup, imperative shell
Normally, a web application like this would contain a good deal more business logic which can be unit-testable, but since our example is rather small, the source code mainly comprises the API layer and the database call. For this reason, we're going to focus primarily on writing good integration tests.
Flask has a pytest
plugin called pytest-flask
that makes it easy to create
app clients. The pytest
fixture would look something like this:
@pytest.fixture
def client():
with app.test_client() as testing_client:
with app.app_context():
yield testing_client
Let's say we want to test that every type of string can be a name, and every type of int can be an age. It's pretty easy to create some basic tests:
def test_init_returns_ok(client):
response = client.get("/init")
assert response.status_code == 200
def test_test_returns_ok_with_params(client):
response = client.get("/test?name=Ted&age=25")
assert response.status_code == 200
One problem with this approach is how while this works for naive URL parameters,
any string requiring URL encoding will fail. This happens even with names, since
names can have special characters. It's far better to handle cases like these
programmatically, rather than working around the webapp (e.g. requiring manual
data entry for names with special characters). We're also not handling either
CHECK
clauses here properly. These are things we could try and fix in our
existing test harness. However, we may be able to handle this better if we moved
to a PBT-based approach.
PBT approach to web server
Pretty much any bytestream can be a legal name, so let's use the regex
r"\A\d{1,}\Z"
as our name, like st.from_regex(r"\A\d{1,}\Z")
. I tried
(\W+){1,}
and ([a-zA-Z]+){1,}
, but according to this GitHub
issue, it's important
to set boundary markers to exclude prefixes and suffixes. We can then apply a
filter to the stream of possible integer inputs in order to constrain ourselves
to ages above 18, like st.integers().filter(lambda x: x >= 18)
.
The full method templates in a sanitized URL, calls endpoint in order to add a record with those attributes to the database, and then checks for an HTTP 200 OK response:
@given(st.from_regex(r"\A\d{1,}\Z"), st.integers().filter(lambda x: x >= 18))
def test_test_returns_ok_with_arbitrary_inputs(client, name, age):
url_string = f"/test?name={urllib.parse.quote(name.encode('utf-8'))}&age={age}"
response = client.get(url_string)
json_response = json.loads(response.data)
assert json_response.get("status") == 200
From doing this, I noticed that Flask has an escape
method in order to
sanitize inputs from request arguments. I didn't notice that before, and
wouldn't have noticed it if I had approached testing this service without
hypothesis
.
Where to go from here
These are really quick, whipped-up examples of how to apply property-based testing. I advocate for the usage of property-based testing, not just because it can catch edge cases, but because it changes your mindset towards programming. I've learned to think more critically about types and properties after using property-based testing, and developed programming habits to guard against certain classes of errors.