#todayilearned: Encodings in Python

Today, I wanted to discuss one major failing in Python 2.7.x and how you need to pay attention to how your application encodes data, all dependencies your application relies on to do so, and what assumptions those dependencies have.

One key difference between Python 2.7.x and Python 3.x is the default encoding of the data generated by your code. By default, Python 2.7.x uses ASCII, which was designed back in the day for simplicity and efficiency, and is mainly limited to English (though ASCII encoding extensions do exist). In contrast, Python 3.x uses Unicode, which is designed from the ground up for internationalization. Mixing these two encodings is a big no-no, and the change broke many Python 2 dependencies. It's a major reason why the migration from Python 2 to Python 3 has taken more than a decade, because all dependencies that assumed a particular encoding required code changes to become compatible with Python 2 and Python 3. In fact, many companies still cannot migrate due to the sheer volume of Python 2 code they rely on. For example, Google's Tensorflow, first released in 2015 and used internally at Google, is still both Python 2 and Python 3 compatible. In the end, the core Python development team simply set an end-of-life date of January 1st, 2020 in order to relieve themselves of Python 2's shortcomings and accelerate development on Python 3.

When I first started off on my current project at work, development and production both used Python 2.7.x. Today, both thankfully use Python 3.x. But for a long period of time in between, development used Python 3.x while production used Python 2.7.x, which meant source code needed to be both Python 2 and 3 compatible. This usually didn't end up being a problem, because I had access to the source code and because many people write and maintain guides between Python 2 and Python 3, like this one. In addition, most major dependencies had already migrated over to Python 3 and were aware of encoding conflicts.

During one sprint, I was committed to adding support in my tool to ingest data into the database while security was turned on. The feature worked fine, and after I had added some regressions to my development environment, all passed with flying colors. Then I compiled a build and encountered an error while running my test suite on the build. Instead of a successful ingest, I got a timeout error instead. Since every test had a copy of the CLI arguments, I manually ran the same test through the CLI in the distribution. It passed with no problems. In case it was a server-side issue, I removed all data in the database using the UI and rebooted the database, then retried both behaviors. It was reproducible. This indicated to me that a silent failure occurred in the test harness I built.

I dived down into the Python API to see what the request looked like when it was being sent out to the native REST endpoints. Imagine my surprise when the request body contained an invalid Unicode character:

'\xeeJ{"fields": ...'

That "\xeeJ" is not supposed to be there; the entire string is supposed to be valid stringified JSON and should start with the { character. When I added an ipdb trace and manually propagated up the stack from the error, it turned out I was appending Unicode strings to the binary data passed into a call to the /insert/records REST endpoint. The binary data is binary and cannot be changed because it is already encoded in Apache Avro, as detailed in our product documentation. So this is actually a bytes type, which unfortunately in Python 2.7.x is the same as the str type:

(base) hostname:/path/to/directory username$ conda activate python2.7
(python2.7) hostname:/path/to/directory username$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> str
<type 'str'>
>>> bytes
<type 'str'>
>>> str is bytes
True
>>>

Another unfortunate characteristic in Python 2.7.x is how appending a string of type str to a string of type unicode creates a string of type unicode:

(base) hostname:/path/to/directory username$ conda activate python2.7
(python2.7) hostname:/path/to/directory username$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> string1 = 'hi'
>>> type(string1)
<type 'str'>
>>> string2 = u'hi'
>>> type(string2)
<type 'unicode'>
>>> string3 = string1 + string2
>>> string3
u'hihi'
>>> type(string3)
<type 'unicode'>
>>>

So somewhere up the line, a unicode string is concatenated to a str string, which propagates the Unicode encoding all the way until it tries to encode Avro data as Unicode, which caused the encoding error.

When I made my way up the stack, I found that the entire JSON test case I had passed in was encoded in Unicode. This presented a worrying predicament, as due to the nature of pytest's dynamic test generation, a lot of obfuscation took place between the loading of the JSON test case and the creation of the individual test, and I didn't have enough time to look into the internals before switching back to other tasks. Hence, there remained the possibility that pytest itself erroneously encoded the data before passing it off to the parametrized stub method.

In other words, I may need to migrate off of pytest to a framework that may be much older or that I am much less familiar with, and complete another migration right after having migrated most of my tests to the data driven test harness, none of which directly generates business value, at the time when stakeholders are asking questions about where those new features are.

In the end, I got lucky. I had some breathing room after completing some additional features, and as I refused to believe pytest, a framework I trusted very much, would do something like change encodings of string inputs without user direction, I went back to my program trace and at every level checked that the encoding did not change. It remained Unicode at every step. Then I reached the stage where I generated the parametrized tests, and the encoding hadn't changed. I breathed a sigh of relief, as pytest no longer touched the data. pytest was not responsible for changing the encoding, and I would be able to keep the dependency and not need to execute another expensive harness migration.

So where did the change come from, then? As it turns out, it was from Python 2 standard library. Specifically, it was the json library module that read the JSON file.

Take a look at this:

(base) hostname:/path/to/directory username$ conda activate python2.7
(python2.7) hostname:/path/to/directory username$ echo '{"a":"b"}' > out.json
(python2.7) hostname:/path/to/directory username$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.load(open('out.json'))
{u'a': u'b'}
>>>

When json reads a JSON file into a Python dictionary, it casts keys and values to Unicode. You can specify the encoding you want in the method call as detailed in the documentation for json.load, but it doesn't seem to make a difference:

>>> json.load(open('out.json'), encoding='ascii')
{u'a': u'b'}
>>> json.load(open('out.json'), encoding='ASCII')
{u'a': u'b'}

So there we have it. The Python 2 standard library disagrees with Python 2 about what the standard encoding should be. I guess I should not have been too surprised. Python 2.7.x has been maintained for so long since new features actually came out for Python 2, that its primary purpose has mostly been holding the line until people move to Python 3. Maybe if legacy Python 2 services needed to communicate with newer Python 3 services, this could be useful. It could also be an issue of what tool I used in order to write the JSON file (in this case, a third-party IDE called VS Code). If Unicode was used to write the JSON file, the load command may ignore an invalid encoding argument. Still, it seems rather incongruous for the standard library to fracture something as fundamental as default encodings, and it seems on the standard library to handle encoding translations for something as common as JSON, as it is a low-level action.

The solution to this problem was rather simple: traverse the entire Python dictionary after reading from a JSON file, and cast the strings within to the default encoding of the program. In Python 3, this would keep the encoding as Unicode; in Python 2, it would cast it to the native str type.

This brushfire took about three weeks to fix, from first encounter to final patch commit. I think this is the first bug that is seared into my brain, because of how unexpected it was.

So, what did I learn in this whole experience?

How data is stored is way different from how it is displayed: I've had a number of other encoding run-ins since this bug, but none have really bit me as hard. Nowadays, I assume the byte array description to be the source of truth, and not what I see on my monitor. For an ETL tool, not assuming encodings is an important step forward to bettering compatibility with enterprise systems.
Just because you created it, doesn't mean you understand it: I created the JSON files but didn't understand how the test harness processed it. In this sense, the problem came up from behind. Double check that your work passes muster before exploring other failure conditions.
Keep your development and prodction environments as similar as possible: One reason I did not catch this error during development was because I used an entirely different version of Python in the distribution. Since the same test passed on development and during manual runs on distribution, it appeared as a transient issue, and given the pile of bugs existing with an ETL tool, the priority of the bug fell accordingly, making it more difficult to follow up on. When the distribution began using Python 3.x, compatibility errors dropped off.