#todayilearned: Encodings in Python
Today, I wanted to discuss one major failing in Python 2.7.x and how you need to pay attention to how your application encodes data, all dependencies your application relies on to do so, and what assumptions those dependencies have.
One key difference between Python 2.7.x and Python 3.x is the default encoding of the data generated by your code. By default, Python 2.7.x uses ASCII, which was designed back in the day for simplicity and efficiency, and is mainly limited to English (though ASCII encoding extensions do exist). In contrast, Python 3.x uses Unicode, which is designed from the ground up for internationalization. Mixing these two encodings is a big no-no, and the change broke many Python 2 dependencies. It's a major reason why the migration from Python 2 to Python 3 has taken more than a decade, because all dependencies that assumed a particular encoding required code changes to become compatible with Python 2 and Python 3. In fact, many companies still cannot migrate due to the sheer volume of Python 2 code they rely on. For example, Google's Tensorflow, first released in 2015 and used internally at Google, is still both Python 2 and Python 3 compatible. In the end, the core Python development team simply set an end-of-life date of January 1st, 2020 in order to relieve themselves of Python 2's shortcomings and accelerate development on Python 3.
When I first started off on my current project at work, development and production both used Python 2.7.x. Today, both thankfully use Python 3.x. But for a long period of time in between, development used Python 3.x while production used Python 2.7.x, which meant source code needed to be both Python 2 and 3 compatible. This usually didn't end up being a problem, because I had access to the source code and because many people write and maintain guides between Python 2 and Python 3, like this one. In addition, most major dependencies had already migrated over to Python 3 and were aware of encoding conflicts.
During one sprint, I was committed to adding support in my tool to ingest data into the database while security was turned on. The feature worked fine, and after I had added some regressions to my development environment, all passed with flying colors. Then I compiled a build and encountered an error while running my test suite on the build. Instead of a successful ingest, I got a timeout error instead. Since every test had a copy of the CLI arguments, I manually ran the same test through the CLI in the distribution. It passed with no problems. In case it was a server-side issue, I removed all data in the database using the UI and rebooted the database, then retried both behaviors. It was reproducible. This indicated to me that a silent failure occurred in the test harness I built.
I dived down into the Python API to see what the request looked like when it was being sent out to the native REST endpoints. Imagine my surprise when the request body contained an invalid Unicode character:
'\xeeJ{"fields": ...'
That "\xeeJ" is not supposed to be there; the entire string is supposed to
be valid stringified JSON and should start with the {
character. When I added
an ipdb
trace and manually propagated up
the stack from the error, it turned out I was appending Unicode strings to the
binary data passed into a call to the
/insert/records
REST endpoint. The binary data is binary and cannot be changed because it is
already encoded in Apache Avro, as detailed in our
product documentation. So this is actually a
bytes
type, which unfortunately in Python 2.7.x is the same as the str
type:
(base) hostname:/path/to/directory username$ conda activate python2.7
(python2.7) hostname:/path/to/directory username$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> str
<type 'str'>
>>> bytes
<type 'str'>
>>> str is bytes
True
>>>
Another unfortunate characteristic in Python 2.7.x is how appending a string of
type str
to a string of type unicode
creates a string of type unicode
:
(base) hostname:/path/to/directory username$ conda activate python2.7
(python2.7) hostname:/path/to/directory username$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> string1 = 'hi'
>>> type(string1)
<type 'str'>
>>> string2 = u'hi'
>>> type(string2)
<type 'unicode'>
>>> string3 = string1 + string2
>>> string3
u'hihi'
>>> type(string3)
<type 'unicode'>
>>>
So somewhere up the line, a unicode
string is concatenated to a str
string,
which propagates the Unicode encoding all the way until it tries to encode Avro
data as Unicode, which caused the encoding error.
When I made my way up the stack, I found that the entire JSON test case I had
passed in was encoded in Unicode. This presented a worrying predicament, as due
to the nature of pytest
's dynamic test
generation,
a lot of obfuscation took place between the loading of the JSON test case and
the creation of the individual test, and I didn't have enough time to look into
the internals before switching back to other tasks. Hence, there remained the
possibility that pytest
itself erroneously encoded the data before passing it
off to the parametrized stub method.
In other words, I may need to migrate off of pytest
to a framework that may be
much older or that I am much less familiar with, and complete another migration
right after having migrated most of my tests to the data driven test harness,
none of which directly generates business value, at the time when stakeholders
are asking questions about where those new features are.
In the end, I got lucky. I had some breathing room after completing some
additional features, and as I refused to believe pytest
, a framework I trusted
very much, would do something like change encodings of string inputs without
user direction, I went back to my program trace and at every level checked that
the encoding did not change. It remained Unicode at every step. Then I reached
the stage where I generated the parametrized tests, and the encoding hadn't
changed. I breathed a sigh of relief, as pytest
no longer touched the data.
pytest
was not responsible for changing the encoding, and I would be able to
keep the dependency and not need to execute another expensive harness migration.
So where did the change come from, then? As it turns out, it was from Python 2
standard library. Specifically, it was the
json
library module that
read the JSON file.
Take a look at this:
(base) hostname:/path/to/directory username$ conda activate python2.7
(python2.7) hostname:/path/to/directory username$ echo '{"a":"b"}' > out.json
(python2.7) hostname:/path/to/directory username$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.load(open('out.json'))
{u'a': u'b'}
>>>
When json
reads a JSON file into a Python dictionary, it casts keys and
values to Unicode. You can specify the encoding you want in the method call as
detailed in the documentation for
json.load
, but it
doesn't seem to make a difference:
>>> json.load(open('out.json'), encoding='ascii')
{u'a': u'b'}
>>> json.load(open('out.json'), encoding='ASCII')
{u'a': u'b'}
So there we have it. The Python 2 standard library disagrees with Python 2 about
what the standard encoding should be. I guess I should not have been too
surprised. Python 2.7.x has been maintained for so long since new features
actually came out for Python 2, that its primary purpose has mostly been holding
the line until people move to Python 3. Maybe if legacy Python 2 services needed
to communicate with newer Python 3 services, this could be useful. It could also
be an issue of what tool I used in order to write the JSON file (in this case, a
third-party IDE called VS Code). If Unicode
was used to write the JSON file, the load
command may ignore an invalid
encoding argument. Still, it seems rather incongruous for the standard library
to fracture something as fundamental as default encodings, and it seems on the
standard library to handle encoding translations for something as common as
JSON, as it is a low-level action.
The solution to this problem was rather simple: traverse the entire Python
dictionary after reading from a JSON file, and cast the strings within to the
default encoding of the program. In Python 3, this would keep the encoding as
Unicode; in Python 2, it would cast it to the native str
type.
This brushfire took about three weeks to fix, from first encounter to final patch commit. I think this is the first bug that is seared into my brain, because of how unexpected it was.
So, what did I learn in this whole experience?
-
How data is stored is way different from how it is displayed: I've had a number of other encoding run-ins since this bug, but none have really bit me as hard. Nowadays, I assume the byte array description to be the source of truth, and not what I see on my monitor. For an ETL tool, not assuming encodings is an important step forward to bettering compatibility with enterprise systems.
-
Just because you created it, doesn't mean you understand it: I created the JSON files but didn't understand how the test harness processed it. In this sense, the problem came up from behind. Double check that your work passes muster before exploring other failure conditions.
-
Keep your development and prodction environments as similar as possible: One reason I did not catch this error during development was because I used an entirely different version of Python in the distribution. Since the same test passed on development and during manual runs on distribution, it appeared as a transient issue, and given the pile of bugs existing with an ETL tool, the priority of the bug fell accordingly, making it more difficult to follow up on. When the distribution began using Python 3.x, compatibility errors dropped off.