#todayilearned: Isolate your Development Environment
Today I wanted to share with you the hardest bug that I have encountered yet at work. It's a great argument for why language-specific "virtual environments" don't really work well in practice, especially if there's a mismatch between its intended use cases and yours, and why you should strictly isolate your development environment from your build environment even while compiling locally.
I use conda
for my development environment.
conda
is very helpful in terms of isolating and deploying your source code
only into production, but as it turns out, it can fail pretty hard in terms
of making sure the rest of your development environment remains free of any side
effects it generates.
Sometimes, I need to compile on-premise distributions of my ETL tool in order to verify its functional integrity from my development laptop. This custom-built compilation process, while it manages itself quite well, doesn't really check for side effects because it assumes it is compiling in our CI/CD pipelines.
You might see where this is going.
So there's this dependency called freetds
,
which enables your Unix-based operating system to talk to Windows-based database
products like Microsoft SQL Server. This is important if you want your ETL tool
to talk to SQL Server and ingest data from it. This dependency in turn requires
a dependency on openssl
, which is an
open-source implementation of the TLS/SSL encryption protocol. We keep the copy
of openssl
we compile with in a special location, and the compilation process
fetches that copy of openssl
.
One day, I was executing a compilation to check a particular feature. Instead of the rather straightforward build I was expecting, I got this logged out instead:
./.libs/libtdssrv.a(tls.o): In function `sk_GENERAL_NAME_num':
tls.c:(.text+0x7c): undefined reference to `OPENSSL_sk_num'
./.libs/libtdssrv.a(tls.o): In function `sk_GENERAL_NAME_value':
tls.c:(.text+0x9e): undefined reference to `OPENSSL_sk_value'
./.libs/libtdssrv.a(tls.o): In function `sk_GENERAL_NAME_pop_free':
tls.c:(.text+0xc3): undefined reference to `OPENSSL_sk_pop_free'
./.libs/libtdssrv.a(tls.o): In function `tds_pull_func_login':
tls.c:(.text+0xe5): undefined reference to `BIO_get_data'
./.libs/libtdssrv.a(tls.o): In function `tds_push_func_login':
tls.c:(.text+0x20c): undefined reference to `BIO_get_data'
./.libs/libtdssrv.a(tls.o): In function `tds_pull_func':
tls.c:(.text+0x27d): undefined reference to `BIO_get_data'
./.libs/libtdssrv.a(tls.o): In function `tds_push_func':
tls.c:(.text+0x2f0): undefined reference to `BIO_get_data'
./.libs/libtdssrv.a(tls.o): In function `tds_ssl_ctrl_login':
tls.c:(.text+0x36a): undefined reference to `BIO_get_data'
./.libs/libtdssrv.a(tls.o): In function `tds_init_ssl_methods':
tls.c:(.text+0x3c7): undefined reference to `BIO_meth_new'
tls.c:(.text+0x3e9): undefined reference to `BIO_meth_set_write'
tls.c:(.text+0x3fc): undefined reference to `BIO_meth_set_read'
tls.c:(.text+0x40f): undefined reference to `BIO_meth_set_ctrl'
tls.c:(.text+0x422): undefined reference to `BIO_meth_set_destroy'
tls.c:(.text+0x433): undefined reference to `BIO_meth_new'
tls.c:(.text+0x455): undefined reference to `BIO_meth_set_write'
tls.c:(.text+0x468): undefined reference to `BIO_meth_set_read'
tls.c:(.text+0x47b): undefined reference to `BIO_meth_set_destroy'
./.libs/libtdssrv.a(tls.o): In function `tds_deinit_openssl_methods':
tls.c:(.text+0x491): undefined reference to `BIO_meth_free'
tls.c:(.text+0x4a0): undefined reference to `BIO_meth_free'
./.libs/libtdssrv.a(tls.o): In function `tds_init_openssl':
tls.c:(.text+0x4da): undefined reference to `OPENSSL_init_ssl'
tls.c:(.text+0x4fa): undefined reference to `TLS_client_method'
./.libs/libtdssrv.a(tls.o): In function `tds_ssl_init':
tls.c:(.text+0xb92): undefined reference to `SSL_CTX_set_options'
tls.c:(.text+0xd6d): undefined reference to `BIO_set_init'
tls.c:(.text+0xd80): undefined reference to `BIO_set_data'
tls.c:(.text+0xebd): undefined reference to `SSL_set_options'
tls.c:(.text+0xef1): undefined reference to `SSL_get_state'
tls.c:(.text+0xfdb): undefined reference to `BIO_set_init'
tls.c:(.text+0xfee): undefined reference to `BIO_set_data'
collect2: error: ld returned 1 exit status
To be honest, this was a little terrifying. Not only did I not see anything like this before, I not know how C/C++ projects are compiled, and I also didn't understand how my own tool's compilation process worked underneath the hood because it was handled by another team.
Although I asked around, the bug got propagated, and multiple people took a look, none of us really knew what was going on. The distribution was compiling fine on the CI/CD pipelines, and because we were finalizing our major release and didn't have enough man-hours to devote to this, we decided to work on source code only changes, pay extra super-duper attention during code review, and just verify it worked in development before merging into release. Obviously, this dont-poke-the-kraken approach was unsustainable over the long run given how different the development and distribution environments were (How different? Different enough to bite us), and one day we simply needed to squash this bug. Our senior devops engineer and I began pair programming on my machine to diagnose this issue.
So first, we tried logging out all the environment variables that touched the
compilation. One is the default $PATH
variable. My mistake was letting conda
prepend its own path to $PATH
. If conda
isn't at the head of $PATH
,
running $(which python)
may turn up a different Python version in your
environment than conda
, which is not what conda
wants.
Did you know that conda
comes with its own version of OpenSSL (besides the
version that is compiled with Python)? I didn't either!
So this was an issue of the compile process finding the wrong version of OpenSSL
to use. I simply renamed the executable in /path/to/anaconda2/bin
from ssl
to ssl.bak
, and then made sure that $PATH
did not have the path to conda
prepended, but rather appended instead, as I still needed access to my
development environment. Somehow though, the OpenSSL binaries from conda
were
still being pulled into the build process, even though the path to the correct
version of OpenSSL was ahead in $PATH
and should have been picked up first.
All this didn't address one particular nagging concern; exactly why was
OpenSSL complaining of not understanding some internal variables? Even if
the slightly different version of OpenSSL was used, it should all still compile
correctly, right? For a while, I wondered whether OpenSSL published a broken
build that I downloaded and tried to compile, because pyspark
had came broken
with some of the newer release builds of Apache Spark. This turned out to not be
the case. Then I wondered whether freetds
relied on a specific version of
OpenSSL, which after combing through the source code, turned out not to be the
same either. So this meant the freetds
compilation errors were blocked by
OpenSSL build failures while the OpenSSL libraries worked fine previously
(proved out by regressions successfully transporting data over HTTPS), and while
freetds
did not rely on a specific OpenSSL version.
We needed additional information on how to turn on verbose output during
compilation of freetds
to see what inputs it was taking in. There wasn't a
whole lot of information in the README for freetds
on how to do this, and not
much information online. Eventually we found a variable called AM_V_CC
, which
is used in make
debugging.
Recursively grep
-ing for this variable turned up the statement AM_V_CC = $(am__v_CC_$(V))
was found. Recursively grep
-ing am__v_CC_*
found the
statement am__v_CC_ = $(am__v_CC_$(AM_DEFAULT_VERBOSITY))
in
fisql
's Makefile,
which then led us to recursively grep
for AM_DEFAULT_VERBOSITY
.
A lot of work to tell us verbosity was passed in with the flag --V={0,1}
, 1
for verbose output. But now, we had the keys to the kingdom.
Now, when we ran make clean && make V=1 2>&1 | tee debug.out
, we could see
something suspicious:
In file included from /path/to/anaconda2/include/openssl/e_os2.h:13:0,
from /path/to/anaconda2/include/openssl/ssl.h:15,
from ../../include/freetds/tls.h:37,
from tls.c:55:
Somehow, miraculously, /path/to/anaconda2/include
was showing up! This means
some files related to OpenSSL are fetched from the conda
version instead of
the proper location, and being used by freetds
during compilation! So why is
this happening?
After executing the build process with an inline $PATH
excluding the directory
entirely, tee
-ing the verbose output to a file, and vimdiff
-ing it with a
log of the original run, the file /path/to/anaconda2/bin/odbc_config
was
discovered. This executable had a --include-prefix
flag, which when run added
/path/to/anaconda2/include
to the make
command.
Did you know that conda
came with an odbc_config
executable? I didn't
either! I still don't even know what it does. I renamed it to odbc_config.bak
,
reran compilation, and everything worked again.
So in conclusion, an unknown executable caused the freetds
Makefile
configuration setup to include the conda
header files, causing the header
files for OpenSSL to be fetched from a separate directory than the source files
for a separate version of OpenSSL. The headers and source were not compatible
with each other, which caused OpenSSL to fail compilation.
In doing all of this, I broke my development environment. I had failed to script the development environment install because small bus factors meant prioritization of features over taking time to setup reproducible development environments. This wasn't a large loss by any means, as I had kept instructions and flight rules elsewhere; this was just somewhat inconvenient. An hour of bash scripting, and it's patched, good as new.
So what did I learn from this whole experience?
-
conda
is a virtual environment for Python, and not a full-service virtual environment. If you are deploying source code, training models, or otherwise using Python at a high level,conda
is a good fit. Otherwise, consider a tool likevagrant
ordocker
. Vagrant in particular is intended for development environments, with source files compiled on a virtual machine connected to a regular developer front-end. I've heard of development issues with containerization (exactly what I don't remember), so if I had to do it over again, I would rely on Vagrant. -
When in doubt, turn on verbose logs during compilation and
diff
those logs against those you know work. We would not have been able to debug this issue had the logs not mentioned the incorrect include paths. -
Source code is the true documentation, especially for more obscure and less well maintained open source projects. You can't be afraid of diving into the source because sometimes that's the only way you can figure out what's going on.
When we have the time, we will probably create separate CI/CD pipelines for client-side tooling in order to not risk encountering issues like these.
Much thanks to Jeremy Hutchins (senior devops engineer) for the assistance; solving this problem would not be possible without him.