Recently, I found myself needing to generate a great deal of randomized data (e.g. a CSV file with a billion non-trivial records), in order to run load/performance tests to ensure data ingestion through the ETL pipelines I developed ran correctly from static files to the database. Now, our products are quite fast. Very fast, actually; you can create a gigabyte-scale CSV export from a randomly generated table in seconds. So ultimately, data generation wasn’t the performance bottleneck for me.
The bottleneck was the very ETL product I created. In its current form, it’s
slow to run and slow to test. If this ETL pipeline has to be a performance
bottleneck in our product stack, it should at least try and be efficient in its
use of system resources. It’s not really trying at the moment. Ingestion through
the default single-threaded Python paths I wrote were very slow in comparison to
the one-off ETL tools this pipeline is intended to replace, and while there are
significant performance benefits to integrating
pyspark as an
acceleration layer, it’s not a bulletproof solution. It’s not just calendar
mismatches you need to worry about.
Imagine debugging a deadlock that occurs in a JVM created by Python, with the
learned knowledge that
py4j, the Java to Python bridge used underneath
pyspark, will cut off any logging output from
log4j. You wouldn’t even get
proper logging output to get an idea of what’s going on, assuming there is even
enough logging in whatever dependencies the deadlock occurred in the first
place. If it’s a transient issue, you could just restart the run as a
workaround, but in general restarting ETL runs is expensive because you may need
to toss a lot of data or deduplicate it, and both waste a lot of time. In
addition, with each concurrency model introduced, your options in unifying your
concurrency models in a workable manner are reduced. If you wanted to implement
concurrent ETL execution runs at the top-level interface, how would you inform
pyspark that a Python-based ETL run to ingest GIS-specific data (whose logic
can’t be easily ported to
pyspark) into the same database is occurring at the
same time? A similar issue bedevils executing regression tests in parallel. I
don’t think you can parallelize regressions if you can’t guarantee thread safety
at the interface level, which I don’t think our dependencies do.
I guess what I’m trying to say is I have a lot of problems, and I don’t have too much visibility into what an ideal solution would be. Because I have no background in understanding concurrency models – any concurrency model, I don’t know what I don’t know in this domain.
This does give me something to purse, though. Now I have a problem I can solve.
I went ahead and purchased Paul Butcher’s Seven Concurrency Models in Seven Weeks, and I’ll try and communicate my learned understanding of each one in a series of blog posts.