Product Dimensionality
"The bigger the interface, the weaker the abstraction."
Today I wanted to introduce, or perhaps re-introduce, a nice way to think about client-side abstractions.
While I don't believe in a "right" and "wrong" way of writing code, I think it's fair to say that for a given set of requirements, you can establish a total ordering for "desirability" over a distribution of possible known solutions, and the primary problem in software is how to evolve your solution given an evolving set of requirements. My current experience teaches me that there is no hardship quite like persistent state. Luckily, I've mostly worked on client-side applications, and the nice thing about client-side applications is there isn't really any need to preserve state and hence any true complexity. Even so, I've found myself slogging through developing features at work, and I've been thinking about whether the abstraction I'm using is suitable for the use case at hand.
I'm building an ETL client, and generally my issue requests fall into three buckets:
-
Supporting a new format: This generally involves supporting various type definitions for different data formats and converting between type definitions in as lossless a manner as possible.
-
Adding a feature flag: Usually, to allow users to override default behaviors of the tool to tailor results to their specific requirements.
-
Making sure it works: I didn't fully cover one or both of the above cases, and a use case in production resulted in undefined or unexpected behavior.
In practice, it's a bit more complex than this, but the nature of the game is unchanged. So, what's hard about this? Why should I find myself slogging through developing new features?
To convey this conundrum, I want to introduce the term product dimensionality. It's something I made up; I couldn't find anything on this online, and it's not to be confused with the curse of dimensionality or dimensionality reduction. I'll define it as "how similar your solutions to new requirements are to older requirements". So a highly dimensional product integrates multiple similar solutions, and a low dimensional product integrates multiple dissimilar solutions.
When I worked as a full-stack developer, I generally could not anticipate what the next feature request would be, and a solution to one randomly selected problem doesn't look like the solution to another randomly selected problem. In this sense, CRUD applications would count as low-dimensionality products. Of course, you can always simplify sub-problems within this domain. For example, you can boil down any one view in an analytics dashboard to a particular SQL query, and then let the user define and save a set of SQL queries representing interesting data perspectives to the user as your analytics dashboard screen. However, how you approach designing an analytics dashboard is different from how you might approach implementing purchase orders, which might query third-party APIs and save PDFs to BLOB storage, in addition to adding SQL queries that read and write, as opposed to only read from the database. This means that the total set of abstractions relevant to all full-stack developers may be rather limited. The 12 Factor App, a list of design recommendations for software-as-a-service applications, may be a good set of heuristics to follow, but it's more a manifesto than an instructional guide.
As a data engineer, feature requests for this ETL client fit rather nicely within the three existing buckets I described previously. This means every feature looks pretty much like every other feature. In this sense, ETL applications would count as high-dimensionality products. If you have a high amount of automation in your story management tool, and you have a design abstraction you like, you could probably create a checklist to follow when implementing a feature request, and run automated checks in your CI/CD pipeline to ensure all bases are covered. You could even move your checklist to a SQL database, and then run SQL migrations when your request dimensions change and generate tickets for features to be added or updated for previously implemented stories. Add a type translation layer. Add a plugins interface for type extensions. Add CLI documentation. Add end-to-end and load tests and comparison logic. Rinse and repeat. The problem is that highly structured.
Right?
The answer is, sort of. There's a number of quirks to general ETL design:
-
Not all SQL is the same: In fact, not a whole lot of it is. I believe most SQL is at least up to the SQL92 standard, although some data warehouses and legacy databases may have older definitions of SQL. Many SQL databases also recognize portions of different SQL standards for backwards compatibility reasons. This presents some problems when trying to run certain SQL statements. For example, Teradata does not support
DROP TABLE IF EXISTS
syntax that exists in PostgreSQL or MySQL, using a series ofGOTO
statements and labels instead (community forum post here). Oracle also does not haveDROP TABLE IF EXISTS
syntax, preferring to run a try/catch statement instead (community forum post here). If you're trying to define batched reads from a database, the disjoint nature of SQL implementations may pose an impediment. -
Not all database clients are the same: Not every database has something as nice as
psql
, a multi-platform interactive terminal for PostgreSQL. Oracle has something called SQL*Plus, which is likepsql
but involves referencing additional.ora
configuration files for defining and listening to a particular Oracle service using Oracle TNS. Teradata has SQL Assistant, but it's Windows-only; the alternative UNIX client is Basic Teradata Query (BTEQ), which as far as I can tell has no available documentation anywhere (noman
pages, no CLI help menu, and no online documentation, although Teradata can email you a PDF if you're specific enough in referencing what you want).These issues are likely why the SQL community came up with the ODBC and JDBC standards, which are simpler and more standardized ways to connect to various databases. However, I believe these standards are still limited by the SQL syntax supported by each SQL database.
-
Feature gaps exist depending on the product use case: For example, a particular SQL database catering to various federal government agencies may have specific functionality to satisfy Federal Information Processing Standards. Other databases with heavier relative commercial usage may not have such functionality. If the company offer a database product to a federal agency, data ingestion from one government database to another may need to abide by FIPS regulations (depending on the size/importance of the contract/RFP). If available, this will need to be implemented for the ingest stage of a proof-of-concept sale if you want to chase government agencies as customers.
This is a tasting menu of the fundamental limitations for an ETL tool needing to interface with wholly external services and the outside world. Add in the troubles of ingesting dirty, semi-structured data with no SQL interface at all, like CSVs (especially CSVs that do not follow a particular specification, like RFC 4180), weird data definitions causing edge-case performance behavior (say, a customer defining an entire city as one multi-megabyte polygon in one record, resulting in a slow records/second ingest and triggering a slow ingest warning), the difficulties of working with legacy code, and other problems of translating requirements to reality, and the highly structured nature of the problem appears to loosen before your eyes.
The most lesson to me in appreciating product dimensionality is how it affects the kinds of design tradeoffs that are possible or practicable for my particular use case.
As a full-stack software engineer, I've used Flask and Django in the past, and both come with particular design abstractions. Flask is literally the Wild West. This is a full Flask application (cut and pasted from the Flask homepage):
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello World!"
No design abstraction whatsoever. It's great for smaller projects, and if you design your product with lots of native Python and clear separation between code and protocol, you can cut and paste your application into a different framework with little difficulty. On the other hand, if you are hiring somebody who claims they know Flask, you can only be certain they know about this much Python (unless they demonstrate otherwise), and as every Flask application is different because there's no design, there's really no such thing as "Flask best practices".
In the middle, you have something like Django, which preaches The Zen of
Python. It does a lot of the work
for you (including a very nice pre-configured production dashboard), but it
still allows you to change things around. See the difference between something
like
cookiecutter-django
, and
something generated by django-admin startproject mysite
. If you interview a
Django developer, you probably have enough to talk about over a beer (and the
water cooler if you hire him/her).
On the far side of the spectrum, you have something like
apistar
, a still experimental Python
web framework that has you define YAML files to implement the
OpenAPI/Swagger standard. Cool
to try out, good luck using it in production because implementing a REST
endpoint that is not OpenAPI/Swagger may result in breaking the design
encapsulation of the framework.
Three examples, with three sets of design tradeoffs. You could see how in
designing a CRUD application, you may start off with Flask since your problem
space is so massive and your solution so ill-defined, that you want as much
flexibility as possible when building something to satisfy the market. Later on,
as you formalize your solution and tailor it towards a particular audience, you
may choose to move to Django to "professionalize" your application. However, you
may never get to the point of committing only YAML files to version control
using apistar
if your problem model may not be that dimensional.
odo
, the open-source project that
my project is forked off of, helpfully exposes Python APIs for integrating a
backend into the project, which I discussed in my series on data-driven
testing.
I would consider odo
to be a medium-dimensionality product. The APIs it
exposes are not direct links to methods, but rather dispatch handlers to a given
method handled by the arguments passed in the function signature. Guido van
Rossum did a blog post introducing multimethods in
Python, and odo
leverages this knowledge through a third-party dependency called
multipledispatch
.
The flexibility the medium-dimensonality abstractions multipledispatch
and
odo
provided me, both in the features implemented in this abstraction and the
ability to extend this abstraction, helped me immensely when I started off. For
instance, if there are particular options that need to be set for a specific
JDBC client of a SQL database, they can be done in one dispatch handler without
affecting the others. You could add in a feature flag for that feature you wish
to expose, list it in the feature-specific CLI documentation, and then pass the
configuration along in **kwargs
. Another example would be implementing direct
path ingest using a custom-defined path()
handler. As odo
is a data
transformation tool and not an ETL tool, we had designed the ETL tool to
transform to an "intermediate format" in a hub-and-spoke model to tame the graph
nature of the framework and provide predictable results. Therefore, its typing
schema would be more tightly bound to the intermediate format than we would like
sometimes, and you may experience unpredictable results if your source format
had a typing schema that didn't overlay well with this "intermediate format".
With the "direct path" transformation, you could provide a custom-defined type
conversion between something like a
shapefile and a SQL database, and go
around this problem. Finally, if there was a particular pattern noticed in how
dispatch handlers were implemented, you could refactor the code after ensuring
functionality was deduplicated by creating a number of utility methods. odo
didn't block me from doing what was necessary to make a sale -- yet it
accelerated development to the point where a single developer could provide for
a large number of the company's customers at once.
Recently though, I had begun feeling jealous of tools like Apache
Airflow and
intake
, that are very
configuration-driven when viewed by the end user. Why?
I think the problem's root starts at addressing a high-dimensonality problem
with a medium-dimensionality abstraction. While odo
was wonderful in getting
my feet wet in the field of data engineering, I do think it's still very
code-heavy, and I blame the notion of using an abstraction that is simply too
loose for the task at hand. You need to write a lot of code in order to ensure a
new feature is supported, and writing code generates technical debt and costs
time and effort, as well as opportunity costs for other, perhaps more fruitful
initiatives. The abstraction isn't powerful enough on a day-to-day development
basis.
It has other costs, too:
-
I can't guarantee how new features should be integrated to downstream dependencies: There's no predictable schema; I have to manually ensure that the CHANGELOG is up to date, send out emails, or walk over to somebody's desk if they have problems. That's because every new feature flag I create has a made up name. I fear comparisons to Oracle, where this ETL tool will dissolve into an indecipherable mess of feature flags and macros that override each other. It's difficult, if not impossible, to scale development in this fashion. If the end user could validate which features are available programatically, this wouldn't be a problem, and while I do have a Python API ideally mimicing a thin client on top of the core ETL logic, in practice it remains unused because many services are not Python-based. Instead, we start from the root of
bash
compatibility instead. As in, they generate and sendbash
commands to my tool, which parses thebash
for arguments. -
Type conversions are opaque: Oftentimes, they're hidden behind a dependency, and there's no authoritative typing paradigm in the project itself, and any type conversion errors have to be patched on an ongoing basis. This means the typing system is more a patchwork quilt than a well-thought out tapestry. This is immensely frustrating because lossy type conversions will lose some data permanently, and are some of the easiest problems to fix. Even if you define many types that are used once, it's a lot better than relying on many different client that version differently from you and each other. It also means you're relegated to a non-source-of-truth role (like analytics), where it doesn't matter as much if some data is lost, and blocks you from offering your tool for OLTP workloads.
-
Feature migrations are manual, and feature integrity is questionable: For example, let's say you wanted to migrate a current logging/analytics stack into this project. You'd need to manually hook into every ingest path in order to ensure successful data pass-through into said logging/analytics. This opens up a wide field for basic manual errors, where the developer can fat-finger something without thinking about it, and the error would not show up until runtime. Other features, like setting a default batch size to read a certain number of records at once, or graceful failure where any batch's ingestion failure wouldn't affect any other batch, would also need to be ported manually. This means given a number of features, and a number of ingest/export paths, the dimensionality of the problem grows exponentially, and would soon exceed my ability to tame it.
There are some abstractions that I can implement that address some of these above issues. One is ensuring a clear separation between the request layer (the type of transformation to apply), and the data layer (the actual moving of data from one location to another). This is done by leveraging Apache Spark, and the use of its lazily evaluated dataframes and its ability to spill to disk upon memory overload. It's not perfect -- a lot of issues occur with JVM configuration management -- but it's alot better than mixing requests and data together in a Python-native tool.
Any other abstractions, such as defining a set of immutable, transactional
pipelines from the client up, or using a configuration file with strictly
defined schemas instead of a feature flag, might be implemented as shims on the
grandfathered legacy functions. It's still possible to use odo
in this sense,
and would get around the problem of having the perfect be the enemy of the good.
But it would take a good amount of additional work to implement, validate, and
sell.
If you are starting off on a new project, definitely think about how dimensional your problem model is, and whether your product's dimensionality matches that of the problem. It may save you a lot of time and energy down the line.