Concurrency with Python: Data-Intensive Architectures

The Concurrency with Python Series:

Overview

The field of data-intensive computing introduces systems design around datasets much larger than a single machine's disk or memory, and hence must be persisted and processed across multiple machines. Ensuring data-intensive tasks are correct, performant, and efficient, among other criteria, are key priorities of designing scalable systems. Implementing those goals involves leveraging the concurrency primitives made available to the developer by the language(s) used.

Martin Kleppmann published "Designing Data-Intensive Applications", which is a much more thorough and general introduction to this field. I covered this book in a book review this year, which helped solidify some of the core principles important to building large scale systems.

Python brings a number of strengths around data-intensive processing:

  • Python-based I/O-bound tasks are not affected by the GIL: Since I/O-bound tasks likely define the wall-clock time a data-intensive process may take to run, the GIL should not define the real latency threshold I/O-bound Python processes will take to complete. Of course, this claim comes with its own disclaimers. To verify which circumstances apply to your situation, run performance profiling on your system using tools like pyflame or snakeviz, and examine the breakdown of how real time is allocated during the lifetime of the process.

  • Support for typing across its C FFI: Python exposes both the ability to define custom Python types with C extension typing, and the ability to call shared libraries with the ctypes library. In data-intensive tasks, one prime difficulty is maintaining a common type system across all services, since different definitions of common types (e.g. float) cause data loss or corruption, many-to-many mappings between language-specific types rapidly leads to confusion, and since some data specifications (like that of ESRI Shapefiles) are defined at the byte level. This problem is also extraordinarily difficult to retrofit a solution for, since other libraries (like one implementing a common data definition language, or DDL), depend on the type system for their own definitions of correctness. With the ability to define types in C, and the ability to use C types in a discrete manner at the Python level, Python can leverage an entire third-party type system if need be, and communicate with other languages using a commonly accepted type system.

  • Continued investment into existing Python-based tooling: Python libraries intended for high performance on single machines may evolve concurrently with libraries using multiple machines. As many of these libraries use hardware-based parallelism and leverage Python for composability and orchestration, the already existing separation of concerns may make retrofitting a low-level API for distributed tasks easier. One such project evolution is dask, which has the ability to define data structures on distributed data as with dask.distributed, by composing data structures defined with pandas or numpy.

Python may bring a number of downsides when it comes to creating data-intensive systems:

  • Pipelining systems may cause latency amplification: If data infrastructure architectures follow data-driven programming best practices, and multiple applications come together to compose different data pipelines, there may be unavoidable propagation of control flow to Python user code if library components communicate with each other using Python APIs. Some amount of latency amplification may occur as a result, as CPU-bound Python code is generally orders of magnitude slower than e.g. CPU-bound C code.

  • Python's dynamic type system: Python is a dynamically typed language, where type-checking occurs at runtime. This contrasts with static typing, where type checking occurs at compile-time. While dynamic typing may result in boosts to developer productivity, it comes at the cost of latent technical debt when scaling codebases, in addition to typing errors only being caught on runtime. Type systems are sometimes the subject of various religious wars, as noted in this xkcd forum thread. However, when prioritizing robustness and reliability, static typing provide more tangible advantages than dynamic typing.

    This cost may be mitigated in Python by using a type checker like mypy, pyre-check, or Python 3.5+'s native type annotations. Ultimately though, type checking on dynamically typed languages may not be as computationally efficient for data-intensive systems as statically typed languages, and may increase developer cognitive load and organizational process overhead. The cost of having a static type system may also decrease with time, with newer languages like Rust emulating more efficient Haskell-like static typing rather than verbose Java-like static typing (Paul Chiusano describes his conversion back to static typing in this blog post).

Common Data Engineering Patterns

MapReduce

MapReduce is a way of processing large quantities of data on a distributed system. The seminal paper on MapReduce describes the operation as a two step process, a "map" step to generate a key/value pair associated with some data, and a "reduce" step to merge values with the same key.

The primary advantage of MapReduce is its simplicity in usage. There are two primary methods, map and reduce. However, they are not native methods, but rather mapper and reducer source files that are loaded onto a framework. The framework merely expects that results generated by executing the files will behave like native methods, and handles cases when that isn't the case. In this sense, map and reduce behave more like user-defined functions, while the framework handles data sharding/indexing, fault tolerance, and parallelization of code execution.

Lambda Architectures

Lambda architectures combine both batch and stream processing paradigms to deliver query results that are both up-to-date and accurate. The batch processing pipeline focuses on maximizing correctness, while the stream processing pipeline focuses on minimizing latency. The serving layer that handles queries joins the two services together to present a cohesive result to the end user.

Most lambda service components, such as Apache Hadoop, Apache Spark, and Apache Druid, are written in JVM-based languages like Java or Scala. JVM-based services can be connected to Python in a variety of different ways. For example, pyspark composes py4j, a library for invoking Java methods and classes in Python, to create a native Python API for Apache Spark. Other services, like Apache Kafka, leverage Python libraries like Faust to communicate across a language-agnostic service layer.

For less data-intensive tasks, like job orchestration, Python remains a competitive language to implement frameworks in. For example, Apache Airflow is a job orchestration framework written almost entirely in Python; job orchestration is a less data-intensive task since CPU usage is spent coordinating tasks and statuses.

Conclusion

While Python does serve a useful purpose as a last-mile client, its metrics (e.g. latency numbers) and properties (e.g. dynamic typing) lend themselves less towards building native data-intensive frameworks. In addition, building data-intensive frameworks is less about any individual language and more about overall systems design competencies. Some fundamental principles that hold across any service implementation may include discussion of the following:

  • What kind of error model does the service implement: Many-to-many language-specific error mappings rapidly causes confusion as to exactly what part of a service is failing. An error service, with clearly defined error propagation and exit codes, may lend itself to a service failing more predictably and assist in debugging/patching in production, but also increases implementation costs as branching and state mutation exponentially increase the number of error conditions to be handled.

  • How does the service interact with outside agents: One example of this is how does it interact with IPC signals like SIGTERM, SIGINT, or SIGKILL. This is an important consideration with Python because singular instructions may break out into multiple bytecode instructions, as discussed in the blog post around complecting identity and state, and no guarantee is made by the Python virtual machine to exit after an instruction termination as opposed to bytecode termination. Hence, handling interrupts means anticipating race conditions at every point during code execution.

  • How does the service interact with other services: The service may be connected as part of a service pipeline, in which case all outputs must be considered inputs to another service, or it may be wrapped in a security sandbox in order to execute operations with untrusted code and/or on untrusted data, where it is expected to fail in a manner the sandbox can handle.

  • What assumptions does the service make, and are those assumptions accurate in all environments: Python as a language only assumes a single CPU core and some data store (RAM, disk) is available to execute Python code. Additional performance can be gained by increasing the number of assumptions made about the environment, or changing existing assumptions. For example, breaking the assumption that CPU is always faster than I/O, and being able to assume I/O availability, makes for much different systems designs than otherwise. This Hacker News discussion discusses some of these tradeoffs, which are also mentioned in one of Dan Luu's blog posts.

  • Does this service clearly respect separation of concerns: Services that focus on doing one thing well are much more likely to reach a stable state where all possible ways of doing that thing well have been covered. This results in predictable behavior, more testing coverage over time, and increased code reuse.

This tiny subset of concerns around data-intensive systems design lends itself to building out well-tested, stable libraries around which user-defined scripts and data can execute reliably. Python's role in this design may be relegated to:

  • Increasing system observability for non-technical stakeholders: This role is key in order to guarantee engineering flexibility in implementation, and ensure effective communication between engineering and the rest of the company when establishing roadmaps and requirements. Python's toolchain lends itself greatly to observable achievements like GUIs and applications. Netflix's data operations team discusses Jupyter notebook scheduling by leveraging tools like papermill in order to replace scripts and templates.

  • Increasing asynchronicity between engineering teams: Forced synchronicity results in wasted time as implementation efforts suffer bottlenecks. Python's ability to deliver more observability ensures less face-to-face communication is necessary to keep track of progress, and next steps can be prefetched and anticipated before the need for them arises.