Skip to content
Bytes by Ying
Go back

Concurrency with Python: Data-Intensive Architectures

Edit page

The Concurrency with Python Series:


Overview

The field of data-intensive computing introduces systems design around datasets much larger than a single machine’s disk or memory, and hence must be persisted and processed across multiple machines. Ensuring data-intensive tasks are correct, performant, and efficient, among other criteria, are key priorities of designing scalable systems. Implementing those goals involves leveraging the concurrency primitives made available to the developer by the language(s) used.

Martin Kleppmann published “Designing Data-Intensive Applications”, which is a much more thorough and general introduction to this field. I covered this book in a book review this year, which helped solidify some of the core principles important to building large scale systems.

Python brings a number of strengths around data-intensive processing:

Python may bring a number of downsides when it comes to creating data-intensive systems:

Common Data Engineering Patterns

MapReduce

MapReduce is a way of processing large quantities of data on a distributed system. The seminal paper on MapReduce describes the operation as a two step process, a “map” step to generate a key/value pair associated with some data, and a “reduce” step to merge values with the same key.

The primary advantage of MapReduce is its simplicity in usage. There are two primary methods, map and reduce. However, they are not native methods, but rather mapper and reducer source files that are loaded onto a framework. The framework merely expects that results generated by executing the files will behave like native methods, and handles cases when that isn’t the case. In this sense, map and reduce behave more like user-defined functions, while the framework handles data sharding/indexing, fault tolerance, and parallelization of code execution.

Lambda Architectures

Lambda architectures combine both batch and stream processing paradigms to deliver query results that are both up-to-date and accurate. The batch processing pipeline focuses on maximizing correctness, while the stream processing pipeline focuses on minimizing latency. The serving layer that handles queries joins the two services together to present a cohesive result to the end user.

Most lambda service components, such as Apache Hadoop, Apache Spark, and Apache Druid, are written in JVM-based languages like Java or Scala. JVM-based services can be connected to Python in a variety of different ways. For example, pyspark composes py4j, a library for invoking Java methods and classes in Python, to create a native Python API for Apache Spark. Other services, like Apache Kafka, leverage Python libraries like Faust to communicate across a language-agnostic service layer.

For less data-intensive tasks, like job orchestration, Python remains a competitive language to implement frameworks in. For example, Apache Airflow is a job orchestration framework written almost entirely in Python; job orchestration is a less data-intensive task since CPU usage is spent coordinating tasks and statuses.

Conclusion

While Python does serve a useful purpose as a last-mile client, its metrics (e.g. latency numbers) and properties (e.g. dynamic typing) lend themselves less towards building native data-intensive frameworks. In addition, building data-intensive frameworks is less about any individual language and more about overall systems design competencies. Some fundamental principles that hold across any service implementation may include discussion of the following:

This tiny subset of concerns around data-intensive systems design lends itself to building out well-tested, stable libraries around which user-defined scripts and data can execute reliably. Python’s role in this design may be relegated to:


Edit page
Share this post on:

Previous Post
Concurrency with Python: Conclusion
Next Post
Concurrency with Python: Hardware-Based Parallelism