Skip to content
Bytes by Ying
Go back

Concurrency with Python: Hardware-Based Parallelism

Edit page

The Concurrency with Python Series:


Overview

If there is one concurrency model that makes Python one of the dominant programming languages of today, it’s hardware-based parallelism. Python’s C/C++ API, backed by an extensive integration tutorial, transforms Python from a general-purpose scripting language into a data orchestration language. This, combined with the superlinearly increasing value prop differentiation between companies that sanctify data and those that do not, make Python and its ecosystem very much worth investing in.

Python excels with hardware-based parallelism for a number of reasons:

There are a number of serious downsides to applying hardware-based parallelism as a concurrency model, which should be carefully considered before deploying hardware-accelerated Python to production:

Forms of Hardware

SIMD/MIMD

Single instruction, multiple data (SIMD) is a hardware paradigm where atomic CPU instructions process on multiple chunks of data in parallel. Some compilations of numpy can apply SIMD to numpy data structures via vectorization and numpy.vectorize(). This may be possible via Intel’s SSE and AVX extensions, and Intel’s special compilation of Python.

SIMD can be further composed in hardware architectures to become multiple instructions, multiple data, where multiple processors execute SIMD instructions.

SIMD enforces lockstep where each chunk of data must be processed at the same time as other data chunks. Strictly speaking, this means SIMD is parallel but not concurrent; asynchronicity is not possible with SIMD.

SIMT/GPGPUs

Single instruction, multiple threads (SIMT) is an execution model where single instructions execute on data passing through threads in multiple discrete processors, as opposed to SIMD lanes within a single processor. GPGPUs leverage SIMT to process large quantities of data with many simple processors.

One interesting aspect of GPU architectures is that the instruction set architecture varies significantly between generations of GPUs, as noted in this NVIDIA DevTalk comment thread. This contrasts heavily with CPUs, which strongly prioritize binary-code compatibility and ISA stability. Hence, software targeting GPUs leverage a high-level library such as CUDA that can abstract away targeting multiple GPU generations. This may imply software compiled for GPU targets may particularly benefit from concolic testing. angr is one Python-based concolic testing framework, while GKLEE is a concolic testing tool targeted towards GPUs.

ASICs

Application-specific integrated circuits optimize for a specific workload, granting benefits like performance speedups or cost reductions compared to other hardware acceleration targets at the cost of vendor lock-in. Major ASIC developments, such as TPUs, have evolved from GPU-based workloads, where additional performance met benefit/cost thresholds.

Conclusion

Python’s ability to leverage hardware-based parallelism, and the resultant development of highly performant and composable libraries, is a big reason behind its relevance today and why user adoption is “sticky”. However, it makes deploying and reproducing the benefits gained from hardware-based parallelism difficult. This problem already presents itself as a realized business opportunity. For example, h2o.ai eases the process of Jupyter notebook deployment, which removes the last-mile blockers between data science and business impact. As long as Python’s lower-level bindings exist, Python can be made performant.


Edit page
Share this post on:

Previous Post
Concurrency with Python: Data-Intensive Architectures
Next Post
Concurrency with Python: CSP and Coroutines