Spark as a Workflow Orchestrator to Parallelize Command-Line Bioinformatics Tools
You can use Spark as a workflow orchestrator to manage running a bioinformatics tool across a set of samples or regions of the genome. Orchestration in this context means each row of the Spark Dataframe defines a task, which could contain individual samples or regions of the genome, as well as parameters for the bioinformatics tool you will run on those regions / samples. These rows will then be sent out to available cores of the cluster and the tasks will be run in parallel from the command line.
This approach is efficient and flexible and is similar to submitting jobs to an high performance compute (HPC) cluster. Or applying multithreading across a single node.
Tip
Spark offers these advantages over multithreading / multiprocessing,
- Scale
Instead of using one large node with multithreading, you can use many nodes, and you can choose virtual machine type with the best price/performance
- Efficiency
Most bioinformatics tools are single threaded, using only one core of a node
Tools that use multithreading often do not fully utilize CPU resources
- Observability
multithreading or multiprocessing is difficult to maintain and debug compared to Spark, where errors are captured in the Spark worker logs
When to use workflow orchestrator vs pipe transformer architecture
The pipe transformer is designed for embarrassingly parallel processing of a large dataset, where each row is processed in an identical way.
The pipe transformer supports VCF
and txt
, and csv
formats, but does not support bioinformatics tools that depend on Plink
, BGEN
or other specialized file formats.
Furthermore, the pipe transformer is not designed for parallel processing of distinct samples or regions of the genome. Nor can it apply different parameters to those samples or regions.
Important
- data must be accessible locally on each Spark worker using one of these approaches,
data is downloaded to each node of the cluster at start-up via an initialization script
cloud storage is mounted on the local filesystem using an open source tool such as goofys
Databricks’ local file APIs automatically mounts cloud object storage to the local filesystem
- bioinformatics tools must be installed on each node of the cluster
via an initialization script or a Glow Docker Container