Getting Started

Running Locally

Glow requires Apache Spark 3.2.1.

If you don’t have a local Apache Spark installation, you can install it from PyPI:

pip install pyspark==3.2.1

or download a specific distribution.

Install the Python frontend from pip:

pip install glow.py

and then start the Spark shell with the Glow maven package:

./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.2.1 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

To start a Jupyter notebook instead of a shell:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.2.1 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

And now your notebook is glowing! To access the Glow functions, you need to register them with the Spark session.

import glow
spark = glow.register(spark)
df = spark.read.format('vcf').load(path)

Getting started on Databricks

Databricks makes it simple to run Glow on Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

To spin up a cluster with Glow, please use the Databricks Glow docker container to manage the environment. This container includes genomics libraries that complement Glow. This container can be installed via Databricks container services.

Here is how to set it up on the Databricks web application,

  1. Have your Databricks administrator enable container services via Settings -> Admin Console

_images/databricks_container_services_admin_console.png
  1. Go to Compute -> Create Cluster and configure the cluster as follows,

_images/glow_databricks_container_services_cluster_config.png

Important

Please use the projectglow/databricks-glow:<tag> Docker Image URL, replacing <tag> with the latest version of Glow on the Project Glow Dockerhub page. Then match the version of Glow to a version of the Databricks Runtime that includes the same version of Spark. For example, Glow v1.2.1 and Databricks Runtime 10.4 Long Term Support (LTS) are both built on Spark 3.2.1. Use LTS runtimes where available, 10.4 LTS will be supported until Mar 18, 2024.

  1. Sync the Glow notebooks via Repos

    1. Fork the Glow github repo.

    2. Clone your fork to your Databricks workspace using Repos (step-by-step guide).

    3. The notebooks are located under docs/source/_static.

_images/glow-repo-notebooks.png
  1. Create automated jobs

To build an automated Glow workflow in your Databricks workspace, please follow these steps, which simulate data and then run the Glow GWAS tutorial

  1. Configure the Databricks CLI, authenticating via Databricks personal access token (docs).

  2. Create a directory in your Databricks workspace,

databricks workspace mkdirs /Repos/test
  1. Import source files from your fork of the Glow Github repository to this directory using repos,

databricks repos create --url https://github.com/<github_profile>/glow --provider gitHub --path /Repos/test/glow
  1. Switch to the branch of Glow that you are working on using repos,

databricks repos update --branch master --path /Repos/test/glow
  1. Create a workflow using jobs,

  • Azure GWAS tutorial

databricks jobs create --json-file docs/dev/glow-gwas-tutorial-azure.json
  • AWS GWAS tutorial

databricks jobs create --json-file docs/dev/glow-gwas-tutorial-aws.json
  1. Take the job id that is returned, and run the job,

databricks jobs run-now --job-id <job id>
  1. Go to the Databricks web application and view the output of the job,

_images/glow_gwas_tutorial_run.png
  1. Epilogue

The full set of notebooks in Glow undergo nightly integration testing orchestrated by CircleCI (example output) using the latest version of the Glow Docker container on Databricks. CircleCI kicks off these notebooks from the Databricks command line interface (CLI) via a python script, which contains the above steps. The workflow is defined in this configuration json template. And the output is shown below. You can adapt these as you build your own production jobs.

_images/glow_ci_pipeline.png

Important

These notebooks must be run in order!

As you build out your pipelines please consider the following points,

Important

  • Start small. Experiment on individual variants, samples or chromosomes.

  • Steps in your pipeline might require different cluster configurations.

Tip

  • Use compute-optimized virtual machines to read variant data from cloud object stores.

  • Use Delta Cache accelerated virtual machines to query variant data.

  • Use memory-optimized virtual machines for genetic association studies.

  • The Glow Pipe Transformer supports parallelization of deep learning tools that run on GPUs.

Getting started on other cloud services

Glow is packaged into a Docker container based on an image from data mechanics that can be run locally and that also includes connectors to Azure Data Lake, Google Cloud Storage, Amazon Web Services S3, Snowflake, and Delta Lake. This container can be installed using the projectglow/open-source-glow:<tag> Docker Image URL, replacing <tag> with the latest version of Glow.

This container can be used or adapted to run Glow outside of Databricks (source code). And was contributed by Edoardo Giacopuzzi (edoardo.giacopuzzi at fht.org) from Human Technopole.

Please submit a pull request to add guides for specific cloud services.

Notebooks embedded in the docs

Documentation pages are accompanied by embedded notebook examples. Most code in these notebooks can be run on Spark and Glow alone, but functions such as display() or dbutils() are only available on Databricks. See Databricks notebooks for more info.

These notebooks are located in the Glow github repository here and are tested nightly end-to-end. They include notebooks to define constants such as the number of samples to simulate and the output paths for each step in the pipeline. Notebooks that define constants are %run at the start of each notebook in the documentation. Please see Data Simulation to get started.