Getting Started

Running Locally

Glow requires Apache Spark 3.2.0.

If you don’t have a local Apache Spark installation, you can install it from PyPI:

pip install pyspark==3.2.0

or download a specific distribution.

Install the Python frontend from pip:

pip install glow.py

and then start the Spark shell with the Glow maven package:

./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.1.2 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

To start a Jupyter notebook instead of a shell:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.1.2 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

And now your notebook is glowing! To access the Glow functions, you need to register them with the Spark session.

import glow
spark = glow.register(spark)
df = spark.read.format('vcf').load(path)

Getting started on Databricks

The Databricks documentation shows how to get started with Glow on,

  • Amazon Web Services (AWS - docs)

  • Microsoft Azure (docs)

  • Google Cloud Platform (GCP - docs)

We recommend using the Databricks Glow docker container to manage the environment, which includes genomics libraries that complement Glow. This container can be installed via Databricks container services using the projectglow/databricks-glow:<tag> Docker Image URL, replacing <tag> with the latest version of Glow.

Getting started on other cloud services

Glow is packaged into a Docker container based on an image from data mechanics that can be run locally and that also includes connectors to Azure Data Lake, Google Cloud Storage, Amazon Web Services S3, Snowflake, and Delta Lake. This container can be installed using the projectglow/open-source-glow:<tag> Docker Image URL, replacing <tag> with the latest version of Glow.

This container can be used or adapted to run Glow outside of Databricks (source code). And was contributed by Edoardo Giacopuzzi (edoardo.giacopuzzi at fht.org) from Human Technopole.

Please submit a pull request to add guides for specific cloud services.

Notebooks embedded in the docs

Documentation pages are accompanied by embedded notebook examples. Most code in these notebooks can be run on Spark and Glow alone, but functions such as display() or dbutils() are only available on Databricks. See Databricks notebooks for more info.

These notebooks are located in the Glow github repository here and are tested nightly end-to-end. They include notebooks to define constants such as the number of samples to simulate and the output paths for each step in the pipeline. Notebooks that define constants are %run at the start of each notebook in the documentation. Please see Data Simulation to get started.