Getting Started

Running Locally

Glow requires Apache Spark 2.4.3 (or a later version of Spark 2.4.x that is built on Scala 2.11).

If you don’t have a local Apache Spark installation, you can install it from PyPI:

pip install pyspark==2.4.3

or download a specific distribution.

Install the Python frontend from pip:

pip install glow.py

and then start the Spark shell with the Glow maven package:

./bin/pyspark --packages io.projectglow:glow_2.11:0.3.0
 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

To start a Jupyter notebook instead of a shell:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark --packages io.projectglow:glow_2.11:0.3.0
 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

And now your notebook is glowing! To access the Glow functions, you need to register them with the Spark session.

import glow
glow.register(spark)
df = spark.read.format('vcf').load(path)

If you don’t have a local Apache Spark installation, download a specific distribution.

Start the Spark shell with the Glow maven package:

./bin/spark-shell --packages io.projectglow:glow_2.11:0.3.0
 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

To access the Glow functions, you need to register them with the Spark session.

import io.projectglow.Glow
Glow.register(spark)
val df = spark.read.format("vcf").load(path)

Running in the cloud

The easiest way to use Glow in the cloud is with the Databricks Runtime for Genomics. However, it works with any cloud provider or Spark distribution. You need to install the maven package io.project:glow_2.11:${version} and optionally the Python frontend glow.py. Also set the Spark configuration spark.hadoop.io.compression.codecs to io.projectglow.sql.util.BGZFCodec in order to read and write BGZF-compressed files.

Notebooks embedded in the docs

To demonstrate example use cases of Glow functionalities, most doc pages are accompanied by embedded Databricks Notebooks. Most of the code in these notebooks can be run on Spark and Glow alone, but a few functions such as display() or dbutils() are only available on Databricks. See Running Databricks notebooks for more info.

Also note that the path to datasets used as example in these notebooks is usually a folder in /databricks-datasets/genomics/ and should be replaced with the appropriate path based on your own folder structure.

Demo notebook

This notebook showcases some of the key functionality of Glow, like reading in a genomic dataset, saving it as a Delta Lake, and performing a genome-wide association study.