Getting Started

Running Locally

Glow requires Apache Spark 3.1.2.

If you don’t have a local Apache Spark installation, you can install it from PyPI:

pip install pyspark==3.1.2

or download a specific distribution.

Install the Python frontend from pip:

pip install glow.py

and then start the Spark shell with the Glow maven package:

./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.1.0 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

To start a Jupyter notebook instead of a shell:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.1.0 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

And now your notebook is glowing! To access the Glow functions, you need to register them with the Spark session.

import glow
spark = glow.register(spark)
df = spark.read.format('vcf').load(path)

Notebooks embedded in the docs

To demonstrate use cases of Glow, documentation pages are accompanied by embedded notebooks. Most code in these notebooks can be run on Spark and Glow alone, but functions such as display() or dbutils() are only available on Databricks. See Databricks notebooks for more info.

Also note that the path to datasets used as example in these notebooks is usually a folder in /databricks-datasets/genomics/ and should be replaced with the appropriate path based on your own folder structure.

Getting started on Databricks

The Databricks documentation shows how to get started with Glow on,

  • Amazon Web Services (AWS - docs)

  • Microsoft Azure (docs)

  • Google Cloud Platform (GCP - docs)

Getting started on other cloud services

Please submit a pull request to add a guide for other cloud services.