Liftover

LiftOver converts genomic data between reference assemblies. The UCSC liftOver tool uses a chain file to perform simple coordinate conversion, for example on BED files. The Picard LiftOverVcf tool also uses the new reference assembly file to transform variant information (eg. alleles and INFO fields). Glow can be used to run coordinate liftOver and variant liftOver.

Create a liftOver cluster

For both coordinate and variant liftOver, you need a chain file on every node of the cluster. On a Databricks cluster, an example of a cluster-scoped init script you can use to download the required file for liftOver from the b37 to the hg38 reference assembly is as follows:

#!/usr/bin/env bash
set -ex
set -o pipefail
mkdir /opt/liftover
curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain

Tip

Chain files may represent chromosomes with the “chr” prefix or not, e.g. “chr1” or “1”. Use the Spark SQL function regexp_replace to transform your variant dataset to match the chain file. For example:

import pyspark.sql.functions as fx
#add 'chr' prefix
vcf_df = vcf_df.withColumn("contigName", fx.regexp_replace(fx.col('contigName'), '^', 'chr'))
#remove prefix
vcf_df = vcf_df.withColumn("contigName", fx.regexp_replace(fx.col('contigName'), 'chr', ''))

Coordinate liftOver

To perform liftOver for genomic coordinates, use the function lift_over_coordinates. lift_over_coordinates has the following parameters.

chromosome: string
start: long
end: long
chain file: string (constant value, such as one created with lit())
minimum fraction of bases that must remap: double (optional, defaults to .95)

The returned struct has the following values if liftOver succeeded. If not, the function returns null.

contigName: string
start: long
end: long

output_df = input_df.withColumn('lifted', glow.lift_over_coordinates('contigName', 'start',
  'end', chain_file, 0.99))

Variant liftOver

For genetic variant data, use the lift_over_variants transformer. In addition to performing liftOver for genetic coordinates, variant liftOver performs the following transformations:

Reverse-complement and left-align the variant if needed
Adjust the SNP, and correct allele-frequency-like INFO fields and the relevant genotypes if the reference and alternate alleles have been swapped in the new genome build

Pull a target assembly reference file down to every node in the Spark cluster in addition to a chain file before performing variant liftOver.

The lift_over_variants transformer operates on a DataFrame containing genetic variants and supports the following options:

Parameter	Default	Description
`chain_file`	n/a	The path of the chain file.
`reference_file`	n/a	The path of the target reference file.
`min_match_ratio`	.95	Minimum fraction of bases that must remap.

The output DataFrame’s schema consists of the input DataFrame’s schema with the following fields appended:

INFO_SwappedAlleles: boolean (null if liftOver failed, true if the reference and alternate alleles were swapped, false otherwise)
INFO_ReverseComplementedAlleles: boolean (null if liftover failed, true if the reference and alternate alleles were reverse complemented, false otherwise)
liftOverStatus: struct
- success: boolean (true if liftOver succeeded, false otherwise)
- errorMessage: string (null if liftOver succeeded, message describing reason for liftOver failure otherwise)

If liftOver succeeds, the output row contains the liftOver result and liftOverStatus.success is true. If liftOver fails, the output row contains the original input row, the additional INFO fields are null, liftOverStatus.success is false, and liftOverStatus.errorMessage contains the reason liftOver failed.

output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)

Liftover notebook

How to run a notebook Get notebook link