LiftOver converts genomic data between reference assemblies. The UCSC liftOver tool uses a chain file to perform simple coordinate conversion, for example on BED files. The Picard LiftOverVcf tool also uses the new reference assembly file to transform variant information (eg. alleles and INFO fields). Glow can be used to run coordinate liftOver and variant liftOver.
Create a liftOver cluster¶
For both coordinate and variant liftOver, you need a chain file on every node of the cluster. On a Databricks cluster, an example of a cluster-scoped init script you can use to download the required file for liftOver from the b37 to the hg38 reference assembly is as follows:
#!/usr/bin/env bash set -ex set -o pipefail mkdir /opt/liftover curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain
To perform liftOver for genomic coordinates, use the function
lift_over_coordinates, which has
the following parameters.
string(constant value, such as one created with
minimum fraction of bases that must remap:
double(optional, defaults to
struct has the following values if liftOver succeeded. If not, the function returns
output_df = input_df.withColumn('lifted', glow.lift_over_coordinates('contigName', 'start', 'end', chain_file, 0.99))
For genetic variant data, use the
lift_over_variants transformer. In addition to performing liftOver for genetic
coordinates, variant liftOver performs the following transformations:
Reverse-complement and left-align the variant if needed
Adjust the SNP, and correct allele-frequency-like INFO fields and the relevant genotypes if the reference and alternate alleles have been swapped in the new genome build
Pull a target assembly reference file down to every node in the Spark cluster in addition to a chain file before performing variant liftOver.
lift_over_variants transformer operates on a DataFrame containing genetic variants and supports the following
The path of the chain file.
The path of the target reference file.
Minimum fraction of bases that must remap.
The output DataFrame’s schema consists of the input DataFrame’s schema with the following fields appended:
boolean(null if liftOver failed, true if the reference and alternate alleles were swapped, false otherwise)
boolean(null if liftover failed, true if the reference and alternate alleles were reverse complemented, false otherwise)
boolean(true if liftOver succeeded, false otherwise)
string(null if liftOver succeeded, message describing reason for liftOver failure otherwise)
If liftOver succeeds, the output row contains the liftOver result and
liftOverStatus.success is true.
If liftOver fails, the output row contains the original input row, the additional
INFO fields are null,
liftOverStatus.success is false, and
liftOverStatus.errorMessage contains the reason liftOver failed.
output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)