Liftover
LiftOver converts genomic data between reference assemblies. The UCSC liftOver tool uses a chain file to perform simple coordinate conversion, for example on BED files. The Picard LiftOverVcf tool also uses the new reference assembly file to transform variant information (eg. alleles and INFO fields). Glow can be used to run coordinate liftOver and variant liftOver.
Create a liftOver cluster
For both coordinate and variant liftOver, you need a chain file on every node of the cluster. On a Databricks cluster, an example of a cluster-scoped init script you can use to download the required file for liftOver from the b37 to the hg38 reference assembly is as follows:
#!/usr/bin/env bash
set -ex
set -o pipefail
mkdir /opt/liftover
curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain
Tip
Chain files may represent chromosomes with the “chr” prefix or not, e.g. “chr1” or “1”.
Use the Spark SQL function regexp_replace
to transform your variant dataset to match the chain file.
For example:
import pyspark.sql.functions as fx
#add 'chr' prefix
vcf_df = vcf_df.withColumn("contigName", fx.regexp_replace(fx.col('contigName'), '^', 'chr'))
#remove prefix
vcf_df = vcf_df.withColumn("contigName", fx.regexp_replace(fx.col('contigName'), 'chr', ''))
Coordinate liftOver
To perform liftOver for genomic coordinates, use the function lift_over_coordinates
. lift_over_coordinates
has
the following parameters.
chromosome:
string
start:
long
end:
long
chain file:
string
(constant value, such as one created withlit()
)minimum fraction of bases that must remap:
double
(optional, defaults to.95
)
The returned struct
has the following values if liftOver succeeded. If not, the function returns null
.
contigName
:string
start
:long
end
:long
output_df = input_df.withColumn('lifted', glow.lift_over_coordinates('contigName', 'start',
'end', chain_file, 0.99))
Variant liftOver
For genetic variant data, use the lift_over_variants
transformer. In addition to performing liftOver for genetic
coordinates, variant liftOver performs the following transformations:
Reverse-complement and left-align the variant if needed
Adjust the SNP, and correct allele-frequency-like INFO fields and the relevant genotypes if the reference and alternate alleles have been swapped in the new genome build
Pull a target assembly reference file down to every node in the Spark cluster in addition to a chain file before performing variant liftOver.
The lift_over_variants
transformer operates on a DataFrame containing genetic variants and supports the following
options:
Parameter |
Default |
Description |
---|---|---|
|
n/a |
The path of the chain file. |
|
n/a |
The path of the target reference file. |
|
.95 |
Minimum fraction of bases that must remap. |
The output DataFrame’s schema consists of the input DataFrame’s schema with the following fields appended:
INFO_SwappedAlleles
:boolean
(null if liftOver failed, true if the reference and alternate alleles were swapped, false otherwise)INFO_ReverseComplementedAlleles
:boolean
(null if liftover failed, true if the reference and alternate alleles were reverse complemented, false otherwise)liftOverStatus
:struct
success
:boolean
(true if liftOver succeeded, false otherwise)errorMessage
:string
(null if liftOver succeeded, message describing reason for liftOver failure otherwise)
If liftOver succeeds, the output row contains the liftOver result and liftOverStatus.success
is true.
If liftOver fails, the output row contains the original input row, the additional INFO
fields are null,
liftOverStatus.success
is false, and liftOverStatus.errorMessage
contains the reason liftOver failed.
output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)