.. _liftover:

=========
Liftover
=========

.. invisible-code-block: python

    import glow
    input_df = spark.read.format('vcf').load('test-data/combined.chr20_18210071_18210093.g.vcf')
    chain_file = 'test-data/liftover/hg38ToHg19.over.chain.gz'
    reference_file = 'test-data/liftover/hg19.chr20.fa.gz'

LiftOver converts genomic data between reference assemblies. The `UCSC liftOver tool`_  uses a `chain file`_ to
perform simple coordinate conversion, for example on `BED files`_. The `Picard LiftOverVcf tool`_ also uses the new
`reference assembly file`_ to transform variant information (eg. alleles and INFO fields).
Glow can be used to run `coordinate liftOver`_ and `variant liftOver`_.

.. _`UCSC liftOver tool`: https://genome.ucsc.edu/cgi-bin/hgLiftOver
.. _`chain file`: https://genome.ucsc.edu/goldenPath/help/chain.html
.. _`reference assembly file`: https://gatk.broadinstitute.org/hc/en-us/articles/360035531652?id=11013
.. _`BED files`: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
.. _`Picard LiftOverVcf tool`: https://gatk.broadinstitute.org/hc/en-us/articles/360036857991-LiftoverVcf-Picard

Create a liftOver cluster
==========================

For both coordinate and variant liftOver, you need a chain file on every node of the cluster.
On a Databricks cluster, an example of a
`cluster-scoped init script <https://docs.databricks.com/en/init-scripts/cluster-scoped.html>`_
you can use to download the required file for liftOver from the b37 to the hg38 reference assembly is as follows:

.. code-block:: bash

    #!/usr/bin/env bash
    set -ex
    set -o pipefail
    mkdir /opt/liftover
    curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain

.. tip::
   
   Chain files may represent chromosomes with the "chr" prefix or not, e.g. "chr1" or "1".
   Use the Spark SQL function ``regexp_replace`` to transform your variant dataset to match the chain file.
   For example:

.. invisible-code-block: python
    import glow
    vcf_df = spark.read.format('vcf').load('test-data/combined.chr20_18210071_18210093.g.vcf')

.. code-block:: python

    import pyspark.sql.functions as fx
    #add 'chr' prefix
    vcf_df = vcf_df.withColumn("contigName", fx.regexp_replace(fx.col('contigName'), '^', 'chr'))
    #remove prefix
    vcf_df = vcf_df.withColumn("contigName", fx.regexp_replace(fx.col('contigName'), 'chr', ''))

Coordinate liftOver
====================

To perform liftOver for genomic coordinates, use the function ``lift_over_coordinates``. ``lift_over_coordinates`` has
the following parameters.

- chromosome: ``string``
- start: ``long``
- end: ``long``
- chain file: ``string`` (constant value, such as one created with ``lit()``)
- minimum fraction of bases that must remap: ``double`` (optional, defaults to ``.95``)

The returned ``struct`` has the following values if liftOver succeeded. If not, the function returns ``null``.

- ``contigName``: ``string``
- ``start``: ``long``
- ``end``: ``long``

.. code-block:: python

    output_df = input_df.withColumn('lifted', glow.lift_over_coordinates('contigName', 'start',
      'end', chain_file, 0.99))

.. invisible-code-block: python

    from pyspark.sql import Row
    assert_rows_equal(output_df.select('lifted').head().lifted, Row(contigName='chr20', start=18190714, end=18190715))

Variant liftOver
=================

For genetic variant data, use the ``lift_over_variants`` transformer. In addition to performing liftOver for genetic
coordinates, variant liftOver performs the following transformations:

- Reverse-complement and left-align the variant if needed
- Adjust the SNP, and correct allele-frequency-like INFO fields and the relevant genotypes if the reference and alternate alleles have
  been swapped in the new genome build

Pull a target assembly reference file down to every node in the Spark cluster in addition to a chain file before
performing variant liftOver.

The ``lift_over_variants`` transformer operates on a DataFrame containing genetic variants and supports the following
options:

.. list-table::
  :header-rows: 1

  * - Parameter
    - Default
    - Description
  * - ``chain_file``
    - n/a
    - The path of the chain file.
  * - ``reference_file``
    - n/a
    - The path of the target reference file.
  * - ``min_match_ratio``
    - .95
    - Minimum fraction of bases that must remap.

The output DataFrame's schema consists of the input DataFrame's schema with the following fields appended:

- ``INFO_SwappedAlleles``: ``boolean`` (null if liftOver failed, true if the reference and alternate alleles were
  swapped, false otherwise)
- ``INFO_ReverseComplementedAlleles``: ``boolean`` (null if liftover failed, true if the reference and alternate
  alleles were reverse complemented, false otherwise)
- ``liftOverStatus``: ``struct``

   * ``success``: ``boolean`` (true if liftOver succeeded, false otherwise)
   * ``errorMessage``: ``string`` (null if liftOver succeeded, message describing reason for liftOver failure otherwise)

If liftOver succeeds, the output row contains the liftOver result and ``liftOverStatus.success`` is true.
If liftOver fails, the output row contains the original input row, the additional ``INFO`` fields are null,
``liftOverStatus.success`` is false, and ``liftOverStatus.errorMessage`` contains the reason liftOver failed.

.. code-block:: python

    output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)

.. invisible-code-block: python

   lifted_variant = output_df.select('contigName', 'start', 'end', 'INFO_SwappedAlleles', 'INFO_ReverseComplementedAlleles', 'liftOverStatus').head()
   expected_variant = Row(contigName='chr20', start=18190714, end=18190715, INFO_SwappedAlleles=False, INFO_ReverseComplementedAlleles=False, liftOverStatus=Row(errorMessage=None, success=True))
   assert_rows_equal(lifted_variant, expected_variant)

.. notebook:: .. etl/10_liftOver.html
  :title: Liftover notebook