.. _split_multiallelics:

===============================
Split Multiallelic Variants
===============================

.. invisible-code-block: python

    import glow

    test_dir = 'test-data/variantsplitternormalizer-test/'
    df_original = spark.read.format('vcf').load(test_dir + '01_IN_altered_multiallelic.vcf')

**Splitting multiallelic variants to biallelic variants** is a transformation sometimes required before further downstream analysis. Glow provides the ``split_multiallelics`` transformer to be applied on a variant DataFrame to split multiallelic variants in the DataFrame to biallelic variants. This transformer is able to handle any number of ``ALT`` alleles and any ploidy.

.. note::

    The splitting logic used by the ``split_multiallelics`` transformer is the same as the one used by the `vt decompose tool <https://genome.sph.umich.edu/wiki/Vt#Decompose>`_ of the vt package with option ``-s`` (note that the example provided at `vt decompose user manual page <https://genome.sph.umich.edu/wiki/Vt#Decompose>`_ does not reflect the behavior of ``vt decompose -s`` completely correctly).

    The precise behavior of the ``split_multiallelics`` transformer is presented below:

    - A given multiallelic row with :math:`n` ``ALT`` alleles is split to :math:`n` biallelic rows, each with one of the ``ALT`` alleles of the original multiallelic row. The ``REF`` allele in all split rows is the same as the ``REF`` allele in the multiallelic row.

    - If the ``split_info_fields`` option is provided, only the specified INFO columns will be split

    - If the ``split_info_fields`` option is not provided, ``INFO`` columns derived from VCF fields with number ``A`` will be split

    - Genotype fields for each sample are treated as follows: The ``GT`` field becomes biallelic in each row, where the original ``ALT`` alleles that are not present in that row are replaced with no call. The fields with number of entries equal to number of ``REF`` + ``ALT`` alleles, are properly split into rows, where in each split row, only entries corresponding to the ``REF`` allele as well as the ``ALT`` allele present in that row are kept. The fields which follow colex order (e.g., ``GL``, ``PL``, and ``GP``) are properly split between split rows where in each row only the elements corresponding to genotypes comprising of the ``REF`` and ``ALT`` alleles in that row are listed. Other genotype fields are just repeated over the split rows.

    - Any other field in the DataFrame is just repeated across the split rows.

    As an example (shown in VCF file format), the following multiallelic row

    .. code-block::

        #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE1
        20	101	.	A	ACCA,TCGG	.	PASS	VC=INDEL;AC=3,2;AF=0.375,0.25;AN=8	GT:AD:DP:GQ:PL	0/1:2,15,31:30:99:2407,0,533,697,822,574

    will be split into the following two biallelic rows:

    .. code-block::

        #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE1
        20	101	.	A	ACCA	.	PASS	VC=INDEL;AC=3;AF=0.375;AN=8;OLD_MULTIALLELIC=20:101:A/ACCA/TCGG	GT:AD:DP:GQ:PL	0/1:2,15:30:99:2407,0,533
        20	101	.	A	TCGG	.	PASS	VC=INDEL;AC=2;AF=0.25;AN=8;OLD_MULTIALLELIC=20:101:A/ACCA/TCGG	GT:AD:DP:GQ:PL	0/.:2,31:30:99:2407,697,574

Options
=======
The ``split_multiallelics`` transformer has the following options:

.. list-table::
   :header-rows: 1

   * - Option
     - Type
     - Possible values and description
   * - ``split_info_fields``
     - string
     - A comma separated list of info columns that should be split i.e., ``INFO_AC,INFO_AF``
   
Usage
=====

Assuming ``df_original`` is a variable of type DataFrame which contains the genomic variant records, an example of using this transformer for splitting multiallelic variants is:

.. tabs::

    .. tab:: Python

        .. code-block:: python

            df_split = glow.transform("split_multiallelics", df_original)

        .. invisible-code-block: python

            from pyspark.sql import Row

            expected_split_variant = Row(contigName='20', start=100, end=101, names=None, referenceAllele='A', alternateAlleles=['ACCA'], qual=None, filters=['PASS'], splitFromMultiAllelic=True, INFO_VC='INDEL', INFO_AC=[3], INFO_AF=[0.375], INFO_AN=8, **{'INFO_refseq.name':'NM_144628', 'INFO_refseq.positionType':'intron'},INFO_OLD_MULTIALLELIC='20:101:A/ACCA/TCGG', genotypes=[Row(sampleId='SAMPLE1',  calls=[0, 1], alleleDepths=[2,15], phased=False, depth=30, conditionalQuality=99, phredLikelihoods=[2407,0,533]), Row(sampleId='SAMPLE2', calls=[1, -1], alleleDepths=[2,15], phased=False, depth=30, conditionalQuality=99, phredLikelihoods=[2407,585,533]), Row(sampleId='SAMPLE3',  calls=[0, 1], alleleDepths=[2,15], phased=False, depth=30, conditionalQuality=99, phredLikelihoods=[2407,0,533]), Row(sampleId='SAMPLE4',  calls=[0, -1], alleleDepths=[2,15], phased=False, depth=30, conditionalQuality=99, phredLikelihoods=[2407,822,533])])
            assert_rows_equal(df_split.head(), expected_split_variant)

    .. tab:: Scala

        .. code-block:: scala

            df_split = Glow.transform("split_multiallelics", df_original)

.. tip::

    The ``split_multiallelics`` transformer is often significantly faster if the `whole-stage code generation` feature of Spark Sql is turned off. Therefore, it is recommended that you temporarily turn off this feature using the following command before using this transformer.

    .. tabs::

       .. tab:: Python

            .. code-block:: python

              spark.conf.set("spark.sql.codegen.wholeStage", False)

       .. tab:: Scala

            .. code-block:: scala

              spark.conf.set("spark.sql.codegen.wholeStage", false)

    Remember to turn this feature back on after your split DataFrame is materialized.

.. notebook:: .. etl/splitmultiallelics-transformer.html
  :title: Split Multiallelic Variants notebook