Variant Normalization

Different genomic analysis tools often represent the same genomic variant in different ways, making it non-trivial to integrate and compare variants across call sets. Therefore, variant normalization is an essential step to be applied on variants before further downstream analysis to make sure the same variant is represented identically in different call sets. Normalization is achieved by making sure the variant is parsimonious and left-aligned (see Variant Normalization for more details).

Glow provides variant normalization capabilities as a DataFrame transformer as well as a SQL expression function with a Python API, bringing unprecedented scalability to this operation.

Note

Glow’s variant normalization algorithm follows the same logic as those used in normalization tools such as bcftools norm and vt normalize. This normalization logic is different from the one used by GATK’s LeftAlignAndTrimVariants, which sometimes yields incorrect normalization (see Variant Normalization for more details).

`normalize_variants` Transformer

The normalize_variants transformer can be applied to normalize a variant DataFrame, such as one generated by loading VCF or BGEN files. The output of the transformer is described under the replace_columns option below.

Usage

Assuming df_original is a variable of type DataFrame which contains the genomic variant records, and ref_genome_path is a variable of type String containing the path to the reference genome file, a minimal example of using this transformer for normalization is as follows:

df_normalized = glow.transform("normalize_variants", df_original, reference_genome_path=ref_genome_path)

df_normalized = Glow.transform("normalize_variants", df_original, Map("reference_genome_path" -> ref_genome_path))

Options

The normalize_variants transformer has the following options:

Option	Type	Possible values and description
`reference_genome_path`	string	Path to the reference genome `.fasta` or `.fa` file. This file must be accompanied with a `.fai` index file in the same folder.
`replace_columns`	boolean	`False`: The transformer does not modify the original `start`, `end`, `referenceAllele` and `alternateAlleles` columns. Instead, a StructType column called `normalizationResult` is added to the DataFrame. This column contains the normalized `start`, `end`, `referenceAllele`, and `alternateAlleles` columns as well as the `normalizationStatus` StructType as the fifth field, which contains the following subfields: `changed`: Indicates whether the variant data was changed as a result of normalization `errorMessage`: An error message in case the attempt at normalizing the row hit an error. In this case, the `changed` field will be set to `False`. If no errors occur this field will be `null`. In case of error, the first four fields in `normalizationResult` will be `null`. `True` (default): The original `start`, `end`, `referenceAllele`, and `alternateAlleles` columns are replaced with the normalized values in case they have changed. Otherwise (in case of no change or an error), the original `start`, `end`, `referenceAllele`, and `alternateAlleles` are not modified. A StructType `normalizationStatus` column is added to the DataFrame with the same subfields explained above.
`mode` (deprecated)	string	`normalize`: Only normalizes the variants (if user does not pass the option, `normalize` is assumed as default) `split_and_normalize`: Split multiallelic variants to biallelic variants and then normalize the variants. This usage is deprecated. Instead, use split_multiallelics transformer followed by normalize_variants transformer. `split`: Only split the multiallelic variants to biallelic without normalizing. This usage is deprecated. Instead, use split_multiallelics transformer.

`normalize_variant` Function

The normalizer can also be used as a SQL expression function. See Glow PySpark Functions for details on how to use it in the Python API. An example of an expression using the normalize_variant function is as follows:

from pyspark.sql.functions import lit
normalization_expr = glow.normalize_variant('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', ref_genome_path)
df_normalized = df_original.withColumn('normalizationResult', normalization_expr)