Variant Normalization

Different genomic analysis tools often represent the same genomic variant in different ways, making it non-trivial to integrate and compare variants across call sets. Therefore, variant normalization is an essential step on variants before further downstream analysis to make sure the same variant is represented identically in different call sets. Normalization is achieved by making sure the variant is parsimonious and left-aligned (see Variant Normalization for more details).

Glow provides the normalize_variants transformer to be applied on a variant DataFrame to normalize its variants, bringing unprecedented scalability to this operation. When applied on an input DataFrame of variants (e.g., generated by loading VCF or BGEN files), this transformer generates a DataFrame containing normalized variants.

Note

The variant normalization algorithm used by the normalize_variants transformer follows the same logic as the one used in normalizations tools such as bcftools norm and vt normalize tools. This normalization logic is different from the one used by GATK’s LeftAlignAndTrimVariants, which sometimes yields incorrect normalization (see Variant Normalization for more details).

Usage

Assuming df_original is a variable of type DataFrame which contains the genomic variant records, and ref_genome_path is a variable of type String containing the path to the reference genome file, a minimal example of using this transformer for normalization is:

df_normalized = glow.transform("normalize_variants", df_original, reference_genome_path=ref_genome_path)
df_normalized = Glow.transform("normalize_variants", df_original, Map("reference_genome_path" -> ref_genome_path))

Options

The normalize_variants transformer has the following options:

Option

Type

Possible values and description

referenceGenomePath

string

Path to the reference genome .fasta or .fa file (required for normalization)

Note: .fai and .dict files with the same name must be present in the same folder.

mode (deprecated)

string

normalize: Only normalizes the variants (if user does not pass the option, normalize is assumed as default)
split_and_normalize: Split multiallelic variants to biallelic variants and then normalize the variants. This usage is deprecated. Instead, use split_multiallelics transformer followed by normalize_variants transformer.
split: Only split the multiallelic variants to biallelic without normalizing. This usage is deprecated. Instead, use split_multiallelics transformer.

Variant normalization notebook