Variant Quality Control

Glow includes a variety of tools for variant quality control.

Tip

This topic uses the terms “variant” or “variant data” to refer to single nucleotide variants and short indels.

You can calculate quality control statistics on your variant data using Spark SQL functions, which can be expressed in Python, R, Scala, or SQL.

Function	Arguments	Return
`hardy_weinberg`	The `genotypes` array. This function assumes that the variant has been converted to a biallelic representation.	A struct with two elements: the expected heterozygous frequency according to Hardy-Weinberg equilibrium and the associated p-value.
`call_summary_stats`	The `genotypes` array	A struct containing the following summary stats: `callRate`: The fraction of samples with a called genotype `nCalled`: The number of samples with a called genotype `nUncalled`: The number of samples with a missing or uncalled genotype, as represented by a ‘.’ in a VCF or -1 in a DataFrame. `nHet`: The number of heterozygous samples `nHomozygous`: An array with the number of samples that are homozygous for each allele. The 0th element describes how many sample are hom-ref. `nNonRef`: The number of samples that are not hom-ref `nAllelesCalled`: An array with the number of times each allele was seen `alleleFrequencies`: An array with the frequency for each allele
`dp_summary_stats`	The `genotypes` array	A struct containing the min, max, mean, and sample standard deviation for genotype depth (DP in VCF v4.2 specificiation) across all samples
`gq_summary_stats`	The `genotypes` array	A struct containing the min, max, mean, and sample standard deviation for genotype quality (GQ in VCF v4.2 specification) across all samples

Notebook