Sample Quality Control
You can calculate quality control statistics on your variant data using Spark SQL functions, which can be expressed in Python, R, Scala, or SQL.
Each of these functions returns an array of structs containing metrics for one sample. If sample ids are including in the input DataFrame, they will be propagated to the output. The functions assume that the genotypes in each row of the input DataFrame contain the same samples in the same order.
Functions |
Arguments |
Return |
---|---|---|
|
|
A struct containing the following summary stats:
|
|
|
A struct with |
|
|
A struct with |
Computing user-defined sample QC metrics
In addition to the built-in QC functions discussed above, Glow provides two ways to compute user-defined per-sample statistics.
Explode and aggregate
If your dataset is not in a normalized, pVCF-esque shape, or if you want the aggregation output in a
table rather than a single array, you can explode the genotypes
array and use any of the
aggregation functions built into Spark. For example, this code snippet computes the number of sites
with a non-reference allele for each sample:
import pyspark.sql.functions as fx
exploded_df = df.withColumn("genotype", fx.explode("genotypes"))\
.withColumn("hasNonRef", fx.expr("exists(genotype.calls, call -> call != -1 and call != 0)"))
agg = exploded_df.groupBy("genotype.sampleId", "hasNonRef")\
.agg(fx.count(fx.lit(1)))\
.orderBy("sampleId", "hasNonRef")