Merging Variant Datasets

You can use Glow and Spark to merge genomic variant datasets from non-overlapping sample sets into a multi-sample dataset. In these examples, we will read from VCF files, but the same logic works on DataFrames backed by other file formats.

First, read the VCF files into a single Spark DataFrame:

from pyspark.sql.functions import *

df = spark.read.format('vcf').load([path1, path2])

# Alternatively, you can use the "union" DataFrame method if the VCF files have the same schema
df1 = spark.read.format('vcf').load(path1)
df2 = spark.read.format('vcf').load(path2)
df = df1.union(df2)

The resulting DataFrame contains all records from the VCFs you want to merge, but the genotypes from different samples at the same site have not been combined. You can use an aggregation to combine the genotype arrays.

from pyspark.sql.functions import *

merged_df = df.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
  .agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'))

Important

When reading VCF files for a merge operation, sampleId must be the first field in the genotype struct. This is the default Glow schema.

The genotypes from different samples now appear in the same genotypes array.

Note

If the VCFs you are merging contain different sites, elements will be missing from the genotypes array after aggregation. Glow automatically fills in missing genotypes when writing to bigvcf, so an exported VCF will still contain all samples.

Aggregating INFO fields

To preserve INFO fields in a merge, you can use the aggregation functions in Spark. For instance, to emit an INFO_DP column that is the sum of the INFO_DP columns across all samples:

from pyspark.sql.functions import *

merged_df = df.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
  .agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'),
       sum('INFO_DP').alias('INFO_DP'))

Joint genotyping

The merge logic in this document allows you to quickly aggregate genotyping array data or single sample VCFs. For a more sophisticated aggregation that unifies alleles at overlapping sites and uses cohort-level statistics to refine genotype calls, we recommend running a joint genotyping pipeline.

Notebook