Split Multiallelic Variants
Splitting multiallelic variants to biallelic variants is a transformation sometimes required before further downstream analysis. Glow provides the split_multiallelics
transformer to be applied on a variant DataFrame to split multiallelic variants in the DataFrame to biallelic variants. This transformer is able to handle any number of ALT
alleles and any ploidy.
Note
The splitting logic used by the split_multiallelics
transformer is the same as the one used by the vt decompose tool of the vt package with option -s
(note that the example provided at vt decompose user manual page does not reflect the behavior of vt decompose -s
completely correctly).
The precise behavior of the split_multiallelics
transformer is presented below:
A given multiallelic row with \(n\)
ALT
alleles is split to \(n\) biallelic rows, each with one of theALT
alleles of the original multiallelic row. TheREF
allele in all split rows is the same as theREF
allele in the multiallelic row.If the
split_info_fields
option is provided, only the specified INFO columns will be splitIf the
split_info_fields
option is not provided,INFO
columns derived from VCF fields with numberA
will be splitGenotype fields for each sample are treated as follows: The
GT
field becomes biallelic in each row, where the originalALT
alleles that are not present in that row are replaced with no call. The fields with number of entries equal to number ofREF
+ALT
alleles, are properly split into rows, where in each split row, only entries corresponding to theREF
allele as well as theALT
allele present in that row are kept. The fields which follow colex order (e.g.,GL
,PL
, andGP
) are properly split between split rows where in each row only the elements corresponding to genotypes comprising of theREF
andALT
alleles in that row are listed. Other genotype fields are just repeated over the split rows.Any other field in the DataFrame is just repeated across the split rows.
As an example (shown in VCF file format), the following multiallelic row
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
20 101 . A ACCA,TCGG . PASS VC=INDEL;AC=3,2;AF=0.375,0.25;AN=8 GT:AD:DP:GQ:PL 0/1:2,15,31:30:99:2407,0,533,697,822,574
will be split into the following two biallelic rows:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
20 101 . A ACCA . PASS VC=INDEL;AC=3;AF=0.375;AN=8;OLD_MULTIALLELIC=20:101:A/ACCA/TCGG GT:AD:DP:GQ:PL 0/1:2,15:30:99:2407,0,533
20 101 . A TCGG . PASS VC=INDEL;AC=2;AF=0.25;AN=8;OLD_MULTIALLELIC=20:101:A/ACCA/TCGG GT:AD:DP:GQ:PL 0/.:2,31:30:99:2407,697,574
Options
The split_multiallelics
transformer has the following options:
Option |
Type |
Possible values and description |
---|---|---|
|
string |
A comma separated list of info columns that should be split i.e., |
Usage
Assuming df_original
is a variable of type DataFrame which contains the genomic variant records, an example of using this transformer for splitting multiallelic variants is:
df_split = glow.transform("split_multiallelics", df_original)
df_split = Glow.transform("split_multiallelics", df_original)
Tip
The split_multiallelics
transformer is often significantly faster if the whole-stage code generation feature of Spark Sql is turned off. Therefore, it is recommended that you temporarily turn off this feature using the following command before using this transformer.
spark.conf.set("spark.sql.codegen.wholeStage", False)
spark.conf.set("spark.sql.codegen.wholeStage", false)
Remember to turn this feature back on after your split DataFrame is materialized.