Glow PySpark Functions

Glow includes a number of functions that operate on PySpark columns. These functions are interoperable with functions provided by PySpark or other libraries.

glow.functions.add_struct_fields(struct: Union[pyspark.sql.column.Column, str], *fields: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Adds fields to a struct.

Added in version 0.3.0.

Examples

>>> df = spark.createDataFrame([Row(struct=Row(a=1))])
>>> df.select(glow.add_struct_fields('struct', lit('b'), lit(2)).alias('struct')).collect()
[Row(struct=Row(a=1, b=2))]
Parameters
  • struct – The struct to which fields will be added

  • fields – The new fields to add. The arguments must alternate between string-typed literal field names and field values.

Returns

A struct consisting of the input struct and the added fields

glow.functions.array_summary_stats(arr: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes the minimum, maximum, mean, standard deviation for an array of numerics.

Added in version 0.3.0.

Examples

>>> df = spark.createDataFrame([Row(arr=[1, 2, 3])])
>>> df.select(glow.expand_struct(glow.array_summary_stats('arr'))).collect()
[Row(mean=2.0, stdDev=1.0, min=1.0, max=3.0)]
Parameters

arr – An array of any numeric type

Returns

A struct containing double mean, stdDev, min, and max fields

glow.functions.array_to_dense_vector(arr: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Converts an array of numerics into a spark.ml DenseVector.

Added in version 0.3.0.

Examples

>>> from pyspark.ml.linalg import DenseVector
>>> df = spark.createDataFrame([Row(arr=[1, 2, 3])])
>>> df.select(glow.array_to_dense_vector('arr').alias('v')).collect()
[Row(v=DenseVector([1.0, 2.0, 3.0]))]
Parameters

arr – The array of numerics

Returns

A spark.ml DenseVector

glow.functions.array_to_sparse_vector(arr: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Converts an array of numerics into a spark.ml SparseVector.

Added in version 0.3.0.

Examples

>>> from pyspark.ml.linalg import SparseVector
>>> df = spark.createDataFrame([Row(arr=[1, 0, 2, 0, 3, 0])])
>>> df.select(glow.array_to_sparse_vector('arr').alias('v')).collect()
[Row(v=SparseVector(6, {0: 1.0, 2: 2.0, 4: 3.0}))]
Parameters

arr – The array of numerics

Returns

A spark.ml SparseVector

glow.functions.call_summary_stats(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes call summary statistics for an array of genotype structs. See Variant Quality Control for more details.

Added in version 0.3.0.

Examples

>>> schema = 'genotypes: array<struct<calls: array<int>>>'
>>> df = spark.createDataFrame([Row(genotypes=[Row(calls=[0, 0]), Row(calls=[1, 0]), Row(calls=[1, 1])])], schema)
>>> df.select(glow.expand_struct(glow.call_summary_stats('genotypes'))).collect()
[Row(callRate=1.0, nCalled=3, nUncalled=0, nHet=1, nHomozygous=[1, 1], nNonRef=2, nAllelesCalled=6, alleleCounts=[3, 3], alleleFrequencies=[0.5, 0.5])]
Parameters

genotypes – The array of genotype structs with calls field

Returns

A struct containing callRate, nCalled, nUncalled, nHet, nHomozygous, nNonRef, nAllelesCalled, alleleCounts, alleleFrequencies fields. See Variant Quality Control.

glow.functions.dp_summary_stats(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes summary statistics for the depth field from an array of genotype structs. See Variant Quality Control.

Added in version 0.3.0.

Examples

>>> df = spark.createDataFrame([Row(genotypes=[Row(depth=1), Row(depth=2), Row(depth=3)])], 'genotypes: array<struct<depth: int>>')
>>> df.select(glow.expand_struct(glow.dp_summary_stats('genotypes'))).collect()
[Row(mean=2.0, stdDev=1.0, min=1.0, max=3.0)]
Parameters

genotypes – An array of genotype structs with depth field

Returns

A struct containing mean, stdDev, min, and max of genotype depths

glow.functions.expand_struct(struct: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Promotes fields of a nested struct to top-level columns similar to using struct.* from SQL, but can be used in more contexts.

Added in version 0.3.0.

Examples

>>> df = spark.createDataFrame([Row(struct=Row(a=1, b=2))])
>>> df.select(glow.expand_struct(col('struct'))).collect()
[Row(a=1, b=2)]
Parameters

struct – The struct to expand

Returns

Columns corresponding to fields of the input struct

glow.functions.explode_matrix(matrix: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Explodes a spark.ml Matrix (sparse or dense) into multiple arrays, one per row of the matrix.

Added in version 0.3.0.

Examples

>>> from pyspark.ml.linalg import DenseMatrix
>>> m = DenseMatrix(numRows=3, numCols=2, values=[1, 2, 3, 4, 5, 6])
>>> df = spark.createDataFrame([Row(matrix=m)])
>>> df.select(glow.explode_matrix('matrix').alias('row')).collect()
[Row(row=[1.0, 4.0]), Row(row=[2.0, 5.0]), Row(row=[3.0, 6.0])]
Parameters

matrix – The sparl.ml Matrix to explode

Returns

An array column in which each row is a row of the input matrix

glow.functions.genotype_states(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Gets the number of alternate alleles for an array of genotype structs. Returns -1 if there are any -1 s (no-calls) in the calls array.

Added in version 0.3.0.

Examples

>>> genotypes = [
... Row(calls=[1, 1]),
... Row(calls=[1, 0]),
... Row(calls=[0, 0]),
... Row(calls=[-1, -1])]
>>> df = spark.createDataFrame([Row(genotypes=genotypes)], 'genotypes: array<struct<calls: array<int>>>')
>>> df.select(glow.genotype_states('genotypes').alias('states')).collect()
[Row(states=[2, 1, 0, -1])]
Parameters

genotypes – An array of genotype structs with calls field

Returns

An array of integers containing the number of alternate alleles in each call array

glow.functions.gq_summary_stats(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes summary statistics about the genotype quality field for an array of genotype structs. See Variant Quality Control.

Added in version 0.3.0.

Examples

>>> genotypes = [
... Row(conditionalQuality=1),
... Row(conditionalQuality=2),
... Row(conditionalQuality=3)]
>>> df = spark.createDataFrame([Row(genotypes=genotypes)], 'genotypes: array<struct<conditionalQuality: int>>')
>>> df.select(glow.expand_struct(glow.gq_summary_stats('genotypes'))).collect()
[Row(mean=2.0, stdDev=1.0, min=1.0, max=3.0)]
Parameters

genotypes – The array of genotype structs with conditionalQuality field

Returns

A struct containing mean, stdDev, min, and max of genotype qualities

glow.functions.hard_calls(probabilities: Union[pyspark.sql.column.Column, str], numAlts: Union[pyspark.sql.column.Column, str], phased: Union[pyspark.sql.column.Column, str], threshold: float = None) → pyspark.sql.column.Column[source]

Converts an array of probabilities to hard calls. The probabilities are assumed to be diploid. See Variant data transformations for more details.

Added in version 0.3.0.

Examples

>>> df = spark.createDataFrame([Row(probs=[0.95, 0.05, 0.0])])
>>> df.select(glow.hard_calls('probs', numAlts=lit(1), phased=lit(False)).alias('calls')).collect()
[Row(calls=[0, 0])]
>>> df = spark.createDataFrame([Row(probs=[0.05, 0.95, 0.0])])
>>> df.select(glow.hard_calls('probs', numAlts=lit(1), phased=lit(False)).alias('calls')).collect()
[Row(calls=[0, 1])]
>>> # Use the threshold parameter to change the minimum probability required for a call
>>> df = spark.createDataFrame([Row(probs=[0.05, 0.95, 0.0])])
>>> df.select(glow.hard_calls('probs', numAlts=lit(1), phased=lit(False), threshold=0.99).alias('calls')).collect()
[Row(calls=[-1, -1])]
Parameters
  • probabilities – The array of probabilities to convert

  • numAlts – The number of alternate alleles

  • phased – Whether the probabilities are phased. If phased, we expect one 2 * numAlts values in the probabilities array. If unphased, we expect one probability per possible genotype.

  • threshold – The minimum probability to make a call. If no probability falls into the range of [0, 1 - threshold] or [threshold, 1], a no-call (represented by -1 s) will be emitted. If not provided, this parameter defaults to 0.9.

Returns

An array of hard calls

glow.functions.hardy_weinberg(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes statistics relating to the Hardy Weinberg equilibrium. See Variant Quality Control for more details.

Added in version 0.3.0.

Examples

>>> genotypes = [
... Row(calls=[1, 1]),
... Row(calls=[1, 0]),
... Row(calls=[0, 0])]
>>> df = spark.createDataFrame([Row(genotypes=genotypes)], 'genotypes: array<struct<calls: array<int>>>')
>>> df.select(glow.expand_struct(glow.hardy_weinberg('genotypes'))).collect()
[Row(hetFreqHwe=0.6, pValueHwe=0.7)]
Parameters

genotypes – The array of genotype structs with calls field

Returns

A struct containing two fields, hetFreqHwe (the expected heterozygous frequency according to Hardy-Weinberg equilibrium) and pValueHwe (the associated p-value)

glow.functions.lift_over_coordinates(contigName: Union[pyspark.sql.column.Column, str], start: Union[pyspark.sql.column.Column, str], end: Union[pyspark.sql.column.Column, str], chainFile: str, minMatchRatio: float = None) → pyspark.sql.column.Column[source]

Performs liftover for the coordinates of a variant. To perform liftover of alleles and add additional metadata, see Liftover.

Added in version 0.3.0.

Examples

>>> df = spark.read.format('vcf').load('test-data/liftover/unlifted.test.vcf').where('start = 18210071')
>>> chain_file = 'test-data/liftover/hg38ToHg19.over.chain.gz'
>>> reference_file = 'test-data/liftover/hg19.chr20.fa.gz'
>>> df.select('contigName', 'start', 'end').head()
Row(contigName='chr20', start=18210071, end=18210072)
>>> lifted_df = df.select(glow.expand_struct(glow.lift_over_coordinates('contigName', 'start', 'end', chain_file)))
>>> lifted_df.head()
Row(contigName='chr20', start=18190715, end=18190716)
Parameters
  • contigName – The current contig name

  • start – The current start

  • end – The current end

  • chainFile – Location of the chain file on each node in the cluster

  • minMatchRatio – Minimum fraction of bases that must remap to do liftover successfully. If not provided, defaults to 0.95.

Returns

A struct containing contigName, start, and end fields after liftover

glow.functions.linear_regression_gwas(genotypes: Union[pyspark.sql.column.Column, str], phenotypes: Union[pyspark.sql.column.Column, str], covariates: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Performs a linear regression association test optimized for performance in a GWAS setting. See Linear regression for details.

Added in version 0.3.0.

Examples

>>> from pyspark.ml.linalg import DenseMatrix
>>> phenotypes = [2, 3, 4]
>>> genotypes = [0, 1, 2]
>>> covariates = DenseMatrix(numRows=3, numCols=1, values=[1, 1, 1])
>>> df = spark.createDataFrame([Row(genotypes=genotypes, phenotypes=phenotypes, covariates=covariates)])
>>> df.select(glow.expand_struct(glow.linear_regression_gwas('genotypes', 'phenotypes', 'covariates'))).collect()
[Row(beta=0.9999999999999998, standardError=1.4901161193847656e-08, pValue=9.486373847239922e-09)]
Parameters
  • genotypes – A numeric array of genotypes

  • phenotypes – A numeric array of phenotypes

  • covariates – A spark.ml Matrix of covariates

Returns

A struct containing beta, standardError, and pValue fields. See Linear regression.

glow.functions.logistic_regression_gwas(genotypes: Union[pyspark.sql.column.Column, str], phenotypes: Union[pyspark.sql.column.Column, str], covariates: Union[pyspark.sql.column.Column, str], test: str, offset: Union[pyspark.sql.column.Column, str] = None) → pyspark.sql.column.Column[source]

Performs a logistic regression association test optimized for performance in a GWAS setting. See Logistic regression for more details.

Added in version 0.3.0.

Examples

>>> from pyspark.ml.linalg import DenseMatrix
>>> phenotypes = [1, 0, 0, 1, 1]
>>> genotypes = [0, 0, 1, 2, 2]
>>> covariates = DenseMatrix(numRows=5, numCols=1, values=[1, 1, 1, 1, 1])
>>> offset = [1, 0, 1, 0, 1]
>>> df = spark.createDataFrame([Row(genotypes=genotypes, phenotypes=phenotypes, covariates=covariates, offset=offset)])
>>> df.select(glow.expand_struct(glow.logistic_regression_gwas('genotypes', 'phenotypes', 'covariates', 'Firth'))).collect()
[Row(beta=0.7418937644793101, oddsRatio=2.09990848346903, waldConfidenceInterval=[0.2509874689201784, 17.569066925598555], pValue=0.3952193664793294)]
>>> df.select(glow.expand_struct(glow.logistic_regression_gwas('genotypes', 'phenotypes', 'covariates', 'LRT'))).collect()
[Row(beta=1.1658962684583645, oddsRatio=3.208797538802116, waldConfidenceInterval=[0.29709600522888285, 34.65674887513274], pValue=0.2943946848756769)]
>>> df.select(glow.expand_struct(glow.logistic_regression_gwas('genotypes', 'phenotypes', 'covariates', 'Firth', 'offset'))).collect()
[Row(beta=0.8024832156793392, oddsRatio=2.231074294047771, waldConfidenceInterval=[0.2540891981649045, 19.590334974925725], pValue=0.3754070658316332)]
>>> df.select(glow.expand_struct(glow.logistic_regression_gwas('genotypes', 'phenotypes', 'covariates', 'LRT', 'offset'))).collect()
[Row(beta=1.1996041727573317, oddsRatio=3.3188029900720117, waldConfidenceInterval=[0.3071189078535928, 35.863807161497334], pValue=0.2857137988674153)]
Parameters
  • genotypes – An numeric array of genotypes

  • phenotypes – A double array of phenotype values

  • covariates – A spark.ml Matrix of covariates

  • test – Which logistic regression test to use. Can be LRT or Firth

  • offset – An optional double array of offset values. The offset vector is added with coefficient 1 to the linear predictor term X*b.

Returns

A struct containing beta, oddsRatio, waldConfidenceInterval, and pValue fields. See Logistic regression.

glow.functions.mean_substitute(array: Union[pyspark.sql.column.Column, str], missingValue: Union[pyspark.sql.column.Column, str] = None) → pyspark.sql.column.Column[source]

Substitutes the missing values of a numeric array using the mean of the non-missing values. Any values that are NaN, null or equal to the missing value parameter are considered missing. See Variant data transformations for more details.

Added in version 0.4.0.

Examples

>>> df = spark.createDataFrame([Row(unsubstituted_values=[float('nan'), None, 0.0, 1.0, 2.0, 3.0, 4.0])])
>>> df.select(glow.mean_substitute('unsubstituted_values', lit(0.0)).alias('substituted_values')).collect()
[Row(substituted_values=[2.5, 2.5, 2.5, 1.0, 2.0, 3.0, 4.0])]
>>> df = spark.createDataFrame([Row(unsubstituted_values=[0, 1, 2, 3, -1, None])])
>>> df.select(glow.mean_substitute('unsubstituted_values').alias('substituted_values')).collect()
[Row(substituted_values=[0.0, 1.0, 2.0, 3.0, 1.5, 1.5])]
Parameters
  • array – A numeric array that may contain missing values

  • missingValue – A value that should be considered missing. If not provided, this parameter defaults to -1.

Returns

A numeric array with substituted missing values

glow.functions.normalize_variant(contigName: Union[pyspark.sql.column.Column, str], start: Union[pyspark.sql.column.Column, str], end: Union[pyspark.sql.column.Column, str], refAllele: Union[pyspark.sql.column.Column, str], altAlleles: Union[pyspark.sql.column.Column, str], refGenomePathString: str) → pyspark.sql.column.Column[source]

Normalizes the variant with a behavior similar to vt normalize or bcftools norm. Creates a StructType column including the normalized start, end, referenceAllele and alternateAlleles fields (whether they are changed or unchanged as the result of normalization) as well as a StructType field called normalizationStatus that contains the following fields:

changed: A boolean field indicating whether the variant data was changed as a result of normalization

errorMessage: An error message in case the attempt at normalizing the row hit an error. In this case, the changed field will be set to false. If no errors occur, this field will be null.

In case of an error, the start, end, referenceAllele and alternateAlleles fields in the generated struct will be null.

Added in version 0.3.0.

Examples

>>> df = spark.read.format('vcf').load('test-data/variantsplitternormalizer-test/test_left_align_hg38_altered.vcf')
>>> ref_genome = 'test-data/variantsplitternormalizer-test/Homo_sapiens_assembly38.20.21_altered.fasta'
>>> df.select('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles').head()
Row(contigName='chr20', start=400, end=401, referenceAllele='G', alternateAlleles=['GATCTTCCCTCTTTTCTAATATAAACACATAAAGCTCTGTTTCCTTCTAGGTAACTGGTTTGAG'])
>>> normalized_df = df.select('contigName', glow.expand_struct(glow.normalize_variant('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', ref_genome)))
>>> normalized_df.head()
Row(contigName='chr20', start=268, end=269, referenceAllele='A', alternateAlleles=['ATTTGAGATCTTCCCTCTTTTCTAATATAAACACATAAAGCTCTGTTTCCTTCTAGGTAACTGG'], normalizationStatus=Row(changed=True, errorMessage=None))
Parameters
  • contigName – The current contig name

  • start – The current start

  • end – The current end

  • refAllele – The current reference allele

  • altAlleles – The current array of alternate alleles

  • refGenomePathString – A path to the reference genome .fasta file. The .fasta file must be accompanied with a .fai index file in the same folder.

Returns

A struct as explained above

glow.functions.sample_call_summary_stats(genotypes: Union[pyspark.sql.column.Column, str], refAllele: Union[pyspark.sql.column.Column, str], alternateAlleles: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes per-sample call summary statistics. See Sample Quality Control for more details.

Added in version 0.3.0.

Examples

>>> sites = [
... Row(refAllele='C', alternateAlleles=['G'], genotypes=[Row(sampleId='NA12878', calls=[0, 0])]),
... Row(refAllele='A', alternateAlleles=['G'], genotypes=[Row(sampleId='NA12878', calls=[1, 1])]),
... Row(refAllele='AG', alternateAlleles=['A'], genotypes=[Row(sampleId='NA12878', calls=[1, 0])])]
>>> df = spark.createDataFrame(sites, 'alternateAlleles: array<string>, genotypes: array<struct<sampleId: string, calls: array<int>>>, refAllele: string')
>>> df.select(glow.sample_call_summary_stats('genotypes', 'refAllele', 'alternateAlleles').alias('stats')).collect()
[Row(stats=[Row(sampleId='NA12878', callRate=1.0, nCalled=3, nUncalled=0, nHomRef=1, nHet=1, nHomVar=1, nSnp=2, nInsertion=0, nDeletion=1, nTransition=2, nTransversion=0, nSpanningDeletion=0, rTiTv=inf, rInsertionDeletion=0.0, rHetHomVar=1.0)])]
Parameters
  • genotypes – An array of genotype structs with calls field

  • refAllele – The reference allele

  • alternateAlleles – An array of alternate alleles

Returns

A struct containing sampleId, callRate, nCalled, nUncalled, nHomRef, nHet, nHomVar, nSnp, nInsertion, nDeletion, nTransition, nTransversion, nSpanningDeletion, rTiTv, rInsertionDeletion, rHetHomVar fields. See Sample Quality Control.

glow.functions.sample_dp_summary_stats(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes per-sample summary statistics about the depth field in an array of genotype structs.

Added in version 0.3.0.

Examples

>>> sites = [
... Row(genotypes=[Row(sampleId='NA12878', depth=1)]),
... Row(genotypes=[Row(sampleId='NA12878', depth=2)]),
... Row(genotypes=[Row(sampleId='NA12878', depth=3)])]
>>> df = spark.createDataFrame(sites, 'genotypes: array<struct<depth: int, sampleId: string>>')
>>> df.select(glow.sample_dp_summary_stats('genotypes').alias('stats')).collect()
[Row(stats=[Row(sampleId='NA12878', mean=2.0, stdDev=1.0, min=1.0, max=3.0)])]
Parameters

genotypes – An array of genotype structs with depth field

Returns

An array of structs where each struct contains mean, stDev, min, and max of the genotype depths for a sample. If sampleId is present in a genotype, it will be propagated to the resulting struct as an extra field.

glow.functions.sample_gq_summary_stats(genotypes: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Computes per-sample summary statistics about the genotype quality field in an array of genotype structs.

Added in version 0.3.0.

Examples

>>> sites = [
... Row(genotypes=[Row(sampleId='NA12878', conditionalQuality=1)]),
... Row(genotypes=[Row(sampleId='NA12878', conditionalQuality=2)]),
... Row(genotypes=[Row(sampleId='NA12878', conditionalQuality=3)])]
>>> df = spark.createDataFrame(sites, 'genotypes: array<struct<conditionalQuality: int, sampleId: string>>')
>>> df.select(glow.sample_gq_summary_stats('genotypes').alias('stats')).collect()
[Row(stats=[Row(sampleId='NA12878', mean=2.0, stdDev=1.0, min=1.0, max=3.0)])]
Parameters

genotypes – An array of genotype structs with conditionalQuality field

Returns

An array of structs where each struct contains mean, stDev, min, and max of the genotype qualities for a sample. If sampleId is present in a genotype, it will be propagated to the resulting struct as an extra field.

glow.functions.subset_struct(struct: Union[pyspark.sql.column.Column, str], *fields: str) → pyspark.sql.column.Column[source]

Selects fields from a struct.

Added in version 0.3.0.

Examples

>>> df = spark.createDataFrame([Row(struct=Row(a=1, b=2, c=3))])
>>> df.select(glow.subset_struct('struct', 'a', 'c').alias('struct')).collect()
[Row(struct=Row(a=1, c=3))]
Parameters
  • struct – Struct from which to select fields

  • fields – Fields to select

Returns

A struct containing only the indicated fields

glow.functions.vector_to_array(vector: Union[pyspark.sql.column.Column, str]) → pyspark.sql.column.Column[source]

Converts a spark.ml Vector (sparse or dense) to an array of doubles.

Added in version 0.3.0.

Examples

>>> from pyspark.ml.linalg import DenseVector, SparseVector
>>> df = spark.createDataFrame([Row(v=SparseVector(3, {0: 1.0, 2: 2.0})), Row(v=DenseVector([3.0, 4.0]))])
>>> df.select(glow.vector_to_array('v').alias('arr')).collect()
[Row(arr=[1.0, 0.0, 2.0]), Row(arr=[3.0, 4.0])]
Parameters

vector – Vector to convert

Returns

An array of doubles