Utility Functions

Glow includes a variety of utility functions for performing basic data manipulation.

Struct transformations

Glow’s struct transformation functions change the schema structure of the DataFrame. These transformations integrate with functions whose parameter structs require a certain schema.

  • subset_struct: subset fields from a struct

from pyspark.sql import Row
row_one = Row(Row(str_col='foo', int_col=1, bool_col=True))
row_two = Row(Row(str_col='bar', int_col=2, bool_col=False))
base_df = spark.createDataFrame([row_one, row_two], schema=['base_col'])
subsetted_df = base_df.select(glow.subset_struct('base_col', 'str_col', 'bool_col').alias('subsetted_col'))
  • add_struct_fields: append fields to a struct

from pyspark.sql.functions import lit, reverse
added_df = base_df.select(glow.add_struct_fields('base_col', lit('float_col'), lit(3.14), lit('rev_str_col'), reverse(base_df.base_col.str_col)).alias('added_col'))
  • expand_struct: explode a struct into columns

expanded_df = base_df.select(glow.expand_struct('base_col'))

Spark ML transformations

Glow supports transformations between double arrays and Spark ML vectors for integration with machine learning libraries such as Spark’s machine learning library (MLlib).

  • array_to_dense_vector: transform from an array to a dense vector

array_df = spark.createDataFrame([Row([1.0, 2.0, 3.0]), Row([4.1, 5.1, 6.1])], schema=['array_col'])
dense_df = array_df.select(glow.array_to_dense_vector('array_col').alias('dense_vector_col'))
  • array_to_sparse_vector: transform from an array to a sparse vector

sparse_df = array_df.select(glow.array_to_sparse_vector('array_col').alias('sparse_vector_col'))
  • vector_to_array: transform from a vector to a double array

from pyspark.ml.linalg import SparseVector
row_one = Row(vector_col=SparseVector(3, [0, 2], [1.0, 3.0]))
row_two = Row(vector_col=SparseVector(3, [1], [1.0]))
vector_df = spark.createDataFrame([row_one, row_two])
array_df = vector_df.select(glow.vector_to_array('vector_col').alias('array_col'))
  • explode_matrix: explode a Spark ML matrix such that each row becomes an array of doubles

from pyspark.ml.linalg import DenseMatrix
matrix_df = spark.createDataFrame(Row([DenseMatrix(2, 3, range(6))]), schema=['matrix_col'])
array_df = matrix_df.select(glow.explode_matrix('matrix_col').alias('array_col'))

Variant data transformations

Glow supports numeric transformations on variant data for downstream calculations, such as GWAS.

  • genotype_states: create a numeric representation for each sample’s genotype data. This calculates the sum of the calls (or -1 if any calls are missing); the sum is equivalent to the number of alternate alleles for biallelic variants.

from pyspark.sql.types import *

missing_and_hom_ref = Row([Row(calls=[-1,0]), Row(calls=[0,0])])
het_and_hom_alt = Row([Row(calls=[0,1]), Row(calls=[1,1])])
calls_schema = StructField('calls', ArrayType(IntegerType()))
genotypes_schema = StructField('genotypes_col', ArrayType(StructType([calls_schema])))
genotypes_df = spark.createDataFrame([missing_and_hom_ref, het_and_hom_alt], StructType([genotypes_schema]))
num_alt_alleles_df = genotypes_df.select(glow.genotype_states('genotypes_col').alias('num_alt_alleles_col'))
  • hard_calls: get hard calls from genotype probabilities. These are determined based on the number of alternate alleles for the variant, whether the probabilities are phased (true for haplotypes and false for genotypes), and a call threshold (if not provided, this defaults to 0.9). If no calls have a probability above the threshold, the call is set to -1.

unphased_above_threshold = Row(probabilities=[0.0, 0.0, 0.0, 1.0, 0.0, 0.0], num_alts=2, phased=False)
phased_below_threshold = Row(probabilities=[0.1, 0.9, 0.8, 0.2], num_alts=1, phased=True)
uncalled_df = spark.createDataFrame([unphased_above_threshold, phased_below_threshold])
hard_calls_df = uncalled_df.select(glow.hard_calls('probabilities', 'num_alts', 'phased', 0.95).alias('calls'))
  • mean_substitute: substitutes the missing values of a numeric array using the mean of the non-missing values. Any values that are NaN, null or equal to the missing value parameter are considered missing. If all values are missing, they are substituted with the missing value. If the missing value is not provided, this defaults to -1.

unsubstituted_row = Row(unsubstituted_values=[float('nan'), None, -1.0, 0.0, 1.0, 2.0, 3.0])
unsubstituted_df = spark.createDataFrame([unsubstituted_row])
substituted_df = unsubstituted_df.select(glow.mean_substitute('unsubstituted_values', lit(-1.0)).alias('substituted_values'))