GloWGR: Genome-Wide Association Study (GWAS) Regression Tests
Glow contains functions for performing regression analyses used in genome-wide association studies (GWAS). These functions are best used in conjunction with the GloWGR whole genome regression method, but also work as standalone analysis tools.
Tip
Glow automatically converts literal one-dimensional and two-dimensional numpy
ndarray
s of double
s
to array<double>
and spark.ml
DenseMatrix
respectively.
Linear regression
linear_regression
performs a linear regression association test optimized for performance
in a GWAS setting. You provide a Spark DataFrame containing the genetic data and Pandas DataFrames
with the phenotypes, covariates, and optional offsets (typically predicted phenotypes from
GloWGR). The function returns a Spark DataFrame with association test results for each
(variant, phenotype) pair.
Each worker node in the cluster tests a subset of the total variant dataset. Multiple phenotypes and variants are tested together to take advantage of efficient matrix-matrix linear algebra primitives.
Example
import glow
import numpy as np
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col, lit
# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)
# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = (glow.transform('split_multiallelics', variants)
.withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes'))))
.select('contigName', 'start', 'names', 'gt')
.cache())
# Read covariates from a CSV file
covariates = pd.read_csv(covariates_csv, index_col=0)
# Read phenotypes from a CSV file
continuous_phenotypes = pd.read_csv(continuous_phenotypes_csv, index_col=0)
# Run linear regression test
lin_reg_df = glow.gwas.linear_regression(genotypes, continuous_phenotypes, covariates, values_column='gt')
For complete parameter usage information, check out the API reference for glow.gwas.linear_regression()
.
Note
Glow also includes a SQL-based function for performing linear regression. However, this function
only processes one phenotype at time, and so performs more slowly than the batch linear regression function
documented above. To read more about the SQL-based function, see the docs for
glow.linear_regression_gwas()
.
Logistic regression
logistic_regression
performs a logistic regression hypothesis test optimized for performance
in a GWAS setting.
Example
import glow
import numpy as np
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col, lit
# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)
# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = (glow.transform('split_multiallelics', variants)
.withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes'))))
.select('contigName', 'start', 'names', 'gt')
.cache())
# Read covariates from a CSV file
covariates = pd.read_csv(covariates_csv, index_col=0)
# Read phenotypes from a CSV file
binary_phenotypes = pd.read_csv(binary_phenotypes_csv, index_col=0)
# Run logistic regression test with approximate Firth correction for p-values below 0.05
log_reg_df = glow.gwas.logistic_regression(
genotypes,
binary_phenotypes,
covariates,
correction='approx-firth',
pvalue_threshold=0.05,
values_column='gt'
)
For complete parameter usage information, check out the API reference for glow.gwas.logistic_regression()
.
Note
Glow also includes a SQL-based function for performing logistic regression. However, this function
only processes one phenotype at time, and so performs more slowly than the batch logistic regression function
documented above. To read more about the SQL-based function, see the docs for
glow.logistic_regression_gwas()
.
Offset
The linear and logistic regression functions accept GloWGR phenotypic predictions (either global or per chromosome) as an offset.
continuous_offsets = pd.read_csv(continuous_offset_csv, index_col=0)
lin_reg_df = glow.gwas.linear_regression(
genotypes,
continuous_phenotypes,
covariates,
offset_df=continuous_offsets,
values_column='gt'
)
binary_offsets = pd.read_csv(binary_offset_csv, index_col=0)
log_reg_df = glow.gwas.logistic_regression(
genotypes,
binary_phenotypes,
covariates,
offset_df=binary_offsets,
correction='approx-firth',
pvalue_threshold=0.05,
values_column='gt'
)
Tip
The offset
parameter is especially useful in incorporating the results of GloWGR with
phenotypes in GWAS. Please refer to GloWGR: Whole Genome Regression for details and example notebook.
Example notebooks and blog post
GloWGR: GWAS for quantitative traits
GloWGR: GWAS for binary traits
A detailed example and explanation of a GWAS workflow is available here.