Genomewide Association Study Regression Tests¶
Glow contains functions for performing simple regression analyses used in genomewide association studies (GWAS).
Linear regression¶
linear_regression_gwas
performs a linear regression association test optimized for performance
in a GWAS setting.
Example¶
from pyspark.ml.linalg import DenseMatrix
import pyspark.sql.functions as fx
import numpy as np
# Read in VCF file
df = spark.read.format('vcf') \
.option("splitToBiallelic", True) \
.load(path) \
.cache()
# Generate random phenotypes and an interceptonly covariate matrix
n_samples = df.select(fx.size('genotypes')).first()[0]
covariates = DenseMatrix(n_samples, 1, np.ones(n_samples))
np.random.seed(500)
phenotypes = np.random.random(n_samples).tolist()
covariates_and_phenotypes = spark.createDataFrame([[covariates, phenotypes]],
['covariates', 'phenotypes'])
# Run linear regression test
lin_reg_df = df.crossJoin(covariates_and_phenotypes).selectExpr(
'contigName',
'start',
'names',
# genotype_states returns the number of alt alleles for each sample
'expand_struct(linear_regression_gwas(genotype_states(genotypes), phenotypes, covariates))')
Parameters¶
Name 
Type 
Details 



A numeric representation of the genotype for each sample at a given site, for example the
result of the 


A matrix containing the covariates to use in the linear regression model. Each row in the
matrix represents observations for a sample. The indexing must match that of the 


A numeric representation of the phenotype for each sample. This parameter may vary for each
row in the dataset. The indexing of this array must match the 
Return¶
The function returns a struct with the following fields. The computation of each value matches the lm R package.
Name 
Type 
Details 



The fit effect coefficient of the 


The standard error of 


The Pvalue of the tstatistic for 
Implementation details¶
The linear regression model is fit using the QR decomposition. For performance, the QR decomposition
of the covariate matrix is computed once and reused for each (genotypes
, phenotypes
) pair.
Logistic regression¶
logistic_regression_gwas
performs a logistic regression hypothesis test optimized for performance
in a GWAS setting.
Example¶
# Likelihood ratio test
log_reg_df = df.crossJoin(covariates_and_phenotypes).selectExpr(
'contigName',
'start',
'names',
'expand_struct(logistic_regression_gwas(genotype_states(genotypes), phenotypes, covariates, \'LRT\'))')
# Firth test
firth_log_reg_df = df.crossJoin(covariates_and_phenotypes).selectExpr(
'contigName',
'start',
'names',
'expand_struct(logistic_regression_gwas(genotype_states(genotypes), phenotypes, covariates, \'Firth\'))')
Parameters¶
The parameters for the logistic regression test are largely the same as those for linear regression. The primary
differences are that the phenotypes
values should be in the set [0,1]
and that there is one additional
parameter test
to specify the hypothesis test method.
Name 
Type 
Details 



A numeric representation of the genotype for each sample at a given site, for example the
result of the 


A matrix containing the covariates to use in the logistic regression model. Each row in the
matrix represents observations for a sample. The indexing must match that of the 


A numeric representation of the phenotype for each sample. This parameter may vary for each
row in the dataset. The indexing of this array must match the 


The hypothesis test method to use. Currently likelihood ratio ( 
Return¶
The function returns a struct with the following fields. The computation of each value matches the glm R package for the likelihood ratio test and the logistf R package for the Firth test.
Name 
Type 
Details 



Logodds associated with the 


Odds ratio associated with the 


Wald 95% confidence interval of the odds ratio, 


pvalue for the specified 
Implementation details¶
The logistic regression null model and fullyspecified model are fit using Newton iterations. For performance, the null
model is computed once for each phenotype
and used as a prior for each (genotypes
, phenotypes
) pair.