GloWGR: Genome-Wide Association Study (GWAS) Regression Tests

Glow contains functions for performing regression analyses used in genome-wide association studies (GWAS). These functions are best used in conjunction with the GloWGR whole genome regression method, but also work as standalone analysis tools.

Tip

Glow automatically converts literal one-dimensional and two-dimensional numpy ndarray s of double s to array<double> and spark.ml DenseMatrix respectively.

Linear regression

linear_regression performs a linear regression association test optimized for performance in a GWAS setting. You provide a Spark DataFrame containing the genetic data and Pandas DataFrames with the phenotypes, covariates, and optional offsets (typically predicted phenotypes from GloWGR). The function returns a Spark DataFrame with association test results for each (variant, phenotype) pair.

Each worker node in the cluster tests a subset of the total variant dataset. Multiple phenotypes and variants are tested together to take advantage of efficient matrix-matrix linear algebra primitives.

Example

import glow
import numpy as np
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col, lit

# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)

# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = (glow.transform('split_multiallelics', variants)
  .withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes'))))
  .select('contigName', 'start', 'names', 'gt')
  .cache())

# Read covariates from a CSV file
covariates = pd.read_csv(covariates_csv, index_col=0)

# Read phenotypes from a CSV file
continuous_phenotypes = pd.read_csv(continuous_phenotypes_csv, index_col=0)

# Run linear regression test
lin_reg_df = glow.gwas.linear_regression(genotypes, continuous_phenotypes, covariates, values_column='gt')

For complete parameter usage information, check out the API reference for glow.gwas.linear_regression().

Note

Glow also includes a SQL-based function for performing linear regression. However, this function only processes one phenotype at time, and so performs more slowly than the batch linear regression function documented above. To read more about the SQL-based function, see the docs for glow.linear_regression_gwas().

Logistic regression

logistic_regression performs a logistic regression hypothesis test optimized for performance in a GWAS setting.

Example

import glow
import numpy as np
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col, lit

# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)

# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = (glow.transform('split_multiallelics', variants)
  .withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes'))))
  .select('contigName', 'start', 'names', 'gt')
  .cache())

# Read covariates from a CSV file
covariates = pd.read_csv(covariates_csv, index_col=0)

# Read phenotypes from a CSV file
binary_phenotypes = pd.read_csv(binary_phenotypes_csv, index_col=0)

# Run logistic regression test with approximate Firth correction for p-values below 0.05
log_reg_df = glow.gwas.logistic_regression(
  genotypes,
  binary_phenotypes,
  covariates,
  correction='approx-firth',
  pvalue_threshold=0.05,
  values_column='gt'
)

For complete parameter usage information, check out the API reference for glow.gwas.logistic_regression().

Note

Glow also includes a SQL-based function for performing logistic regression. However, this function only processes one phenotype at time, and so performs more slowly than the batch logistic regression function documented above. To read more about the SQL-based function, see the docs for glow.logistic_regression_gwas().

Offset

The linear and logistic regression functions accept GloWGR phenotypic predictions (either global or per chromosome) as an offset.

continuous_offsets = pd.read_csv(continuous_offset_csv, index_col=0)
lin_reg_df = glow.gwas.linear_regression(
  genotypes,
  continuous_phenotypes,
  covariates,
  offset_df=continuous_offsets,
  values_column='gt'
)
binary_offsets = pd.read_csv(binary_offset_csv, index_col=0)
log_reg_df = glow.gwas.logistic_regression(
  genotypes,
  binary_phenotypes,
  covariates,
  offset_df=binary_offsets,
  correction='approx-firth',
  pvalue_threshold=0.05,
  values_column='gt'
)

Tip

The offset parameter is especially useful in incorporating the results of GloWGR with phenotypes in GWAS. Please refer to GloWGR: Whole Genome Regression for details and example notebook.

Example notebook and blog post

A detailed example and explanation of a GWAS workflow is available here.