.. _gwas: ============================================================= GloWGR: Genome-Wide Association Study (GWAS) Regression Tests ============================================================= .. invisible-code-block: python import glow genotypes_vcf = 'test-data/gwas/genotypes.vcf.gz' covariates_csv = 'test-data/gwas/covariates.csv.gz' continuous_phenotypes_csv = 'test-data/gwas/continuous-phenotypes.csv.gz' continuous_offset_csv = 'test-data/gwas/continuous-offsets.csv.gz' binary_phenotypes_csv = 'test-data/gwas/binary-phenotypes.csv.gz' binary_offset_csv = 'test-data/gwas/binary-offsets.csv.gz' Glow contains functions for performing regression analyses used in genome-wide association studies (GWAS). These functions are best used in conjunction with the :ref:`GloWGR whole genome regression method `, but also work as standalone analysis tools. .. tip:: Glow automatically converts literal one-dimensional and two-dimensional ``numpy`` ``ndarray`` s of ``double`` s to ``array`` and ``spark.ml`` ``DenseMatrix`` respectively. .. _linear-regression: Linear regression ================= ``linear_regression`` performs a linear regression association test optimized for performance in a GWAS setting. You provide a Spark DataFrame containing the genetic data and Pandas DataFrames with the phenotypes, covariates, and optional offsets (typically predicted phenotypes from GloWGR). The function returns a Spark DataFrame with association test results for each (variant, phenotype) pair. Each worker node in the cluster tests a subset of the total variant dataset. Multiple phenotypes and variants are tested together to take advantage of efficient matrix-matrix linear algebra primitives. Example ------- .. code-block:: python import glow import numpy as np import pandas as pd from pyspark.sql import Row from pyspark.sql.functions import col, lit # Read in VCF file variants = spark.read.format('vcf').load(genotypes_vcf) # genotype_states returns the number of alt alleles for each sample # mean_substitute replaces any missing genotype states with the mean of the non-missing states genotypes = (glow.transform('split_multiallelics', variants) .withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes')))) .select('contigName', 'start', 'names', 'gt') .cache()) # Read covariates from a CSV file covariates = pd.read_csv(covariates_csv, index_col=0) # Read phenotypes from a CSV file continuous_phenotypes = pd.read_csv(continuous_phenotypes_csv, index_col=0) # Run linear regression test lin_reg_df = glow.gwas.linear_regression(genotypes, continuous_phenotypes, covariates, values_column='gt') .. invisible-code-block: python expected_lin_reg_row = Row( contigName='22', start=16050114, names=['rs587755077'], phenotype='Continuous_Trait_1', effect=0.14722512852575978, stderror=0.14155327969643167, pvalue=0.2984087428847886, tvalue=1.0400686500623064 ) assert_rows_equal(lin_reg_df.filter('contigName = 22 and start = 16050114').head(), expected_lin_reg_row) For complete parameter usage information, check out the API reference for :func:`glow.gwas.linear_regression`. .. note:: Glow also includes a SQL-based function for performing linear regression. However, this function only processes one phenotype at time, and so performs more slowly than the batch linear regression function documented above. To read more about the SQL-based function, see the docs for :func:`glow.linear_regression_gwas`. .. _logistic-regression: Logistic regression =================== ``logistic_regression`` performs a logistic regression hypothesis test optimized for performance in a GWAS setting. Example ------- .. code-block:: python import glow import numpy as np import pandas as pd from pyspark.sql import Row from pyspark.sql.functions import col, lit # Read in VCF file variants = spark.read.format('vcf').load(genotypes_vcf) # genotype_states returns the number of alt alleles for each sample # mean_substitute replaces any missing genotype states with the mean of the non-missing states genotypes = (glow.transform('split_multiallelics', variants) .withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes')))) .select('contigName', 'start', 'names', 'gt') .cache()) # Read covariates from a CSV file covariates = pd.read_csv(covariates_csv, index_col=0) # Read phenotypes from a CSV file binary_phenotypes = pd.read_csv(binary_phenotypes_csv, index_col=0) # Run logistic regression test with approximate Firth correction for p-values below 0.05 log_reg_df = glow.gwas.logistic_regression( genotypes, binary_phenotypes, covariates, correction='approx-firth', pvalue_threshold=0.05, values_column='gt' ) .. invisible-code-block: python expected_corrected_row = Row( contigName='21', start=9411301, names=['rs531010746'], effect=1.6771295643163748, stderror=0.727886866796042, correctionSucceeded=True, chisq=7.561977300936178, pvalue=0.005961255160993639, phenotype='Binary_Trait_2' ) expected_uncorrected_row = Row( contigName='22', start=16050114, names=['rs587755077'], effect=None, stderror=None, correctionSucceeded=None, chisq=2.8620926885005393, pvalue=0.09068944361066741, phenotype='Binary_Trait_1' ) assert_rows_equal( log_reg_df.filter("contigName = 21 and start = 9411301 and phenotype = 'Binary_Trait_2'").head(), expected_corrected_row ) assert_rows_equal( log_reg_df.filter("contigName = 22 and start = 16050114 and phenotype = 'Binary_Trait_1'").head(), expected_uncorrected_row ) For complete parameter usage information, check out the API reference for :func:`glow.gwas.logistic_regression`. .. note:: Glow also includes a SQL-based function for performing logistic regression. However, this function only processes one phenotype at time, and so performs more slowly than the batch logistic regression function documented above. To read more about the SQL-based function, see the docs for :func:`glow.logistic_regression_gwas`. Offset ====== The linear and logistic regression functions accept GloWGR phenotypic predictions (either global or per chromosome) as an offset. .. code-block:: python continuous_offsets = pd.read_csv(continuous_offset_csv, index_col=0) lin_reg_df = glow.gwas.linear_regression( genotypes, continuous_phenotypes, covariates, offset_df=continuous_offsets, values_column='gt' ) .. invisible-code-block: python expected_lin_reg_row = Row( contigName='22', start=16050114, names=['rs587755077'], effect=0.14270211676783012, stderror=0.16438780778222647, tvalue=0.8680821205236547, pvalue=0.38543283172628373, phenotype='Continuous_Trait_1') assert_rows_equal(lin_reg_df.filter('contigName = 22 and start = 16050114').head(), expected_lin_reg_row) .. code-block:: python binary_offsets = pd.read_csv(binary_offset_csv, index_col=0) log_reg_df = glow.gwas.logistic_regression( genotypes, binary_phenotypes, covariates, offset_df=binary_offsets, correction='approx-firth', pvalue_threshold=0.05, values_column='gt' ) .. invisible-code-block: python expected_corrected_row = Row( contigName='21', start=9411301, names=['rs531010746'], effect=1.7685129759078362, stderror=0.7304047109477187, correctionSucceeded=True, chisq=8.406635446007385, pvalue=0.003738539179357663, phenotype='Binary_Trait_2' ) expected_uncorrected_row = Row( contigName='22', start=16050114, names=['rs587755077'], effect=None, stderror=None, correctionSucceeded=None, chisq=3.637471149006905, pvalue=0.05649268083569545, phenotype='Binary_Trait_1' ) assert_rows_equal( log_reg_df.filter("contigName = 21 and start = 9411301 and phenotype = 'Binary_Trait_2'").head(), expected_corrected_row ) assert_rows_equal( log_reg_df.filter("contigName = 22 and start = 16050114 and phenotype = 'Binary_Trait_1'").head(), expected_uncorrected_row ) .. tip:: The ``offset`` parameter is especially useful in incorporating the results of :ref:`GloWGR ` with phenotypes in GWAS. Please refer to :ref:`glowgr` for details and example notebook. Example notebooks and blog post ------------------------------- .. notebook:: .. tertiary/gwas-quantitative.html :title: GloWGR: GWAS for quantitative traits .. notebook:: .. tertiary/gwas-binary.html :title: GloWGR: GWAS for binary traits A detailed example and explanation of a GWAS workflow is available `here `_.