# GloWGR: Genome-Wide Association Study (GWAS) Regression Tests¶

Glow contains functions for performing regression analyses used in genome-wide association studies (GWAS). These functions are best used in conjunction with the GloWGR whole genome regression method, but also work as standalone analysis tools.

Tip

Glow automatically converts literal one-dimensional and two-dimensional `numpy`

`ndarray`

s of `double`

s
to `array<double>`

and `spark.ml`

`DenseMatrix`

respectively.

## Linear regression¶

`linear_regression`

performs a linear regression association test optimized for performance
in a GWAS setting. You provide a Spark DataFrame containing the genetic data and Pandas DataFrames
with the phenotypes, covariates, and optional offsets (typically predicted phenotypes from
GloWGR). The function returns a Spark DataFrame with association test results for each
(variant, phenotype) pair.

Each worker node in the cluster tests a subset of the total variant dataset. Multiple phenotypes and variants are tested together to take advantage of efficient matrix-matrix linear algebra primitives.

### Example¶

```
import glow
import numpy as np
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col, lit
# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)
# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = (glow.transform('split_multiallelics', variants)
.withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes'))))
.select('contigName', 'start', 'names', 'gt')
.cache())
# Read covariates from a CSV file
covariates = pd.read_csv(covariates_csv, index_col=0)
# Read phenotypes from a CSV file
continuous_phenotypes = pd.read_csv(continuous_phenotypes_csv, index_col=0)
# Run linear regression test
lin_reg_df = glow.gwas.linear_regression(genotypes, continuous_phenotypes, covariates, values_column='gt')
```

For complete parameter usage information, check out the API reference for `glow.gwas.linear_regression()`

.

Note

Glow also includes a SQL-based function for performing linear regression. However, this function
only processes one phenotype at time, and so performs more slowly than the batch linear regression function
documented above. To read more about the SQL-based function, see the docs for
`glow.linear_regression_gwas()`

.

## Logistic regression¶

`logistic_regression`

performs a logistic regression hypothesis test optimized for performance
in a GWAS setting.

### Example¶

```
import glow
import numpy as np
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col, lit
# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)
# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = (glow.transform('split_multiallelics', variants)
.withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes'))))
.select('contigName', 'start', 'names', 'gt')
.cache())
# Read covariates from a CSV file
covariates = pd.read_csv(covariates_csv, index_col=0)
# Read phenotypes from a CSV file
binary_phenotypes = pd.read_csv(binary_phenotypes_csv, index_col=0)
# Run logistic regression test with approximate Firth correction for p-values below 0.05
log_reg_df = glow.gwas.logistic_regression(
genotypes,
binary_phenotypes,
covariates,
correction='approx-firth',
pvalue_threshold=0.05,
values_column='gt'
)
```

For complete parameter usage information, check out the API reference for `glow.gwas.logistic_regression()`

.

Note

Glow also includes a SQL-based function for performing logistic regression. However, this function
only processes one phenotype at time, and so performs more slowly than the batch logistic regression function
documented above. To read more about the SQL-based function, see the docs for
`glow.logistic_regression_gwas()`

.

## Offset¶

The linear and logistic regression functions accept GloWGR phenotypic predictions (either global or per chromosome) as an offset.

```
continuous_offsets = pd.read_csv(continuous_offset_csv, index_col=0)
lin_reg_df = glow.gwas.linear_regression(
genotypes,
continuous_phenotypes,
covariates,
offset_df=continuous_offsets,
values_column='gt'
)
```

```
binary_offsets = pd.read_csv(binary_offset_csv, index_col=0)
log_reg_df = glow.gwas.logistic_regression(
genotypes,
binary_phenotypes,
covariates,
offset_df=binary_offsets,
correction='approx-firth',
pvalue_threshold=0.05,
values_column='gt'
)
```

Tip

The `offset`

parameter is especially useful in incorporating the results of GloWGR with
phenotypes in GWAS. Please refer to GloWGR: Whole Genome Regression for details and example notebook.