GloWGR
WGR functions
- class glow.wgr.LogisticRidgeRegression(reduced_block_df, label_df, sample_blocks, cov_df=Empty DataFrame Columns: [] Index: [], add_intercept=True, alphas=[])[source]
The LogisticRidgeRegression class is used to fit logistic ridge regression models against one or more labels optimized over a provided list of ridge alpha parameters. The optimal ridge alpha value is chosen for each label by minimizing the average out of fold log_loss scores.
- fit()[source]
Fits a logistic regression model, represented by a Spark DataFrame containing coefficients for each of the ridge alpha parameters, for each block in the reduced block matrix, for each label in the target labels, as well as a Spark DataFrame containing the optimal ridge alpha value for each label.
- Return type
(
DataFrame
,DataFrame
)- Returns
Two Spark DataFrames, one containing the model resulting from the fitting routine and one containing the results of the cross validation procedure.
- fit_transform(response='linear')[source]
Fits a logistic ridge regression model, then transforms the matrix using the model.
- Parameters
response (
str
) – String specifying the desired output. Can be ‘linear’ to specify the direct output of the linear WGR model (default) or ‘sigmoid’ to specify predicted label probabilities.- Return type
DataFrame
- Returns
Pandas DataFrame containing prediction y_hat values. The shape and order match labeldf such that the rows are indexed by sample ID and the columns by label. The column types are float64.
- fit_transform_loco(response='linear', chromosomes=[])[source]
Fits a logistic ridge regression model with a block matrix, then generates predictions for the target labels in the provided label DataFrame by applying the model resulting from the LogisticRidgeRegression fit method to the starting reduced block matrix using a leave-one-chromosome-out (LOCO) approach (this method caches the model and cross-validation DataFrames in the process for better performance).
- Parameters
response (
str
) – String specifying the desired output. Can be ‘linear’ to specify the direct output of the linear WGR model (default) or ‘sigmoid’ to specify predicted label probabilities.chromosomes (
List
[str
]) – List of chromosomes for which to generate a prediction (optional). If not provided, the chromosomes will be inferred from the block matrix.
- Return type
DataFrame
- Returns
Pandas DataFrame containing prediction y_hat values per chromosome. The rows are indexed by sample ID and chromosome; the columns are indexed by label. The column types are float64. The DataFrame is sorted using chromosome as the primary sort key, and sample ID as the secondary sort key.
- classmethod from_ridge_reduction(cls, ridge_reduced, alphas=[])[source]
Initializes an instance of LogsiticRidgeRegression using a RidgeReduction object
- Parameters
ridge_reduced (
RidgeReduction
) – A RidgeReduction instance based on which the LogisticRidgeRegression instance must be madealphas (
List
[float
]) – array_like of alpha values used in the logistic ridge regression (optional).
- reduce_block_matrix(response)[source]
Transforms a starting reduced block matrix by applying a linear model. The form of the output can either be a direct linear transformation (response = “linear”) or a linear transformation followed by a sigmoid transformation (response = “sigmoid”).
- Parameters
response (
str
) – String specifying what transformation to apply (“linear” or “sigmoid”)- Return type
DataFrame
- Returns
Spark DataFrame containing the result of the transformation.
- transform(response='linear')[source]
Generates GWAS covariates for the target labels in the provided label DataFrame by applying the model resulting from the LogisticRidgeRegression fit method to the starting reduced block matrix.
- Parameters
response (
str
) – String specifying the desired output. Can be ‘linear’ to specify the direct output of the linear WGR model (default) or ‘sigmoid’ to specify predicted label probabilities.- Return type
DataFrame
- Returns
Pandas DataFrame containing covariate values. The shape and order match label_df such that the rows are indexed by sample ID and the columns by label. The column types are float64.
- transform_loco(response='linear', chromosomes=[])[source]
Generates predictions for the target labels in the provided label DataFrame by applying the model resulting from the LogisticRidgeRegression fit method to the starting reduced block matrix using a leave-one-chromosome-out (LOCO) approach (this method caches the model and cross-validation DataFrames in the process for better performance).
- Parameters
response (
str
) – String specifying the desired output. Can be ‘linear’ to specify the direct output of the linear WGR model (default) or ‘sigmoid’ to specify predicted label probabilities.chromosomes (
List
[str
]) – List of chromosomes for which to generate a prediction (optional). If not provided, the chromosomes will be inferred from the block matrix.
- Return type
DataFrame
- Returns
Pandas DataFrame containing prediction y_hat values per chromosome. The rows are indexed by sample ID and chromosome; the columns are indexed by label. The column types are float64. The DataFrame is sorted using chromosome as the primary sort key, and sample ID as the secondary sort key.
- class glow.wgr.RidgeReduction(block_df, label_df, sample_blocks, cov_df=Empty DataFrame Columns: [] Index: [], add_intercept=True, alphas=[], label_type='detect')[source]
The RidgeReduction class is intended to reduce the feature space of an N by M block matrix X to an N by P<<M block matrix. This is done by fitting K ridge models within each block of X on one or more target labels, such that a block with L columns to begin with will be reduced to a block with K columns, where each column is the prediction of one ridge model for one target label.
- fit()[source]
Fits a ridge reducer model, represented by a Spark DataFrame containing coefficients for each of the ridge alpha parameters, for each block in the starting matrix, for each label in the target labels.
- Return type
DataFrame
- Returns
Spark DataFrame containing the model resulting from the fitting routine.
- class glow.wgr.RidgeRegression(reduced_block_df, label_df, sample_blocks, cov_df=Empty DataFrame Columns: [] Index: [], add_intercept=True, alphas=[])[source]
The RidgeRegression class is used to fit ridge models against one or more labels optimized over a provided list of ridge alpha parameters. It is similar in function to RidgeReduction except that whereas RidgeReduction attempts to reduce a starting matrix X to a block matrix of smaller dimension, RidgeRegression is intended to find an optimal model of the form Y_hat ~ XB, where Y_hat is a matrix of one or more predicted labels and B is a matrix of coefficients. The optimal ridge alpha value is chosen for each label by maximizing the average out of fold r2 score.
- fit()[source]
Fits a ridge regression model, represented by a Spark DataFrame containing coefficients for each of the ridge alpha parameters, for each block in the starting reduced matrix, for each label in the target labels, as well as a Spark DataFrame containing the optimal ridge alpha value for each label.
- Return type
(
DataFrame
,DataFrame
)- Returns
Two Spark DataFrames, one containing the model resulting from the fitting routine and one containing the results of the cross validation procedure.
- fit_transform()[source]
Fits a ridge regression model, then transforms the matrix using the model.
- Return type
DataFrame
- Returns
Pandas DataFrame containing prediction y_hat values. The shape and order match labeldf such that the rows are indexed by sample ID and the columns by label. The column types are float64.
- fit_transform_loco(chromosomes=[])[source]
Fits a ridge regression model and then generates predictions for the target labels in the provided label DataFrame by applying the model resulting from the RidgeRegression fit method to the starting reduced block matrix using a leave-one-chromosome-out (LOCO) approach ((this method caches the model and cross-validation DataFrames in the process for better performance).
- Parameters
chromosomes (
List
[str
]) – List of chromosomes for which to generate a prediction (optional). If not provided, the chromosomes will be inferred from the block matrix.- Return type
DataFrame
- Returns
Pandas DataFrame containing offset values (y_hat) per chromosome. The rows are indexed by sample ID and chromosome; the columns are indexed by label. The column types are float64. The DataFrame is sorted using chromosome as the primary sort key, and sample ID as the secondary sort key.
- classmethod from_ridge_reduction(cls, ridge_reduced, alphas=[])[source]
Initializes an instance of RidgeRegression using a RidgeReduction object
- Parameters
ridge_reduced (
RidgeReduction
) – A RidgeReduction instance based on which the RidgeRegression instance must be madealphas (
List
[float
]) – array_like of alpha values used in the ridge regression (optional).
- transform()[source]
Generates predictions for the target labels in the provided label DataFrame by applying the model resulting from the RidgeRegression fit method to the reduced block matrix.
- Return type
DataFrame
- Returns
Pandas DataFrame containing prediction y_hat values. The shape and order match label_df such that the rows are indexed by sample ID and the columns by label. The column types are float64.
- transform_loco(chromosomes=[])[source]
Generates predictions for the target labels in the provided label DataFrame by applying the model resulting from the RidgeRegression fit method to the starting reduced block matrix using a leave-one-chromosome-out (LOCO) approach (this method caches the model and cross-validation DataFrames in the process for better performance).
- Parameters
chromosomes (
List
[str
]) – List of chromosomes for which to generate a prediction (optional). If not provided, the chromosomes will be inferred from the block matrix.- Return type
DataFrame
- Returns
Pandas DataFrame containing offset values (y_hat) per chromosome. The rows are indexed by sample ID and chromosome; the columns are indexed by label. The column types are float64. The DataFrame is sorted using chromosome as the primary sort key, and sample ID as the secondary sort key.
- glow.wgr.block_variants_and_samples(variant_df, sample_ids, variants_per_block, sample_block_count)[source]
Creates a blocked GT matrix and index mapping from sample blocks to a list of corresponding sample IDs. Uses the same sample-blocking logic as the blocked GT matrix transformer.
Requires that:
Each variant row has the same number of values
The number of values per row matches the number of sample IDs
- Parameters
- Return type
- Returns
tuple of (blocked GT matrix, index mapping)
- glow.wgr.estimate_loco_offsets(block_df, label_df, sample_blocks, cov_df=Empty DataFrame Columns: [] Index: [], add_intercept=True, reduction_alphas=[], regression_alphas=[], label_type='detect', chromosomes=[])[source]
The one-stop function to generate WGR predictors to be used as offsets in gwas functions. Given the input the function performs the ridge reduction followed by appropriate choice of ridge regression or logistic ridge regression in a loco manner.
- Parameters
block_df (
DataFrame
) – Spark DataFrame representing the beginning block matrix Xlabel_df (
DataFrame
) – Pandas DataFrame containing the target labels used in fitting the ridge modelssample_blocks (
Dict
[str
,List
[str
]]) – Dict containing a mapping of sample_block ID to a list of corresponding sample IDscov_df (
DataFrame
) – Pandas DataFrame containing covariates to be included in every model in the stacking ensemble (optional).add_intercept (
bool
) – If True, an intercept column (all ones) will be added to the covariates (as the first column)reduction_alphas (
List
[float
]) – array_like of alpha values used in the ridge reduction (optional). If not provided, the automatically generates alphas for reduction.regression_alphas (
List
[float
]) – array_like of alpha values used in the ridge or logistic ridge regression (optional). If not provided, the automatically generates alphas for regression.label_type – String to determine type treatment of labels. It can be ‘detect’ (default), ‘binary’, or ‘quantitative’. On ‘detect’ the function picks binary or quantitative based on whether label_df is all binary or not, respectively.
chromosomes (
List
[str
]) – List of chromosomes for which to generate offsets (optional). If not provided, the chromosomes will be inferred from the block matrix.
- Return type
DataFrame
- Returns
Pandas DataFrame containing offset values per chromosome. The rows are indexed by sample ID and chromosome; the columns are indexed by label. The column types are float64. The DataFrame is sorted using chromosome as the primary sort key, and sample ID as the secondary sort key.
- glow.wgr.get_sample_ids(data)[source]
Extracts sample IDs from a variant DataFrame, such as one read from PLINK files.
Requires that the sample IDs:
Are in
genotype.sampleId
Are the same across all the variant rows
Are a list of strings
Are non-empty
Are unique
- glow.wgr.reshape_for_gwas(spark, label_df)[source]
Reshapes a Pandas DataFrame into a Spark DataFrame with a convenient format for Glow’s GWAS functions. This function can handle labels that are either per-sample or per-sample and per-contig, like those generated by GloWGR’s transform_loco function.
Examples
>>> label_df = pd.DataFrame({'label1': [1, 2], 'label2': [3, 4]}, index=['sample1', 'sample2']) >>> reshaped = reshape_for_gwas(spark, label_df) >>> reshaped.head() Row(label='label1', values=[1, 2])
>>> loco_label_df = pd.DataFrame({'label1': [1, 2], 'label2': [3, 4]}, ... index=pd.MultiIndex.from_tuples([('sample1', 'chr1'), ('sample1', 'chr2')])) >>> reshaped = reshape_for_gwas(spark, loco_label_df) >>> reshaped.head() Row(contigName='chr1', label='label1', values=[1])
Requires that:
The input label DataFrame is indexed by sample id or by (sample id, contig name)
- Parameters
spark (
SparkSession
) – A Spark sessionlabel_df (
DataFrame
) – A pandas DataFrame containing labels. The Data Frame should either be indexed by sample id or multi indexed by (sample id, contig name). Each column is interpreted as a label.
- Return type
DataFrame
- Returns
A Spark DataFrame with a convenient format for Glow regression functions. Each row contains the label name, the contig name if provided in the input DataFrame, and an array containing the label value for each sample.
GWAS functions
- glow.gwas.linear_regression(genotype_df, phenotype_df, covariate_df=Empty DataFrame Columns: [] Index: [], offset_df=Empty DataFrame Columns: [] Index: [], contigs=None, add_intercept=True, values_column='values', dt=<class 'numpy.float64'>, verbose_output=False, intersect_samples=False, genotype_sample_ids=None)[source]
Uses linear regression to test for association between genotypes and one or more phenotypes. The implementation is a distributed version of the method used in regenie: https://www.biorxiv.org/content/10.1101/2020.06.19.162354v2
Implementation details:
On the driver node, we decompose the covariate matrix into an orthonormal basis and use it to project the covariates out of the phenotype matrix. The orthonormal basis and the phenotype residuals are broadcast as part of a Pandas UDF. In each Spark task, we project the covariates out of a block of genotypes and then compute the regression statistics for each phenotype, taking into account the distinct missingness patterns of each phenotype.
Examples
>>> np.random.seed(42) >>> n_samples, n_phenotypes, n_covariates = (710, 3, 3) >>> phenotype_df = pd.DataFrame(np.random.random((n_samples, n_phenotypes)), columns=['p1', 'p2', 'p3']) >>> covariate_df = pd.DataFrame(np.random.random((n_samples, n_phenotypes))) >>> genotype_df = (spark.read.format('vcf').load('test-data/1kg_sample.vcf') ... .select('contigName', 'start', 'genotypes')) >>> results = glow.gwas.linear_regression(genotype_df, phenotype_df, covariate_df, ... values_column=glow.genotype_states('genotypes')) >>> results.head() Row(contigName='1', start=904164, effect=0.0453..., stderror=0.0214..., tvalue=2.114..., pvalue=0.0348..., phenotype='p1')
>>> phenotype_df = pd.DataFrame(np.random.random((n_samples, n_phenotypes)), columns=['p1', 'p2', 'p3']) >>> covariate_df = pd.DataFrame(np.random.random((n_samples, n_phenotypes))) >>> genotype_df = (spark.read.format('vcf').load('test-data/1kg_sample.vcf') ... .select('contigName', 'start', 'genotypes')) >>> contigs = ['1', '2', '3'] >>> offset_index = pd.MultiIndex.from_product([phenotype_df.index, contigs]) >>> offset_df = pd.DataFrame(np.random.random((n_samples * len(contigs), n_phenotypes)), ... index=offset_index, columns=phenotype_df.columns) >>> results = glow.gwas.linear_regression(genotype_df, phenotype_df, covariate_df, ... offset_df=offset_df, values_column=glow.genotype_states('genotypes'))
- Parameters
genotype_df (
DataFrame
) – Spark DataFrame containing genomic dataphenotype_df (
DataFrame
) – Pandas DataFrame containing phenotypic datacovariate_df (
DataFrame
) – An optional Pandas DataFrame containing covariatesoffset_df (
DataFrame
) – An optional Pandas DataFrame containing the phenotype offset, as output by GloWGR’s RidgeRegression or Regenie step 1. The actual phenotype used for linear regression is the mean-centered, residualized and scaledphenotype_df
minus the appropriate offset. Theoffset_df
may have one or two levels of indexing. If one level, the index should be the same as thephenotype_df
. If two levels, the level 0 index should be the same as thephenotype_df
, and the level 1 index should be the contig name. The two level index scheme allows for per-contig offsets like LOCO predictions from GloWGR.contigs (
Optional
[List
[str
]]) – When using LOCO offsets, this parameter indicates the contigs to analyze. You can use this parameter to limit the size of the broadcasted data, which may be necessary with large sample sizes. If this parameter is omitted, the contigs are inferred from theoffset_df
.add_intercept (
bool
) – Whether or not to add an intercept column to the covariate DataFramevalues_column (
Union
[str
,Column
]) – A column name or column expression to test with linear regression. If a column name is provided,genotype_df
should have a column with this name and a numeric array type. If a column expression is provided, the expression should return a numeric array type.dt (
type
) – The numpy datatype to use in the linear regression test. Must benp.float32
ornp.float64
.verbose_output (
bool
) – Whether or not to generate additional test statistics (n, sum_x, y_transpose_x) to the output DataFrame. These values are derived directly from phenotype_df and genotype_df, and does not reflect any standardization performed as part of the implementation of linear_regression.intersect_samples (
bool
) – The current implementation of linear regression is optimized for speed, but is not robust to high levels missing phenotype values. Without handling missingness appropriately, pvalues may become inflated due to imputation. When intersect_samples is enabled, samples that do no exist in the phenotype dataframe will be dropped from genotypes, offsets, and covariates prior to regression analysis. Note that if phenotypes in phenotypes_df contain missing values, these samples will not be automatically dropped. The user is responsible for determining their desired levels of missingness and imputation. Drop any rows with missing values from phenotype_df prior to linear_regression to prevent any imputation. If covariates are provided, covariate and phenotype samples will automatically be intersected.genotype_sample_ids (
Optional
[List
[str
]]) – Sample ids from genotype_df. i.e. from applying glow.wgr.functions.get_sample_ids(genotype_df) or if include_sample_ids=False was used during the generation genotype_df, then using an externally managed list of sample_ids that correspond to the array of genotype calls.
- Return type
DataFrame
- Returns
A Spark DataFrame that contains
All columns from
genotype_df
except thevalues_column
and thegenotypes
column if one existseffect
: The effect size estimate for the genotypestderror
: The estimated standard error of the effecttvalue
: The T statisticpvalue
: P value estimated from a two sided T-testphenotype
: The phenotype name as determined by the column names ofphenotype_df
``n``(int): (verbose_output only) number of samples with non-null phenotype
``sum_x``(float): (verbose_output only) sum of genotype inputs
- ``y_transpose_x``(float): (verbose_output only) dot product of phenotype response (missing values encoded as zeros)
and genotype input, i.e. phenotype value * number of alternate alleles
- glow.gwas.logistic_regression(genotype_df, phenotype_df, covariate_df=Empty DataFrame Columns: [] Index: [], offset_df=Empty DataFrame Columns: [] Index: [], correction='approx-firth', pvalue_threshold=0.05, contigs=None, add_intercept=True, values_column='values', dt=<class 'numpy.float64'>, intersect_samples=False, genotype_sample_ids=None)[source]
Uses logistic regression to test for association between genotypes and one or more binary phenotypes. This is a distributed version of the method from regenie: https://www.nature.com/articles/s41588-021-00870-7
Implementation details:
On the driver node, we fit a logistic regression model based on the covariates for each phenotype:
\[logit(y) \sim C\]where \(y\) is a phenotype vector and \(C\) is the covariate matrix.
We compute the probability predictions \(\hat{y}\) and broadcast the residuals (\(y - \hat{y}\)), \(\gamma\) vectors (where \(\gamma = \hat{y} * (1 - \hat{y})\)), and \((C^\intercal \gamma C)^{-1}\) matrices. In each task, we then adjust the new genotypes based on the null fit, perform a score test as a fast scan for potentially significant variants, and then test variants with p-values below a threshold using a more selective, more expensive test.
- Parameters
genotype_df (
DataFrame
) – Spark DataFrame containing genomic dataphenotype_df (
DataFrame
) – Pandas DataFrame containing phenotypic datacovariate_df (
DataFrame
) – An optional Pandas DataFrame containing covariatesoffset_df (
DataFrame
) – An optional Pandas DataFrame containing the phenotype offset. This value will be used as an offset in the covariate only and per variant logistic regression models. Theoffset_df
may have one or two levels of indexing. If one level, the index should be the same as thephenotype_df
. If two levels, the level 0 index should be the same as thephenotype_df
, and the level 1 index should be the contig name. The two level index scheme allows for per-contig offsets like LOCO predictions from GloWGR.correction (
str
) – Which test to use for variants that meet a significance threshold for the score test. Supported methods arenone
andapprox-firth
.pvalue_threshold (
float
) – Variants with a pvalue below this threshold will be tested using thecorrection
method.contigs (
Optional
[List
[str
]]) – When using LOCO offsets, this parameter indicates the contigs to analyze. You can use this parameter to limit the size of the broadcasted data, which may be necessary with large sample sizes. If this parameter is omitted, the contigs are inferred from theoffset_df
.add_intercept (
bool
) – Whether or not to add an intercept column to the covariate DataFramevalues_column (
str
) – A column name or column expression to test with linear regression. If a column name is provided,genotype_df
should have a column with this name and a numeric array type. If a column expression is provided, the expression should return a numeric array type.dt (
type
) – The numpy datatype to use in the linear regression test. Must benp.float32
ornp.float64
.
- Return type
DataFrame
- Returns
A Spark DataFrame that contains
All columns from
genotype_df
except thevalues_column
and thegenotypes
column if one existseffect
: The effect size (if approximate Firth correction was applied)stderror
: Standard error of the effect size (if approximate Firth correction was applied)correctionSucceeded
: Whether the correction succeeded (if the correction test method is notnone
).True
if succeeded,False
if failed,null
if correction was not applied.chisq
: The chi squared test statistic according to the score test or the correction methodpvalue
: p-value estimated from the test statisticphenotype
: The phenotype name as determined by the column names ofphenotype_df