Data Simulation

These data simulation notebooks generate phenotypes, covariates and genotypes at a user-defined scale. This dataset can be used for integration and scale-testing.

The variables n_samples and n_variants for defining this scale are in the notebook 0_setup_constants_glow. This notebook is %run from the notebooks below using its relative path. The notebook is located in the Glow github repository here.

Simulate Covariates & Phenotypes

This data simulation notebooks uses Pandas to simulate quantitative and binary phenotypes and covariates.

Notebook

Simulate Genotypes

This data simulation notebook loads variant call format (VCF) files from the 1000 Genomes Project, and returns a Delta Lake table with simulated genotypes, maintaining hardy-weinberg equilibrium and allele frequency for each variant.

Notebook