GWAS guide using Alzheimer's disease
Explore one of the ways to conduct a GWAS on the Research Analysis Platform.
The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this guide.
Introduction
This tutorial will provide the Python code needed to conduct the QC portion of a genome-wide association study (GWAS) focusing on discovering variants associated with Alzheimer's Disease (AD) in JupyterLab. For this GWAS, we used UK Biobank data to search for genetic loci associated with AD. For more detailed information about the AD study, see this blog post.
One of the motivations for performing this study was to demonstrate the process of running an end-to-end GWAS on the UK Biobank Research Analysis Platform. While researchers can bring their own programming languages and tools to the Platform, the Platform’s user interface enables researchers to run analyses easily and quickly. We wanted to demonstrate that conducting a GWAS could be done without the knowledge of sophisticated command-line methods.
Overview
Here is a brief overview of the steps that we conducted to perform this GWAS:
In JupyterLab:
Access data and construct phenotype.
Perform sample QC
In CLI or UI:
Conduct LiftOver.
Variant QC using Swiss Army Knife.
Conduct GWAS using regenie.
In UI:
Visualize analysis results using LocusZoom.
Accessing the Data
The genetic data needed to run this GWAS is stored in multiple folders in the Research Analysis Platform’s file structure, which is detailed here. For this analysis, we use the array genotype and whole exome sequencing data stored in the “/Bulk/Genotype Results/Genotype calls/” and “/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/” folders, respectively. Phenotypic data needed to collect the AD phenotype is located in the SQL database named “app<APPLICATION-ID>_<CREATION-TIME>”, and can be accessed via the corresponding dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” in the root (“/”) folder.
Pre-GWAS Steps
The first part of this analysis will consist of conducting sample QC and processing the UK Biobank data fields to calculate the individual’s AD risk by proxy.
Sample QC Steps and Derived Case/Control Phenotype
We conducted our QC using a Jupyter Lab spark node. Here is the following configuration:
Cluster configuration: Spark Cluster
Instance type: mem1_hdd1_v2_x4
Number of Nodes: 2
Getting the Data
We first imported dxpy
to find and extract the UKB datasets on the Research Analysis Platform.
The extracted UKB data is returned as 3 .csv files. The data_dictionary.csv
contains a table with participants as the rows and metadata along the columns, including field names (see table below for naming convention). The codings.csv
contains a lookup table for the different medical codes, including ICD-10 codes that will be displayed in the diagnosis field column (p41270). The entity_dictionary.csv
contains the different entities that can be queried, where participant
is the main entity that corresponds to most phenotype fields.
To access the data_dictionary.csv
, use the command:
Selecting Participant Fields by Field Index, Instance Index, or Array Index¶
For the main participant phenotype entity, the Research Analysis Platform uses field names with the following convention:
Type of field
Syntax for field name
Example
Neither instanced nor arrayed
p<FIELD>
p31
Instanced but not arrayed
p<FIELD>_i<INSTANCE>
p40005_i0
Arrayed but not instanced
p<FIELD>_a<ARRAY>
p41262_a0
Instanced and arrayed
p<FIELD>_i<INSTANCE>_a<ARRAY>
p93_i0_a0
Data-Field ID and Corresponding Description
Using the UKB showcase, we located the following field IDs of interest from UKB showcase. We note them below so we can easily refer to them throughout the rest of the notebook.
The full list of all data-fields used in our case study:
Data-Field ID
Description
Data-Coding
Sex
Father's age at death
Mother's age
Father's age
Mother's age at death
Illnesses of father
Illnesses of mother
Age at recruitment
Units of measurement are years
Genetic sex
Genetic ethnic grouping
Genetic principal components
Units of measurement are Units; Array indices run from 1 to 40
Sex chromosome aneuploidy
Genetic kinship to other participants
Outliers for heterozygosity or missing rate
Diagnoses - main ICD10
Diagnoses - secondary ICD10
Extract Data-Fields Used for Deriving Phenotype and Sample QC
The fields we want include principal components as well as the fields we need to construct our phenotype. We use the following function to collect all the field names and create a list of field IDs:
If you view these field names now, each will look something like this: p<field_id>_iYYY_aZZZ
.
Now we can retrieve our necessary fields as a Pandas DataFrame:
The dispensed_dictionary.csv file contains a table with participants as rows and the fields of interest as columns. From this point onward, we are interacting with the data in a Pandas DataFrame.
Use this command to visualize your table:
We will format the column headers to drop the "participant" prefix for the field names:
Extract ICD-10 Codes for AD by Patterns (G30 and F00)
Here we resolve the ICD-10 codes for AD from the dataset. This list of codes will be used later for defining the phenotype.
Your output may look like the following:
Derive the Phenotype (AD-by-Proxy)
The AD phenotype is generated using both a participant’s disease status and that participant's parental disease status and age. We are following the definition of AD-by-proxy described in this article by Jansen et al.
Here is a summary of the steps of the process. There are two calculation methods:
Participant's disease status.
If the participant has been diagnosed with AD. Count risk as the maximum value, 2.
Additive of participant's parents' risk, determined by disease status along with parent's age.
If a biological parent has been diagnosed with AD, count the risk as a 1. The risk is set to 2 if both biological parents were diagnosed with AD.
Otherwise, get the parent' risk by the parents age (recorded age or age at death, whichever is older):
risk = max( 0.32, (100 - age)/100 )
0.32 is the general risk for an adult to be diagnosed with AD
Take the maximum of steps 1 and 2.
To get the participant's disease status, we look for the participant’s primary or secondary ICD-10 codes. If a participant has ICD-10 codes for AD, we record their AD risk as “2”. Note that we first need to format the values in the columns to be a list object instead of a string.
Now we look at the illness statuses of the participant’s mother and father. We first get a list of illnesses with which the participants' biological parents have been diagnosed.
Data-fields 20107 and 20110 record illnesses of each participant's father and mother. If diagnosed with AD, assign the parent's risk as "1":
UKB participants attended multiple measurement-taking sessions, so there are multiple fields for the participant’s parents’ ages. Since the risk of developing AD increases by age, we only need the maximum value for each parent’s age to calculate the AD risk:
If the parent does not have any AD diagnosis, calculate the parent’s risk using their age, as shown in the following, then organize the data into a table:
Perform Sample QC and Determine Phenotype
Now we conduct the Sample QC steps. We filtered for participants for whom reported sex and genetic sex are the same, ancestry is listed as "White British," and there is no sex chromosome aneuploidy. We filter out for outliers that are heterozygous and missing rates, or are distinguished excessive kinship, i.e. have ten or more third-degree relatives. Participants who had parental ages as “do not know" or "prefer not to answer" were also removed from the cohort:
Use this command to check the resulting number of samples:
Visualizing the Distribution and Setting the Case/Control Threshold
Since GWAS analyses have been run before for AD, and the results have been documented, we took the extra step of visualizing the distribution of cases with a histogram, to double check that our results looked reasonable in a general population. Our histogram looked reasonable, so we continued with our analysis:
Seeing the distribution, we decide to set the case/control threshold with AD risk at 1, with the range going from 0 to 2. All participants with a value for 1 or above for “AD-by-proxy” are considered cases, and those below 1 are considered to be controls:
Cleaning up the Pandas DataFrame and Saving to Your Project
Next we cleaned up the DataFrame to set up as input with regenie. We renamed the columns for human-readability and added an FID column, which is the required format for regenie. We also convert the list of parental illnesses to a string, for ease of formatting:
At this point, your table should look something like this, noting that displayed data values are placeholders, so to not disclose actual values:
IID
sex
age
missingness
ethnic_group
sex_chromosome_aneuploidy
0
0000011
1
45
0.003
1
None
1
0000022
0
67
0.022
1
None
2
0000003
1
50
0.001
1
None
To upload your phenotype file back to the project:
Now the QC steps are complete.
The next steps for the GWAS is to conduct variant QC and the array-genotype LiftOver.
Variant QC Steps and Array-Genotype LiftOver
LiftOver
Note that the UKB SNP array genotype calls (bulk, genotype calls) are mapped to GRCh37, while the whole-exome sequences are mapped to GRCh38, the most recent version. We chose to lift the coordinates of genotype calls over to GRCh38 using LiftOver before conducting the rest of the analysis, so that our genotype calls are consistent.
The following pipeline may be useful in conducting this part of the analysis.
Variant QC
To conduct variant QC, we use PLINK2, available as part of the Swiss Army Knife app, to create a list of variants that pass the QC threshold for array genotype and WES variant data. To access it from the UK Biobank Research Analysis Program, open your project, then click Start Analysis. From the list of tools that appear, click Swiss Army Knife. The following shows what you will put as input string under the cmd
option of the Swiss Army Knife app.
For more information about PLINK2 see the official PLINK2 documentation.
Array Genotype Data
The input files to our variant QC job for running the array genotype data is:
The UK Biobank genotype call files in .bed, .bim, and .fam formats. We merged the chromosomes together into one file for each file format in preparation for running the whole genome regression model using Step 1 of regenie.
Below are the Swiss Army Knife cmd
inputs used to run the job:
The outputs we get from this step are:
A list of SNPs that pass QC,
final_array_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC,
final_array_snps_CRCh38_qc_pass.id
A log file (
.log
)
WES Data
The WES QC is performed using the bgens_qc.wdl
script found here. The variants are QC’d per chromosome and then concatenated at the end to create a single list of variants to keep. The variants were filtered using following options:
The input file to run the bgens_qc.wdl
was created by first running the generate_inputs.ipynb
notebook shown here. Within this notebook you’ll need to update the following parameters:
Run the WDL workflow using the .json file created in the generate_inputs.ipynb
notebook by running the command: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json
. Otherwise you can run the WDL workflow in the UI and specify the inputs manually instead of using the Jupyter notebook.
The outputs we get from this step are:
A list of SNPs that pass QC,
final_WES_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC,
final_WES_snps_CRCh38_qc_pass.id
A log file (.
log
)
GWAS using regenie
We will be using regenie to conduct our GWAS. DNAnexus provides a suite of apps to run the analysis as a whole. The Regenie orchestration app is used to facilitate the run of two required steps as well as input preprocessing, validation, resource allocation and post-processing (including creation of Manhattan plots per trait).
regenie Step 1
For Step 1, we will estimate how background SNPs contribute to the phenotype.
The inputs used for Step 1 are:
A merged array genotype calls:
ukb_c1-22_hs38DH_merged.[bed,bim,fam]
that we also used as our input in the variant QC step above.Our phenotype file:
ad_risk_by_proxy_wes.phe
SNPs that pass QC list:
final_array_snps_CRCh38_qc_pass.snplist
and the SNP ID file,final_array_snps_CRCh38_qc_pass.id
The following outputs from Step 1 are the necessary inputs for Step 2:
LOCO file:
ukb_c1-22_hs38DH_merged_1.loco
Prediction list:
ukb_c1-22_hs38DH_merged_pred.list
regenie Step 2
Then in Step 2, linear or logistic regression is used to test association between the variants and phenotype conditional upon the prediction from the regression model in Step 1. In Step 2, the orchestration app distributes each set of BGEN files into a separate job, as each variant is independently tested to the phenotype, and this parallelization helps accelerate the end-to-end runtime.
The inputs to the second step are:
23 sets of WES genotype files (
ukb23159_c[1-22,X]_b0_v1.[bgen,bgi,sample]
).ID file of SNPs that pass QC:
final_WES_snps_CRCh38_qc_pass.id
Phenotype file:
ad_risk_by_proxy_wes.phe
Prediction list:
ukb_c1-22_hs38DH_merged_pred.list
(from Step 1)LOCO file:
ukb_c1-22_hs38DH_merged_1.loco
(from Step 1).
After the completion of Step 2, a set of association plots is generated. This concatenation is done per phenotype. On the output of the orchestration app are concatenated Manhattan plots as well as statistic in lmm.tsv.gz
(Extra output files/extra_outputs ) that can be investigated with LocusZoom.
Running regenie in the UI
We used the following input options to run the regenie app using the UI. If an input option isn’t specified, then the default option was used.
Inputs:
Genotype BED for Step 1: ukb_c1-22_hs38DH_merged.bed
Genotype BIM for Step 1: ukb_c1-22_hs38DH_merged.bim
Genotype FAM for Step 1: ukb_c1-22_hs38DH_merged.fam
Genotype BGEN files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen
Genotype BGI index files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen.bgi
Sample files for Step 2: ukb23159_c[1-22,X]_b0_v1.sample
Phenotypes file: ad_risk_by_proxy_wes.phe
Covariates file: ad_risk_by_proxy_wes.phe
Variant IDs to extract (Step 1): final_array_snps_CRCh38_qc_pass.snplist
Variant IDs to extract (Step 2): final_WES_snps_CRCh38_qc_pass.snplist
Common
Quantitative traits: FALSE
Phenotypes to include: ad_by_proxy
Array of covariates to apply: age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10
WGR (Step 1)
Size of the genotype blocks (Step 1): 1000
First allele as reference (Step 1): TRUE
Association Testing (Step 2)
Size of the genotype blocks (Step 2): 200
Firth approximation instead of SPA: TRUE
Minimum allele count (MAC): 3
Running regenie in the CLI
To run regenie in the CLI, do not directly copy and paste this portion of code. Fill in the file paths specified by <> and add file paths for the chromosomes you want to analyze using the igenotype_bgens/bgis/samples.
LocusZoom
Next you can use LocusZoom to visualize the results of your analysis. The LocusZoom app is available in the Tools Library of the Research Analysis Platform. It generates Manhattan plots, QQ plots, and a table of high p-value loci.
To visualize your regenie outputs (e.g. the lmm.tsv.gz
file), provide it as input to LocusZoom, and follow the app's instructions. From these results you can conduct further analyses and look into variant effects, or you may find new loci to explore.
Examples of these visualizations are shown below.
Benchmarking
A few example cases that might be interesting were tested to provide expectations on the Regenie runtime.
Step 1
The runtime and resources of various scenarios for Step 1 execution are summarized in the following table. Please note that this is only for orientational purposes and analysis-specific factors can influence the final results.
Step 2
Focusing on chromosome 1 (99236 variants after QC) as we analyze all chromosomes in parallel. The runtime of Step 2 varied from 7 to 140 minutes for mentioned use cases. The resources used are relatively stable across the tested scenarios and the default values are established to 3700 MB RAM and 40 GB disk.
How to Cite
If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.
Last updated
Was this helpful?