GWAS guide using Alzheimer's disease

Explore one of the ways to conduct a GWAS on the Research Analysis Platform.

circle-info

The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this guide.

Introduction

This tutorial will provide the Python code needed to conduct the QC portion of a genome-wide association study (GWAS) focusing on discovering variants associated with Alzheimer's Disease (AD) in JupyterLab. For this GWAS, we used UK Biobank data to search for genetic loci associated with AD. For more detailed information about the AD study, see this blog postarrow-up-right.

One of the motivations for performing this study was to demonstrate the process of running an end-to-end GWAS on the UK Biobank Research Analysis Platform. While researchers can bring their own programming languages and tools to the Platform, the Platform’s user interface enables researchers to run analyses easily and quickly. We wanted to demonstrate that conducting a GWAS could be done without the knowledge of sophisticated command-line methods.

Overview

Here is a brief overview of the steps that we conducted to perform this GWAS:

  1. In JupyterLab:

    1. Access data and construct phenotype.

    2. Perform sample QC

  2. In CLI or UI:

    1. Conduct LiftOver.

    2. Conduct GWAS using regeniearrow-up-right.

  3. In UI:

    1. Visualize analysis results using LocusZoomarrow-up-right.

Accessing the Data

The genetic data needed to run this GWAS is stored in multiple folders in the Research Analysis Platform’s file structure, which is detailed herearrow-up-right. For this analysis, we use the array genotype and whole exome sequencing data stored in the “/Bulk/Genotype Results/Genotype calls/” and “/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/” folders, respectively. Phenotypic data needed to collect the AD phenotype is located in the SQL database named “app<APPLICATION-ID>_<CREATION-TIME>”, and can be accessed via the corresponding dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” in the root (“/”) folder.

Pre-GWAS Steps

The first part of this analysis will consist of conducting sample QC and processing the UK Biobank data fields to calculate the individual’s AD risk by proxy.

Sample QC Steps and Derived Case/Control Phenotype

We conducted our QC using a Jupyter Lab spark node. Here is the following configuration:

  • Cluster configuration: Spark Cluster

  • Instance type: mem1_hdd1_v2_x4

  • Number of Nodes: 2

Getting the Data

We first imported dxpy to find and extract the UKB datasets on the Research Analysis Platform.

The extracted UKB data is returned as 3 .csv files. The data_dictionary.csv contains a table with participants as the rows and metadata along the columns, including field names (see table below for naming convention). The codings.csv contains a lookup table for the different medical codes, including ICD-10 codes that will be displayed in the diagnosis field column (p41270). The entity_dictionary.csv contains the different entities that can be queried, where participant is the main entity that corresponds to most phenotype fields.

To access the data_dictionary.csv, use the command:

Selecting Participant Fields by Field Index, Instance Index, or Array Index

For the main participant phenotype entity, the Research Analysis Platform uses field names with the following convention:

Type of field

Syntax for field name

Example

Neither instanced nor arrayed

p<FIELD>

p31

Instanced but not arrayed

p<FIELD>_i<INSTANCE>

p40005_i0

Arrayed but not instanced

p<FIELD>_a<ARRAY>

p41262_a0

Instanced and arrayed

p<FIELD>_i<INSTANCE>_a<ARRAY>

p93_i0_a0

Data-Field ID and Corresponding Description

Using the UKB showcase, we located the following field IDs of interest from UKB showcasearrow-up-right. We note them below so we can easily refer to them throughout the rest of the notebook.

The full list of all data-fields used in our case studyarrow-up-right:

Data-Field ID

Description

Data-Coding

Father's age at death

Units of measurement are years or 100291arrow-up-right

Mother's age

Units of measurement are years or 100291arrow-up-right

Father's age

Units of measurement are years or 100291arrow-up-right

Mother's age at death

Units of measurement are years or 100291arrow-up-right

Age at recruitment

Units of measurement are years

Genetic principal components

Units of measurement are Units; Array indices run from 1 to 40

Sex chromosome aneuploidy

Genetic kinship to other participants

Outliers for heterozygosity or missing rate

Diagnoses - secondary ICD10

Extract Data-Fields Used for Deriving Phenotype and Sample QC

The fields we want include principal components as well as the fields we need to construct our phenotype. We use the following function to collect all the field names and create a list of field IDs:

If you view these field names now, each will look something like this: p<field_id>_iYYY_aZZZ.

Now we can retrieve our necessary fields as a Pandas DataFrame:

The dispensed_dictionary.csv file contains a table with participants as rows and the fields of interest as columns. From this point onward, we are interacting with the data in a Pandas DataFrame.

Use this command to visualize your table:

We will format the column headers to drop the "participant" prefix for the field names:

Extract ICD-10 Codes for AD by Patterns (G30 and F00)

Here we resolve the ICD-10 codes for AD from the dataset. This list of codes will be used later for defining the phenotype.

Your output may look like the following:

Derive the Phenotype (AD-by-Proxy)

The AD phenotype is generated using both a participant’s disease status and that participant's parental disease status and age. We are following the definition of AD-by-proxy described in this articlearrow-up-right by Jansen et al.

Here is a summary of the steps of the process. There are two calculation methods:

  1. Participant's disease status.

    • If the participant has been diagnosed with AD. Count risk as the maximum value, 2.

  2. Additive of participant's parents' risk, determined by disease status along with parent's age.

    • If a biological parent has been diagnosed with AD, count the risk as a 1. The risk is set to 2 if both biological parents were diagnosed with AD.

    • Otherwise, get the parent' risk by the parents age (recorded age or age at death, whichever is older):

      • risk = max( 0.32, (100 - age)/100 )

      • 0.32 is the general risk for an adult to be diagnosed with AD

  3. Take the maximum of steps 1 and 2.

To get the participant's disease status, we look for the participant’s primary or secondary ICD-10 codes. If a participant has ICD-10 codes for AD, we record their AD risk as “2”. Note that we first need to format the values in the columns to be a list object instead of a string.

Now we look at the illness statuses of the participant’s mother and father. We first get a list of illnesses with which the participants' biological parents have been diagnosed.

Data-fields 20107 and 20110 record illnesses of each participant's father and mother. If diagnosed with AD, assign the parent's risk as "1":

UKB participants attended multiple measurement-taking sessions, so there are multiple fields for the participant’s parents’ ages. Since the risk of developing AD increases by age, we only need the maximum value for each parent’s age to calculate the AD risk:

If the parent does not have any AD diagnosis, calculate the parent’s risk using their age, as shown in the following, then organize the data into a table:

Perform Sample QC and Determine Phenotype

Now we conduct the Sample QC steps. We filtered for participants for whom reported sex and genetic sex are the same, ancestry is listed as "White British," and there is no sex chromosome aneuploidy. We filter out for outliers that are heterozygous and missing rates, or are distinguished excessive kinship, i.e. have ten or more third-degree relatives. Participants who had parental ages as “do not know" or "prefer not to answer" were also removed from the cohort:

Use this command to check the resulting number of samples:

Visualizing the Distribution and Setting the Case/Control Threshold

Since GWAS analyses have been run before for AD, and the results have been documented, we took the extra step of visualizing the distribution of cases with a histogram, to double check that our results looked reasonable in a general population. Our histogram looked reasonable, so we continued with our analysis:

Historgram of the values of AD-by-proxy.

Seeing the distribution, we decide to set the case/control threshold with AD risk at 1, with the range going from 0 to 2. All participants with a value for 1 or above for “AD-by-proxy” are considered cases, and those below 1 are considered to be controls:

Cleaning up the Pandas DataFrame and Saving to Your Project

Next we cleaned up the DataFrame to set up as input with regeniearrow-up-right. We renamed the columns for human-readability and added an FID column, which is the required format for regenie. We also convert the list of parental illnesses to a string, for ease of formatting:

At this point, your table should look something like this, noting that displayed data values are placeholders, so to not disclose actual values:

IID

sex

age

missingness

ethnic_group

sex_chromosome_aneuploidy

0

0000011

1

45

0.003

1

None

1

0000022

0

67

0.022

1

None

2

0000003

1

50

0.001

1

None

To upload your phenotype file back to the project:

Now the QC steps are complete.

The next steps for the GWAS is to conduct variant QC and the array-genotype LiftOver.

Variant QC Steps and Array-Genotype LiftOver

LiftOver

Note that the UKB SNP array genotype calls (bulk, genotype calls) are mapped to GRCh37arrow-up-right, while the whole-exome sequences are mapped to GRCh38arrow-up-right, the most recent version. We chose to lift the coordinates of genotype calls over to GRCh38 using LiftOverarrow-up-right before conducting the rest of the analysis, so that our genotype calls are consistent.

The following pipelinearrow-up-right may be useful in conducting this part of the analysis.

Variant QC

To conduct variant QC, we use PLINK2, available as part of the Swiss Army Knife app, to create a list of variants that pass the QC threshold for array genotype and WES variant data. To access it from the UK Biobank Research Analysis Program, open your project, then click Start Analysis. From the list of tools that appear, click Swiss Army Knife. The following shows what you will put as input string under the cmd option of the Swiss Army Knife app.

For more information about PLINK2 see the official PLINK2 documentationarrow-up-right.

Array Genotype Data

The input files to our variant QC job for running the array genotype data is:

  • The UK Biobank genotype call files in .bed, .bim, and .fam formats. We merged the chromosomes together into one file for each file format in preparation for running the whole genome regression model using Step 1 of regenie.

Below are the Swiss Army Knife cmd inputs used to run the job:

The outputs we get from this step are:

  • A list of SNPs that pass QC, final_array_snps_CRCh38_qc_pass.snplist

  • A list of sample IDs that pass QC, final_array_snps_CRCh38_qc_pass.id

  • A log file (.log)

WES Data

The WES QC is performed using the bgens_qc.wdl script found herearrow-up-right. The variants are QC’d per chromosome and then concatenated at the end to create a single list of variants to keep. The variants were filtered using following options:

The input file to run the bgens_qc.wdl was created by first running the generate_inputs.ipynb notebook shown herearrow-up-right. Within this notebook you’ll need to update the following parameters:

Run the WDL workflow using the .json file created in the generate_inputs.ipynb notebook by running the command: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json. Otherwise you can run the WDL workflow in the UI and specify the inputs manually instead of using the Jupyter notebook.

The outputs we get from this step are:

  • A list of SNPs that pass QC, final_WES_snps_CRCh38_qc_pass.snplist

  • A list of sample IDs that pass QC, final_WES_snps_CRCh38_qc_pass.id

  • A log file (.log)

GWAS using regenie

We will be using regeniearrow-up-right to conduct our GWAS. DNAnexus provides a suite of apps to run the analysis as a whole. The Regenie orchestration apparrow-up-right is used to facilitate the run of two required steps as well as input preprocessing, validation, resource allocation and post-processing (including creation of Manhattan plots per trait).

regenie Step 1

For Step 1, we will estimate how background SNPs contribute to the phenotype.

The inputs used for Step 1 are:

  • A merged array genotype calls: ukb_c1-22_hs38DH_merged.[bed,bim,fam] that we also used as our input in the variant QC step above.

  • Our phenotype file: ad_risk_by_proxy_wes.phe

  • SNPs that pass QC list: final_array_snps_CRCh38_qc_pass.snplist and the SNP ID file, final_array_snps_CRCh38_qc_pass.id

The following outputs from Step 1 are the necessary inputs for Step 2:

  • LOCO file: ukb_c1-22_hs38DH_merged_1.loco

  • Prediction list: ukb_c1-22_hs38DH_merged_pred.list

regenie Step 2

Then in Step 2, linear or logistic regression is used to test association between the variants and phenotype conditional upon the prediction from the regression model in Step 1. In Step 2, the orchestration app distributes each set of BGEN files into a separate job, as each variant is independently tested to the phenotype, and this parallelization helps accelerate the end-to-end runtime.

The inputs to the second step are:

  • 23 sets of WES genotype files (ukb23159_c[1-22,X]_b0_v1.[bgen,bgi,sample]).

  • ID file of SNPs that pass QC: final_WES_snps_CRCh38_qc_pass.id

  • Phenotype file: ad_risk_by_proxy_wes.phe

  • Prediction list: ukb_c1-22_hs38DH_merged_pred.list (from Step 1)

  • LOCO file: ukb_c1-22_hs38DH_merged_1.loco (from Step 1).

After the completion of Step 2, a set of association plots is generated. This concatenation is done per phenotype. On the output of the orchestration app are concatenated Manhattan plots as well as statistic in lmm.tsv.gz (Extra output files/extra_outputs ) that can be investigated with LocusZoomarrow-up-right.

Running regenie in the UI

We used the following input options to run the regenie app using the UIarrow-up-right. If an input option isn’t specified, then the default option was used.

Inputs:

Genotype BED for Step 1: ukb_c1-22_hs38DH_merged.bed

Genotype BIM for Step 1: ukb_c1-22_hs38DH_merged.bim

Genotype FAM for Step 1: ukb_c1-22_hs38DH_merged.fam

Genotype BGEN files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen

Genotype BGI index files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen.bgi

Sample files for Step 2: ukb23159_c[1-22,X]_b0_v1.sample

Phenotypes file: ad_risk_by_proxy_wes.phe

Covariates file: ad_risk_by_proxy_wes.phe

Variant IDs to extract (Step 1): final_array_snps_CRCh38_qc_pass.snplist

Variant IDs to extract (Step 2): final_WES_snps_CRCh38_qc_pass.snplist

Common

Quantitative traits: FALSE

Phenotypes to include: ad_by_proxy

Array of covariates to apply: age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10

WGR (Step 1)

Size of the genotype blocks (Step 1): 1000

First allele as reference (Step 1): TRUE

Association Testing (Step 2)

Size of the genotype blocks (Step 2): 200

Firth approximation instead of SPA: TRUE

Minimum allele count (MAC): 3

Running regenie in the CLI

To run regenie in the CLI, do not directly copy and paste this portion of code. Fill in the file paths specified by <> and add file paths for the chromosomes you want to analyze using the igenotype_bgens/bgis/samples.

LocusZoom

Next you can use LocusZoom to visualize the results of your analysis. The LocusZoom app is available in the Tools Library of the Research Analysis Platform. It generates Manhattan plots, QQ plots, and a table of high p-value loci.

To visualize your regenie outputs (e.g. the lmm.tsv.gz file), provide it as input to LocusZoom, and follow the app's instructions. From these results you can conduct further analyses and look into variant effects, or you may find new loci to explore.

Examples of these visualizations are shown below.

Manhattan plot of GWAS results of Alzheimer's Disease from WES variants of 500k participants from UKB.
GWAS results of Alzheimer's Disease around the APOE gene, which is a known association.

Benchmarking

A few example cases that might be interesting were tested to provide expectations on the Regenie runtime.

Step 1

The runtime and resources of various scenarios for Step 1 execution are summarized in the following table. Please note that this is only for orientational purposes and analysis-specific factors can influence the final results.

Step 2

Focusing on chromosome 1 (99236 variants after QC) as we analyze all chromosomes in parallel. The runtime of Step 2 varied from 7 to 140 minutes for mentioned use cases. The resources used are relatively stable across the tested scenarios and the default values are established to 3700 MB RAM and 40 GB disk.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.comarrow-up-right)" in your work.

Last updated

Was this helpful?