Genomic Target Discovery for Ischaemic Heart Disease using GWAS, LD Clumping, and PheWAS

The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this tutorial.

Note that all run times estimates and configurations given in this tutorial are not a guarantee and these estimates are given only as a reference.

Introduction

This research project provides an example of performing end-to-end genomic target discovery on UK Biobank data using ischaemic heart disease as an example phenotype. For this tutorial, ischaemic heart disease was chosen because of the large cohort size and because of the existence of previous GWAS studies with this phenotype, allowing for simpler comparison of results. The first step in this analysis was to create the case and control cohorts. After that, sample data (phenotypic data) from cohorts and genomic data (array and imputed) were cleaned. Then a genome-wide association studies (GWAS) analysis was performed and significant GWAS variants were aggregated using the linkage disequilibrium (LD) clumping approach. Lastly, a phenome-wide association study (PheWAS) was performed for each variant.

The original analysis and guide was created by Anastazie Sedlakova and there is a corresponding Youtube tutorial here.

This tutorial demonstrates how to:

Create control and cases cohorts
Perform sample QC in JupyterLab
Lift over array data by compiling and running a WDL workflow
Perform quality control of array data using Swiss Army Knife
Perform quality control of imputed data by compiling and running WDL workflow
Run GWAS analysis using REGENIE app
Perform LD clumping on imputed data in JupyterLab
- Extract variant allele counts from imputed data in JupyterLab
Run PheWAS analysis in JupyterLab
- Create phenome dataset from a ICD10 field

This analysis was done as part of the UKB 46926 research application.

Data

Genome Data

For this GWAS, two types of genomic data were used: array data (field 22418) and imputed data (Genomics England) (field 21008). For linkage disequilibrium (LD) clumping, only imputed data were used.

Phenotype Data

First, ischaemic heart disease was chosen as a phenotype of interest (ICD 10 code I20-I25). The following fields were also retrieved for the sample QC:

31 - Sex
22001 - Genetic sex
22006 - Genetic ethnic grouping
22019 - Sex chromosome aneuploidy
22020 - Used in genetic principal components

Covariates were added both to GWAS and PheWAS analysis to increase power and reduce confounding. Here is the list of covariates used:

31 - Sex
2966 - Age high blood pressure diagnosed (not used in PheWAS)
21022 - Age at recruitment
23104 - Body mass index (BMI)
20160 - Ever smoked
30760 - HDL cholesterol (not used in PheWAS)
30780 - LDL direct (not used in PheWAS)
22009 - Genetic principal components

The phenotype (ischaemic heart disease) being used in this analysis and other phenotypes for PheWAS testing were taken from the 41270 - Diagnoses - ICD10 field.

Creating the Cohort

Once the cohorts are created they can be used for subsequent analysis (GWAS). In the present example, the phenotype definition is simple, but often the phenotype definition is quite complex and can include multiple fields. Defined cohorts will be combined in the QC step.

Overview

First, the Cohort Browser is used to create the selected cohort. The "ischemic_cases" cohort was created by selecting participants that have I20-I25 Ischaemic heart disease in the field 41270. The "ischemic_controls" cohort is created by using “Cohorts compare: not” in "ischemic_cases". This results in a sample of 57,383 "ischemic_cases" and 445,027 "ischemic_controls".

Steps (UI)

On RAP, click on the dataset you wish to use from the Manage tab. Clicking on the dataset name, as shown in the image below, will take you to the Cohort Browser page.
Then, click the Add Filter button. Type in “ischaemic heart disease” into the search bar and select “Diagnoses - ICD10”, which has data-field 41270. Confirm your selection by clicking on Add Cohort Filter.
In the modal prompt for “Includes any of”, type “I20-I25 Ischaemic heart disease” and select that option. Then click the Apply Filter button.
Name the cohort. In this example, the name is “ischemic_cases” (Note: The example cohort name uses the American spelling, however mentions of the specific UKB data-field will use the UK spelling).

Creating the Control

To create the control group, clicking on the “+” (mouse-over: “Compare/combine cohort”) button.
From the drop down menu select the “not in ischemic_cases” option and apply this filter.
Name and save this cohort. In this example, the "ischemic_cases" cohort had 57,383 samples and the "ischemic_controls" cohort had 445,027 samples.

For additional information about creating cohorts, see this detailed tutorial on how to explore data in Cohort Browser.

Sample QC

Cleaning samples will decrease noisiness in data and increase accuracy of the GWAS analysis results. For example, checking for sex discordance and sex chromosome aneuploidy removes possible sample swaps and genotyping errors. Population substructure is minimized by selecting just one population (White British). In order to check phenotype-genotype associations, only samples of non-related participants were selected in choosing the samples that were used to calculate the PCA.

Overview

The code for the sample QC can be found in gwas-phenotype-samples-qc.ipynb. Phenotypic data was retrieved for each cohort using dx extract_dataset as a table. Then, samples were selected using the following criteria:

Sex and genetic sex are the same
Participant has White British ancestry
Not sex chromosome aneuploidy
Participant was used to calculate PCA (only non-relatives were included)

After filtering, 38,197 and 298,886 samples remained in "ischemic_cases" and "ischemic_controls", respectively.

Sample Phenotype Imputation

After QC, covariate variables were then imputed as part of the GWAS process. The covariates and a summary of what the above notebook code does to clean the data are listed below:

Age high blood pressure diagnosed in participant (2966) field
- The code combines all instances and create the hypertension boolean variable
Participant ever smoked
- The code makes all missing values as 0
Participant body mass index (BMI), HDL cholesterol, direct LDL
- The code replaces missing values by mean for that variable

Phenotypic data was then merged with the list of samples for which imputed data is available. The resulting table contained 38,197 cases and 298,886 controls.

JupyterLab Configuration

The next step is to use JupyterLab to run the GWAS step. The JupyterLab configuration used when running this example GWAS is listed below, however, individual user preferences may cause costs and runtime to vary.

Cluster configuration: Single node
Recommended instance: mem1_ssd1_v2_x36
Run time: Approximately 5 minutes

Steps (UI)

On your local machine, clone the repository for shared UKB Jupyter notebooks here.
Upload the Jupyter notebook to your project directory on RAP: dx upload gwas-phenotype-samples-qc.ipynb —-destination <directory path> .
On RAP, launch JupyterLab. The recommended configuration is listed below:
Once JupyterLab is open, click on the DNAnexus tab on the left hand side, and open the gwas-phenotype-samples-qc.ipynb notebook.
Once the notebook is open, go to the JupyterLab menu located at the top of the page, click the Run tab and click Run All Cells.

Array Data LiftOver

Array data that is provided by UKB is mapped to the older version of the reference genome (GRCh37). However, WES and WGS data that is released is mapped onto the current version of the reference genome, GRCh38. In order to perform association testing with the sequencing data, we need to ensure the data is mapped to the current version of the reference genome first using the liftOver script.

Overview

For this step, the liftOver WDL script created by Yih-Chii Hwang was used. As a result of this script, data was lifted to the newer reference genome and all chromosomes were merged. LiftOver was performed using Picard LiftoverVcf. This step took approximately 34 hours.

Steps (CLI/UI)

On your local machine’s terminal, install Java:
1. (Mac OS): brew install openjdk
2. (Linux OS): apt install default-jre
Download the latest JAR compiler here.
Login to Platform using the dx login command. Then, use the following command to compile the workflow and make it available for use on the Platform: java -jar dxCompiler-2.10.8.jar compile liftover_plink_beds.wdl -project project-xxxx -folder <directory path>
On RAP you will now find:
1. A workflow called liftover_plink_beds
2. Applets created that correspond to the steps of the workflow
In the UI, launch the liftover_plink_beds workflow and use the following input parameters:
1. Common parameters to specify:
  1. plink_beds: /Bulk/Genotype Results/Genotype calls/*.bed (22 files)
  2. plink_bims: /Bulk/Genotype Results/Genotype calls/*.bim (22 files)
  3. plink_fams: /Bulk/Genotype Results/Genotype calls/*.fam (22 files)
  4. reference_fastagz: /Bulk/Exome sequences/Exome OQFE CRAM files/helper_files/GRCh38_full_analysis_set_plus_decoy_hla.fa
  5. ucsc_chain: b37ToHg38.over.chain

See a detailed tutorial on how to work with WDL on DNAnexus.

Variant QC

Cleaning of genotypic data, as in case of sample QC, will reduce its noisiness and increase accuracy of the GWAS analysis results. In this step we also check for Hardy-Weinberg equilibrium deviation in order to detect genotyping error. To clean the data, data with high missing rate for both per-variant and per-sample are excluded. The GWAS will be performed on autosomes only, so variants in sex chromosomes are excluded.

Array Data QC

Code for array data QC is in the run_array_qc.sh script. For QC filtering, array data (field 22418) was used. Variants in array data were filtered with the Swiss Army Knife (SAK) app using following options:

plink2 --bfile ukb_c1-22_merged --keep ischemia_df.phe --autosome --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out imputed_array_snps_qc_pass

Below are criteria used for the filtering:

Use only for samples that are contained in ischemia_df.phe file: --keep ischemia_df.phe
Keep only variants located in autosoNes: --autosome
Minor allele frequency is greater than 0.01: --maf 0.01
Minor allele count is greater than 100: --mac 100
Missing call rate for variant is not exceeding 0.1: --geno 0.1
Missing call rate for sample is not exceeding 0.1: --mind 0.1
Hardy-Weinberg equilibrium exact test p-value for the variant is greater than 1e-15: --hwe 1e-15
- This is due to the fact that serious genotyping errors often yield extreme p-values.

For more information on filtering, see PLINK2 documentation.

SAK Configuration

Priority: Normal
Recommended instance: mem1_ssd1_v2_x36
Run time: Approximately 10 minutes

Steps (CLI/UI)

For the CLI, use the command below:

On your local machine run: sh run_array_qc.sh

In the UI:

After logging into the Research Analysis Platform, navigate to the Tools Library.
Select the “Swiss Army Knife” app and click Run to run it in your desired project.
Copy and paste the following into the command prompt:
1. plink2 --bfile “/mnt/project/<file path>/ukb_c1-22_merged” --keep “/mnt/project/<file path>/ischemia_df.phe” --autosome --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out imputed_array_snps_qc_pass

Imputed Data QC

The code for the imputed data QC is in bgens_qc.wdl script created by Yih-Chii Hwang. For QC filtering, imputation from genotype (GEL) (field 21008) was used. Variants were filtered using following options: --mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1

Below are the criteria used for the filtering:

Use only for samples that are contained in ischemia_df.phe file: --keep ischemia_df.phe
Minor allele frequency is greater than 0.0001: --maf 0.0001
Minor allele count is greater than 10: --mac 10
Missing call rate for variant is not exceeding 0.1: --geno 0.1
Missing call rate for sample is not exceeding 0.1: --mind 0.1
Hardy-Weinberg equilibrium exact test p-value for the variant is greater than 1e-15: --hwe 1e-15
- Deviations from Hardy-Weinberg can often indicate genotyping error.

For more details on filtering, see PLINK2 documentation. This script ran for approximately 10 hours when normal priority was used.

Steps (CLI/UI)

The installation of Java is needed to run dxCompiler. See above for Java and dxCompiler download instructions.
Log in to the Platform. Then, compile the workflow to make it available on DNAnexus: java -jar dxCompiler-2.10.8.jar compile bgens_qc.wdl -project project-xxxx -folder <directory path>
On RAP, you’ll now find
1. A workflow called bgens_qc
2. Applets created that correspond to the tasks/steps of the workflow.
In the UI, launch the bgens_qc workflow and use the following input parameters:
1. Common:
  1. geno_bgen_files: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgen (22 files)
  2. geno_sample_files: /Bulk/Imputation/Imputation from genotype (GEL)/*.sample (22 files)
    Note that you will need to correct the sample file header.
  3. output_prefix: gel_imputed_snps_data_qc_pass
  4. Keep_file: ischemia_df.phe
  5. plink2_options: --mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1

See a detailed tutorial on how to work with WDL.

CLI Steps

On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP: dx upload generate_inputs.ipynb —-destination <directory path>
From the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once JupyterLab is open, click on the DNAnexus tab on the left menu, and then click on the generate_inputs.ipynb notebook. Once the notebook is open go to the JupyterLab menu at the top of the page and click the Run tab and then click Run All Cells.
Running generate_inputs.ipynb notebook will generate bgens_qc_input.json. You may need to generate new sample files, see discussion in the community for more details.
Log into the DNAnexus Platform and navigate to the UKB_RAP repository (see Step 1) to compile the workflow: java -jar dxCompiler-2.10.8.jar compile bgens_qc.wdl -project project-xxxx -folder <directory path>
Compile the WDL workflow into DNAnexus native workflow and load it into the Platform. java -jar <path_to_downloaded_dxCompiler>dxCompiler-2.XX.X.jar compile research_project/bgens_qc/bgens_qc.wdl -project project-XXX -inputs research_project/bgens_qc/bgens_qc_input.json -archive -folder '<path_to_folder_on_UKB_RAP>'
1. The folder takes a file path input, which is where you want to store the workflow files. This code assumes that you are in the UKB_RAP directory in the terminal on your local machine. It will generate bgens_qc_input.dx.json in the working directory.
To run the workflow, type the following command in the terminal: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json

GWAS

A GWAS analysis checks for association between variants and selected phenotype. Firth correction is used because of an unbalanced dataset (i.e. there are have fewer cases than controls). The additive test is selected as a recommended first approach when running GWAS using REGENIE.

Overview

The GWAS analysis was done using the REGENIE app. For the first step the quality-controlled array data was used, for the second step the quality-controlled imputed data was used. When running REGENIE app, the following options were used:

Step 2
- SPA instead of Firth approximation? False
- Test type: Additive
- Run time: approximately 7 hours

After analysis, 2549 variants out of 36,695,747 tested variants had significant association with the phenotype.

Steps (UI)

Log in to the Research Analysis Platform and run the REGENIE app.
In the Analysis Settings tab, select “Execution Output Folder". In the "Analysis Outputs 7" field, select the following options:
1. Genotype BED for Step 1: path to array data after liftOver
2. Genotype BIM for Step 1: path to array data after liftOver
3. Genotype FAM for Step 1: path to array data after liftOver
4. Genotype BGEN files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgen (22 files)
5. Genotype BGI index files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgi (22 files)
6. Sample files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.sample (22 files)
  1. Note: You will need to correct the sample file header.
7. Phenotypes file: phenotype file generated in the Sample QC step
8. Variant IDs to extract (Step 1): snplist file generated in the Array data QC step (1 file)
9. Variant IDs to extract (Step 2): snplist file generated in the Impute data QC step (1 file)
10. In the ASSOCIATION TESTING (STEP 2):
  1. Ensure that "SPA" instead of "Firth approximation?" Is set to False
11. In the COMMON section:
  1. Quantitative traits?: False
  2. Phenotypes to include: ischemia_cc
  3. Array of covariates to apply: sex, age, bmi, ever_smoked, hdl_cholesterol, ldl_cholesterol, hypertension, pc1, pc2, pc3, pc4, pc5, pc6, pc7, pc8, pc9, pc10
12. In the App Settings tab, select the instance type. The recommended instance is mem1_ssd1_v2_x36
13. Then click Start Analysis to begin.

Figures showing the example results of this GWAS are shown below.

Table 1. Number of significant GWAS variants divided by chromosome

Chromosome	Number of significant variants
1	115
2	598
4	1
6	658
7	3
8	27
9	204
10	11
11	13
12	104
13	1
15	437
16	63
17	140
19	149
21	7

LD Clumping

LD clumping is the next step in the process in order to reduce the number of significant GWAS variants by clumping genetically linked variants together. This will then report only the most significant variant from each clump.

Overview

Code for the LD clumping is in run_ld_clumping.ipynb. In this notebook, significant variants are extracted from the GWAS report. Then, for each chromosome, significant variants are extracted from the imputed BGEN files, converted to PLINK files and then LD clumping is performed using PLINK software.

Out of 2531 variants, 82 remained as index variants after clumping.

JupyterLab Configuration

Cluster configuration: Single node
Recommended instance: mem2_ssd1_v2_x32
Run time: approximately 20 minutes

Steps:

On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP: dx upload run_ld_clumping.ipynb —-destination <directory path>
From the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once your JupyterLab session is open, click on the DNAnexus tab on the left hand side and open the run_ld_clumping.ipynb notebook.
From the JupyterLab menu at the top of the page, click the Run tab and click Run All Cells.

An example of a table of index variants after LD clumping is shown below.

Table 2. Number of index variants after LD clumping by chromosome

Chromosome	Number of index variants
1	8
2	9
4	1
6	29
7	1
8	1
9	4
10	2
11	2
12	8
13	1
15	5
16	2
17	2
19	6
21	1

PheWAS

PheWAS studies examine causal associations with other phenotypes. Conducting a PheWAS can help to distinguish when comorbidities are caused by horizontal pleiotropy (one locus influencing multiple diseases) and causality or vertical pleiotropy (one disease is causing another disease).

Overview

Code for preparing phenotype data can be found in get-phewas-data.ipynb. Code for running PheWAS analysis is in run-phewas.ipynb. The phenome wide association study was performed using the PheWAS R package. For this analysis, genotypic data for each of the variants selected by LD clumping method, covariates and ICD10 phenotypes were used. To create ICD10 phenotypes, the UKB Diagnoses - ICD10 (41270) field was used. A table was then created, where each row contains one phenotype (ICD10 diagnosis) for one participant. Therefore, when a participant has multiple diagnoses, a participants' eid will appear multiple times.

The PheWAS was run in parallel for each variant with the following options:

Use p-value, Bonferroni and FDR to calculate significance threshold: significance.threshold = c("p-value", "bonferroni", "fdr")
Additive genotypes are not being supplied: additive.genotypes = FALSE

JupyterLab Configuration

Cluster configuration: Single node
Recommended instance: mem2_ssd1_v2_x32
Run time for data preparation: Approximately 20 min
Run time for PheWAS: Approximately 6 hours

Steps

On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP:
1. dx upload get-phewas-data.ipynb --destination <directory path>
2. dx upload run-phewas.ipynb --destination <directory path>
On the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once theJupyterLab session is open, click on the DNAnexus tab on the left hand side, open the get-phewas-data.ipynb and run-phewas.ipynb notebooks.
From the JupyterLab menu at the top of the screen, click the Run tab and click Run All Cells.

The results from the PheWAS are summarized in the below figures.

Table 3. Number of significant associations for each phenotype

Phenotype	Number of significant PheWAS associations
Essential hypertension	17
Hypertension	17
Hematuria	4
Delirium dementia and amnestic and other cognitive disorders	3
Dementias	3
Colorectal cancer	2
Disorders of fluid, electrolyte, and acid-base balance	2
Hypovolemia	2
Malignant neoplasm of rectum, rectosigmoid junction, and anus	2
Cancer of prostate	2
Malignant neoplasm of other and ill-defined sites within the digestive organs and peritoneum	1
Atrial fibrillation and flutter	1
Malignant neoplasm of gallbladder and extrahepatic bile ducts	1
Cardiac dysrhythmias	1
Urinary incontinence	1
Hyperplasia of prostate	1
Helminthiases	1
Other symptoms/disorders or the urinary system	1
Intestinal helminthiases	1

26 out of 82 variants had significant association with phenotype. The phenotypes with the most significant PheWAS associations were essential hypertension or hypertension.

Conclusion

This tutorial demonstrated how UKB-RAP can be used for end-to-end genomic target discovery pipeline. The analysis started with ischemia cohort creation using the Cohort Browser. The QC was done using JupyterLab (samples), PLINK in Swiss Army Knife (array data) or inside WDL script (imputed data). GWAS was performed using REGENIE, where array and imputed data was used for step 1 and 2 respectively. Linkage disequilibrium clumping then extracted the most informative variant among significant GWAS variants. In the final step, PheWAS analysis was performed for each variant, and the phenotypes with the most significant PheWAS associations were found to be essential hypertension and hypertension.