AD-by-proxy GWAS Guide
Explore one of the ways to conduct a GWAS on the Research Analysis Platform.
The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this notebook.


This tutorial will provide the Python code needed to conduct the QC portion of a genome-wide association study (GWAS) focusing on discovering variants associated with Alzheimer's Disease (AD) in JupyterLab. For this GWAS, we used UK Biobank data to search for genetic loci associated with AD. For more detailed information about the original AD study and how to use the Research Analysis Platform, see this blog post.
One of the motivations for performing this study was to demonstrate the process of running an end-to-end GWAS on the UK Biobank Research Analysis Platform. While researchers can bring their own programming languages and tools to the Platform, the Platform’s user interface enables researchers to run analyses easily and quickly. We wanted to demonstrate that conducting a GWAS could be done without the knowledge of sophisticated command-line methods.


Here is a brief overview of the steps that we conducted to perform this GWAS:
  1. 1.
    In Jupyterlab:
    1. 1.
      Access data and construct phenotype.
    2. 2.
      Perform sample QC
  2. 2.
    In CLI or UI:
    1. 1.
      Conduct LiftOver.
    2. 2.
      Variant QC using Swiss Army Knife.
    3. 3.
      Conduct GWAS using regenie through Swiss Army Knife.
  3. 3.
    In UI:
    1. 1.
      Visualize analysis results using LocusZoom.

Accessing the Data

The genetic data needed to run this GWAS is stored in multiple folders in the Research Analysis Platform’s file structure, which is detailed here. For this analysis, we use the array genotype and whole exome sequencing data stored in the “/Bulk/Genotype Results/Genotype calls/” and “/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format 0 interim 200k release/” folders, respectively. Phenotypic data needed to collect the AD phenotype is located in the SQL database named “app<APPLICATION-ID>_<CREATION-TIME>”, and can be accessed via the corresponding dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” in the root (“/”) folder.

Pre-GWAS Steps

The first part of this analysis will consist of conducting sample QC and processing the data fields needed to collect the desired phenotype file, which in turn can be provided as an input to the GWAS. We then use the UK Biobank data fields to calculate the individual’s AD risk by proxy.

Sample QC Steps and Derived Case/Control Phenotype

Getting the Data

We conducted our QC in a Jupyter notebook using Python and Spark. We first imported dxdata to manage the UKB datasets and cohorts on the Research Analysis Platform, and dxpy for other file interactions with Python on the Platform.
dxdata is used to first load the dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” from your project to your notebook:
# Import packages
import dxpy
import dxdata
# Automatically discover dispensed dataset ID and load the dataset
dispensed_dataset_id = dxpy.find_one_data_object(typename='Dataset', name='app*.dataset', folder='/', name_mode='glob')['id']
dataset = dxdata.load_dataset(id=dispensed_dataset_id)
For UKB data, participants are the main entities and correspond to most phenotype fields accordingly. An entity is equivalent to a table in a database, and its associated fields are equivalent to the columns in a table. In the UKB data schema, entities are linked tables, and thus can be accessed with the command dataset.entities. See the dxdata documentation for more details.
To access the participant entity, use the command:
participant = dataset['participant']

Selecting Participant Fields by Field Index, Instance Index, or Array Index

For the main participant phenotype entity, the Research Analysis Platform uses field names with the following convention:
Type of field
Syntax for field name
Neither instanced nor arrayed
Instanced but not arrayed
Arrayed but not instanced
Instanced and arrayed

Data-Field ID and Corresponding Description

Using the UKB showcase, we located the following field IDs of interest from UKB showcase. We note them below so we can easily refer to them throughout the rest of the notebook.
The full list of all data-fields used in our case study:
Data-Field ID
Father's age at death
Units of measurement are years or 100291
Mother's age
Units of measurement are years or 100291
Father's age
Units of measurement are years or 100291
Mother's age at death
Units of measurement are years or 100291
Illnesses of father
Illnesses of mother
Age at recruitment
Units of measurement are years
Genetic sex
Genetic ethnic grouping
Genetic principal components
Units of measurement are Units; Array indices run from 1 to 40
Sex chromosome aneuploidy
Genetic kinship to other participants
Outliers for heterozygosity or missing rate
Diagnoses - main ICD10
Diagnoses - secondary ICD10

Extract Data-Fields Used for Deriving Phenotype and Sample QC

The fields we want include principal components as well as the fields we need to construct our phenotype. We use the following function to collect all the field names and create a list of field IDs:
# This function is used to grab all field names (e.g. "p<field_id>_iYYY_aZZZ") of a list of field IDs
def field_names_for_ids(field_ids):
from distutils.version import LooseVersion
fields = []
for field_id in field_ids:
field_id = 'p' + str(field_id)
fields += participant.find_fields(name_regex=r'^{}(_i\d+)?(_a\d+)?#x27;.format(field_id))
return sorted([ for field in fields], key=lambda n: LooseVersion(n))
field_ids = ['31', '21022', '22001', '22006', '22009', '22019', '22021', '22027',
'41202', '41204',
'20107', '2946', '1807',
'20110', '3526', '1845']
field_names = ['eid'] + field_names_for_ids(field_ids)
If you view these field names now, each will look something like this: p<field_id>_iYYY_aZZZ.
Now we connect to Spark so we can retrieve our necessary fields as a DataFrame, and convert it to a Pandas DataFrame:
df = participant.retrieve_fields(names=field_names, engine=dxdata.connect())
pdf = df.toPandas()
From this point onward, we are interacting with the data in a Pandas DataFrame.
Use this command to visualize your table:
#Displays first three rows of table

Extract ICD-10 Codes for AD by Patterns (G30 and F00)

Here we resolve the ICD-10 codes for AD from the dataset. This list of codes will be used later for defining the phenotype:
# This function is used to look up fields using keyword
def field_codes_by_keyword(field_name, keyword):
return dict([(k,v) for (k,v) in participant[field_name] if keyword.lower() in v.lower()])
# Collapse ICD-10 codes for Alzheimer's disease (G30* and F00*)
ad_icd_codes = list({**field_codes_by_keyword("p41202", "g30."), **field_codes_by_keyword("p41202", "f00.")})
Your output may look like the following:
['G308', 'G300', 'G309', 'G301', 'F009', 'F001', 'F000', 'F002']

Derive the Phenotype (AD-by-Proxy)

The AD phenotype is generated using both a participant’s disease status and that participant's parental disease status and age. We are following the definition of AD-by-proxy described in this article by Jansen et al.
Here is a summary of the steps of the process. There are two calculation methods:
  1. 1.
    Participant's disease status.
    • If the participant has diagnosed with AD. Count risk as the maximum value, 2.
  2. 2.
    Additive of participant's parents' risk, determined by disease status along with parent's age.
    • If a parent has diagnosed with AD, count the risk as a 1. The risk is set to 2 if both parents were diagnosed with AD.
    • Otherwise, get the parent' risk by the parents age (recorded age or age at death, whichever is older):
      • risk = max( 0.32, (100 - age)/100 )
      • 0.32 is the general risk for an adult to be diagnosed with AD
  3. 3.
    Take the maximum of steps 1 and 2.
To get the participant's disease status, we look for the participant’s primary or secondary ICD-10 codes. If a participant has ICD-10 codes for AD, we record their AD risk as “2”:
# Get each participant's hospital inpatient records in either the main or secondary position
def icd10_codes(row):
main_icd10_codes = row['p41202'] or []
secondary_icd10_codes = row['p41204'] or []
return list( set(main_icd10_codes+secondary_icd10_codes) )
pdf['icd10_codes'] = pdf.apply(icd10_codes, axis=1)
# If the participant has any of the ICD-10 codes for AD, record the risk to "2"
def has_ad_icd10(row):
return 0 if set(row['icd10_codes']).isdisjoint(ad_icd_codes) else 2
pdf['has_ad_icd10'] = pdf.apply(has_ad_icd10, axis=1)
Now we look at the illness statuses of the participant’s mother and father. We first get a list of illnesses with which the participants' parents have been diagnosed.
Data-fields 20107 and 20110 record illnesses of each participant's father and mother. If diagnosed with AD, assign the parent's risk as "1":
# Get distinct list of illness for father and mother
pdf['illnesses_of_father'] = pdf.filter(regex=('p20107*')).apply(
lambda x: list(set(x.dropna().astype('int8'))),
pdf['illnesses_of_mother'] = pdf.filter(regex=('p20110*')).apply(
lambda x: list(set(x.dropna().astype('int8'))),
UKB participants attended multiple measurement-taking sessions, so there are multiple fields for the participant’s parents’ ages. Since the risk of developing AD increases by age, we only need the maximum value for each parent’s age to calculate the AD risk:
# Get the max age between age at death and recorded age
pdf['father_age'] = pdf.filter(regex=(r'(p1807_*|p2946_*)')).max(axis=1)
pdf['mother_age'] = pdf.filter(regex=(r'(p3526_*|p1845_*)')).max(axis=1)
If the parent does not have any AD diagnosis, calculate the parent’s risk using their age, as shown in the following, then organize the data into a table:
# If the parent has diagnosed with AD (code 10), record it as 1;
# else assign parent's AD risk with their risk, which is their age (proportional to diff of age of 100) with miumum risk at 0.32
def parents_ad_risk(row):
import numpy as np
father_ad_risk = 1 if 10 in row['illnesses_of_father'] else np.maximum(0.32, (100 - row['father_age'])/100)
mother_ad_risk = 1 if 10 in row['illnesses_of_mother'] else np.maximum(0.32, (100 - row['mother_age'])/100)
return father_ad_risk + mother_ad_risk
pdf['parents_ad_risk'] = pdf.apply(parents_ad_risk, axis=1)
pdf['ad_risk_by_proxy'] = pdf[['has_ad_icd10','parents_ad_risk']].max(axis=1)

Perform Sample QC and Determine Phenotype

Now we conduct the Sample QC steps. We filtered for participants for whom reported sex and genetic sex are the same, ancestry is listed as "White British," and there is no sex chromosome aneuploidy. We filter out for outliers that are heterozygous and missing rates, or are distinguished excessive kinship, i.e. have ten or more third-degree relatives. Participants who had parental ages as “do not know" or "prefer not to answer" were also removed from the cohort:
pdf_qced = pdf[
(pdf['p31'] == pdf['p22001']) & # Filter in sex and genetic sex are the same
(pdf['p22006']==1) & # in_white_british_ancestry_subset
(pdf['p22019'].isnull()) & # Not Sex chromosome aneuploidy
(pdf['p22021']!=10) & # Not Ten or more third-degree relatives identified (not 'excess_relatives')
(pdf['p22027'].isnull()) & # Not het_missing_outliers
((pdf['father_age'].notnull()) & (pdf['father_age']>0)) & # There is father's age
((pdf['mother_age'].notnull()) & (pdf['mother_age']>0)) & # There is mother's age
(pdf['illnesses_of_father'].apply(lambda x:(-11 not in x and -13 not in x))) & # Filter out "do not know" or "prefer not to answer" father illness
(pdf['illnesses_of_mother'].apply(lambda x:(-11 not in x and -13 not in x))) # Filter out "do not know" or "prefer not to answer" mother illness
Use this command to check the resulting number of samples:
# Output of the above code block in JupyterLab
# This is the number of samples we have in our table

Visualizing the Distribution and Setting the Case/Control Threshold

Since GWAS analyses have been run before for AD, and the results have been documented, we took the extra step of visualizing the distribution of cases with a histogram, to doublecheck that our results looked reasonable in a general population. Our histogram looked reasonable, so we continued with our analysis:
# Visualize the distribution of the phenotype we derived
Seeing the distribution, we decide to set the case/control threshold with AD risk at 1, with the range going from 0 to 2. All participants with a value for 1 or above for “AD-by-proxy” are considered cases, and those below 1 are considered to be controls:
# All participants have ad_risk_by_proxy are considered cases (1) and the rest are consitered as controls (0).
import numpy as np
pdf_qced['ad_by_proxy'] = np.where(pdf_qced['ad_risk_by_proxy']>= 1, 1, 0)

Cleaning up the Pandas DataFrame and Saving to Your Project

Next we cleaned up the DataFrame so that it can be used with regenie. We renamed the columns for human-readability and added an FID column, which is necessary for regenie. We also convert the list of parental illnesses to a string, for ease of formatting:
# Rename columns for better human readibility
import re
pdf_qced = pdf_qced.rename(columns=lambda x: re.sub('p22009_a','pc',x))
pdf_qced = pdf_qced.rename(columns={'eid':'IID', 'p31': 'sex', 'p21022': 'age',
'p22006': 'ethnic_group',
'p22019': 'sex_chromosome_aneuploidy',
'p22021': 'kinship_to_other_participants',
'p22027': 'outliers_for_heterozygosity_or_missing'})
# Add FID column -- required input format for regenie
pdf_qced['FID'] = pdf_qced['IID']
# Join list of parent's illness -- making it a single string format rather than list format
pdf_qced['illnesses_of_father'] = pdf_qced['illnesses_of_father'].apply(lambda x: [str(i) for i in x if i]).str.join(",").replace(r'^\s*#x27;, np.nan, regex=True)
pdf_qced['illnesses_of_mother'] = pdf_qced['illnesses_of_mother'].apply(lambda x: [str(i) for i in x if i]).str.join(",").replace(r'^\s*#x27;, np.nan, regex=True)
# Join list of ICD10 code -- making it a single string format rather than list format
pdf_qced['icd10_codes'] = pdf_qced['icd10_codes'].apply(lambda x: 'NA' if x is None else ",".join([str(i) for i in x]))
# Create a phenotype table from our QCed data
pdf_phenotype = pdf_qced[['FID', 'IID', 'sex', 'age',
'ad_by_proxy', 'ad_risk_by_proxy',
'icd10_codes', 'has_ad_icd10',
'ethnic_group', 'sex_chromosome_aneuploidy',
'illnesses_of_father', 'illnesses_of_mother',
'father_age', 'mother_age', 'parents_ad_risk'] + [f'pc{i}' for i in range(1, 41)]]
At this point, your table should look something like this, noting that displayed data values are placeholders, so to not disclose actual values:

Whole Exome Sequence Results

When the initial paper was published, the authors did not have access to the 200,000 whole exome sequencing (WES) results that are currently available. Therefore, another reason we chose to replicate this study was to confirm the accuracy of the WES compared to the imputed genotypes, and to build on the existing study.
We obtain the PLINK WES files first and create a DataFrame, and then intersect this file with the results from our phenotype file, to obtain the data only from participants who have WES:
import pandas as pd
# Get WES
path_to_family_file = '/mnt/project/Bulk/<Path to WES files>' #replace this with actual path to the WES file
plink_fam_df = pd.read_csv(path_to_family_file, delimiter='\s', dtype='object',
names = ['FID','IID','Father ID','Mother ID', 'sex', 'Pheno'], engine='python')
# Intersect the phenotype file and the 200K WES .fam file, to generate phenotype DataFrame for the 200K participants
ad_risk_by_proxy_wes_200k_df = pdf_phenotype.join(plink_fam_df.set_index('IID'), on='IID', rsuffix='_fam', how='inner')
# Drop unuseful columns from .fam file
ad_risk_by_proxy_wes_200k_df.drop(columns=['FID_fam','Father ID','Mother ID','sex_fam', 'Pheno'], axis=1, inplace=True, errors='ignore')
Finally we save the phenotype as a TSV (.phe) file:
# Write phenotype files to a TSV file
ad_risk_by_proxy_wes_200k_df.to_csv('ad_risk_by_proxy_wes_200k.phe', sep='\t', na_rep='NA', index=False, quoting=3)
pdf_phenotype.to_csv('ad_risk_by_proxy_wes.phe', sep='\t', na_rep='NA', index=False, quoting=3)
And re-upload to the project:
# Upload the geno-pheno intersect phenotype file back to the RAP project
dx upload ad_risk_by_proxy_wes_200k.phe -p --path /path/to/ad_by_proxy_GWAS_200K/ --brief
dx upload ad_risk_by_proxy_wes.phe -p --path /path/to/ad_by_proxy_GWAS_500K1/ --brief
Now the QC steps are complete.
The next steps for the GWAS is to conduct variant QC and the array-genotype LiftOver.

Variant QC Steps and Array-Genotype LiftOver


Note that the UKB SNP array genotype calls (bulk, genotype calls) are mapped to GRCh37, while the whole-exome sequences are mapped to GRCh38, the most recent version. We chose to lift the coordinates of genotype calls over to GRCh38 using LiftOver before conducting the rest of the analysis, so that our genotype calls are consistent.
The following pipeline may be useful in conducting this part of the analysis.

Variant QC

To conduct variant QC, we use PLINK2, available as part of Swiss Army Knife, to create a list of variants that pass the QC threshold for array genotype and WES variant data. You can use Swiss Army Knife as part of a larger workflow, or on its own as an analysis. To access it from the UK Biobank Research Analysis Program, open your project, then click Start Analysis. From the list of tools that appear, click Swiss Army Knife. The following shows the command-line options needed to run this step. For more information about PLINK2 functions, see the official PLINK2 documentation.

Array Genotype Data

The input files to our variant QC job for running the array genotype data is:
  • Our phenotype file, ad_risk_by_proxy_wes_200k.phe.
  • The UK Biobank genotype call files in .bed, .bim, and .fam formats. We merged the chromosomes together into one file for each file format in preparation for running the whole genome regression model using step 1 of regenie.
Below are the command line options we use to run our job:
plink2 \
--bfile ukb_c1-22_hs38DH_merged \
--out 200K_WES_array_snps_CRCh38_qc_pass \
--mac 100 --maf 0.01 --hwe 1e-15 --mind 0.1 --geno 0.1 \
--write-snplist --write-samples --no-id-header \
The outputs we get from this step are:
  • A list of SNPs that pass QC, 200K_array_snps_CRCh38_qc_pass.snplist
  • A list of sample IDs that pass QC,
  • A log file (.log).

WES Data

The input files to our variant QC for WES variant data job are:
  • Our phenotype file, ad_risk_by_proxy_wes_200k.phe
  • The UK Biobank WES joint call files in .bgen, and .sample formats. For the WES data, we went through each individual chromosome.
Below are the command line options we use to run our job:
plink2 \
--bgen ukb23155_c22_b0_v1.bgen ref-first \
--sample ukb23155_c22_b0_v1.sample \
--out 200K_WES_variants_CRCh38_qc_pass_c22 \
--mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1 \
--write-snplist --write-samples --no-id-header \
--threads $(nproc)
The outputs we got from this step per chromosome (chromosome 22 in this case):
  • A list of SNP IDs that pass QC, 200K_WES_variants_CRCh38_qc_pass_c22.snplist
  • A list of sample IDs that pass QC,
  • A log file (.log).
See the Swiss Army Knife documentation for more information about running this app and the inputs/outputs.

GWAS using regenie

regenie Step 1

We will be using regenie to conduct our GWAS. Note that for regenie we used the merged chromosome file since this is a requirement for the algorithm to run.
This is a two-step process, and the inputs for step 1 are:
  • Chromosome genotype calls (bed, fam, bim) that we also used as our input in the variant QC step above.
  • Our phenotype file, ad_risk_by_proxy_wes_200k.phe
  • SNPs that pass QC list, 200K_WES_array_snps_CRCh38_qc_pass.snplist
  • SNP ID file,
We use the following command options to run this step:
regenie \
--step 1 \
--bed ukb_c1-22_hs38DH_merged \
--phenoFile ad_risk_by_proxy_wes_200k.phe \
--covarFile ad_risk_by_proxy_wes_200k.phe \
--extract 200K_array_snps_CRCh38_qc_pass.snplist \
--phenoCol ad_by_proxy \
--covarCol age --covarCol sex --covarCol pc{1:10} \
--bsize 1000 \
--out ad_by_proxy \
--bt --loocv --threads $(nproc) --gz
The outputs of this step are the necessary files for regenie Step 2. These include:
  • LOCO file, regenie_step1_adByProxy_c1-22_1.loco
  • Prediction list, regenie_step1_adByProxy_c1-22_pred.list
  • Job log, regenie_step1_adByProxy_c1-22.log
  • regenie zip files.
For more information about different options to run regenie, see the official regenie documentation.

regenie Step 2

In this step, we run each chromosome individually for parallelization and to distribute the jobs to multiple sub-jobs so that it can speed up the wall clock turnaround time.
The inputs to the second step are:
  • .bgen, .sample, and .bgen.bai genotype files from UKB.
  • ID file of SNPs that pass QC,
  • Phenotype file, ad_risk_by_proxy_wes_200k.phe
  • Prediction list (from Step 1)
  • LOCO file from (from Step 1).
The command-line options we use are as follows:
regenie \
--step 2 \
--bgen ukb23155_c22_b0_v1.bgen --ref-first \
--sample ukb23155_c22_b0_v1.sample \
--phenoFile ad_risk_by_proxy_wes_200k.phe \
--covarFile ad_risk_by_proxy_wes_200k.phe \
--phenoCol ad_by_proxy \
--bt --approx --firth-se --firth \
--extract 200K_WES_snps_CRCh38_qc_pass.snplist \
--covarCol age --covarCol sex --covarCol pc{1:10} \
--out assoc.20 \
--pred ad_by_proxy.ad_by_proxy.loco.gz \
--bsize 200 --pThresh 0.05 --minMAC 3 --threads $(nproc) --gz
Since we input each chromosome individually, the outputs from Step 2 for us will be the job log and regenie files for each chromosome, which is chromosome 22 in the example above. Once step 2 is complete, the actual work of conducting the GWAS is complete and you can now visualize your results using LocusZoom. You may want to concatenate your individual chromosome results into one file for easier visualization and further analysis.


Next you can use LocusZoom to visualize the results of your analysis. The LocusZoom app is available in the Tools Library of the Research Analysis Platform. It generates Manhattan plots, QQ plots, and a table of high p-value loci.
To visualize your data, zip your regenie output file (the .regenie file from step 2 above) to convert it to a tab-delimited file, and then compress it into a Gzipped file. Note that LocusZoom only accepts tab-delimited files. From these results you can conduct further analyses and look into variant effects, or you may find new loci to explore.
Examples of these visualizations are shown below.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (" in your work.