GWAS Guide Using Alzheimer's Disease

Explore one of the ways to conduct a GWAS on the Research Analysis Platform.

The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this guide.

Introduction

This tutorial will provide the Python code needed to conduct the QC portion of a genome-wide association study (GWAS) focusing on discovering variants associated with Alzheimer's Disease (AD) in JupyterLab. For this GWAS, we used UK Biobank data to search for genetic loci associated with AD. For more detailed information about the AD study, see this blog post.

One of the motivations for performing this study was to demonstrate the process of running an end-to-end GWAS on the UK Biobank Research Analysis Platform. While researchers can bring their own programming languages and tools to the Platform, the Platform’s user interface enables researchers to run analyses easily and quickly. We wanted to demonstrate that conducting a GWAS could be done without the knowledge of sophisticated command-line methods.

Overview

Here is a brief overview of the steps that we conducted to perform this GWAS:

In JupyterLab:
1. Access data and construct phenotype.
2. Perform sample QC
In CLI or UI:
1. Conduct LiftOver.
2. Variant QC using Swiss Army Knife.
3. Conduct GWAS using regenie.
In UI:
1. Visualize analysis results using LocusZoom.

Accessing the Data

The genetic data needed to run this GWAS is stored in multiple folders in the Research Analysis Platform’s file structure, which is detailed here. For this analysis, we use the array genotype and whole exome sequencing data stored in the “/Bulk/Genotype Results/Genotype calls/” and “/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/” folders, respectively. Phenotypic data needed to collect the AD phenotype is located in the SQL database named “app<APPLICATION-ID>_<CREATION-TIME>”, and can be accessed via the corresponding dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” in the root (“/”) folder.

Pre-GWAS Steps

The first part of this analysis will consist of conducting sample QC and processing the UK Biobank data fields to calculate the individual’s AD risk by proxy.

Sample QC Steps and Derived Case/Control Phenotype

We conducted our QC using a Jupyter Lab spark node. Here is the following configuration:

Cluster configuration: Spark Cluster
Instance type: mem1_hdd1_v2_x4
Number of Nodes: 2

Getting the Data

We first imported dxpy to find and extract the UKB datasets on the Research Analysis Platform.

# Import packages 
import dxpy
import subprocess

# Automatically discover dispensed dataset ID and load the dataset 
dispensed_dataset_id = dxpy.find_one_data_object(typename='Dataset', name='app*.dataset', folder='/', name_mode='glob')['id']

# Get project ID
project_id = dxpy.find_one_project()["id"]
dataset = (':').join([project_id, dispensed_dataset_id])

cmd = ["dx", "extract_dataset", dataset, "-ddd", "--delimiter", ","]
subprocess.check_call(cmd)

The extracted UKB data is returned as 3 .csv files. The data_dictionary.csv contains a table with participants as the rows and metadata along the columns, including field names (see table below for naming convention). The codings.csv contains a lookup table for the different medical codes, including ICD-10 codes that will be displayed in the diagnosis field column (p41270). The entity_dictionary.csv contains the different entities that can be queried, where participant is the main entity that corresponds to most phenotype fields.

To access the data_dictionary.csv, use the command:

import glob
import os
import pandas as pd

path = os.getcwd()
data_dict_csv = glob.glob(os.path.join(path, "*.data_dictionary.csv"))[0]
data_dict_df = pd.read_csv(data_dict_csv)
data_dict_df.head()

Selecting Participant Fields by Field Index, Instance Index, or Array Index¶

For the main participant phenotype entity, the Research Analysis Platform uses field names with the following convention:

Type of field

Syntax for field name

Example

Neither instanced nor arrayed

p<FIELD>

p31

Instanced but not arrayed

p<FIELD>_i<INSTANCE>

p40005_i0

Arrayed but not instanced

p<FIELD>_a<ARRAY>

p41262_a0

Instanced and arrayed

p<FIELD>_i<INSTANCE>_a<ARRAY>

p93_i0_a0

Data-Field ID and Corresponding Description

Using the UKB showcase, we located the following field IDs of interest from UKB showcase. We note them below so we can easily refer to them throughout the rest of the notebook.

The full list of all data-fields used in our case study:

Data-Field ID	Description	Data-Coding
31	Sex	9
1807	Father's age at death	Units of measurement are years or 100291
1845	Mother's age	Units of measurement are years or 100291
2946	Father's age	Units of measurement are years or 100291
3526	Mother's age at death	Units of measurement are years or 100291
20107	Illnesses of father	1010
20110	Illnesses of mother	1010
21022	Age at recruitment	Units of measurement are years
22001	Genetic sex	9
22006	Genetic ethnic grouping	1002
22009	Genetic principal components	Units of measurement are Units; Array indices run from 1 to 40
22019	Sex chromosome aneuploidy	1
22021	Genetic kinship to other participants	682
22027	Outliers for heterozygosity or missing rate	1
41202	Diagnoses - main ICD10	19
41204	Diagnoses - secondary ICD10	19

Extract Data-Fields Used for Deriving Phenotype and Sample QC

The fields we want include principal components as well as the fields we need to construct our phenotype. We use the following function to collect all the field names and create a list of field IDs:

from distutils.version import LooseVersion

def field_names_for_ids(field_id):
    field_names = ["eid"]
    for _id in field_id:
        select_field_names = list(data_dict_df[data_dict_df.name.str.match(r'^p{}(_i\d+)?(_a\d+)?$'.format(_id))].name.values)
        field_names += select_field_names
    field_names = sorted([field for field in field_names], key=lambda n: LooseVersion(n))
        
    field_names = [f"participant.{f}" for f in field_names]
    return ",".join(field_names)

field_ids = ['31', '21022', '22001', '22006', '22009', '22019', '22021', '22027',
             '41270', '20107', '2946', '1807',
             '20110', '3526', '1845']
field_names = field_names_for_ids(field_ids)

If you view these field names now, each will look something like this: p<field_id>_iYYY_aZZZ.

Now we can retrieve our necessary fields as a Pandas DataFrame:

cmd = ["dx", "extract_dataset", dataset, "--fields", field_names, "--delimiter", ",", "--output", "extracted_data.sql", "--sql"]
subprocess.check_call(cmd)

import pyspark

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

with open("extracted_data.sql", "r") as file:
    retrieve_sql=""
    for line in file: 
        retrieve_sql += line.strip()  
             
temp_df = spark.sql(retrieve_sql.strip(";"))
pdf = temp_df.toPandas()

The dispensed_dictionary.csv file contains a table with participants as rows and the fields of interest as columns. From this point onward, we are interacting with the data in a Pandas DataFrame.

Use this command to visualize your table:

#Displays first three rows of table
pdf.head(3)

We will format the column headers to drop the "participant" prefix for the field names:

import re
pdf = pdf.rename(columns=lambda x: re.sub('participant.','',x))

Extract ICD-10 Codes for AD by Patterns (G30 and F00)

Here we resolve the ICD-10 codes for AD from the dataset. This list of codes will be used later for defining the phenotype.

codings_csv = glob.glob(os.path.join(path, "*.codings.csv"))[0]
codings_df = pd.read_csv(codings_csv)
codings_df.head()

# Collapse ICD-10 codes for Alzheimer's disease (G30* and F00*)
ad_icd_codes = list(
    codings_df[(codings_df["coding_name"] == "data_coding_19") & ((codings_df["parent_code"] == "G30") | (codings_df["parent_code"] == "F00"))]["code"])
ad_icd_codes

Your output may look like the following:

['G308', 'G300', 'G309', 'G301', 'F009', 'F001', 'F000', 'F002']

Derive the Phenotype (AD-by-Proxy)

The AD phenotype is generated using both a participant’s disease status and that participant's parental disease status and age. We are following the definition of AD-by-proxy described in this article by Jansen et al.

Here is a summary of the steps of the process. There are two calculation methods:

Participant's disease status.
- If the participant has been diagnosed with AD. Count risk as the maximum value, 2.
Additive of participant's parents' risk, determined by disease status along with parent's age.
- If a biological parent has been diagnosed with AD, count the risk as a 1. The risk is set to 2 if both biological parents were diagnosed with AD.
- Otherwise, get the parent' risk by the parents age (recorded age or age at death, whichever is older):
  - risk = max( 0.32, (100 - age)/100 )
  - 0.32 is the general risk for an adult to be diagnosed with AD
Take the maximum of steps 1 and 2.

To get the participant's disease status, we look for the participant’s primary or secondary ICD-10 codes. If a participant has ICD-10 codes for AD, we record their AD risk as “2”. Note that we first need to format the values in the columns to be a list object instead of a string.

import ast
import numpy as np

# Replace NaN with string None for p41270

# Note: eval will return Nonetype for string "None"
pdf["p41270"] = pdf["p41270"].replace(np.nan, "None")


# Get each participant's hospital inpatient records in ICD10 Diagnoses
def icd10_codes(row):
    icd10_codes = row['p41270'] or []
    return list( set(icd10_codes) )

pdf['icd10_codes'] = pdf.apply(icd10_codes, axis=1)

# If the participant has any of the ICD-10 codes for AD, record the risk to "2" 
def has_ad_icd10(row): 
    return 0 if set(row['icd10_codes']).isdisjoint(ad_icd_codes) else 2 
pdf['has_ad_icd10'] = pdf.apply(has_ad_icd10, axis=1)

Now we look at the illness statuses of the participant’s mother and father. We first get a list of illnesses with which the participants' biological parents have been diagnosed.

Data-fields 20107 and 20110 record illnesses of each participant's father and mother. If diagnosed with AD, assign the parent's risk as "1":

pdf['illnesses_of_father'] = pdf.filter(regex=('p20107*')).apply(
    lambda x: list(set(eval(x.any() or "[]"))),  
    axis=1)

pdf['illnesses_of_mother'] = pdf.filter(regex=('p20110*')).apply(
    lambda x: list(set(eval(x.any() or "[]"))), 
    axis=1)

UKB participants attended multiple measurement-taking sessions, so there are multiple fields for the participant’s parents’ ages. Since the risk of developing AD increases by age, we only need the maximum value for each parent’s age to calculate the AD risk:

# Get the max age between age at death and recorded age

pdf['father_age'] = pdf.filter(regex=(r'(p1807_*|p2946_*)')).max(axis=1)
pdf['mother_age'] = pdf.filter(regex=(r'(p3526_*|p1845_*)')).max(axis=1)

If the parent does not have any AD diagnosis, calculate the parent’s risk using their age, as shown in the following, then organize the data into a table:

# If the parent has diagnosed with AD (code 10), record it as 1; 
# else assign parent's AD risk with their risk, which is their age (proportional to diff of age of 100) with minimum risk at 0.32 

def parents_ad_risk(row): 
import numpy as np 

father_ad_risk = 1 if 10 in row['illnesses_of_father'] else np.maximum(0.32, (100 - row['father_age'])/100)
mother_ad_risk = 1 if 10 in row['illnesses_of_mother'] else np.maximum(0.32, (100 - row['mother_age'])/100)
return father_ad_risk + mother_ad_risk 

pdf['parents_ad_risk'] = pdf.apply(parents_ad_risk, axis=1) 
pdf['ad_risk_by_proxy'] = pdf[['has_ad_icd10','parents_ad_risk']].max(axis=1) 
pdf[['ad_risk_by_proxy','parents_ad_risk','has_ad_icd10']].head()

Perform Sample QC and Determine Phenotype

Now we conduct the Sample QC steps. We filtered for participants for whom reported sex and genetic sex are the same, ancestry is listed as "White British," and there is no sex chromosome aneuploidy. We filter out for outliers that are heterozygous and missing rates, or are distinguished excessive kinship, i.e. have ten or more third-degree relatives. Participants who had parental ages as “do not know" or "prefer not to answer" were also removed from the cohort:

pdf_qced = pdf[
           (pdf['p31'] == pdf['p22001']) & # Filter in sex and genetic sex are the same
           (pdf['p22006']==1) &            # in_white_british_ancestry_subset
           (pdf['p22019'].isnull()) &      # Not Sex chromosome aneuploidy
           (pdf['p22021']!=10) &           # Not Ten or more third-degree relatives identified (not 'excess_relatives')
           (pdf['p22027'].isnull()) &      # Not het_missing_outliers
           ((pdf['father_age'].notnull()) & (pdf['father_age']>0)) &  # There is father's age
           ((pdf['mother_age'].notnull()) & (pdf['mother_age']>0)) &  # There is mother's age
           (pdf['illnesses_of_father'].apply(lambda x:(-11 not in x and -13 not in x))) &  # Filter out "do not know" or "prefer not to answer" father illness
           (pdf['illnesses_of_mother'].apply(lambda x:(-11 not in x and -13 not in x)))    # Filter out "do not know" or "prefer not to answer" mother illness
]

Use this command to check the resulting number of samples:

# This is the number of samples we have in our table 

pdf_qced.shape[0]

Visualizing the Distribution and Setting the Case/Control Threshold

Since GWAS analyses have been run before for AD, and the results have been documented, we took the extra step of visualizing the distribution of cases with a histogram, to double check that our results looked reasonable in a general population. Our histogram looked reasonable, so we continued with our analysis:

# Visualize the distribution of the phenotype we derived
pdf_qced['ad_risk_by_proxy'].hist(bins=30)

Seeing the distribution, we decide to set the case/control threshold with AD risk at 1, with the range going from 0 to 2. All participants with a value for 1 or above for “AD-by-proxy” are considered cases, and those below 1 are considered to be controls:

# All participants have ad_risk_by_proxy are considered cases (1) and the rest are considered as controls (0)

pdf_qced['ad_by_proxy'] = 
np.where(pdf_qced['ad_risk_by_proxy']>= 1, 1, 0)
pdf_qced['ad_by_proxy'].value_counts()

Cleaning up the Pandas DataFrame and Saving to Your Project

Next we cleaned up the DataFrame to set up as input with regenie. We renamed the columns for human-readability and added an FID column, which is the required format for regenie. We also convert the list of parental illnesses to a string, for ease of formatting:

# Rename columns for better human readability 
import re
 
pdf_qced = pdf_qced.rename(columns=lambda x: re.sub('p22009_a','pc',x)) 
pdf_qced = pdf_qced.rename(columns={'eid':'IID', 'p31': 'sex', 'p21022': 'age', 'p22006': 'ethnic_group', 
'p22019': 'sex_chromosome_aneuploidy', 
'p22021': 'kinship_to_other_participants', 
'p22027': 'outliers_for_heterozygosity_or_missing'})
 
# Add FID column -- required input format for regenie 
pdf_qced['FID'] = pdf_qced['IID'] 

# Join list of parent's illness -- making it a single string format rather than list format 
pdf_qced['illnesses_of_father'] = pdf_qced['illnesses_of_father'].apply(lambda x: [str(i) for i in x if i]).str.join(",").replace(r'^\s*$', np.nan, regex=True)
pdf_qced['illnesses_of_mother'] = pdf_qced['illnesses_of_mother'].apply(lambda x: [str(i) for i in x if i]).str.join(",").replace(r'^\s*$', np.nan, regex=True)

# Join list of ICD10 code -- making it a single string format rather than list format
pdf_qced['icd10_codes'] = pdf_qced['icd10_codes'].apply(lambda x: 'NA' if x is None else ",".join([str(i) for i in x]))
 
# Create a phenotype table from our QCed data 
pdf_phenotype = pdf_qced[['FID', 'IID', 'sex', 'age', 
'ad_by_proxy', 'ad_risk_by_proxy', 
'icd10_codes', 'has_ad_icd10', 
'ethnic_group', 'sex_chromosome_aneuploidy', 
'kinship_to_other_participants', 
'outliers_for_heterozygosity_or_missing', 
'illnesses_of_father', 'illnesses_of_mother', 
'father_age', 'mother_age', 'parents_ad_risk'] + [f'pc{i}' for i in range(1, 41)]]

pdf_phenotype.to_csv('ad_risk_by_proxy_wes.phe', sep='\t', na_rep='NA', index=False, quoting=3 )

At this point, your table should look something like this, noting that displayed data values are placeholders, so to not disclose actual values:

	IID	sex	age	missingness	ethnic_group	sex_chromosome_aneuploidy
0	0000011	1	45	0.003	1	None
1	0000022	0	67	0.022	1	None
2	0000003	1	50	0.001	1	None

To upload your phenotype file back to the project:

%%bash 
# Upload the phenotype file back to the RAP project 
dx upload ad_risk_by_proxy_wes.phe -p --path /output directory path/ad_by_proxy_GWAS_500K/ --brief

Now the QC steps are complete.

The next steps for the GWAS is to conduct variant QC and the array-genotype LiftOver.

Variant QC Steps and Array-Genotype LiftOver

LiftOver

Note that the UKB SNP array genotype calls (bulk, genotype calls) are mapped to GRCh37, while the whole-exome sequences are mapped to GRCh38, the most recent version. We chose to lift the coordinates of genotype calls over to GRCh38 using LiftOver before conducting the rest of the analysis, so that our genotype calls are consistent.

The following pipeline may be useful in conducting this part of the analysis.

Variant QC

To conduct variant QC, we use PLINK2, available as part of the Swiss Army Knife app, to create a list of variants that pass the QC threshold for array genotype and WES variant data. To access it from the UK Biobank Research Analysis Program, open your project, then click Start Analysis. From the list of tools that appear, click Swiss Army Knife. The following shows what you will put as input string under the cmd option of the Swiss Army Knife app.

For more information about PLINK2 see the official PLINK2 documentation.

Array Genotype Data

The input files to our variant QC job for running the array genotype data is:

The UK Biobank genotype call files in .bed, .bim, and .fam formats. We merged the chromosomes together into one file for each file format in preparation for running the whole genome regression model using Step 1 of regenie.

Below are the Swiss Army Knife cmd inputs used to run the job:

plink2 \ 
--bfile ukb_c1-22_hs38DH_merged \ 
--out final_array_snps_CRCh38_qc_pass \ 
--mac 100 --maf 0.01 --hwe 1e-15 --mind 0.1 --geno 0.1 \ 
--write-snplist --write-samples --no-id-header \ 
--threads $(nproc)

The outputs we get from this step are:

A list of SNPs that pass QC, final_array_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC, final_array_snps_CRCh38_qc_pass.id
A log file (.log)

WES Data

The WES QC is performed using the bgens_qc.wdl script found here. The variants are QC’d per chromosome and then concatenated at the end to create a single list of variants to keep. The variants were filtered using following options:

--mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1

The input file to run the bgens_qc.wdl was created by first running the generate_inputs.ipynb notebook shown here. Within this notebook you’ll need to update the following parameters:

path_to_data: /Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/

phenotype_folder: /output directory path/ad_by_proxy_GWAS_500K/

phenotype_file: ad_risk_by_proxy_wes.phe

Run the WDL workflow using the .json file created in the generate_inputs.ipynb notebook by running the command: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json. Otherwise you can run the WDL workflow in the UI and specify the inputs manually instead of using the Jupyter notebook.

The outputs we get from this step are:

A list of SNPs that pass QC, final_WES_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC, final_WES_snps_CRCh38_qc_pass.id
A log file (.log)

GWAS using regenie

We will be using regenie to conduct our GWAS. DNAnexus provides a suite of apps to run the analysis as a whole. The Regenie orchestration app is used to facilitate the run of two required steps as well as input preprocessing, validation, resource allocation and post-processing (including creation of Manhattan plots per trait).

regenie Step 1

For Step 1, we will estimate how background SNPs contribute to the phenotype.

The inputs used for Step 1 are:

A merged array genotype calls: ukb_c1-22_hs38DH_merged.[bed,bim,fam] that we also used as our input in the variant QC step above.
Our phenotype file: ad_risk_by_proxy_wes.phe
SNPs that pass QC list: final_array_snps_CRCh38_qc_pass.snplist and the SNP ID file, final_array_snps_CRCh38_qc_pass.id

The following outputs from Step 1 are the necessary inputs for Step 2:

LOCO file: ukb_c1-22_hs38DH_merged_1.loco
Prediction list: ukb_c1-22_hs38DH_merged_pred.list

regenie Step 2

Then in Step 2, linear or logistic regression is used to test association between the variants and phenotype conditional upon the prediction from the regression model in Step 1. In Step 2, the orchestration app distributes each set of BGEN files into a separate job, as each variant is independently tested to the phenotype, and this parallelization helps accelerate the end-to-end runtime.

The inputs to the second step are:

23 sets of WES genotype files (ukb23159_c[1-22,X]_b0_v1.[bgen,bgi,sample]).
ID file of SNPs that pass QC: final_WES_snps_CRCh38_qc_pass.id
Phenotype file: ad_risk_by_proxy_wes.phe
Prediction list: ukb_c1-22_hs38DH_merged_pred.list (from Step 1)
LOCO file: ukb_c1-22_hs38DH_merged_1.loco (from Step 1).

After the completion of Step 2, a set of association plots is generated. This concatenation is done per phenotype. On the output of the orchestration app are concatenated Manhattan plots as well as statistic in lmm.tsv.gz (Extra output files/extra_outputs ) that can be investigated with LocusZoom.

Running regenie in the UI

We used the following input options to run the regenie app using the UI. If an input option isn’t specified, then the default option was used.

Inputs:

Genotype BED for Step 1: ukb_c1-22_hs38DH_merged.bed

Genotype BIM for Step 1: ukb_c1-22_hs38DH_merged.bim

Genotype FAM for Step 1: ukb_c1-22_hs38DH_merged.fam

Genotype BGEN files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen

Genotype BGI index files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen.bgi

Sample files for Step 2: ukb23159_c[1-22,X]_b0_v1.sample

Phenotypes file: ad_risk_by_proxy_wes.phe

Covariates file: ad_risk_by_proxy_wes.phe

Variant IDs to extract (Step 1): final_array_snps_CRCh38_qc_pass.snplist

Variant IDs to extract (Step 2): final_WES_snps_CRCh38_qc_pass.snplist

Common

Quantitative traits: FALSE

Phenotypes to include: ad_by_proxy

Array of covariates to apply: age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10

WGR (Step 1)

Size of the genotype blocks (Step 1): 1000

First allele as reference (Step 1): TRUE

Association Testing (Step 2)

Size of the genotype blocks (Step 2): 200

Firth approximation instead of SPA: TRUE

Minimum allele count (MAC): 3

Running regenie in the CLI

To run regenie in the CLI, do not directly copy and paste this portion of code. Fill in the file paths specified by <> and add file paths for the chromosomes you want to analyze using the igenotype_bgens/bgis/samples.

dx run app-regenie \
-iwgr_genotype_bed="<file path>/<PATH_TO_LIFTEDOVER_DATA>/ukb_c1-22_hs38DH_merged.bed" \
-iwgr_genotype_bim="<file path>/<PATH_TO_LIFTEDOVER_DATA>/ukb_c1-22_hs38DH_merged.bim" \
-iwgr_genotype_fam="<file path>/<PATH_TO_LIFTEDOVER_DATA>/ukb_c1-22_hs38DH_merged.fam" \
-igenotype_bgens="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.bgen" \
-igenotype_bgens="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c2_b0_v1.bgen" \
…
-igenotype_bgens="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c22_b0_v1.bgen" \
-igenotype_bgis="Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.bgen.bgi" \
-igenotype_bgis="Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c2_b0_v1.bgen.bgi" \
…
-igenotype_bgis="Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c22_b0_v1.bgen.bgi" \
-igenotype_samples="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.sample" \
-igenotype_samples="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c2_b0_v1.sample" \
…
-igenotype_samples="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c22_b0_v1.sample" \
-ipheno_txt="<file path>/ad_risk_by_proxy_wes.phe" \
-iquant_traits=False \
-ipheno_names=ad_by_proxy \
-icovar_names=age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
-istep1_block_size=1000 \
-istep2_block_size=200 \
-imin_mac=3 \
-istep1_extract_txt="<file path>/Final_WES_array_snps_CRCh38_qc_pass.snplist" \
-istep2_extract_txt="<file path>/Final_WES_snps_CRCh38_qc_pass.snplist" \
-istep1_ref_first=1 \
-icovar_txt="<file path>/ad_risk_by_proxy_wes.phe" \
-iuse_firth_approx=True \
--destination "<file path>"

LocusZoom

Next you can use LocusZoom to visualize the results of your analysis. The LocusZoom app is available in the Tools Library of the Research Analysis Platform. It generates Manhattan plots, QQ plots, and a table of high p-value loci.

To visualize your regenie outputs (e.g. the lmm.tsv.gz file), provide it as input to LocusZoom, and follow the app's instructions. From these results you can conduct further analyses and look into variant effects, or you may find new loci to explore.

Examples of these visualizations are shown below.

Benchmarking

A few example cases that might be interesting were tested to provide expectations on the Regenie runtime.

Step 1

The runtime and resources of various scenarios for Step 1 execution are summarized in the following table. Please note that this is only for orientational purposes and analysis-specific factors can influence the final results.

Step 2

Focusing on chromosome 1 (99236 variants after QC) as we analyze all chromosomes in parallel. The runtime of Step 2 varied from 7 to 140 minutes for mentioned use cases. The resources used are relatively stable across the tested scenarios and the default values are established to 3700 MB RAM and 40 GB disk.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.

PreviousBurden Testing with regenie Using WES Annotation Files NextTechnical Support

Last updated 9 months ago