Burden Testing with regenie Using WES Annotation Files


This article describes the format of the annotation files for the 500k WES dataset, and demonstrates how to use regenie on the UK Biobank Research Analysis Platform for generating variant masks on which to perform association tests.

Specifically, this article will go over how to use the following helper files:




A typical application of this tool would be in rare variant analyses where single variant tests have lower power, and combining variants into masks can boost association power. The commands shown below are executed in the JupyterLab environment with Bash kernel. For more on how to use JupyterLab on the Research Analysis Platform, see these tutorials. Alternatively, you can use ttyd (web-based terminal) or Cloud Workstation.

We will go over building masks on the fly in regenie, covering the following:

  1. The format of the input files

  2. How to run regenie to build and test masks

  3. The LOVO scheme

We assume that the annotation files are located along with the dispensed files in the user’s project. We also assume the RAP project is mounted on “/mnt/project” on the worker as per the approach detailed here.

The path to the annotation file:

path_to_500kwes_helper_files="/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release/helper_files/"

Input File Format

For this tutorial, we need the following files:

  • Annotation file

  • Masks definition file

  • Set list file

Use the following command to view the directory structure of the files:

ls -1 "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
   "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
   "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" | xargs -L1 -I{} basename "{}"

The following files should be in your directory:

  1. ukb23158_500k_OQFE.annotations.txt.gz

  2. ukb23158_500k_OQFE.masks

  3. ukb23158_500k_OQFE.sets.txt.gz

Set list file

The file ukb23158_500k_OQFE.sets.txt.gz defines how sets - i.e. genes - are to be constructed, by listing variants corresponding to each gene.

Each line contains the gene name, followed by a chromosome and physical position - to be used in the association result file - then by a comma-separated list of variants included in the gene, in the format CHR:POS:REF:ALT ID.

The variant IDs correspond to those in the genotype file (if not in the file, they will be ignored when running regenie). To view the file:

( zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" 2>/dev/null ) | head -n 2 | cut -d',' -f1-3 | awk '{print $0"..."}'

The file details will appear as the following:

A1BG(ENSG00000121410)	19	58347026	19:58347026:C:G,19:58347026:C:T,19:58347030:C:G...
A1CF(ENSG00000148584)	10	50806736	10:50806736:G:A,10:50806738:G:A,10:50806740:C:T...

In this file there are almost 19,000 defined genes. To see the count, use the following command:

zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" | wc -l

Which will give you:


Annotation file

The file ukb23158_500k_OQFE.annotations.txt.gz defines a functional annotation for each variant given a gene set. Each line contains the variant name, the gene name (corresponds to the name in the set list file above), and a single annotation label.

To view this file:


(zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" 2>/dev/null) | head -n 5

The file details will appear as the following:

1:69134:A:G	OR4F5(ENSG00000186092) missense(0/5)
1:69144:C:T	OR4F5(ENSG00000186092) synonymous
1:69149:T:A	OR4F5(ENSG00000186092) missense(>=1/5)
1:69217:G:A	OR4F5(ENSG00000186092) missense(0/5)
1:69224:A:T	OR4F5(ENSG00000186092) missense(>=1/5)

There are a total of 5 annotation labels in the file. Check the file ukb23158_helper_files.pdf that is available on the UK Biobank showcase here for more details on the labels. Variants in the set list file which don't have annotations will be assigned to a default NULL category in regenie.

zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" | cut -d' ' -f3 | sort | uniq



Mask definition file

The file ukb23158_500k_OQFE.masks specifies which annotation labels should be combined into masks. Each line contains a mask name, followed by a comma-separated list of the annotations included in the mask (i.e. taking a union over the annotation categories).


cat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks"


M1 LoF

The specified M1 mask includes only loss-of-function annotated variants in the mask. You can easily add new masks by selecting the annotations you want to combine. When doing so, make sure the annotation labels match the formatting of those in the annotation file above.

For example, to have a mask called M2 that includes LoF and missense annotated variants, you would generate a mask definition file as:

echo "M2 LoF,missense(0/5),missense(5/5),missense(>=1/5)" > ldl_custom_masks.txt

For more on these input files, refer to the file ukb23158_helper_files.pdf here.

Running regenie to Build and Test Masks

From this point on, you will need to have regenie installed in your working environment.

Here we show how to use annotation files for burden testing using regenie. For more detailed instructions on using regenie, refer to the regenie documentation page.

To install regenie in your JupyterLab environment:

# Download and expand the pre-compiled regenie from github
wget https://github.com/rgcgithub/regenie/releases/download/v2.2.4/regenie_v2.2.4.gz_x86_64_Linux_mkl.zip

unzip regenie_v2.2.4.gz_x86_64_Linux_mkl.zip 

# Now we can start use it on the notebook

regenie is also available as part of the Swiss Army Knife app, if you want to run an analysis as part of a larger workflow on the Research Analysis Platform.

In regenie, we will use the following options:

--aaf-bins to specify AAF cutoffs to use when building masks (the singleton class of masks is always included)

e.g. `--aaf-bins 0.05,0.01` will build masks using singletons, 1% and 5% AAF cutoffs.

In the examples below, we will use the OQFE genotype data in BGEN format. For the purposes of this tutorial, we will skip regenie step 1 and use --ignore-pred to bypass specifying the LOCO PRS.

Note that when running an actual analysis, it is highly recommended that you run Step 1 to control for relatedness, population structure and polygenicity.

We will focus on a specific gene, PCSK9. Start by getting its ID from the annotation files:

echo "name: `zgrep \"PCSK9(\" "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" | awk '{print $1}'`"
echo "chr: `zgrep \"PCSK9(\" "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" | awk '{print $2}'`"

The output should be:

name: PCSK9(ENSG00000169174)
chr: 1

We will use a phenotype file containing LDL measurements, and will carry out burden tests using variant masks built on-the-fly in regenie:

# Specify phenotype file and BGEN genotype file prefix

We will first run regenie building variant masks using a 0.1% and 1% AAF cutoff. We use option --extract-setlist to specify a subset of the 19K genes to test. Alternatively, you can provide a file named like "extract_gene_names.txt", with a single column containing names of genes to analyze, then pass this file to regenie using --extract-sets extract_gene_names.txt:

./regenie_v2.2.4.gz_x86_64_Linux_mkl \
    --step 2 \
    --ignore-pred \
    --bgen ${genotype_prefix}.bgen \
    --ref-first \
    --sample ${genotype_prefix}.sample \
    --phenoFile $phenotype_file \
    --covarFile $phenotype_file \
    --phenoCol LDL \
    --covarColList Sex,Age,Age_Sq,Age_x_Sex,PC{1:10} \
   --set-list "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
    --anno-file "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
    --mask-def "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" \
    --nauto 23 \
    --aaf-bins 0.01,0.001 \
    --bsize 200 \
    --extract-setlist "PCSK9(ENSG00000169174)" \
    --out ldl_pcsk9_test

regenie will output a summary statistics file ldl_pcsk9_test_LDL.regenie, containing association results for the built masks, whose name will be PCSK9(ENSG00000169174).M1.*, split between the 3 AAF cutoffs (0.01, 0.001, or singletons).

Saving masks to a file

You have the option to save the built masks to a file, which will be useful and save compute, if you want to use them again in another analysis. To do so, use --write-mask, which will store the masks in PLINK BED format, in files ldl_pcsk9_test_masks.{bed,bim,fam}.

Note that if you want to build masks and save them to file without testing them for association, you can use option --skip-test.

Using different rules to build masks

By default, masks are created by taking the maximum ALT allele count across sites included in the mask - so it will take values "0/1/2" or "NA" for missing. Alternatively, you can specify different rules to build masks:

  • --build-mask sum will take the sum of the ALT allele count across sites

  • --build-mask comphet is used to identify compound heterozygotes, which is defined as carriers of 2 or more ALT alleles across any site included in the mask

Note that building masks using the “sum” rule is not compatible with the use of --write-mask as detailed above.

Getting map composition

To obtain the list of single variants that went into each mask, you can use the option --write-mask-snplist when building masks. This will generate a file where each row has the mask name followed by the list of variants that went into the mask.

Checking input files

If you want to make sure that you’re correctly specifying input annotation files, you can use the option --check-burden-files, which will create a report that checks the input files for concordance. It will check that the same annotation labels are used in all files, and check that there are variants in the set list file which are not in the input genotype file.

Leave-one-variant-out (LOVO) Scheme

If there are built masks that come up as significant, it is usually of interest to determine which of the single variants in the mask is driving the signal. This is what the LOVO scheme, which stands for leave-one-variant-out, aims to do.

For each variant included in a mask, the LOVO scheme will build the mask excluding the variant. Hence, if there are MM variants included in a mask, there will be MM LOVO masks generated. Thus, if a variant highly contributes to the full mask signal the corresponding LOVO mask should have weaker association signal.

To specify LOVO, you need to use option --mask-lovo followed by the gene name, the mask name, and the AAF cutoff (either 'singleton' or a value in (0,1)).

For example, if we wanted to apply LOVO to the M1 mask with 1% AAF cutoff for PCSK9, we would use:

./regenie_v2.2.4.gz_x86_64_Linux_mkl \
    --step 2 \
    --ignore-pred \
    --bgen ${genotype_prefix}.bgen \
    --ref-first \
    --sample ${genotype_prefix}.sample \
    --phenoFile $phenotype_file \
    --covarFile $phenotype_file \
    --phenoCol LDL \
    --covarColList Sex,Age,Age_Sq,Age_x_Sex,PC{1:10} \
    --set-list "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
    --anno-file "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
    --mask-def "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" \
    --nauto 23 \
    --bsize 200 \
    --mask-lovo "PCSK9(ENSG00000169174),M1,0.01" \
    --out ldl_pcsk9_test_lovo

This will generate the result output file ldl_pcsk9_test_lovo_LDL.regenie, which will contain association results for each LOVO mask, as well as results for the full mask that considers all sites.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.

Last updated