Details on Processing Whole Exome Datasets to Generate the Quality Control Set

The processing methodology described on this page applies to all OQFE datasets available on the Research Analysis Platform, including the 300k, 450k, and the final release OQFE whole exome sequencing datasets.

Introduction

The current release of the UK Biobank (UKB) whole exome sequencing (WES) data on 302,333 participants comprises single- and multi-sample variant data generated via the same protocols that were applied to the UKB 200k WES release in early 2021. All samples are processed with the OQFE mapping protocol, and variants are called with DeepVariant and aggregated into a multi-sample VCF with GLnexus. The multi-sample VCF contains per-genotype metrics including depth and genotype qualities, allowing researchers to perform custom variant- and genotype-level filtering as appropriate for their desired analyses. As such, a single unfiltered multi-sample VCF was provided for the 300k WES release along with the derived PLINK files. In response to feedback from the UK Biobank community, the 300k WES release also includes an auxiliary file, ukb23145_300k_OQFE.90pct10dp_qc_variants.txt, to aid researchers in implementing basic best practices for genotype-phenotype association analyses.

Please note that only users with approved access to the 300k WES data will be able to view both the sequencing data and this auxiliary file from the user’s approved project. Users must request access to the 300k WES data and be approved by UK Biobank for access before they can view sequencing data and auxiliary files.

UKB WES Filtering for Genotype-Phenotype Association Analyses

The breadth and depth of UKB phenotypes provide researchers a broad landscape of possibilities for association analyses. These range from single-variant tests to gene burden testing, across individual and aggregated phenotypes. While no singular set of filtered genotypes can be optimized for all possible analyses, there are features fundamental to the UKB WES data that can lead to spurious association results if not accounted for.

Specifically, the UKB WES data was generated in two phases: the first 50k participants (Phase 1) and then the balance of the total 500k cohort (Phase 2). As described in the Phase 1 release manuscript, the 50k release participants were selected to enrich for specific phenotypes. Given the non-random order of participant sequencing, variations in sequencing coverage that occur over long-term projects can manifest as spurious association results. The UKB community reported such spurious hits when single-variant tests were run on the unfiltered UKB WES 200k genotypes. As an example, Figure 1A shows all single-variant hits of the UKB WES 200k unfiltered genotypes tested against an asthma phenotype (PHE10_J45), indicating a large number of likely spurious variants with significant or near-significant P-values. Examination of these spurious hits in the UKB WES 200k unfiltered set indicates that these variants tend to be enriched for sample-genotypes with low per-genotype read depth.

As noted in the UKB WES 200k FAQ (here, section 23.d), we suggest the inclusion of a batch covariate in association tests on these data, to account for differences in oligo lots between Phase 1 and Phase 2. These coverage heterogeneities can also be mitigated by a single variant-level filter requiring that at least 90% of all genotypes for a given variant - independent of variant allele zygosity - have a read depth of at least 10 (i.e. DP>=10). When this filter is applied to the UKB WES 200k data prior to association analysis, the results are largely devoid of the spurious hits (Fig. 1B).

Application of this depth filter (“90pct10dp”) is consistent across the UKB 200k and UKB 300k WES sets with respect to numbers of variants removed (Table 1). The filtering can also be performed directly on the multi-sample VCF with the bcftools commands below:

bcftools norm -m - -f <reference> -Oz -o <normVCF> <inputVCF>
bcftools view -i 'F_PASS(DP>=10 & GT!="mis")> 0.9' -Oz -o <filtered_normVCF> <normVCF>

Details on the <reference> above can be found in this reference document here.

Alternatively, use the provided helper file named ukb23145_300k_OQFE.90pct10dp_qc_variants.txt, which is a single-column text file containing variants failing the “90pct10dp” depth filter in the CHR:POS:REF:ALT format. The following command using PLINK 1.9 can be used to remove the filtered variants from the UKB 300k WES PLINK files:

plink --bfile <original> --out <filtered> --exclude ukb23145_300k_OQFE.90pct10dp_qc_variants.txt --keep-allele-order

The exact file path may change depending on where the file is located or mounted from the RAP project.

bcftools and plink can be used as part of the Swiss Army Knife app found in the Tools Library on the Research Analysis Platform. See here for a detailed tutorial on how to use Swiss Army Knife.

For more information about app documentation for Swiss Army Knife, see here (requires Research Analysis Platform login).

Figure 1.

The above figure shows re- and post-filtering UKB WES 200k association results with asthma phenotype (Phe10_J45. Subfigures A and B (top and bottom, respectively) show results on the unfiltered UKB WES 200k genotypes and the 90% DP>10 variant-filtered genotypes. The tests were logistic regressions performed with standard covariates (10 PCs, age, sex, age^2, age_x_sex).

Table 1.

Learn More

For more on the 300k WES dataset and how to work with it, watch this section of the UK Biobank dataset overview webinar.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.

Last updated