LogoLogo
  • About the Research Analysis Platform
    • About this documentation
    • Frequently asked questions
      • General FAQs
      • 500k WGS FAQs
  • Getting Started
    • Quickstart
      • Creating an account
      • Creating a project
    • Key concepts
    • Data structure
      • Data release versions
      • Updating dispensed data
    • Training videos
      • General
      • Tools
      • Analysis & data types
      • Roundtables
  • Administrator
    • Costs and Billing
    • Managing usage and storage costs
    • Contact support
      • Service packages
  • Working on the Research Analysis Platform
    • Accessing data
      • Accessing phenotype data
      • Accessing bulk data
    • Running analysis jobs
      • Tools library
      • RStudio
      • JupyterLab
      • SAIGE
      • Command-Line Interface
      • Custom app
    • Managing jobs
      • Job Priority
      • Troubleshooting guide
    • Tips & tricks
      • Guide to analyzing large sample sets
    • Returning pVCF Files to UK Biobank
  • Science Corner
    • About the science corner
    • End-to-end target discovery with GWAS and PheWAS
    • Whole Exome Sequencing OQFE Protocol
      • Protocol for Processing UKB Whole Exome Sequencing Data Sets
      • Generation and Utilization of Quality Control Set 90pct10dp on OQFE Data
        • Details on Processing Whole Exome Datasets to Generate the Quality Control Set
    • Burden testing with WES
    • GWAS guide using Alzheimer's disease
Powered by GitBook
On this page
  • Introduction
  • UKB WES Filtering for Genotype-Phenotype Association Analyses
  • Learn More
  • How to Cite

Was this helpful?

Export as PDF
  1. Science Corner
  2. Whole Exome Sequencing OQFE Protocol
  3. Generation and Utilization of Quality Control Set 90pct10dp on OQFE Data

Details on Processing Whole Exome Datasets to Generate the Quality Control Set

PreviousGeneration and Utilization of Quality Control Set 90pct10dp on OQFE DataNextBurden testing with WES

Last updated 2 years ago

Was this helpful?

The processing methodology described on this page applies to all OQFE datasets available on the Research Analysis Platform, including the 300k, 450k, and the final release OQFE whole exome sequencing datasets.

Introduction

The current release of the UK Biobank (UKB) whole exome sequencing (WES) data on 302,333 participants comprises single- and multi-sample variant data generated via the same protocols that were applied to the UKB 200k WES release in early 2021. All samples are processed with the OQFE mapping protocol, and variants are called with DeepVariant and aggregated into a multi-sample . The multi-sample VCF contains per-genotype metrics including depth and genotype qualities, allowing researchers to perform custom variant- and genotype-level filtering as appropriate for their desired analyses. As such, a single unfiltered multi-sample VCF was provided for the 300k WES release along with the derived PLINK files. In response to feedback from the UK Biobank community, the 300k WES release also includes an auxiliary file, ukb23145_300k_OQFE.90pct10dp_qc_variants.txt, to aid researchers in implementing basic best practices for genotype-phenotype association analyses.

Please note that only users with approved access to the 300k WES data will be able to view both the sequencing data and this auxiliary file from the user’s approved project. Users must request access to the 300k WES data and be approved by UK Biobank for access before they can view sequencing data and auxiliary files.

UKB WES Filtering for Genotype-Phenotype Association Analyses

The breadth and depth of UKB phenotypes provide researchers a broad landscape of possibilities for association analyses. These range from single-variant tests to gene burden testing, across individual and aggregated phenotypes. While no singular set of filtered genotypes can be optimized for all possible analyses, there are features fundamental to the UKB WES data that can lead to spurious association results if not accounted for.

Specifically, the UKB WES data was generated in two phases: the first 50k participants (Phase 1) and then the balance of the total 500k cohort (Phase 2). As described in the , the 50k release participants were selected to enrich for specific phenotypes. Given the non-random order of participant sequencing, variations in sequencing coverage that occur over long-term projects can manifest as spurious association results. The UKB community reported such spurious hits when single-variant tests were run on the unfiltered UKB WES 200k genotypes. As an example, Figure 1A shows all single-variant hits of the UKB WES 200k unfiltered genotypes tested against an asthma phenotype (PHE10_J45), indicating a large number of likely spurious variants with significant or near-significant P-values. Examination of these spurious hits in the UKB WES 200k unfiltered set indicates that these variants tend to be enriched for sample-genotypes with low per-genotype read depth.

As noted in the UKB WES 200k FAQ (, section 23.d), we suggest the inclusion of a batch covariate in association tests on these data, to account for differences in oligo lots between Phase 1 and Phase 2. These coverage heterogeneities can also be mitigated by a single variant-level filter requiring that at least 90% of all genotypes for a given variant - independent of variant allele zygosity - have a read depth of at least 10 (i.e. DP>=10). When this filter is applied to the UKB WES 200k data prior to association analysis, the results are largely devoid of the spurious hits (Fig. 1B).

Application of this depth filter (“90pct10dp”) is consistent across the UKB 200k and UKB 300k WES sets with respect to numbers of variants removed (Table 1). The filtering can also be performed directly on the multi-sample VCF with the commands below:

bcftools norm -m - -f <reference> -Oz -o <normVCF> <inputVCF>
bcftools view -i 'F_PASS(DP>=10 & GT!="mis")> 0.9' -Oz -o <filtered_normVCF> <normVCF>

Details on the <reference> above can be found in this reference document .

Alternatively, use the provided helper file named ukb23145_300k_OQFE.90pct10dp_qc_variants.txt, which is a single-column text file containing variants failing the “90pct10dp” depth filter in the CHR:POS:REF:ALT format. The following command using PLINK 1.9 can be used to remove the filtered variants from the UKB 300k WES PLINK files:

plink --bfile <original> --out <filtered> --exclude ukb23145_300k_OQFE.90pct10dp_qc_variants.txt --keep-allele-order

The exact file path may change depending on where the file is located or mounted from the RAP project.

bcftools and plink can be used as part of the Swiss Army Knife app found in the Tools Library on the Research Analysis Platform. See here for a detailed tutorial on how to use .

Figure 1.

The above figure shows re- and post-filtering UKB WES 200k association results with asthma phenotype (Phe10_J45. Subfigures A and B (top and bottom, respectively) show results on the unfiltered UKB WES 200k genotypes and the 90% DP>10 variant-filtered genotypes. The tests were logistic regressions performed with standard covariates (10 PCs, age, sex, age^2, age_x_sex).

Table 1.

% Filtered Variants

SNP

Indel

UKB 200k

1.52%

5.54%

UKB 300k

1.57%

5.20%

Learn More

How to Cite

For more information about app documentation for Swiss Army Knife, see (requires Research Analysis Platform login).

For more on the 300k WES dataset and how to work with it, .

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform ()" in your work.

VCF with GLnexus
Phase 1 release manuscript
here
bcftools
here
Swiss Army Knife
here
watch this section of the UK Biobank dataset overview webinar
https://ukbiobank.dnanexus.com