This document describes the UKB whole exome sequencing (WES) protocol of the single- and aggregate-sample processing, including read alignment, variant calling, joint genotyping and post aggregation reformatting, employed by the Regeneron Genetics Center to generate the UK Biobank WES data set (category 170) for public release. Following this protocol, researchers can aggregate their own sequencing data with the UK Biobank single sample data, enabling mega-analysis.
This protocol applies to all UK Biobank whole exome sequencing data sets, including 200k, 300k, 450k, and the final release OQFE data sets.
The protocol is as follows:
The UKB WES data are reference-aligned with the OQFE protocol, which employs BWA-MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates, and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. All constituent steps of the OQFE protocol are executed with open-source software and described in detail in the OQFE manuscript linked above. Given the impact even small changes to the protocol can introduce into large analyses, the OQFE protocol is available as a Docker file and as an app on the DNAnexus Research Analysis Platform that executes the OQFE Docker file.
The OQFE Docker file takes either FASTQ or CRAM files as inputs and outputs an OQFE CRAM, ensuring that all steps are executed exactly as specified in the OQFE protocol. We strongly recommend that users seeking to harmonize their data with UKB WES data execute the OQFE protocol via these implementations.
To use the OQFE app, run the following command from the CLI:
If you are using the UI, start from the app.
This section of the UKB WES protocol describes the variant calling using DeepVariant. To call variants in WES data, either the default DeepVariant WES model or a custom model can be used. The custom model that is trained on WES data generated by Regeneron Genetics Center and used for the generation of UK Biobank data is available as supplementary materials in Krasheninina et al., 2020.
Prerequisites:
DeepVariant v0.10.0 docker file
OQFE CRAM generated by following section 1 of this protocol (e.g. aligned.cram)
GRCh38 reference sequences (which can be found under the folder containing the exome sequences in your project, detailed in OQFE protocol)
Calling regions in BED format (which can be found in the folder containing exome sequences in your project)
A custom model file, if applicable
The UKB WES variant data set (category 170) was generated using NVIDIA Clara Parabricks Pipelines accelerated version of DeepVariant, which is a faithful reproduction of Google’s DeepVariant (with the additional commit), and is an order of magnitude faster. The results of the two implementations are equivalent when tested across many samples. Across 100 test samples, 7.5 million sample level variants were called with 100% concordance between the two implementations of DeepVariant, with 1 zygosity mismatch and MEAN GQ difference of 0.43 on Phred scale. This protocol first outlines how to run using the GPU accelerated DeepVariant provided by NVIDIA Parabricks, and made available on the DNAnexus Research Analysis Platform as an app.
The details of the applet used to run the dataset are as follows:
Input Files:
Reference FASTA
Input BAM/CRAM
Input BAM/CRAM index
Interval file
Parabricks license file
Parabricks engine file for custom model
Input Options:
--disable-use-window-selector-mode : True
--include-med-dp : True
--gvcf : True
--gzip-output : True
Output Files:
Output gVCF file in bgzipped format
Index of output gVCF file
One sample is expected to run in around 5 minutes including the data upload and download time. The DNAnexus command line input for the run is:
If you are using the UI, start from the app.
This section of the UKB WES protocol describes the aggregation of variant genotypes and variant allele harmonization across sample-level gVCFs into multi-sample project-level VCF file (pVCF) organized in chromosomes or genomic segments using GLnexus, containing a row for every set of overlapping variants and each sample’s genotype for every variant allele. This pVCF is “squared-off”, in that for samples that do not contain an alternative allele genotype for a given variant, genotypes are derived from the gVCF reference blocks, reporting the read depth and most likely genotype (i.e. 0/0 or missing) for that sample at that variant position.
The GLnexus aggregation process requires only the gVCFs and the desired aggregation regions in the BED format (BED format specifications, https://m.ensembl.org/info/website/upload/bed.html) as inputs. The BED file for UKB WES data is the exome capture region buffered by 100 bp on each side of each target, with overlapping buffered regions merged. Users who are applying this protocol to a set of gVCFs derived from sequencing across multiple capture designs will need to generate a unified BED file for the regions of interest. The protocol here describes the BED generation process for aggregating variants across multiple capture designs: the intersection of all capture designs and the union of all capture designs.
The BCFtools and BEDtools commands mentioned in this section of the UKB WES protocol can be executed via command line using the Swiss Army Knife app. Please refer to the Swiss Army Knife documentation for more information about running the app with the command along with managing the inputs/outputs.
Prerequisites:
BEDtools
Custom target regions in BED format (e.g. targets.bed).
FASTA sequences (e.g. references.fa) of the reference index file (e.g. references.fa.fai). For the corresponding FASTA and index files used for UKB WES data, can be found in the dispensed UKB project within the folder containing exome sequences.
Steps:
Generate 100 bp buffer regions of each side of the target regions.
bedtools flank -i targets.bed -g references.fa.fai -b 100 > buffers.bed
Command section
Annotation
bedtools flank
To create flanking intervals for each BED feature
-i targets.bed
The input BED file
-g references.fa.fai
The genome file defining chromosome bounds
-b 100
The number of base pairs in each direction to add to the input BED file
> buffers.bed
Redirect the output to the buffers.bed
Combine the target and buffer regions and sort based on chromosome and start coordinates.
cat targets.bed buffers.bed | sort -k1,1 -k2,2n > target_buffers.bed
Command section
Annotation
cat targets.bed buffers.bed
Concatenate the target and buffer BED files to the standard output
| sort -k1,1 -k2,2n
Pipe to a sort command to sort the BED by chromosome first and then by the start coordinates in the numeric order
> target_buffers.bed
Redirect the output to a target_buffers.bed file
Merge overlapping regions.
bedtools merge -i target_buffers.bed > calling_regions.bed
Command section
Annotation
bedtools merge
Merge overlapping BED features into a single interval
-i target_buffers.bed
The input BED file
> calling_regions.bed
Redirect the output to a calling_regions.bed file
Prerequisites:
BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.
The following commands mentioned in this section of the UKB WES protocol can be executed in the Swiss Army Knife app. Please refer to the Swiss Army Knife documentation for more information about running the app with the command along with managing the inputs/outputs.
Combine the two calling regions and sort based on chromosome and start coordinates.
cat calling_regions_1.bed calling_regions_2.bed | sort -k1,1 -k2,2n > combined.bed
Command section
Annotation
cat calling_regions_1.bed calling_regions_2.bed
Concatenate the calling_regions_1 and calling_regions_2 BED files to the standard output
| sort -k1,1 -k2,2n
Pipe to a sort command to sort the BED by chromosome first and then by the start coordinates in the numeric order
> combined.bed
Redirect the output to a combined.bed file
Merge overlapping regions.
bedtools merge -i combined.bed > combined_calling_regions.bed
Command section
Annotation
bedtools merge
Merge overlapping BED features into a single interval
-i combined.bed
The input BED file
> combined_calling_regions.bed
Redirect the output to a combined_calling_regions.bed file
Prerequisites:
BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.
Steps:
Get the intersect of the two calling regions with BEDtools.
bedtools intersect -a calling_regions_1.bed -b calling_regions_2.bed > intersect.bed
Command section
Annotation
bedtools intersect
Bedtools intersect
-a calling_regions_1.bed
First calling regions in BED format
-b calling_regions_2.bed
Section calling regions in BED format
> intersect.bed
Redirect the output to a intersect.bed file
Prerequisites:
Access to a DNAnexus Research Analysis Platform folder containing the single sample genomic VCF (gVCF) for all samples to be included in pVCF.
A BED format file containing the genomic regions to be aggregated and reported in the pVCF.
Steps:
Generation of a manifest file containing the DNAnexus file ids of the gVCF files by matching the naming convention of the files (e.g. *.gvcf.gz) to be aggregated.
dx find data --project=<rap_project_name> --folder=<rap_folder_name> --name="*.g.vcf.gz" --brief | cut -d\: -f2 > <manifest_file>
Command section
Annotation
dx find data
Find data objects subject to the given search parameters
--project=<dx_project_name>
The DNAnexus project name where the data is in
--folder=<folder_name>
The folder where the data is in the DNAnexus project
--name="*.gvcf.gz"
The files to look for using a wildcard “*”
--brief
Print a DNAnexus ID per line for all matching gVCF files
| cut -d\: -f2
Pipe to the cut command to extract the DNAnexus file IDs only
> <manifest_file>
Redirect the output to a file
Upload the <manifest_file> to the DNAnexus Research Analysis platform.
dx upload --path <dx_project>:/<dx_folder>/ <manifest_file> --brief
Command section
Annotation
dx upload
Upload local file to DNAnexus
--path <dx_project>:/<dx_folder>/
The folder <dx_folder> in the DNAnexus project <dx_project> that the manifest file is uploaded to
<manifest_file>
The local manifest file that needs to be uploaded
--brief
Return a DNAnexus file ID for the uploaded manifest file
Launch GLnexus app to generate pVCF files.
dx run glnexus -y --brief -ivariants_gvcfgz_manifest_csv=<manifest_file_location> -igenotype_range_bed=<calling_regions.bed> -ioutput_prefix=<base_name> --folder=<output_folder_in_RAP>
If you are using the UI, start from the app.
This section of the UKB WES protocol describes how UKB WES PLINK and BGEN format files are derived from the pVCF, including the decomposition of multi-allelic variants into biallelic variants and variant normalization prior to format conversion to PLINK and BGEN. BGEN is recommended for performing GWAS using Regenie (https://doi.org/10.1038/s41588-021-00870-7).
Prerequisites:
PLINK 1.9
PLINK 2.0 (for BGEN conversion)
Reference sequences in FASTA format (e.g. references.fa)
The following commands mentioned in this section of the UKB WES protocol can be executed in the Swiss Army Knife app. Please refer to the Swiss Army Knife documentation for more information about running the app with the command along with managing the inputs/outputs.
Steps:
Split multiallelic variants in pVCF file and variant normalization.
bcftools norm -f references.fa -m -any -Oz -o pvcf.norm.vcf.gz pvcf.vcf.gz
Command section
Annotation
bcftools norm
Split multiallelic variants and normalization
-f references.fa
Specify the reference sequences (required for left-alignment and normalization)
-m -any
Split any multiallelic variants
-Oz
The output type is ‘compressed VCF’
-o pvcf.norm.vcf.gz
Write output to a file named ‘pvcf.norm.vcf.gz’
pvcf.vcf.gz
The input pVCF file
Convert normalized biallelic pVCF to PLINK files.
plink --vcf pvcf.norm.vcf.gz --keep-allele-order --vcf-idspace-to _ --double-id --allow-extra-chr 0 --make-bed --vcf-half-call m --out pvcf.norm
Command section
Annotation
plink
PLINK 1.9
--vcf pvcf.norm.vcf.gz
Specify the input pVCF file
--keep-allele-order
Keep the allele order
--vcf-idspace-to _
Change spaces in the variant IDs to underscore (_)
--double-id
Set both family ID and within-family ID to the same sample ID
--allow-extra-chr 0
Treat the unrecognized chromosome codes as if they have been set to zero
--make-bed
Creates a new PLINK 1 binary fileset
--vcf-half-call m
Convert half calls (./1) in the pVCF to missing in PLINK
--out pvcf.norm
Specify the prefix of the output PLINK
Convert PLINK files to BGEN files and prepare BGEN files.
First convert PLINK to a zlib-compressed BGEN file:
plink2 --bfile pvcf.norm --export bgen-1.2 bits=8 id-paste=iid ref-first --out pvcf.norm_zlib
Command section
Annotation
plink2
PLINK2
--bfile pvcf.norm
The input genotype file
--export bgen-1.2 bits=8 id-paste=iid ref-first
The output format BGEN with 8-bits probability precision, iid used to construct ID column and correct allele order
--out pvcf.norm_zlib
The output genotype file
Convert zlib-compressed BGEN to zstd-compressed BGEN file and omitting the sample identifier block in .bgen file using QCTOOL.
qctool -g pvcf.norm_zlib.bgen -s pvcf.norm_zlib.sample -og pvcf.norm.bgen -os pvcf.norm.sample -ofiletype bgen -bgen-bits 8 -bgen-compression zstd -bgen-omit-sample-identifier-block
Command section
Annotation
qctool
QCTOOL v2
-g pvcf.norm_zlib.bgen
The input genotype file
-s pvcf.norm_zlib.sample
The input sample file
-og pvcf.norm.bgen
The output genotype file
-os pvcf.norm.sample
The output sample file
-ofiletype bgen
The filetype of the output genotype file specified by -og
-bits 8
Store each probability in 8 bits
-bgen-compression zstd
Use zstd algorithm for BGEN compression
-bgen-omit-sample-identifier-block
Omit the sample identifier block
Generate BGEN index.
bgenix -g pvcf.norm.bgen -index -clobber
Command section
Annotation
bgenix
bgenix
-g pvcf.norm.bgen
Specify the input genotype file
-index
Create an index file for the given bgen file
-clobber
bgenix will overwrite existing index file if it exists