1 of 1

Protocol for Processing UKB Whole Exome Sequencing Data Sets

This document describes the UKB whole exome sequencing (WES) protocol of the single- and aggregate-sample processing, including read alignment, variant calling, joint genotyping and post aggregation reformatting, employed by the Regeneron Genetics Center to generate the UK Biobank WES data set (category 170) for public release. Following this protocol, researchers can aggregate their own sequencing data with the UK Biobank single sample data, enabling mega-analysis.

This protocol applies to all UK Biobank whole exome sequencing data sets, including 200k, 300k, 450k, and the final release OQFE data sets.

The protocol is as follows:

Section 1: Aligning Whole Exome Sequencing (WES) Reads to a Reference Genome

The UKB WES data are reference-aligned with the OQFE protocol, which employs BWA-MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates, and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. All constituent steps of the OQFE protocol are executed with open-source software and described in detail in the OQFE manuscript linked above. Given the impact even small changes to the protocol can introduce into large analyses, the OQFE protocol is available as a Docker file and as an app on the DNAnexus Research Analysis Platform that executes the OQFE Docker file.

The OQFE Docker file takes either FASTQ or CRAM files as inputs and outputs an OQFE CRAM, ensuring that all steps are executed exactly as specified in the OQFE protocol. We strongly recommend that users seeking to harmonize their data with UKB WES data execute the OQFE protocol via these implementations.

To use the OQFE app, run the following command from the CLI:

$ dx run oqfe -ireads_fastqgzs=<first_end_reads> -ireads2_fastqgzs=<second_end_reads> -igenome_fastagz=<genome_ file>.fa.gz -isample=<sample_name>

If you are using the UI, start from the app.

Section 2: Variant Calling Using DeepVariant

This section of the UKB WES protocol describes the variant calling using DeepVariant. To call variants in WES data, either the default DeepVariant WES model or a custom model can be used. The custom model that is trained on WES data generated by Regeneron Genetics Center and used for the generation of UK Biobank data is available as supplementary materials in Krasheninina et al., 2020.

Prerequisites:

DeepVariant v0.10.0 docker file
OQFE CRAM generated by following section 1 of this protocol (e.g. aligned.cram)
GRCh38 reference sequences (which can be found under the folder containing the exome sequences in your project, detailed in OQFE protocol)
Calling regions in BED format (which can be found in the folder containing exome sequences in your project)
A custom model file, if applicable

The UKB WES variant data set (category 170) was generated using NVIDIA Clara Parabricks Pipelines accelerated version of DeepVariant, which is a faithful reproduction of Google’s DeepVariant (with the additional commit), and is an order of magnitude faster. The results of the two implementations are equivalent when tested across many samples. Across 100 test samples, 7.5 million sample level variants were called with 100% concordance between the two implementations of DeepVariant, with 1 zygosity mismatch and MEAN GQ difference of 0.43 on Phred scale. This protocol first outlines how to run using the GPU accelerated DeepVariant provided by NVIDIA Parabricks, and made available on the DNAnexus Research Analysis Platform as an app.

The details of the applet used to run the dataset are as follows:

Input Files:
- Reference FASTA
- Input BAM/CRAM
- Input BAM/CRAM index
- Interval file
- Parabricks license file
- Parabricks engine file for custom model
Input Options:
- --disable-use-window-selector-mode : True
- --include-med-dp : True
- --gvcf : True
- --gzip-output : True
Output Files:
- Output gVCF file in bgzipped format
- Index of output gVCF file

One sample is expected to run in around 5 minutes including the data upload and download time. The DNAnexus command line input for the run is:

$ dx run pbdeepvariant_ukb -igenome_fastagz=<genome_file>.fa.gz -iinterval_file=<interval_file>.bed  -imappings_sorted_bam <oqfe_cram>These files can be found on the DNAnexus Research Analysis Platform under the same folder containing the helper files, along with Exome Sequences CRAM Files for Category 170.

If you are using the UI, start from the app.

Section 3: Aggregation and genotype harmonization using GLnexus to generate a multi-sample project-level VCF (pVCF)

This section of the UKB WES protocol describes the aggregation of variant genotypes and variant allele harmonization across sample-level gVCFs into multi-sample project-level VCF file (pVCF) organized in chromosomes or genomic segments using GLnexus, containing a row for every set of overlapping variants and each sample’s genotype for every variant allele. This pVCF is “squared-off”, in that for samples that do not contain an alternative allele genotype for a given variant, genotypes are derived from the gVCF reference blocks, reporting the read depth and most likely genotype (i.e. 0/0 or missing) for that sample at that variant position.

The GLnexus aggregation process requires only the gVCFs and the desired aggregation regions in the BED format (BED format specifications, https://m.ensembl.org/info/website/upload/bed.html) as inputs. The BED file for UKB WES data is the exome capture region buffered by 100 bp on each side of each target, with overlapping buffered regions merged. Users who are applying this protocol to a set of gVCFs derived from sequencing across multiple capture designs will need to generate a unified BED file for the regions of interest. The protocol here describes the BED generation process for aggregating variants across multiple capture designs: the intersection of all capture designs and the union of all capture designs.

The BCFtools and BEDtools commands mentioned in this section of the UKB WES protocol can be executed via command line using the Swiss Army Knife app. Please refer to the Swiss Army Knife documentation for more information about running the app with the command along with managing the inputs/outputs.

Preparation of calling regions in BED format

Scenario 1: Adding 100 bp buffer on each side of the custom target regions in BED format.

Prerequisites:

BEDtools
Custom target regions in BED format (e.g. targets.bed).
FASTA sequences (e.g. references.fa) of the reference index file (e.g. references.fa.fai). For the corresponding FASTA and index files used for UKB WES data, can be found in the dispensed UKB project within the folder containing exome sequences.

Steps:

Generate 100 bp buffer regions of each side of the target regions.
- bedtools flank -i targets.bed -g references.fa.fai -b 100 > buffers.bed
Combine the target and buffer regions and sort based on chromosome and start coordinates.
- cat targets.bed buffers.bed | sort -k1,1 -k2,2n > target_buffers.bed
Merge overlapping regions.
- bedtools merge -i target_buffers.bed > calling_regions.bed

Scenario 2: Merging to create the union of two different calling regions in BED format.

Prerequisites:

BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.

The following commands mentioned in this section of the UKB WES protocol can be executed in the Swiss Army Knife app. Please refer to the Swiss Army Knife documentation for more information about running the app with the command along with managing the inputs/outputs.

Combine the two calling regions and sort based on chromosome and start coordinates.
- cat calling_regions_1.bed calling_regions_2.bed | sort -k1,1 -k2,2n > combined.bed
Merge overlapping regions.
- bedtools merge -i combined.bed > combined_calling_regions.bed

Scenario 3: Create the intersection of two different calling regions in BED format.

Prerequisites:

BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.

Steps:

Get the intersect of the two calling regions with BEDtools.
- bedtools intersect -a calling_regions_1.bed -b calling_regions_2.bed > intersect.bed

Running GLnexus workflow in DNAnexus Research Analysis Platform to generate pVCF

Prerequisites:

GLnexus app
Access to a DNAnexus Research Analysis Platform folder containing the single sample genomic VCF (gVCF) for all samples to be included in pVCF.
A BED format file containing the genomic regions to be aggregated and reported in the pVCF.

Steps:

Generation of a manifest file containing the DNAnexus file ids of the gVCF files by matching the naming convention of the files (e.g. *.gvcf.gz) to be aggregated.
- dx find data --project=<rap_project_name> --folder=<rap_folder_name> --name="*.g.vcf.gz" --brief | cut -d\: -f2 > <manifest_file>
Upload the <manifest_file> to the DNAnexus Research Analysis platform.
- dx upload --path <dx_project>:/<dx_folder>/ <manifest_file> --brief
Launch GLnexus app to generate pVCF files.
- dx run glnexus -y --brief -ivariants_gvcfgz_manifest_csv=<manifest_file_location> -igenotype_range_bed=<calling_regions.bed> -ioutput_prefix=<base_name> --folder=<output_folder_in_RAP>

If you are using the UI, start from the app.

Conversion of pVCF to PLINK and BGEN files

This section of the UKB WES protocol describes how UKB WES PLINK and BGEN format files are derived from the pVCF, including the decomposition of multi-allelic variants into biallelic variants and variant normalization prior to format conversion to PLINK and BGEN. BGEN is recommended for performing GWAS using Regenie (https://doi.org/10.1038/s41588-021-00870-7).

Prerequisites:

BCFtools v1.9
PLINK 1.9
PLINK 2.0 (for BGEN conversion)
QCTOOL v2
Reference sequences in FASTA format (e.g. references.fa)

The following commands mentioned in this section of the UKB WES protocol can be executed in the Swiss Army Knife app. Please refer to the Swiss Army Knife documentation for more information about running the app with the command along with managing the inputs/outputs.

Steps:

Split multiallelic variants in pVCF file and variant normalization.
- bcftools norm -f references.fa -m -any -Oz -o pvcf.norm.vcf.gz pvcf.vcf.gz
Convert normalized biallelic pVCF to PLINK files.
- plink --vcf pvcf.norm.vcf.gz --keep-allele-order --vcf-idspace-to _ --double-id --allow-extra-chr 0 --make-bed --vcf-half-call m --out pvcf.norm
Convert PLINK files to BGEN files and prepare BGEN files.
- First convert PLINK to a zlib-compressed BGEN file:
  - plink2 --bfile pvcf.norm --export bgen-1.2 bits=8 id-paste=iid ref-first --out pvcf.norm_zlib
- Convert zlib-compressed BGEN to zstd-compressed BGEN file and omitting the sample identifier block in .bgen file using QCTOOL.
  - qctool -g pvcf.norm_zlib.bgen -s pvcf.norm_zlib.sample -og pvcf.norm.bgen -os pvcf.norm.sample -ofiletype bgen -bgen-bits 8 -bgen-compression zstd -bgen-omit-sample-identifier-block
Generate BGEN index.
- bgenix -g pvcf.norm.bgen -index -clobber

Protocol for Processing UKB Whole Exome Sequencing Data Sets

This protocol applies to all UK Biobank whole exome sequencing data sets, including 200k, 300k, 450k, and the final release OQFE data sets.

The protocol is as follows:

Section 1: Aligning Whole Exome Sequencing (WES) Reads to a Reference Genome

To use the OQFE app, run the following command from the CLI:

$ dx run oqfe -ireads_fastqgzs=<first_end_reads> -ireads2_fastqgzs=<second_end_reads> -igenome_fastagz=<genome_ file>.fa.gz -isample=<sample_name>

If you are using the UI, start from the app.

Section 2: Variant Calling Using DeepVariant

Prerequisites:

DeepVariant v0.10.0 docker file
OQFE CRAM generated by following section 1 of this protocol (e.g. aligned.cram)
GRCh38 reference sequences (which can be found under the folder containing the exome sequences in your project, detailed in OQFE protocol)
Calling regions in BED format (which can be found in the folder containing exome sequences in your project)
A custom model file, if applicable

The details of the applet used to run the dataset are as follows:

Input Files:
- Reference FASTA
- Input BAM/CRAM
- Input BAM/CRAM index
- Interval file
- Parabricks license file
- Parabricks engine file for custom model
Input Options:
- --disable-use-window-selector-mode : True
- --include-med-dp : True
- --gvcf : True
- --gzip-output : True
Output Files:
- Output gVCF file in bgzipped format
- Index of output gVCF file

One sample is expected to run in around 5 minutes including the data upload and download time. The DNAnexus command line input for the run is:

$ dx run pbdeepvariant_ukb -igenome_fastagz=<genome_file>.fa.gz -iinterval_file=<interval_file>.bed  -imappings_sorted_bam <oqfe_cram>These files can be found on the DNAnexus Research Analysis Platform under the same folder containing the helper files, along with Exome Sequences CRAM Files for Category 170.

If you are using the UI, start from the app.

Section 3: Aggregation and genotype harmonization using GLnexus to generate a multi-sample project-level VCF (pVCF)

Preparation of calling regions in BED format

Scenario 1: Adding 100 bp buffer on each side of the custom target regions in BED format.

Prerequisites:

BEDtools
Custom target regions in BED format (e.g. targets.bed).
FASTA sequences (e.g. references.fa) of the reference index file (e.g. references.fa.fai). For the corresponding FASTA and index files used for UKB WES data, can be found in the dispensed UKB project within the folder containing exome sequences.

Steps:

Generate 100 bp buffer regions of each side of the target regions.
- bedtools flank -i targets.bed -g references.fa.fai -b 100 > buffers.bed
Combine the target and buffer regions and sort based on chromosome and start coordinates.
- cat targets.bed buffers.bed | sort -k1,1 -k2,2n > target_buffers.bed
Merge overlapping regions.
- bedtools merge -i target_buffers.bed > calling_regions.bed

Scenario 2: Merging to create the union of two different calling regions in BED format.

Prerequisites:

BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.

Combine the two calling regions and sort based on chromosome and start coordinates.
- cat calling_regions_1.bed calling_regions_2.bed | sort -k1,1 -k2,2n > combined.bed
Merge overlapping regions.
- bedtools merge -i combined.bed > combined_calling_regions.bed

Scenario 3: Create the intersection of two different calling regions in BED format.

Prerequisites:

BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.

Steps:

Get the intersect of the two calling regions with BEDtools.
- bedtools intersect -a calling_regions_1.bed -b calling_regions_2.bed > intersect.bed

Running GLnexus workflow in DNAnexus Research Analysis Platform to generate pVCF

Prerequisites:

GLnexus app
Access to a DNAnexus Research Analysis Platform folder containing the single sample genomic VCF (gVCF) for all samples to be included in pVCF.
A BED format file containing the genomic regions to be aggregated and reported in the pVCF.

Steps:

Generation of a manifest file containing the DNAnexus file ids of the gVCF files by matching the naming convention of the files (e.g. *.gvcf.gz) to be aggregated.
- dx find data --project=<rap_project_name> --folder=<rap_folder_name> --name="*.g.vcf.gz" --brief | cut -d\: -f2 > <manifest_file>
Upload the <manifest_file> to the DNAnexus Research Analysis platform.
- dx upload --path <dx_project>:/<dx_folder>/ <manifest_file> --brief
Launch GLnexus app to generate pVCF files.
- dx run glnexus -y --brief -ivariants_gvcfgz_manifest_csv=<manifest_file_location> -igenotype_range_bed=<calling_regions.bed> -ioutput_prefix=<base_name> --folder=<output_folder_in_RAP>

If you are using the UI, start from the app.

Conversion of pVCF to PLINK and BGEN files

Prerequisites:

BCFtools v1.9
PLINK 1.9
PLINK 2.0 (for BGEN conversion)
QCTOOL v2
Reference sequences in FASTA format (e.g. references.fa)

Steps:

Split multiallelic variants in pVCF file and variant normalization.
- bcftools norm -f references.fa -m -any -Oz -o pvcf.norm.vcf.gz pvcf.vcf.gz
Convert normalized biallelic pVCF to PLINK files.
- plink --vcf pvcf.norm.vcf.gz --keep-allele-order --vcf-idspace-to _ --double-id --allow-extra-chr 0 --make-bed --vcf-half-call m --out pvcf.norm
Convert PLINK files to BGEN files and prepare BGEN files.
- First convert PLINK to a zlib-compressed BGEN file:
  - plink2 --bfile pvcf.norm --export bgen-1.2 bits=8 id-paste=iid ref-first --out pvcf.norm_zlib
- Convert zlib-compressed BGEN to zstd-compressed BGEN file and omitting the sample identifier block in .bgen file using QCTOOL.
  - qctool -g pvcf.norm_zlib.bgen -s pvcf.norm_zlib.sample -og pvcf.norm.bgen -os pvcf.norm.sample -ofiletype bgen -bgen-bits 8 -bgen-compression zstd -bgen-omit-sample-identifier-block
Generate BGEN index.
- bgenix -g pvcf.norm.bgen -index -clobber