Protocol for Processing UKB Whole Exome Sequencing Data Sets

This document describes the UKB whole exome sequencing (WES) protocol of the single- and aggregate-sample processing, including read alignment, variant calling, joint genotyping and post aggregation reformatting, employed by the Regeneron Genetics Center to generate the UK Biobank WES data set (category 170) for public release. Following this protocol, researchers can aggregate their own sequencing data with the UK Biobank single sample data, enabling mega-analysis.

This protocol applies to all UK Biobank whole exome sequencing data sets, including 200k, 300k, 450k, and the final release OQFE data sets.

The protocol is as follows:

Section 1: Aligning Whole Exome Sequencing (WES) Reads to a Reference Genome

The UKB WES data are reference-aligned with the OQFE protocol, which employs BWA-MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates, and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. All constituent steps of the OQFE protocol are executed with open-source software and described in detail in the OQFE manuscript linked above. Given the impact even small changes to the protocol can introduce into large analyses, the OQFE protocol is available as a Docker file and as an app on the DNAnexus Research Analysis Platform that executes the OQFE Docker file.

The OQFE Docker file takes either FASTQ or CRAM files as inputs and outputs an OQFE CRAM, ensuring that all steps are executed exactly as specified in the OQFE protocol. We strongly recommend that users seeking to harmonize their data with UKB WES data execute the OQFE protocol via these implementations.

To use the OQFE app, run the following command from the CLI:

$ dx run oqfe -ireads_fastqgzs=<first_end_reads> -ireads2_fastqgzs=<second_end_reads> -igenome_fastagz=<genome_ file>.fa.gz -isample=<sample_name>

If you are using the UI, start from the app.

Section 2: Variant Calling Using DeepVariant

This section of the UKB WES protocol describes the variant calling using DeepVariant. To call variants in WES data, either the default DeepVariant WES model or a custom model can be used. The custom model that is trained on WES data generated by Regeneron Genetics Center and used for the generation of UK Biobank data is available as supplementary materials in Krasheninina et al., 2020.

Prerequisites:

  • DeepVariant v0.10.0 docker file

  • OQFE CRAM generated by following section 1 of this protocol (e.g. aligned.cram)

  • GRCh38 reference sequences (which can be found under the folder containing the exome sequences in your project, detailed in OQFE protocol)

  • Calling regions in BED format (which can be found in the folder containing exome sequences in your project)

  • A custom model file, if applicable

The UKB WES variant data set (category 170) was generated using NVIDIA Clara Parabricks Pipelines accelerated version of DeepVariant, which is a faithful reproduction of Google’s DeepVariant (with the additional commit), and is an order of magnitude faster. The results of the two implementations are equivalent when tested across many samples. Across 100 test samples, 7.5 million sample level variants were called with 100% concordance between the two implementations of DeepVariant, with 1 zygosity mismatch and MEAN GQ difference of 0.43 on Phred scale. This protocol first outlines how to run using the GPU accelerated DeepVariant provided by NVIDIA Parabricks, and made available on the DNAnexus Research Analysis Platform as an app.

The details of the applet used to run the dataset are as follows:

  • Input Files:

    • Reference FASTA

    • Input BAM/CRAM

    • Input BAM/CRAM index

    • Interval file

    • Parabricks license file

    • Parabricks engine file for custom model

  • Input Options:

    • --disable-use-window-selector-mode : True

    • --include-med-dp : True

    • --gvcf : True

    • --gzip-output : True

  • Output Files:

    • Output gVCF file in bgzipped format

    • Index of output gVCF file

One sample is expected to run in around 5 minutes including the data upload and download time. The DNAnexus command line input for the run is:

$ dx run pbdeepvariant_ukb -igenome_fastagz=<genome_file>.fa.gz -iinterval_file=<interval_file>.bed  -imappings_sorted_bam <oqfe_cram>These files can be found on the DNAnexus Research Analysis Platform under the same folder containing the helper files, along with Exome Sequences CRAM Files for Category 170.

If you are using the UI, start from the app.

Section 3: Aggregation and genotype harmonization using GLnexus to generate a multi-sample project-level VCF (pVCF)

This section of the UKB WES protocol describes the aggregation of variant genotypes and variant allele harmonization across sample-level gVCFs into multi-sample project-level VCF file (pVCF) organized in chromosomes or genomic segments using GLnexus, containing a row for every set of overlapping variants and each sample’s genotype for every variant allele. This pVCF is “squared-off”, in that for samples that do not contain an alternative allele genotype for a given variant, genotypes are derived from the gVCF reference blocks, reporting the read depth and most likely genotype (i.e. 0/0 or missing) for that sample at that variant position.

The GLnexus aggregation process requires only the gVCFs and the desired aggregation regions in the BED format (BED format specifications, https://m.ensembl.org/info/website/upload/bed.html) as inputs. The BED file for UKB WES data is the exome capture region buffered by 100 bp on each side of each target, with overlapping buffered regions merged. Users who are applying this protocol to a set of gVCFs derived from sequencing across multiple capture designs will need to generate a unified BED file for the regions of interest. The protocol here describes the BED generation process for aggregating variants across multiple capture designs: the intersection of all capture designs and the union of all capture designs.

Preparation of calling regions in BED format

Scenario 1: Adding 100 bp buffer on each side of the custom target regions in BED format.

Prerequisites:

  • BEDtools

  • Custom target regions in BED format (e.g. targets.bed).

  • FASTA sequences (e.g. references.fa) of the reference index file (e.g. references.fa.fai). For the corresponding FASTA and index files used for UKB WES data, can be found in the dispensed UKB project within the folder containing exome sequences.

Steps:

  1. Generate 100 bp buffer regions of each side of the target regions.

    • bedtools flank -i targets.bed -g references.fa.fai -b 100 > buffers.bed

    Command section

    Annotation

    bedtools flank

    To create flanking intervals for each BED feature

    -i targets.bed

    The input BED file

    -g references.fa.fai

    The genome file defining chromosome bounds

    -b 100

    The number of base pairs in each direction to add to the input BED file

    > buffers.bed

    Redirect the output to the buffers.bed

  2. Combine the target and buffer regions and sort based on chromosome and start coordinates.

    • cat targets.bed buffers.bed | sort -k1,1 -k2,2n > target_buffers.bed

    Command section

    Annotation

    cat targets.bed buffers.bed

    Concatenate the target and buffer BED files to the standard output

    | sort -k1,1 -k2,2n

    Pipe to a sort command to sort the BED by chromosome first and then by the start coordinates in the numeric order

    > target_buffers.bed

    Redirect the output to a target_buffers.bed file

  3. Merge overlapping regions.

    • bedtools merge -i target_buffers.bed > calling_regions.bed

    Command section

    Annotation

    bedtools merge

    Merge overlapping BED features into a single interval

    -i target_buffers.bed

    The input BED file

    > calling_regions.bed

    Redirect the output to a calling_regions.bed file

Scenario 2: Merging to create the union of two different calling regions in BED format.

Prerequisites:

  • BEDtools

  • Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.

  1. Combine the two calling regions and sort based on chromosome and start coordinates.

    • cat calling_regions_1.bed calling_regions_2.bed | sort -k1,1 -k2,2n > combined.bed

    Command section

    Annotation

    cat calling_regions_1.bed calling_regions_2.bed

    Concatenate the calling_regions_1 and calling_regions_2 BED files to the standard output

    | sort -k1,1 -k2,2n

    Pipe to a sort command to sort the BED by chromosome first and then by the start coordinates in the numeric order

    > combined.bed

    Redirect the output to a combined.bed file

  2. Merge overlapping regions.

    • bedtools merge -i combined.bed > combined_calling_regions.bed

    Command section

    Annotation

    bedtools merge

    Merge overlapping BED features into a single interval

    -i combined.bed

    The input BED file

    > combined_calling_regions.bed

    Redirect the output to a combined_calling_regions.bed file

Scenario 3: Create the intersection of two different calling regions in BED format.

Prerequisites:

  • BEDtools

  • Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.

Steps:

  1. Get the intersect of the two calling regions with BEDtools.

    • bedtools intersect -a calling_regions_1.bed -b calling_regions_2.bed > intersect.bed

    Command section

    Annotation

    bedtools intersect

    Bedtools intersect

    -a calling_regions_1.bed

    First calling regions in BED format

    -b calling_regions_2.bed

    Section calling regions in BED format

    > intersect.bed

    Redirect the output to a intersect.bed file

Running GLnexus workflow in DNAnexus Research Analysis Platform to generate pVCF

Prerequisites:

  • Access to a DNAnexus Research Analysis Platform folder containing the single sample genomic VCF (gVCF) for all samples to be included in pVCF.

  • A BED format file containing the genomic regions to be aggregated and reported in the pVCF.

Steps:

  1. Generation of a manifest file containing the DNAnexus file ids of the gVCF files by matching the naming convention of the files (e.g. *.gvcf.gz) to be aggregated.

    • dx find data --project=<rap_project_name> --folder=<rap_folder_name> --name="*.g.vcf.gz" --brief | cut -d\: -f2 > <manifest_file>

    Command section

    Annotation

    dx find data

    Find data objects subject to the given search parameters

    --project=<dx_project_name>

    The DNAnexus project name where the data is in

    --folder=<folder_name>

    The folder where the data is in the DNAnexus project

    --name="*.gvcf.gz"

    The files to look for using a wildcard “*”

    --brief

    Print a DNAnexus ID per line for all matching gVCF files

    | cut -d\: -f2

    Pipe to the cut command to extract the DNAnexus file IDs only

    > <manifest_file>

    Redirect the output to a file

  2. Upload the <manifest_file> to the DNAnexus Research Analysis platform.

    • dx upload --path <dx_project>:/<dx_folder>/ <manifest_file> --brief

    Command section

    Annotation

    dx upload

    Upload local file to DNAnexus

    --path <dx_project>:/<dx_folder>/

    The folder <dx_folder> in the DNAnexus project <dx_project> that the manifest file is uploaded to

    <manifest_file>

    The local manifest file that needs to be uploaded

    --brief

    Return a DNAnexus file ID for the uploaded manifest file

  3. Launch GLnexus app to generate pVCF files.

    • dx run glnexus -y --brief -ivariants_gvcfgz_manifest_csv=<manifest_file_location> -igenotype_range_bed=<calling_regions.bed> -ioutput_prefix=<base_name> --folder=<output_folder_in_RAP>

If you are using the UI, start from the app.

This section of the UKB WES protocol describes how UKB WES PLINK and BGEN format files are derived from the pVCF, including the decomposition of multi-allelic variants into biallelic variants and variant normalization prior to format conversion to PLINK and BGEN. BGEN is recommended for performing GWAS using Regenie (https://doi.org/10.1038/s41588-021-00870-7).

Prerequisites:

  • PLINK 1.9

  • PLINK 2.0 (for BGEN conversion)

  • Reference sequences in FASTA format (e.g. references.fa)

Steps:

  1. Split multiallelic variants in pVCF file and variant normalization.

    • bcftools norm -f references.fa -m -any -Oz -o pvcf.norm.vcf.gz pvcf.vcf.gz

      Command section

      Annotation

      bcftools norm

      Split multiallelic variants and normalization

      -f references.fa

      Specify the reference sequences (required for left-alignment and normalization)

      -m -any

      Split any multiallelic variants

      -Oz

      The output type is ‘compressed VCF’

      -o pvcf.norm.vcf.gz

      Write output to a file named ‘pvcf.norm.vcf.gz’

      pvcf.vcf.gz

      The input pVCF file

  2. Convert normalized biallelic pVCF to PLINK files.

    • plink --vcf pvcf.norm.vcf.gz --keep-allele-order --vcf-idspace-to _ --double-id --allow-extra-chr 0 --make-bed --vcf-half-call m --out pvcf.norm

    Command section

    Annotation

    plink

    PLINK 1.9

    --vcf pvcf.norm.vcf.gz

    Specify the input pVCF file

    --keep-allele-order

    Keep the allele order

    --vcf-idspace-to _

    Change spaces in the variant IDs to underscore (_)

    --double-id

    Set both family ID and within-family ID to the same sample ID

    --allow-extra-chr 0

    Treat the unrecognized chromosome codes as if they have been set to zero

    --make-bed

    Creates a new PLINK 1 binary fileset

    --vcf-half-call m

    Convert half calls (./1) in the pVCF to missing in PLINK

    --out pvcf.norm

    Specify the prefix of the output PLINK

  3. Convert PLINK files to BGEN files and prepare BGEN files.

    • First convert PLINK to a zlib-compressed BGEN file:

      • plink2 --bfile pvcf.norm --export bgen-1.2 bits=8 id-paste=iid ref-first --out pvcf.norm_zlib

        Command section

        Annotation

        plink2

        PLINK2

        --bfile pvcf.norm

        The input genotype file

        --export bgen-1.2 bits=8 id-paste=iid ref-first

        The output format BGEN with 8-bits probability precision, iid used to construct ID column and correct allele order

        --out pvcf.norm_zlib

        The output genotype file

    • Convert zlib-compressed BGEN to zstd-compressed BGEN file and omitting the sample identifier block in .bgen file using QCTOOL.

      • qctool -g pvcf.norm_zlib.bgen -s pvcf.norm_zlib.sample -og pvcf.norm.bgen -os pvcf.norm.sample -ofiletype bgen -bgen-bits 8 -bgen-compression zstd -bgen-omit-sample-identifier-block

      Command section

      Annotation

      qctool

      QCTOOL v2

      -g pvcf.norm_zlib.bgen

      The input genotype file

      -s pvcf.norm_zlib.sample

      The input sample file

      -og pvcf.norm.bgen

      The output genotype file

      -os pvcf.norm.sample

      The output sample file

      -ofiletype bgen

      The filetype of the output genotype file specified by -og

      -bits 8

      Store each probability in 8 bits

      -bgen-compression zstd

      Use zstd algorithm for BGEN compression

      -bgen-omit-sample-identifier-block

      Omit the sample identifier block

  4. Generate BGEN index.

    • bgenix -g pvcf.norm.bgen -index -clobber

    Command section

    Annotation

    bgenix

    bgenix

    -g pvcf.norm.bgen

    Specify the input genotype file

    -index

    Create an index file for the given bgen file

    -clobber

    bgenix will overwrite existing index file if it exists

Last updated

Was this helpful?