Returning pVCF Files to UK Biobank

Learn how to prepare pVCF files and return them to UK Biobank.

Obligation to Return Research Results

Researchers using UK Biobank data are obliged to return their research results to UK Biobank, in keeping with the terms detailed here. To return your research results, share the project containing the results with UK Biobank staff as per these instructions.

Preparing pVCF Files for Return

When users access a pVCF file on the Research Analysis Platform, each sample has an identifier that is unique to the version of the file dispensed to projects linked to a particular UK Biobank access application. Giving samples unique, access-application specific identifiers helps ensure the anonymity of UK Biobank participants. See this presentation for more on this technique.

Before returning pVCF files to UK Biobank, researchers must format file headers in such a way as to support the use of this type of identifier. The portion of each file's header referring to samples must be formatted as zero-level-compression bgzip blocks.

To do this, run each pVCF file through the Bash script included below. The script takes a VCF file as input, then uses bcftools, bgzip, and tabix to modify the relevant part of the header. The script then outputs a VCF and a TBI index file. It also prints the byte coordinates of the zero-level-compressed blocks to stdout.

The output VCF file, when uncompressed, will be identical to the input file. So zcat input.vcf.gz | md5sum will return identical results to zcat input.repackaged.vcf.gz | md5sum.

What to Include

When returning pVCF files to UK Biobank, include all of the following:

  • The repackaged VCF files, processed as per the instructions above

  • The accompanying TBI files

  • The byte-level coordinates of the zero-level-compressed blocks, as printed to stdout, when processing the original VCF files. These coordinations are required for validation.

Bash Script for Preparing pVCF Files for Return

#!/bin/bash

set -e -o pipefail

if [[ "$1" == "" ]]
then
  echo "Usage: $0 file.vcf.gz"
  echo ""
  echo "Repackages the VCF so that the samples in the header are contained in a separate, 0-compressed bgzip block."
  echo "Requires bcftools, bgzip, tabix."
  echo "Outputs file.repackaged.vcf.gz+tbi (to disk), and header coordinates (lo,hi) to stdout."
  exit 1
fi

vcfgz_path="$1"
vcfgz_prefix="${vcfgz_path%.vcf.gz}"
outfile="${vcfgz_prefix}.repackaged.vcf.gz"

# Get header up until the sample names, bgzip regularly
( bcftools view --no-version -h "$vcfgz_path" | head -n -1; bcftools view --no-version -h "$vcfgz_path" | tail -n 1 | awk '{for (i=1; i<=9; i++) { printf("%s\t", $i) }}' ) | bgzip | head -c -28 > "$outfile"

# Mark the starting byte range
lo=$(stat -c '%s' "$outfile")
let lo++

# Reprint sample ids with 7 digits and bgzip with 0-compression
bcftools view --no-version -h "$vcfgz_path" | tail -n 1 | awk '{for (i=10; i<=NF; i++) {printf "%07d%c", $i, (i==NF?"\n":"\t")}}' | bgzip -l 0 | head -c -28 >> "$outfile"

# Mark the ending byte range
hi=$(stat -c '%s' "$outfile")

# Bgzip the rest of the file
bcftools view -H "$vcfgz_path" | bgzip -@ `nproc` >> "$outfile"

# Tabix the output file
tabix -p vcf "$outfile"

# Output the byte range of the sample names:
echo $lo $hi

Last updated