Returning pVCF Files to UK Biobank
Learn how to prepare pVCF files and return them to UK Biobank.
Obligation to Return Research Results
Researchers using UK Biobank data are obliged to return their research results to UK Biobank, in keeping with the terms detailed here. To return your research results, share the project containing the results with UK Biobank staff as per these instructions.
Preparing pVCF Files for Return
When users access a pVCF file on the Research Analysis Platform, each sample has an identifier that is unique to the version of the file dispensed to projects linked to a particular UK Biobank access application. Giving samples unique, access-application specific identifiers helps ensure the anonymity of UK Biobank participants. See this presentation for more on this technique.
Before returning pVCF files to UK Biobank, researchers must format file headers in such a way as to support the use of this type of identifier. The portion of each file's header referring to samples must be formatted as zero-level-compression bgzip blocks.
To do this, run each pVCF file through the Bash script included below. The script takes a VCF file as input, then uses bcftools
, bgzip
, and tabix
to modify the relevant part of the header. The script then outputs a VCF and a TBI index file. It also prints the byte coordinates of the zero-level-compressed blocks to stdout
.
The output VCF file, when uncompressed, will be identical to the input file. So zcat input.vcf.gz | md5sum
will return identical results to zcat input.repackaged.vcf.gz | md5sum
.
What to Include
When returning pVCF files to UK Biobank, include all of the following:
The repackaged VCF files, processed as per the instructions above
The accompanying TBI files
The byte-level coordinates of the zero-level-compressed blocks, as printed to
stdout
, when processing the original VCF files. These coordinations are required for validation.
Bash Script for Preparing pVCF Files for Return
Last updated