UK Biobank Data on the Research Analysis Platform
Learn how UK Biobank data is organized and named on the Research Analysis Platform. Learn how to find and access bulk files and tabular data.

Overview

This recorded webinar provides an in-depth overview of the UK Biobank dataset and its component elements.

Topics Covered

Use these links to skip directly to coverage of specific topics covered in the overview webinar:

How the Data is Organized

EIDs and Data-fields

UK Biobank contains data collected from approximately 500,000 volunteer participants. Within an access application, each participant is identified by a unique, 7-digit number, or EID. An EID is typically a number between 1,000,000 and 6,000,000.
Note that each access application receives a different set of randomized EIDs, unique to the application. This EID randomization process - also known as "pseudonymization" - is managed by UK Biobank and is automatically applied to the data by the Research Analysis Platform. (For a given access application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.)
When you create a project on the Research Analysis Platform, the system contacts UK Biobank to get your application's EIDs, then uses them to pseudonymize your dataset. The pseudonymized EIDs are used, for example, to populate the "eid" column in the database, to name per-participant files, to generate the EID-specific content of FAM files for genotyping fields, and to adjust pVCF headers.
All data in the UK Biobank resource are organized into data-fields. Your access application is approved for a precise subset of those data-fields.
The UK Biobank Showcase provides an in-depth look into the types of data stored in the UK Biobank, how it's collected, and how it's organized. You can find more information about data-fields, broken down by type, on the UK Biobank Field Listing page.

Project Data

When you create a project on the UK Biobank Research Analysis Platform, the system dispenses the data corresponding to the data-fields listed in the access application associated with the project.
  • Bulk data-fields are dispensed as files. See Bulk Data Files below for more.
  • Tabular data-fields and linked health data are placed into a Spark SQL database and an associated dataset. See Tabular Data below for more.
The dispensed data correspond to a specific data release version. See Data Release Versions for more.

Bulk Data Files

Overview

Within your project, the Bulk folder contains files associated with UK Biobank data-fields of type "bulk." These are particularly large and/or complex items, such as genotyping array data, genome sequencing data, imaging data, and fitness data.

Folder Conventions

The Bulk folder uses the following subfolder structure:
  • There is a subfolder for each UK Biobank bulk field category. For example, whole genome CRAM files are stored in the subfolder named Whole genome sequences. These categories are defined by the UK Biobank, specifically for the Research Analysis Platform.
  • Within each category subfolder, there is a subfolder for each bulk field (or group of related fields). For example, a subfolder named Whole genome CRAM files would contain files for that field.
  • Within each field folder, files related to an individual participant are grouped in subfolders named using the prefix of the participant's EID. Typically these are two-digit names, ranging between "10" and "60."
In certain cases, the system may dispense related files of different types into the same folder, to improve usability. For example, whole genome CRAM indices files (field ID #23194) would be dispensed into the same folder as whole genome CRAM files (field ID #23193), rather than into their own folder. Similarly, the folder /Bulk/Brain MRI/dMRI includes data from fields #20218 ("Multiband diffusion brain images - DICOM") and #20250 ("Multiband diffusion brain images - NIFTI").
For a full list of folders, see Bulk Fields in the Latest Release below.

Filename Conventions

The Research Analysis Platform uses the following naming conventions for bulk data files:
  • Files that contain data on an individual participant are named in this fashion: <EID>_<FIELD-ID>_<INSTANCE-ID>_<ARRAY-ID>.<SUFFIX> For example, whole genome CRAM files (field ID #23193) are named like so: <EID>_23193_0_0.cram Some exceptions apply to this rule. When a field is meant as a companion to a main field, such as a CRAI index accompanying a CRAM file, or a TBI index accompanying a VCF file, the system uses the prefix of the main field. For example, whole genome CRAM indices (field ID #23194) are named like so: <EID>_23193_0_0.cram.crai
  • Files that contain data on a cohort of participants (such as PLINK, BGEN or pVCF files) are named in this fashion: ukb<FIELD-ID>_c<CHROM>_b<BLOCK>_v<VERSION>.<SUFFIX> Where <CHROM> represents the chromosome (such as "1", "2" or "X"), <BLOCK> represents an index (starting from "0") for datasets that have been split into multiple pieces, and <VERSION> represents a dataset version assigned by UK Biobank.

Pseudonymization of pVCF headers

The Research Analysis Platform pseudonymizes the content of pVCF headers for the following fields:
Field id
Description
23146
Population level exome OQFE variants, pVCF format - interim 300k release
23148
Population level exome OQFE variants, pVCF format - interim 450k release
23156
Population level exome OQFE variants, pVCF format - interim 200k release
23195
Whole genome GraphTyper joint call pVCF (deprecated)
23196
Whole genome GATK joint call pVCF
23352
Whole genome GraphTyper joint call pVCF
23353
Whole genome GraphTyper SV data
24068
Exome variant call files (gnomAD) (VCFs)
24304
Population level WGS variants pVCF format - interim 200k release
When accessing pVCF files in these fields, the header is pseudonymized. The sample ids in the header are EIDs that correspond to the access application. If a participant has withdrawn, the corresponding sample id is marked as "W000001" (for the first encountered sample that belongs to a withdrawn participant), "W000002" (for the second encountered sample that belongs to a withdrawn participant), etc. Overall, the non-withdrawn EIDs in the pVCF header are expected to match the set of application EIDs used elsewhere, such as in the "eid" column of the pheno data and the FAM files of genotyping array fields. This allows you to conduct analyses that combine phenotypic, genotyping array, and whole exome pVCF (or whole genome pVCF) data without having to translate any EIDs.

File Properties

The Research Analysis Platform supports file properties. These are key-value pairs of strings that are attached to files. When bulk files are dispensed to a project, the Platform adds some initial file properties, as below:
Key
Value
Which files have this property?
eid
The corresponding participant EID
Files that correspond to a single participant.
field_id
The corresponding data-field id.
All files.
instance_id
The corresponding instance id (typically a visit to an assessment centre).
Files that correspond to data-fields with multiple instances.
array_id
The corresponding array index.
Files that correspond to array data-fields.
resource_id
The corresponding UK Biobank resource id.
Auxiliary files to a resource on the UK Biobank Showcase.
These properties are searchable both via the Web UI and CLI. Refer to the following section for an example.

Working with Bulk Data Files

See these instructions for in-depth guidance on searching and analyzing UK Biobank bulk data files.

Tabular Data

Database and Dataset

Tabular data-fields and linked health data are stored in a SQL database. This database is based on Spark SQL technology, a modern and more scalable technology than that used by classic relational database systems (RDBMS). This database is located on the root folder of your project, and is typically named in accord with this pattern:
app<APPLICATION-ID>_<CREATION-TIME> (e.g. app12345_20210101123456)
In the same folder, there is an associated dataset named after the database but with the .dataset suffix appended. This dataset is a higher-level construct, using technology that is unique to the Research Analysis Platform. It combines the low-level SQL columns with field-level metadata from the UK Biobank Showcase, and presents a collection of rich fields that can be explored visually in the Cohort Browser, or programmatically in JupyterLab. For general information on the underlying technology, see the DNAnexus Platform documentation overview of Datasets.

Browsing Dataset Fields Using the Cohort Browser

To launch the Cohort Browser, navigate to the project's root folder and click the dataset (or tick the dataset and click Explore Data).
To explore what fields are available in your dataset, click Add Tile. The system will present all available fields, organized in a folder structure inspired by the UK Biobank Showcase. You can search this list by folder name, field name, or field value (for categorical fields).
Click a field to see more information. The Data Field Details pane contains the field title (such as Type of accommodation lived in | Instance 0), and the Link label contains the field name (such as p670_i0). These field names and titles can be used to retrieve data programmatically using JupyterLab.
Using the Cohort Browser features, including the "Export sample IDs" option or the "Download" option in the Data Preview tab, will not lead to any charges.
The Cohort Browser can be used to further explore the data, create charts, or define and compare cohorts. Refer to the following DNAnexus Platform documentation entries:
If your access application has been approved for field #23146 and/or #23148, the Cohort Browser will automatically include a "GENOMICS" section, where you can browse variants in your cohort. The data backing the section depends on the dataset version dispensed: 23148 for version 7 and later, 23146 for previous versions. These variants are sourced from the pVCF files of field #23146, after annotating with snpEff GRCh38.92, dbSNP b154 and gnomAD r2.1.1. You can also use these variants to apply genomic filters. Refer to the following DNAnexus Platform documentation entries:
When performing a data update, creating a new project for the same application, or adding data to a Dataset using Table Exporter, a new Dataset is created. To migrate saved cohorts to the new Datasets, use the Rebase Cohorts and Dashboards app. Note that all Fields defined in the cohort must exist in the new Dataset.

Analyzing Tabular Data as a File

If you are used to working with tabular data as a TSV file - a format used by UK Biobank in distributing tabular data directly via its website - see Accessing Phenotypic Data as a File.

Analyzing Tabular Data Using Spark in JupyterLab

Apache Spark is a modern, scalable framework for parallel processing of big data. Follow these instructions to analyze UK Biobank tabular data using Spark in JupyterLab.