UK Biobank Data on the Research Analysis Platform

Learn how UK Biobank data is organized and named on the Research Analysis Platform. Learn how to find and access bulk files and tabular data.

How the Data is Organized

EIDs and Data-fields

UK Biobank contains data collected from approximately 500,000 volunteer participants. Within an access application, each participant is identified by a unique, 7-digit number, or EID. An EID is typically a number between 1,000,000 and 6,000,000.

Note that each access application receives a different set of randomized EIDs, unique to the application. This EID randomization process - also known as "pseudonymization" - is managed by UK Biobank and is automatically applied to the data by the Research Analysis Platform. (For a given access application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.)

When you create a project on the Research Analysis Platform, the system contacts UK Biobank to get your application's EIDs, then uses them to pseudonymize your dataset. The pseudonymized EIDs are used, for example, to populate the "eid" column in the database, to name per-participant files, to generate the EID-specific content of FAM files for genotyping fields, and to adjust pVCF headers.

All data in the UK Biobank resource are organized into data-fields. Your access application is approved for a precise subset of those data-fields.

The UK Biobank Showcase provides an in-depth look into the types of data stored in the UK Biobank, how it's collected, and how it's organized. You can find more information about data-fields, broken down by type, on the UK Biobank Field Listing page.

Project Data

When you create a project on the UK Biobank Research Analysis Platform, the system dispenses the data corresponding to the data-fields listed in the access application associated with the project.

  • Bulk data-fields are dispensed as files. See Bulk Data Files below for more.

  • Tabular data-fields and linked health data are placed into a Spark SQL database and an associated dataset. See Tabular Data below for more.

The dispensed data correspond to a specific data release version. See Data Release Versions for more.

Bulk Data Files

Overview

Within your project, the Bulk folder contains files associated with UK Biobank data-fields of type "bulk." These are particularly large and/or complex items, such as genotyping array data, genome sequencing data, imaging data, and fitness data.

Folder Conventions

The Bulk folder uses the following subfolder structure:

  • There is a subfolder for each UK Biobank bulk field category. For example, whole genome CRAM files are stored in the subfolder named Whole genome sequences. These categories are defined by the UK Biobank, specifically for the Research Analysis Platform.

  • Within each category subfolder, there is a subfolder for each bulk field (or group of related fields). For example, a subfolder named Whole genome CRAM files would contain files for that field.

  • Within each field folder, files related to an individual participant are grouped in subfolders named using the prefix of the participant's EID. Typically these are two-digit names, ranging between "10" and "60."

In certain cases, the system may dispense related files of different types into the same folder, to improve usability. For example, whole genome CRAM indices files (field ID #23194) would be dispensed into the same folder as whole genome CRAM files (field ID #23193), rather than into their own folder. Similarly, the folder /Bulk/Brain MRI/dMRI includes data from fields #20218 ("Multiband diffusion brain images - DICOM") and #20250 ("Multiband diffusion brain images - NIFTI").

For a full list of folders, see Bulk Fields in the Latest Release below.

Filename Conventions

The Research Analysis Platform uses the following naming conventions for bulk data files:

  • Files that contain data on an individual participant are named in this fashion: <EID>_<FIELD-ID>_<INSTANCE-ID>_<ARRAY-ID>.<SUFFIX> For example, whole genome CRAM files (field ID #23193) are named like so: <EID>_23193_0_0.cram Some exceptions apply to this rule. When a field is meant as a companion to a main field, such as a CRAI index accompanying a CRAM file, or a TBI index accompanying a VCF file, the system uses the prefix of the main field. For example, whole genome CRAM indices (field ID #23194) are named like so: <EID>_23193_0_0.cram.crai

  • Files that contain data on a cohort of participants (such as PLINK, BGEN or pVCF files) are named in this fashion: ukb<FIELD-ID>_c<CHROM>_b<BLOCK>_v<VERSION>.<SUFFIX> Where <CHROM> represents the chromosome (such as "1", "2" or "X"), <BLOCK> represents an index (starting from "0") for datasets that have been split into multiple pieces, and <VERSION> represents a dataset version assigned by UK Biobank.

Pseudonymization of pVCF headers

The Research Analysis Platform pseudonymizes the content of pVCF headers for the following fields:

Field Id

Description

20278

BEAGLE Phased VCFs Whole genome sequences

20279

SHAPEIT Phased VCFs Whole genome sequences

23146

Population level exome OQFE variants, pVCF format - interim 300k release

23148

Population level exome OQFE variants, pVCF format - interim 450k release

23156

Population level exome OQFE variants, pVCF format - interim 200k release

23157

Population level exome OQFE variants, pVCF format - 500k release

23195

Whole genome GraphTyper joint call pVCF (deprecated)

23196

Whole genome GATK joint call pVCF

23352

Whole genome GraphTyper joint call pVCF

23353

Whole genome GraphTyper SV data

23354

GraphTyper WGS 500k SV

23374

Population level WGS variants, pVCF format - 500k release

24068

Exome variant call files (gnomAD) (VCFs)

24304

Population level WGS variants pVCF format - interim 200k release

24310

DRAGEN population level WGS variants, pVCF format [500k release]

When accessing pVCF files in these fields, the header is pseudonymized. The sample ids in the header are EIDs that correspond to the access application. If a participant has withdrawn, the corresponding sample id is marked as "W000001" (for the first encountered sample that belongs to a withdrawn participant), "W000002" (for the second encountered sample that belongs to a withdrawn participant), etc. Overall, the non-withdrawn EIDs in the pVCF header are expected to match the set of application EIDs used elsewhere, such as in the "eid" column of the phenotypic data and the FAM files of genotyping array fields. This allows you to conduct analyses that combine phenotypic, genotyping array, and whole exome pVCF (or whole genome pVCF) data without having to translate any EIDs.

File Properties

The Research Analysis Platform supports file properties. These are key-value pairs of strings that are attached to files. When bulk files are dispensed to a project, the Platform adds some initial file properties, as below:

KeyValueWhich files have this property?

eid

The corresponding participant EID

Files that correspond to a single participant.

field_id

The corresponding data-field id.

All files.

instance_id

The corresponding instance id (typically a visit to an assessment centre).

Files that correspond to data-fields with multiple instances.

array_id

The corresponding array index.

Files that correspond to array data-fields.

resource_id

The corresponding UK Biobank resource id.

Auxiliary files to a resource on the UK Biobank Showcase.

These properties are searchable both via the Web UI and CLI. Refer to the following section for an example.

Working with Bulk Data Files

See these instructions for in-depth guidance on searching and analyzing UK Biobank bulk data files.

About Proteomics Data

Proteomics data stored in UK Biobank Showcase data-field 30900 has been transformed to enable users to visualize it using the Cohort Browser. The transformation produces the per-instance entities stored in the dispensed database. As a result of this transformation, the data can be visualized, at the individual protein level, on a per-instance basis. For example, each protein for each instance can then be added as a tile within the Cohort Browser, and used as a filter along with other data modalities.

The transformation involves the following steps:

  1. Abbreviated protein names are substituted for protein ID codes - protein ID code "3", for example, becomes "aarsd1":

  2. The file is then split into new files by instance, with each new file containing participant records that share the same value in the ins_index field:

  3. Each instance file is then pivoted by creating a column for each protein, with cells filled in by the value in the result column. This value is the NPX value, for that protein, for the participant in question. For example, for the file containing "instance 0" data:

If a participant doesn’t have NPX data for a given protein - like participant 2345678, for protein "abca2," in the example shown above - the cell in question will not contain a value.

Note the column order in the last illustration, showing the fully transformed table. The eid column is first, followed by the columns for protein data, sorted in alphabetical order, by column name.

About Instance Values in Proteomics Data

For details on instance data and the meaning of each instance value, see the "3 Instances" tab of the UK Biobank Data Showcase page covering data stored in Showcase data-field 30900.

Tabular Data

Database and Dataset

Tabular data-fields and linked health data are stored in a SQL database. This database is based on Spark SQL technology, a modern and more scalable technology than that used by classic relational database systems (RDBMS). This database is located on the root folder of your project, and is typically named in accord with this pattern:

app<APPLICATION-ID>_<CREATION-TIME> (e.g. app12345_20210101123456)

In the same folder, there is an associated dataset named after the database but with the .dataset suffix appended. This dataset is a higher-level construct, using technology that is unique to the Research Analysis Platform. It combines the low-level SQL columns with field-level metadata from the UK Biobank Showcase, and presents a collection of rich fields that can be explored visually in the Cohort Browser, or programmatically in JupyterLab. For general information on the underlying technology, see the DNAnexus Platform documentation overview of Datasets.

Browsing Dataset Fields Using the Cohort Browser

To launch the Cohort Browser, navigate to the project's root folder and click the dataset (or tick the dataset and click Explore Data).

To explore what fields are available in your dataset, click Add Tile. The system will present all available fields, organized in a folder structure inspired by the UK Biobank Showcase. You can search this list by folder name, field name, or field value (for categorical fields).

Click a field to see more information. The Data Field Details pane contains the field title (such as Type of accommodation lived in | Instance 0), and the Link label contains the field name (such as p670_i0). These field names and titles can be used to retrieve data programmatically using JupyterLab.

Using the Cohort Browser features, including the "Export sample IDs" option or the "Download" option in the Data Preview tab, will not lead to any charges.

The Cohort Browser can be used to further explore the data, create charts, or define and compare cohorts. Refer to the following DNAnexus Platform documentation entries:

If your access application has been approved for data-field 23146, 23148 and/or 23157, the Cohort Browser will automatically include a "GENOMICS" section, where you can browse variants in your cohort. The data backing the section depends on the dataset version dispensed: 23157 for version 11 and later, 23148 for version 7 and later, 23146 for previous versions. These variants are sourced from the pVCF files of field #23146, 23148 and/or 23157, after annotating with snpEff GRCh38.92, dbSNP b154 and gnomAD r2.1.1. You can also use these variants to apply genomic filters. Refer to the following DNAnexus Platform documentation entries:

When performing a data update, creating a new project for the same application, or adding data to a Dataset using Table Exporter, a new Dataset is created. To migrate saved cohorts to the new Datasets, use the Rebase Cohorts and Dashboards app. Note that all Fields defined in the cohort must exist in the new Dataset.

Analyzing Tabular Data as a File

If you are used to working with tabular data as a TSV file - a format used by UK Biobank in distributing tabular data directly via its website - see Accessing Phenotypic Data as a File.

Analyzing Tabular Data Using Spark in JupyterLab

Apache Spark is a modern, scalable framework for parallel processing of big data. Follow these instructions to analyze UK Biobank tabular data using Spark in JupyterLab.

Last updated