Accessing phenotype data
Learn how to export selected phenotypic fields into a TSV or CSV file, for easy browsing and analysis.
If you've worked with UK Biobank data prior to using the Research Analysis Platform, you may be aware that UK Biobank distributes the main tabular dataset in a large encoded file with the extension .enc_ukb
. To work with the dataset, you first convert this file to TSV or CSV format.
On the Research Analysis Platform (RAP), this dataset is dispensed into your project as a database, in Parquet format.
There are a few approaches to access and extract this tabular phenotype data on the RAP.
Manually using Cohort Browser
This approach requires you to manually add columns (data-fields) of interest to the table one-by-one. Therefore, this approach is suitable when you have a small number of fields.
Select Fields of Interest in the Cohort Browser
Start by navigating to your project and clicking on the name of the dispensed dataset. The Cohort Browser will launch.
In the Cohort Browser,
Open the Data Preview tab
Click the "grid" icon at the right end of the Participant ID header row. Then click Add Columns. The Add Columns to Table dialog will open
Navigate to any field, either directly or via search. Once you've found the field you're looking for, click Add as Column
Continue locating the fields you're interested in, and adding them as columns. Note that as you add additional fields as columns, you do not have to wait for the Data Preview to finish loading.
Once you've finished, close the dialog by clicking the X to the right of the Add Column to Table title. In the Data Preview tab, you'll see the first few rows of the data.
In the upper right corner of the screen, click Views, then click Save View. Enter a name for the view, then save it.
Create a TSV or CSV File Using Table Exporter
Now convert your saved view into a TSV or CSV file, using the Table Exporter app.
Navigate back to your project and click the Start Analysis button in the upper right corner of the screen. In the Start New Analysis dialog, select the Table Exporter app, then click Run Selected. Note that if this is the first time you've run Table Exporter, you'll be prompted to install it first.
Select Input
Within the Table Exporter app, open the Analysis Inputs tab on the right side of the screen. Then click the Dataset or Cohort or Dashboard tile:
A modal window will open. Select the view that you created and saved in the Cohort Browser.
Configuring Output Options
Within the Options section, configure your output options.
In the Output File Name field, enter a filename prefix. In the Output File Format field, select "CSV" or "TSV." You may find it easier to work with a TSV file downstream, because the values in certain fields contain commas, complicating the parsing of a CSV file.
In the Coding Option field, select "RAW" so that you can work with the original UK Biobank data, as you would get them from the Biobank. (For example, in the Sex field, you will see the coded value "0" rather than "Female.")
In the Header Style field, select "UKB-FORMAT" to get headers that match the original UK Biobank format (e.g. 123-4.5).
Launching the Table Exporter App and Viewing the Converted File
Click Start Analysis. Once the conversion finishes and the file is ready, you will be notified via email. To access the file, either return to your project, or click the link in the email.
Programatically
Alternatively, to extract a large amount of data-fields programatically using the following approach:
Using Table Exporter
Step 1. Get the file containing all available data-fields in your Dataset by running the following command on your local terminal (please make sure that dx-toolkit is installed on your machine) or on a ttyd terminal:
The above command will generate 3 *.csv files, and the *.dataset.data_dictionary.csv file contains full information about all available data-fields (the field names are in the second column called name
).
Step 2. Collect all data-field names of interest, and write them into a field_name.txt file. The file should have 1 column, and each line has only 1 data-field name. The file above should contain only data-fields in the same Entity (e.g. "participant", "olink_instance_0", etc). Data–fields in different Entities should be written in separate files. The value to use for Entity
can be found in the first column of the .dataset.data_dictionary.csv file called entity
(Note: this is not the same as entity_title
found in the entity_dictionary.csv file)
Step 3. Upload the field_name.txt file to the RAP project:
Step 4. Use Table Exporter app to extract the data.
Users usually experience problems when extracting the "eid" field without Entity specification, so do not forget to specify the corresponding Entity when extracting the "eid" field.
For more information about how to use the Table exporter application, see the DNAnexus documentation page.
Using JupyterLab
Step 1. Get the file containing all available data-fields in your Dataset by running the following command:
The above command will generate 3 *.csv files, and the *.dataset.data_dictionary.csv file contains full information about all available data-fields (the field names are in the second column).
Step 2. Collect all data-field names of interest, and write them into a variable field_names_protein. Here is an example collecting all proteins.
Step 3. Extract the data using dx extract_dataset
command.
Depending on the number of fields you're trying to grab, you may need to use a Spark environment - for example, by querying it from inside a Spark JupyterLab session. We recommend using a Spark instance if you are attempting to extract more than 30 fields. Your Step 3 will instead look like the following:
Using Spark JupyterLab
Apache Spark is a modern, scalable framework for parallel processing of big data. To analyze tabular data using Spark in JupyterLab, you first need to launch JupyterLab in a Spark cluster configuration. For information on how to use HAIL with Jupyterlab, see example notebooks here.
When launching JupyterLab instance:
Click Spark Cluster under Cluster Configuration.
Select an instance type and number of nodes. This will affect how powerful the Spark cluster will be. The default settings allow for casual interrogation of the data. If you will be running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.
To begin, import relevant Spark and DNAnexus libraries, and instantiate a Spark context and Spark session at the very top of your notebook, as shown below.
Ensure that your Spark session is only initialized once per JupyterLab session. If you try to evaluate this cell multiple times (for example, by selecting "Run All Cells" to rerun a notebook after it's already run, or by opening and running multiple notebooks in the same JupyterLab session), you may encounter errors or your notebook may hang. If that happens, you may need to restart the specific notebook's kernel.
As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
Accessing the Database Directly Using SQL
To evaluate SQL, you can use the
spark.sql("...")
function, which returns a Spark DataFrame.You can view the contents of a DataFrame (in full width) by calling
.show(truncate=False)
on it.
The following example lists the tables in the database:
Database Tables
The database contains the following tables:
Table name
Description
participant_0001, ..., participant_9999
These tables contain the main UK Biobank participant data. Each participant is represented as one row, and each data-field is represented as one or more columns. For scalability reasons, the data-fields are horizontally split across multiple tables, starting from table participant_0001 (which contains the first few hundred columns for all participants), followed by participant_0002 (which contains the next few hundred columns), etc. The exact number of tables depends on how many data-fields your application is approved for.
hesin
Hospitalization records. This table is only included if your application is approved for data-field #41259.
hesin_critical
Hospital critical care records. This table is only included if your application is approved for data-field #41290.
hesin_delivery
Hospital delivery records. This table is only included if your application is approved for data-field #41264.
hesin_diag
Hospital diagnosis records. This table is only included if your application is approved for data-field #41234.
hesin_maternity
Hospital maternity records. This table is only included if your application is approved for data-field #41261.
hesin_oper
Hospital operation records. This table is only included if your application is approved for data-field #41149.
hesin_psych
Hospital psychiatric records. This table is only included if your application is approved for data-field #41289.
death
Death records. This table is only included if your application is approved for data-field #40023.
death_cause
Death cause records. This table is only included if your application is approved for data-field #40023.
gp_clinical
GP clinical event records. This table is only included if your application is approved for data-field #42040.
gp_registrations
GP registration records. This table is only included if your application is approved for data-field #42038.
gp_scripts
GP prescription records. This table is only included if your application is approved for data-field #42039.
covid19_tpp_gp_clinical
GP clinical event records (COVID TPP). This table is only included if your application is approved for data-field #40101.
covid19_tpp_gp_scripts
GP prescription records (COVID TPP). This table is only included if your application is approved for data-field #40102.
covid19_emis_gp_clinical
GP clinical event records (COVID EMIS). This table is only included if your application is approved for data-field #40103.
covid19_emis_gp_scripts
GP prescription records (COVID EMIS). This table is only included if your application is approved for data-field #40104.
covid19_result_england
COVID19 Test Result Record (England). This table is only included if your application is approved for data-field #40100.
covid19_result_scotland
COVID19 Test Result Record (Scotland). This table is only included if your application is approved for data-field #40100.
covid19_result_wales
COVID19 Test Result Record (Wales). This table is only included if your application is approved for data-field #40100.
covid19_vaccination
COVID-19 vaccination data. This table is only included if your application is approved for data-field #32040.
olink_instance_0
Olink NPX values for the instance 0 visit. This table is only included if your application is approved for data-field #30900. For scalability reasons, the protein columns are horizontally split across multiple tables, starting from table olink_instance_0_0001
(which contains the first few hundred columns for all participants), followed by olink_instance_0_0002
(which contains the next few hundred columns), etc. The splitting applies to the olink_instance_2
and olink_instance_3
tables mentioned below as well.
olink_instance_2
Olink NPX values for the instance 2 visit. This table is only included if your application is approved for data-field #30900.
olink_instance_3
Olink NPX values for the instance 3 visit. This table is only included if your application is approved for data-field #30900.
omop_condition_era
OMOP Condition Era. This table is only included if your application is approved for data-field #20142.
omop_condition_occurrence
OMOP Condition Occurrence. This table is only included if your application is approved for data-field #20142.
omop_death
OMOP Death. This table is only included if your application is approved for data-field #20142.
omop_device_exposure
OMOP Device Exposure. This table is only included if your application is approved for data-field #20142.
omop_dose_era
OMOP Dose Era. This table is only included if your application is approved for data-field #20142.
omop_drug_era
OMOP Drug Era. This table is only included if your application is approved for data-field #20142.
omop_drug_exposure
OMOP Drug Exposure. This table is only included if your application is approved for data-field #20142.
omop_measurement
OMOP Measurement. This table is only included if your application is approved for data-field #20142.
omop_note
OMOP Note. This table is only included if your application is approved for data-field #20142.
omop_observation
OMOP Observation. This table is only included if your application is approved for data-field #20142.
omop_observation_period
OMOP Observation Period. This table is only included if your application is approved for data-field #20142.
omop_person
OMOP Person. This table is only included if your application is approved for data-field #20142.
omop_procedure_occurrence
OMOP Procedure Occurrence. This table is only included if your application is approved for data-field #20142.
omop_specimen
OMOP Specimen. This table is only included if your application is approved for data-field #20142.
omop_visit_detail
OMOP Visit Detail. This table is only included if your application is approved for data-field #20142.
omop_visit_occurrence
OMOP Visit Occurrence. This table is only included if your application is approved for data-field #20142.
When listing tables in SQL, you may notice each table appearing twice, using a regular name and a versioned name, such as"gp_clinical"
and"gp_clinical_v4_0_9b7a7f3"
. This naming scheme is part of the system's architecture, supporting data refreshes and participant withdrawals.
The "regularly named" table is actually a SQL VIEW pointing to the versioned table. When data is updated, the VIEW is switched to point to a new versioned table, and the old versioned table is deleted. Due to this behavior, please make sure to always use the regularly named tables - such as "gp_clinical"
- because the versioned tables do not persist over time.
If your access application has been approved for Data-field 23146, 23148, and/or 23157 you will also see the following tables:
Allele_23146, allele_23148, allele_23157, annotation_23146, annotation_23148, annotation_23157, assay_eid_map_23146, assay_eid_map_23148, assay_eid_map_23157, genotype_23146, genotype_23148, genotype_23157, pheno_assay_23146_link, rsid_lookup_r81_23146, pheno_assay_23146_link, rsid_lookup_r81_23148, pheno_assay_23157_link, and rsid_lookup_r81_23157.
These tables contain limited information about alleles and genotypes, transcribed into SQL from the pVCF files of Data-field 23146 and/or 23148 and/or 23157 (along with added annotations). These tables are used by the Cohort Browser in the creation of the "GENOMICS" tab. They have not been optimized for direct SQL querying, and their schema and conventions are subject to change. For this reason, it is not recommended to access these tables on your own but to access the bulk files instead.
Database Columns
For the main UK Biobank participant tables, the column-naming convention is generally as follows:
p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>
However, the following additional rules apply:
If a field is not instanced, the
_i<INSTANCE-ID>
piece is skipped altogether.If a field is not arrayed, the
_a<ARRAY-ID>
piece is skipped altogether.If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the
_a<ARRAY-ID>
piece is skipped altogether.
Examples:
Age at recruitment:
p21022
Date of attending assessment centre:
p53_i0
,p53_i1
, ...Diagnoses - ICD10 (converted into embedded array):
p41270
For all other tables - such as hospital records, GP records, death records, and COVID-19 records - the column names are identical to what UK Biobank provides in its Showcase. For more information on the columns of these tables, consult Resource #138483 (hospital records), Resource #591 (GP records), Resource #115559 (death records), Resource #3151 (COVID-19 GP records), or Resource #1758 (COVID-19 test results).
Tips for Using SQL
The main participant data is horizontally split into multiple tables, and you may find that SQL is less than suitable for querying those tables directly. To access main participant data, consider using the dataset construct as discussed below.
For linked health care tables, it is easier to use SQL directly to extract data as a Spark DataFrame. The following example retrieves all GP records related to serum HDL cholesterol levels.
Spark DataFrames are lazy-evaluated. In the code block above, the command will return right away, assigning the variable df
without executing the query. The query is only evaluated when needed, potentially with additional transformations.
For example, typing df.count()
later will evaluate an equivalent SELECT COUNT(*)
...
For a list of all DataFrame functions, consult the PySpark DataFrame Documentation.
Troubleshooting
For general tips on troubleshooting, see guide.
Data not exported
Warning: Out of memory
Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13,...
Note: If you don’t provide an entity value, then by default Table exporter will use the “Participant” entity table
Failed to export data: An error occurred while calling o305.csv. : org.apache.spark.SparkException: Job aborted
Make sure you specified the entity to use
Export participant id (EID)
By default the participant identifier (EID) is no longer extracted.
In the Table exporter app you’ll need to add “eid” to the File containing Field Names
parameter as well as specify entity
parameter in the Advanced Options. Entity refers to the entity table from which we are extracting data from - e.g. “participant” or “olink_instance_0”
Are there spaces in the input argument "output" (Example: "physical activity")?
table_exporter.py: error: unrecognized arguments: activity
Remove or Replace spaces with underscore. Example: “physical_activity”
Is there a file containing a list of field names for the proteomics dataset?
Last updated
Was this helpful?