Accessing Phenotypic Data as a File

Learn how to export selected phenotypic fields into a TSV or CSV file, for easy browsing and analysis.

If you've worked with UK Biobank data prior to using the Research Analysis Platform, you may be aware that UK Biobank distributes the main tabular dataset in a large encoded file with the extension .enc_ukb. To work with the dataset, you first convert this file to TSV or CSV format.

On the Research Analysis Platform, this dataset is dispensed into your project as a database, in Parquet format. You can access this database within a Spark environment - for example, by querying it from inside a Spark JupyterLab session.

If you have existing code that relies on reading just a handful of fields from a file, you may find it easier to extract those fields from the database, and dumping them into a TSV or CSV file. You can then run your code or otherwise work with the file, without having to do so within a Spark environment.

Selecting Fields of Interest in the Cohort Browser

Start by navigating to your project and clicking on the name of the dispensed dataset. The Cohort Browser will launch.

In the Cohort Browser, open the Data Preview tab:

Click the "grid" icon at the right end of the Participant ID header row. Then click Add Columns. The Add Columns to Table dialog will open:

Navigate to any field, either directly or via search. Once you've found the field you're looking for, click Add as Column:

Continue locating the fields you're interested in, and adding them as columns. Note that as you add additional fields as columns, you do not have to wait for the Data Preview to finish loading.

Once you've finished, close the dialog by clicking the X to the right of the Add Column to Table title. In the Data Preview tab, you'll see the first few rows of the data.

In the upper right corner of the screen, click Views, then click Save View. Enter a name for the view, then save it.

Creating a TSV or CSV File Using Table Exporter

Now convert your saved view into a TSV or CSV file, using the Table Exporter app.

Navigate back to your project and click the Start Analysis button in the upper right corner of the screen. In the Start New Analysis dialog, select the Table Exporter app, then click Run Selected. Note that if this is the first time you've run Table Exporter, you'll be prompted to install it first.

Selecting an Input

Within the Table Exporter app, open the Analysis Inputs tab on the right side of the screen. Then click the Dataset or Cohort or Dashboard tile:

A modal window will open. Select the view that you created and saved in the Cohort Browser.

Configuring Output Options

Within the Options section, configure your output options.

In the Output File Name field, enter a filename prefix. In the Output File Format field, select "CSV" or "TSV." You may find it easier to work with a TSV file downstream, because the values in certain fields contain commas, complicating the parsing of a CSV file.

In the Coding Option field, select "RAW" so that you can work with the original UK Biobank data, as you would get them from the Biobank. (For example, in the Sex field, you will see the coded value "0" rather than "Female.")

In the Header Style field, select "UKB-FORMAT" to get headers that match the original UK Biobank format (e.g. 123-4.5).

Launching the Table Exporter App and Viewing the Converted File

Click Start Analysis. Once the conversion finishes and the file is ready, you will be notified via email. To access the file, either return to your project, or click the link in the email.

Troubleshooting

For general tips on troubleshooting, see guide.

Issue Example error message What to do

Issue	Example error message	What to do
Data not exported	`Warning: Out of memory`	Try to adjust the instance type to use one with more memory/storage and re-run your table exporter query. Alternatively, you could try using the dx extract_dataset command within spark Jupyterlab. Example code here.
	`Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13,...`	Check that you provided your inputs correctly following the documentation. Note: If you don’t provide an entity value, then by default Table exporter will use the “Participant” entity table
	`Failed to export data: An error occurred while calling o305.csv. : org.apache.spark.SparkException: Job aborted`	Make sure you specified the entity to use
Export participant id (EID)		By default the participant identifier (EID) is no longer extracted. In the Table exporter app you’ll need to add “eid” to the `File containing Field Names` parameter as well as specify `entity` parameter in the Advanced Options. Entity refers to the entity table from which we are extracting data from - e.g. “participant” or “olink_instance_0” Alternatively, if using `dx extract_dataset` command, then you’ll need to specify <entity>.eid as one of the field names in your query. See example.
Are there spaces in the input argument "output" (Example: "physical activity")?	`table_exporter.py: error: unrecognized arguments: activity`	Remove or Replace spaces with underscore. Example: “physical_activity”
Is there a file containing a list of field names for the proteomics dataset?		All protein fields can be found here.

Data not exported

Warning: Out of memory

Try to adjust the instance type to use one with more memory/storage and re-run your table exporter query. Alternatively, you could try using the dx extract_dataset command within spark Jupyterlab. Example code here.

Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13,...

Check that you provided your inputs correctly following the documentation.

Note: If you don’t provide an entity value, then by default Table exporter will use the “Participant” entity table

Failed to export data: An error occurred while calling o305.csv. : org.apache.spark.SparkException: Job aborted

Make sure you specified the entity to use

Export participant id (EID)

By default the participant identifier (EID) is no longer extracted.

In the Table exporter app you’ll need to add “eid” to the File containing Field Names parameter as well as specify entity parameter in the Advanced Options. Entity refers to the entity table from which we are extracting data from - e.g. “participant” or “olink_instance_0”

Alternatively, if using dx extract_dataset command, then you’ll need to specify <entity>.eid as one of the field names in your query. See example.

Are there spaces in the input argument "output" (Example: "physical activity")?

table_exporter.py: error: unrecognized arguments: activity

Remove or Replace spaces with underscore. Example: “physical_activity”

Is there a file containing a list of field names for the proteomics dataset?

All protein fields can be found here.

PreviousWorking with Bulk Data Files NextUsing Spark to Analyze Tabular Data

Last updated 6 months ago