Get answers to common questions about the Research Analysis Platform, and about UK Biobank data and systems.
Registration, Login, and Linking with the UK Biobank Access Management System.
Who can sign up for the UK Biobank Research Analysis Platform?
The Research Analysis Platform is open to researchers who are listed as collaborators on UK Biobank-approved access applications.
How do I register for the Research Analysis Platform?
Registration is a two-step process. You must first create a Research Analysis Platform user account, and then you must link it to your UK Biobank Access Management System (AMS) account.
How do I create a Research Analysis Platform account?
If your organization has been set up for Single Sign On (SSO), follow internal procedures specific to your organization. Otherwise:
If you already have a DNAnexus account, you do not need to create a separate Research Analysis Platform account. You can use your existing DNAnexus account on the Research Analysis Platform.
If you do not have an account, visit the Research Analysis Platform homepage and select Create an account. You will need to provide your full name and email, as well as a username and password that you want to use.
Does my Research Analysis Platform info (username and email address) need to match my AMS info?
No, your username and email address on Research Analysis Platform can be different from those you use on the AMS.
I tried creating a Research Analysis Platform account and got an error "Email Already Registered."
It looks like you already have a DNAnexus account. If your organization has been set up for SSO, follow your organization's internal procedures. Otherwise visit the Research Analysis Platform Set Password page and enter your email address. You will receive an email with a password reset link, which you can use to reset the password of your account.
How do I log in to the Research Analysis Platform?
To log in:
If your organization has been set up for SSO, follow internal procedures.
How do I link my Research Analysis Platform account to my AMS account?
This process happens automatically upon first login (see "How do I log in to the Research Analysis Platform?"). You will be presented with the Research Analysis Platform Terms of Service, and once you read them (by scrolling down) and accept them, you will be taken to the AMS website, where you must enter your UK Biobank credentials.
What are my AMS credentials?
If you have forgotten your AMS username or password, you can retrieve them via the AMS Login page.
Can I access the Research Analysis Platform without an AMS account?
No. To access the Research Analysis Platform you must have an AMS account, and you must be listed as a collaborator in one of the Research Analysis Platform-approved applications.
I entered my valid AMS credentials but got an approval error.
You must finish the AMS registration process, and be approved by UK Biobank. For more information, see the UK Biobank User Guide.
Can I link multiple Research Analysis Platform accounts to the same AMS account?
No, an AMS account may be linked to only one Research Analysis Platform account.
Can I link multiple AMS accounts to the same Research Analysis Platform account?
No, a Research Analysis Platform account may be linked to only one AMS account.
Can I "unlink" my Research Analysis Platform account from my AMS account?
No, this operation is not supported.
I previously linked my Research Analysis Platform account to my AMS account, but during a subsequent Research Analysis Platform login, I was asked to do so again.
Occasionally the platform may ask you to refresh your link, for security reasons. Among others, this can happen if your state on the AMS changes for any reason (e.g. if you update your contact details on the AMS).
Projects and Files
What is a project?
On the UK Biobank Research Analysis Platform, all work takes place in the context of a project. Projects allow a defined set of users to:
An access application is a research application submitted to UK Biobank by a Principal Investigator. It includes a written research proposal and a set of UK Biobank data-fields to which access is requested. UK Biobank assigns a unique numeric identifier to each application. All activity on the Research Analysis Platform needs to be done within the context of such an access application.
When making a project, I get the error "Application does not belong to this user or is not an approved application".
Please ensure that you are listed as a collaborator in the access application on the AMS.
Can I have a project associated with multiple applications?
No, each project is tied to one application only.
After a project has been created, can I change its application?
No, the application is set at project creation time and cannot be changed.
Can I have multiple projects associated with the same application?
Yes, you can make multiple projects using the same application id.
Data has been dispensed to my project, but not all the data I am expecting is there
The most common reason for data not showing in the RAP is due to your UKB Access Application not being fully completed.
For new applications, please check your application in the AMS. If the project's status is “Underway,” then your data should be ready to be dispensed to your project on the Platform.
Upgraded / Additional Data Requested
If you have applied to move your application to a tiered application, or requested further data, this will need to go through a number of steps; quotation, new MTAs and payment etc.. You will receive an email when notifying you that the MTA has been executed. MTA execution is the final step in the process. Once you have received this email, your data will be ready for dispensal. If you have already had data dispensed to a project on the Platform, you will need to have data dispensed to a new project, in order to receive any new data.
I created a project and chose to dispense data, but I don't see any data.
The process of dispensing data happens over a short period of time. When you first select the Create Project button to submit the New Project dialog, the new project will appear empty. Subsequently, it will begin to get populated with files and other data. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.
How long does it take for the data to appear in a new project?
The process can take as little as 20 minutes or as long as 2 hours, depending on the scope of the access application.
Do I need to remain logged in while the data is being dispensed in a new project?
No, the process happens in the background, even if you are logged out.
I created a project but it's stuck at "0%".
Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.
What kind of data is dispensed when I create a project?
The system dispenses the data that correspond to the approved data-fields of the access application associated with the project. Tabular data-fields and linked health data are dispensed into a SQL database, and bulk data-fields are dispensed as files.
Can I access and use Research Analysis Platform projects on the DNAnexus Platform?
If you use the same account on both platforms, you will be able to access and use Research Analysis Platform projects on the DNAnexus Platform. Note, however, that:
You will only be able to access and use tools that are hosted in the London AWS region.
All sharing, download, and other data-use restrictions apply fully to UK Biobank data, on both platforms.
What is a data-field?
All data in the UK Biobank resource are organized into data-fields. Your access application is approved for a precise subset of those data-fields. You can find more information about data-fields, broken down by type, on the UK Biobank Field Listing page.
What data-fields is my access application approved for?
You can get more information on your access application by logging into the AMS and selecting the Applications tab on the left.
What is in the "Bulk" folder?
The "Bulk" folder contains files associated with data-fields of type "bulk". These are data items that are particularly large and/or complex and are therefore made available as files, such as genome sequencing files.
UK Biobank is a resource compiled from approximately 500,000 volunteer participants. Each participant is uniquely identified by a 7-digit numeric identifier (EID), typically in the 1,000,000 - 6,000,000 range. These identifiers are scrambled for each access application, hence the EIDs will not match across applications. For more information, refer to https://www.ukbiobank.ac.uk/media/5bvp0vqw/de-identification-protocol.pdf.
How are EIDs used on the Research Analysis Platform?
When you create a project on the Research Analysis Platform, the system contacts UK Biobank to get your application's EIDs, then uses them to pseudonymize your dataset. The pseudonymized EIDs are used, for example, to populate the "eid" column in the database, to name per-participant files, to generate the EID-specific content of FAM files for genotyping fields, and to adjust pVCF headers.
For a given application, is the Research Analysis Platform using the same EIDs as data on UK Biobank's website?
Yes, for a given application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.
I am interested in a specific participant EID. What files are available?
You can find all files associated with a specific participant EID using the Web UI. Inside a project, select the filter icon
on the right, then select the filter settings icon
below and choose Properties. A new property filter will appear in the filter bar. Select Any properties and type "eid" (lowercase, without the quotes) in the left textbox that is labeled "Any Key". Type the numeric 7-digit value in the right textbox that is labeled "Any Value", and select Apply. To search across all folders, make sure the search scope is set to Entire Project instead of the default Current Folder Only.
You can also do the same using the CLI. Type:
dx find data --property eid=1234567
Note that these methods find participant-specific files (like individual VCF or CRAM files) and not cohort-wide files (like PLINK or pVCF files).
In the header of pVCF files, why are there samples named "W000001", "W000002", etc.?
These samples correspond to participants that have withdrawn, and the Research Analysis Platform uses this convention to denote them in the header, to help you exclude them from your research.
Since the Research Analysis Platform pseudonymizes pVCF headers, does that mean that different researchers see different content when accessing the same file?
The pseudonymized pVCF headers are specific to aspecific access application. Researchers who work on different applications will encounter different headers for each, just as they encounter different content for the FAM files of PLINK fields.
Are the headers of gVCF or CRAM files pseudonymized?
No, the content of these files is not pseudonymized. However, the names of these files are pseudonymized accordingly. Therefore, we recommend relying on the filename prefix for determining the EID corresponding to a gVCF or CRAM file, and discarding any identifiers found in the gVCF or CRAM header.
What is in the "Showcase metadata" folder?
This folder contains all the files published by UK Biobank, as described on the UK Biobank Showcase Schema page. These files describe aspects of the UK Biobank Showcase, including all fields available in the UK Biobank resource.
The files under "Showcase metadata" are different from what's on the UK Biobank showcase website.
The files in the "Showcase metadata" folder represent the showcase metadata at the time that the data was ingested in the system, and may not reflect the latest showcase updates.
How did UK Biobank generate the bulk files? What instruments, assays or scientific workflows were used?
For information about data provenance, please consult the UK Biobank website or contact UK Biobank directly.
I found my bulk data-fields under the "Bulk" folder. Where are the rest of the non-bulk data-fields?
All other non-bulk data-fields (for which UK Biobank defines the item type as "data", "sample", or "record") are dispensed into a SQL database and associated Research Analysis Platform dataset.
Can data be downloaded, exported or otherwise egressed out of the Research Analysis Platform?
From a policy standpoint, you are responsible for complying with the Material Transfer Agreement (MTA) and with any other rules set forth by UK Biobank. As of June 2021, Annex 1 of the MTA states that "any WGS (whole genome sequence) or WES (whole exome sequence) files [..] must not be transmitted or downloaded from the research analysis platform". In addition, depending on the tier of your access application, you may or may not be allowed to egress certain other data.
To help you comply, the Platform may restrict external downloads of certain original files, using rules specific to your application tier. These restrictions are not comprehensive, and it is your responsibility to refrain from actions that would violate the MTA even if the Platform does not technically restrict those actions.
Project sharing and cross-project collaboration
How do I see who has access to a project?
In the projects list, under the "Members" column, select the number corresponding to the row of interest. Alternatively, from inside a project, select the Share icon on the upper right (next to the "Access:" label).
I just created a project, and the system says there are two users. Who is the other user?
While the project is being populated with data, the system adds a service user called "UK Biobank Robot" (username: ukb.robot).
Why is the user "ukb.robot" in my project?
The system automatically adds this service user to a project whenever the project is being edited or updated, such as when data is being dispensed in a newly created project. The system uses that user to perform any necessary data manipulations in an automated manner.
Can I remove or alter the access of the user "ukb.robot" in my project?
No, but the system will automatically remove that user once any necessary data manipulations are completed.
How do I share a project with other users?
If you are a project administrator, from inside a project select the Share icon on the upper right to launch the sharing dialog. Enter the username or email of the user you want to share the project with, select their access level, then select Add User.
I tried sharing a project with someone and got an error.
You can only share a project with Research Analysis Platform users who are listed as collaborators in the project's access application on the AMS. If you receive an error, please ensure the following:
The username or email you are entering exists. You cannot share a project with someone if they have not yet signed up for an account.
You are sharing with a linked Research Analysis Platform account. You cannot share a project with an account if they have not yet logged into the Research Analysis Platform and linked their account to the AMS (or if their link needs to be refreshed).
You are sharing with someone on the same application. You cannot share a project with a linked Research Analysis Platform account unless they are listed as collaborators in the project's access application on the AMS.
Can I share a project with a group of people at once?
No, you must share with each person individually, as the platform needs to enforce AMS permissions at the user level.
Can I share a project with Customer Support?
Yes. By default, Customer Support does not have access to any projects, unless you explicitly share a project with them. To do that, in the project sharing dialog enter "org-support" (without the quotes) as the username, select Viewer as the access level, and select Add User.
Can I share a project with UK Biobank staff?
Yes. The system supports a special alias that you can use to share a project with UK Biobank. In the project sharing dialog enter "org-ukb_reviewers" (without the quotes) as the username, select Viewer as the access level, and select Add User. This action shares your project with a specific UK Biobank team, managed by UK Biobank themselves. The purpose of this team is to receive your research results.
Can I share just a subset of my data, instead of the whole project?
Sharing is on a project basis. If you need to share a subset of data, such as the files in one folder, we recommend copying them to a new project and then sharing that project, as follows:
In the project list page, select New Project. Enter the same application id as your existing project, and deselect the option Dispense data to the project. Select Create Project. This will create a new empty project, associated with the same application as your existing project.
In your existing project, tick the items you want to share, and select Copy. Select the new project, then select Copy Selected.
Share the new project.
Are there any restrictions in copying data across projects?
You may only copy data across projects associated with the same access application. If you have uploaded a file in a project associated with one application, and you need to use it in a second project associated with a different application, you must re-upload it in the second project.
Which job priority do I choose for my analysis?
You can assign each job a different priority, depending on whether you want to prioritize job execution speed or cost control. See the page Managing Job Priority.
What compute instance types are available for running my analysis?
See the Platform rate card for a full list of available AWS instance types, including detailed specs for each on number of cores, amount of RAM, storage memory type and size, and cost.
How do I visualize a CRAM or VCF file using IGV.js?
To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:
Navigate to the project containing the files you want to visualize.
Select the VISUALIZE tab.
Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".
Select the files you want to visualize.
If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.
For CRAM files, you must select both the CRAM and the associated CRAI file.
For VCF files, you must select both the VCF and the associated TBI file.
IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.
Select Launch Viewer.
Databases and datasets
What is the database found in the root folder of the project?
This is a database containing tables, columns, and rows, that correspond to the approved data-fields of the access application associated with the project. It is a SQL database that is based on Spark SQL technology, which is a modern and more scalable technology than classic relational database technologies (RDBMS).
See this page for more information about databases and datasets.
What tables are included in the dispensed database?
The database contains the following tables:
participant_0001, ..., participant_9999
These tables contain the main UK Biobank participant data. Each participant is represented as one row, and each data-field is represented as one or more columns. For scalability reasons, the data-fields are horizontally split across multiple tables, starting from table participant_0001 (which contains the first few hundred columns for all participants), followed by participant_0002 (which contains the next few hundred columns), etc. The exact number of tables depends on how many data-fields your application is approved for.
Hospitalization records. This table is only included if your application is approved for data-field #41259.
Hospital critical care records. This table is only included if your application is approved for data-field #41290.
Hospital delivery records. This table is only included if your application is approved for data-field #41264.
Hospital diagnosis records. This table is only included if your application is approved for data-field #41234.
Hospital maternity records. This table is only included if your application is approved for data-field #41261.
Hospital operation records. This table is only included if your application is approved for data-field #41149.
Hospital psychiatric records. This table is only included if your application is approved for data-field #41289.
Death records. This table is only included if your application is approved for data-field #40023.
Death cause records. This table is only included if your application is approved for data-field #40023.
GP clinical event records. This table is only included if your application is approved for data-field #42040.
GP registration records. This table is only included if your application is approved for data-field #42038.
GP prescription records. This table is only included if your application is approved for data-field #42039.
GP clinical event records (COVID TPP). This table is only included if your application is approved for data-field #40101.
GP prescription records (COVID TPP). This table is only included if your application is approved for data-field #40102.
GP clinical event records (COVID EMIS). This table is only included if your application is approved for data-field #40103.
GP prescription records (COVID EMIS). This table is only included if your application is approved for data-field #40104.
COVID19 Test Result Record (England). This table is only included if your application is approved for data-field #40100.
COVID19 Test Result Record (Scotland). This table is only included if your application is approved for data-field #40100.
COVID19 Test Result Record (Wales). This table is only included if your application is approved for data-field #40100.
How are column names determined for the dispensed database?
For the main UK Biobank participant tables, the column naming convention is generally as follows:
However, the following additional rules apply:
If a field is not instanced, the _i<INSTANCE-ID> piece is skipped altogether.
If a field is not arrayed, the _a<ARRAY-ID> piece is skipped altogether.
If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the _a<ARRAY-ID> piece is skipped altogether.
Age at recruitment: p21022
Date of attending assessment centre: p53_i0, p53_i1, ...
Diagnoses - ICD10 (converted into embedded array): p41270
For all other tables (such as hospital records, GP records, death records, or COVID-19 records), the column names are identical to what UK Biobank provides in the data showcase. For more information on the columns of these tables, consult Resource #138483 (hospital records), Resource #591 (GP records), Resource #115559 (death records), Resource #3151 (COVID-19 GP records), or Resource #1758 (COVID-19 test results).
What is a data release version?
The Research Analysis Platform holds a copy of all UK Biobank data. All projects are created using this copy of UK Biobank data. As UK Biobank updates the data on their end, the copy held by the Research Analysis Platform is periodically updated to reflect these upstream updates. Whenever the Research Analysis Platform updates its copy of the data, it will be indicated by a new data release version.
I am about to make a new project and choose the option to dispense data. What data release version will my data correspond to?
The data in your project will be dispensed out of whatever copy is held by the Research Analysis Platform at the time that you create the project. Therefore, your data will correspond to the latest data release version at that time.
I previously created a project. How is my project affected by new data releases?
Your existing project is not affected, and will continue to reflect the data release version from the time that the project was created.
The data in my project is not up to date. What can I do?
To take advantage of new data releases, you will need to make a new project. In the future, we plan to launch additional features to allow you to opt into in-place upgrades for existing projects.
How can I find out what data release version was used, for an existing project?
In a project, locate and tick the dataset that was dispensed in the root folder. Click the info icon on the upper right to open the info panel. Scroll at the bottom to reveal the "Details" section. The value of the "Description" key contains the original version, e.g.
"Description" = "Dataset: app68444_202101290057.dataset Original Version: v3.0+ae7924f"
Example of where to find the data release version for an existing project. Data version is shown in the "Details" section in the info panel.
The scope of my UK Biobank access application has been expanded. Will new data automatically appear in my projects?
No. If you have been approved for new fields, this change will not apply to existing projects. To take advantage of the increased application scope (regardless of whether there has been a new data release) you need to make a new project.