Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The UK Biobank Research Analysis Platform is built on the DNAnexus Platform.
This documentation provides descriptions and instructions for accessing UK Biobank data within the Research Analysis Platform.
For a detailed rundown of DNAnexus Platform features and how to leverage them in your work, refer to the Platform documentation.
UK Biobank is a national and international health resource with unparalleled research opportunities, open to all bona fide health researchers. UK Biobank aims to improve the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses – including cancer, heart diseases, stroke, diabetes, arthritis, osteoporosis, eye disorders, depression and forms of dementia. It is following the health and well-being of 500,000 volunteer participants and provides health information, which does not identify them, to approved researchers in the UK and overseas, from academia and industry.
The UK Biobank Research Analysis Platform is an informatics platform that provides access to, and analysis of, UK Biobank data by its registered researcher community. Read the announcement.
See the DNAnexus website for more information on the Research Analysis Platform, including details on applying for access.
This FAQ addresses questions related to the new data dispensing functionality that allows users to select which elements of the data to dispense. If you would like more information on the new 500k WGS data release, visit the UK Biobank FAQ.
You can subscribe at https://status.dnanexus.com/
Currently the refresh feature is unavailable to ensure that the maximum number of users can get access to the new data as soon as possible via dispensal.
We recommend that users dispense a new project to get the 500k WGS data, and migrate data analysis workflows from existing projects to the new project. We will enable the “refresh” feature again in the future and send notifications out once it is available.
We recommend that each research application dispense data to only one project to be considerate to other researchers who would like to access the data.
Each dispense request will take about 4-8 hours once your project starts dispensing. However, due to the large number of people interested in 500k WGS data and the size of this data, you might experience a long waiting time for your project dispensal to start due to the queue of requests. Please do not dispense more than one project.
Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time. https://dnanexus.gitbook.io/uk-biobank-rap/frequently-asked-questions#i-created-a-project-but-its-stuck-at-0-.
You will need to create a new project in order to access the data. Note that you will not be able to refresh an already existing project. On the project creation screen, users will now see a new section with the different data types available to dispense. For a faster dispensal time, only select what data you’ll need. You will have the option to dispense the data on project creation or later in the project settings of that new project.
If you are interested in accessing the updated phenotypic, health care and proteomics data, select structured tabular data. This option is selected by default, but can be unselected if the data is not necessary for your project.
If you are interested in accessing the updated imaging data or the population-level WGS pVCF data, select unstructured bulk data files. This option will dispense population-level WGS pVCF data (600,000 files), but not individual-level WGS data such as CRAM or gVCF files. This was decided in order to streamline the new project experience for all users. If your research requires access to the individual-level WGS data (18 million files), return back to the project once the initial dispensing is completed and request an additional dispensing of these data files.
Due to the size of the dispensal we recommend waiting until demand for the WGS has decreased.
Due to the size of the data (18 million files), we recommend waiting until the demand for the WGS has reduced. If your research requires access to the individual-level WGS data, you will have to request "Additional Bulk Data Files" after your first request has been completed. You can make the request in your project settings by selecting the “Dispense More Data” button.
You can create an empty project without dispensing data by deselecting both checkboxes on the project creation screen.
See the below table for details.
They can be found at the two locations below:
/Bulk/GATK and GraphTyper WGS/GraphTyper population level WGS variants, pVCF format [500k release]/
/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]
Get started using the Research Analysis Platform.
Data dispensal can take over an hour, or even longer, in some cases. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.
To collaborate with colleagues who are named on the same approved access application, add them as members to the project associated with this application.
Get answers to common questions about the Research Analysis Platform, and about UK Biobank data and systems.
The Research Analysis Platform is open to researchers who are listed as collaborators on UK Biobank-approved access applications.
Registration is a two-step process. You must first create a Research Analysis Platform user account, and then you must link it to your UK Biobank Access Management System (AMS) account.
If your organization has been set up for Single Sign On (SSO), follow internal procedures specific to your organization. Otherwise:
If you already have a DNAnexus account, you do not need to create a separate Research Analysis Platform account. You can use your existing DNAnexus account on the Research Analysis Platform.
No, your username and email address on Research Analysis Platform can be different from those you use on the AMS.
To log in:
If your organization has been set up for SSO, follow internal procedures.
This process happens automatically upon first login (see "How do I log in to the Research Analysis Platform?"). You will be presented with the Research Analysis Platform Terms of Service, and once you read them (by scrolling down) and accept them, you will be taken to the AMS website, where you must enter your UK Biobank credentials.
No. To access the Research Analysis Platform you must have an AMS account, and you must be listed as a collaborator in one of the Research Analysis Platform-approved applications.
No, an AMS account may be linked to only one Research Analysis Platform account.
No, a Research Analysis Platform account may be linked to only one AMS account.
No, this operation is not supported.
Occasionally the platform may ask you to refresh your link, for security reasons. Among others, this can happen if your state on the AMS changes for any reason (e.g. if you update your contact details on the AMS).
Currently, there is no limit of how many projects the users can create. However, we recommend everyone under the same research application use the same single project. This would allow better coordination when there is a new data release and also better reuse of tools, workflows, and data that users generated.
The projects on UKB-RAP are eligible for deletion considered inactive or unfunded and will be removed if one of the following criteria are met. This will help ensure the best user experience for active projects and will help optimize use of the platform.
The project has not been accessed for the last 60 days, with no requests from those with access to the project have been made to browse project folders and files. In addition, the project contains only the dispensed UK Biobank data and does not have any derived data generated by the user or others
OR
The project is billed to a wallet that has no funds available. In addition, the project contains generated data resulting in ongoing storage charges.
If your project is considered inactive and for any reason you would like to keep this project, please re-access the project. If the project is no longer inactive, it will not be deleted. If your project is considered unfunded and for any reason you would like to keep this project, then one of the following actions:
Add funds to the wallet that the project is billed to.
Transfer the project to another wallet that has funds.
Delete all generated data and ensure no user generated data remains in the project
On the UK Biobank Research Analysis Platform, all work takes place in the context of a project. Projects allow a defined set of users to:
Access specific datasets
Conduct analyses on these datasets
Please ensure that you are listed as a collaborator in the access application on the AMS.
No, each project is tied to one application only.
No, the application is set at project creation time and cannot be changed.
Yes, you can make multiple projects using the same application id.
The most common reason for data not showing in the Research Analysis Platform is due to your UKB Access Application not being fully completed.
If you have applied to move your application to a tiered application, or requested further data, this will need to go through a number of steps; quotation, new MTAs and payment etc.. You will receive an email when notifying you that the MTA has been executed. MTA execution is the final step in the process. Once you have received this email, your data will be ready for dispensal. If you have already had data dispensed to a project on the Platform, you will need to have data dispensed to a new project, in order to receive any new data.
The process of dispensing data happens over a short period of time. When you first select the Create Project button to submit the New Project dialog, the new project will appear empty. Subsequently, it will begin to get populated with files and other data. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.
The process can take as little as 20 minutes or as long as a full day, depending on the scope of the access application.
No, the process happens in the background, even if you are logged out.
Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.
The system dispenses the data that correspond to the approved data-fields of the access application associated with the project. Tabular data-fields and linked health data are dispensed into a SQL database, and bulk data-fields are dispensed as files.
If you use the same account on both platforms, you will be able to access and use Research Analysis Platform projects on the DNAnexus Platform. Note, however, that:
You will only be able to access and use tools that are hosted in the London AWS region.
All sharing, download, and other data-use restrictions apply fully to UK Biobank data, on both platforms.
You can get more information on your access application by logging into the AMS and selecting the Applications tab on the left.
The "Bulk" folder contains files associated with data-fields of type "bulk". These are data items that are particularly large and/or complex and are therefore made available as files, such as genome sequencing files.
When you create a project on the Research Analysis Platform, the system contacts UK Biobank to get your application's EIDs, then uses them to pseudonymize your dataset. The pseudonymized EIDs are used, for example, to populate the "eid" column in the database, to name per-participant files, to generate the EID-specific content of FAM files for genotyping fields, and to adjust pVCF headers.
Yes, for a given application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.
You can also do the same using the CLI. Type:
dx find data --property eid=1234567
Note that these methods find participant-specific files (like individual VCF or CRAM files) and not cohort-wide files (like PLINK or pVCF files).
These samples correspond to participants that have withdrawn, and the Research Analysis Platform uses this convention to denote them in the header, to help you exclude them from your research.
The pseudonymized pVCF headers are specific to a specific access application. Researchers who work on different applications will encounter different headers for each, just as they encounter different content for the FAM files of PLINK fields.
No, the content of these files is not pseudonymized. However, the names of these files are pseudonymized accordingly. Therefore, we recommend relying on the filename prefix for determining the EID corresponding to a gVCF or CRAM file, and discarding any identifiers found in the gVCF or CRAM header.
The files in the "Showcase metadata" folder represent the showcase metadata at the time that the data was ingested in the system, and may not reflect the latest showcase updates.
For information about data provenance, please consult the UK Biobank website or contact UK Biobank directly.
All other non-bulk data-fields (for which UK Biobank defines the item type as "data", "sample", or "record") are dispensed into a SQL database and associated Research Analysis Platform dataset.
To help you comply, the Platform may restrict external downloads of certain original files, using rules specific to your application tier. These restrictions are not comprehensive, and it is your responsibility to refrain from actions that would violate the MTA even if the Platform does not technically restrict those actions.
In the projects list, under the "Members" column, select the number corresponding to the row of interest. Alternatively, from inside a project, select the Share icon on the upper right (next to the "Access:" label).
While the project is being populated with data, the system adds a service user called "UK Biobank Robot" (username: ukb.robot).
The system automatically adds this service user to a project whenever the project is being edited or updated, such as when data is being dispensed in a newly created project. The system uses that user to perform any necessary data manipulations in an automated manner.
No, but the system will automatically remove that user once any necessary data manipulations are completed.
If you are a project administrator, from inside a project select the Share icon on the upper right to launch the sharing dialog. Enter the username or email of the user you want to share the project with, select their access level, then select Add User.
You can only share a project with Research Analysis Platform users who are listed as collaborators in the project's access application on the AMS. If you receive an error, please ensure the following:
The username or email you are entering exists. You cannot share a project with someone if they have not yet signed up for an account.
You are sharing with a linked Research Analysis Platform account. You cannot share a project with an account if they have not yet logged into the Research Analysis Platform and linked their account to the AMS (or if their link needs to be refreshed).
You are sharing with someone on the same application. You cannot share a project with a linked Research Analysis Platform account unless they are listed as collaborators in the project's access application on the AMS.
No, you must share with each person individually, as the platform needs to enforce AMS permissions at the user level.
Yes. By default, Customer Support does not have access to any projects, unless you explicitly share a project with them. To do that, in the project sharing dialog enter "org-support" (without the quotes) as the username, select Viewer as the access level, and select Add User.
Yes. The system supports a special alias that you can use to share a project with UK Biobank. In the project sharing dialog enter "org-ukb_reviewers" (without the quotes) as the username, select Viewer as the access level, and select Add User. This action shares your project with a specific UK Biobank team, managed by UK Biobank themselves. The purpose of this team is to receive your research results.
Sharing is on a project basis. If you need to share a subset of data, such as the files in one folder, we recommend copying them to a new project and then sharing that project, as follows:
In the project list page, select New Project. Enter the same application id as your existing project, and deselect the option Dispense data to the project. Select Create Project. This will create a new empty project, associated with the same application as your existing project.
In your existing project, tick the items you want to share, and select Copy. Select the new project, then select Copy Selected.
Share the new project.
You may only copy data across projects associated with the same access application. If you have uploaded a file in a project associated with one application, and you need to use it in a second project associated with a different application, you must re-upload it in the second project.
To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:
Navigate to the project containing the files you want to visualize.
Select the VISUALIZE tab.
Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".
Select the files you want to visualize.
If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.
For CRAM files, you must select both the CRAM and the associated CRAI file.
For VCF files, you must select both the VCF and the associated TBI file.
IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.
Select Launch Viewer.
This is a database containing tables, columns, and rows, that correspond to the approved data-fields of the access application associated with the project. It is a SQL database that is based on Spark SQL technology, which is a modern and more scalable technology than classic relational database technologies (RDBMS).
The database contains the following tables:
For the main UK Biobank participant tables, the column naming convention is generally as follows:
p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>
However, the following additional rules apply:
If a field is not instanced, the _i<INSTANCE-ID>
piece is skipped altogether.
If a field is not arrayed, the _a<ARRAY-ID>
piece is skipped altogether.
If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the _a<ARRAY-ID>
piece is skipped altogether.
Examples:
Age at recruitment: p21022
Date of attending assessment centre: p53_i0
, p53_i1
, ...
Diagnoses - ICD10 (converted into embedded array): p41270
The Research Analysis Platform holds a copy of all UK Biobank data. All projects are created using this copy of UK Biobank data. As UK Biobank updates the data on their end, the copy held by the Research Analysis Platform is periodically updated to reflect these upstream updates. Whenever the Research Analysis Platform updates its copy of the data, it will be indicated by a new data release version.
The data in your project will be dispensed out of whatever copy is held by the Research Analysis Platform at the time that you create the project. Therefore, your data will correspond to the latest data release version at that time.
In a project, locate and tick the dataset that was dispensed in the root folder. Click the info icon on the upper right to open the info panel. Scroll at the bottom to reveal the "Details" section. The value of the "Description" key contains the original version, e.g.
"Description" = "Dataset: app68444_202101290057.dataset Original Version: v3.0+ae7924f"
After each data release, the data need to be unrestricted by the UKB before it is available for the researchers. If the user begins the process of data dispensal during this time, the project will be set to the latest version, but the restricted data will still not be not available.
The user can re-dispense data after the data has been unrestricted by UKB. UKB will typically send out an email after the initial data release to notify users when data has been unrestricted. The version numbers of data releases can be the same, however the version signature will change when restricted data becomes available. For example, the user whose current data is "v3.0+ae7924f" could re-dispense data to version "3.0+ae9999f", even though both versions start with “v3.0”. The version signature, “ae7924f” and “ae9999f”, will be different between these two data dispensal batches.
No. The data update process will update all the data fields that your application is eligible for.
The update process will take some time to complete. Your request to update the data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.
If ongoing jobs use files that need to be removed as part of the data update, these jobs may fail. We recommend starting the update process when there are no jobs running and waiting for the update process to complete before starting any new jobs.
If you have received confirmation from UK Biobank that your grant application has been approved, the next step is to create a new grant org on the Research Analysis Platform. This will enable you to receive the grant. See the next question for more information.
Log onto the Research Analysis Platform.
From the main Platform menu, select Org Admin.
From the dropdown menu, select All Orgs.
On the Organizations list page, click the New Organization button.
A New Organization form will open in a modal window. Enter the following information in the form: Org Name: Enter a name of your choice for this organization Org ID: You can edit the default value, as long as you preserve the prefix "ukbgrant_yymm_" Note that this field has a character limit. A valid org ID would be, for example, "ukbgrant_yymm_FirstNameLastName”.
Click Create Organization.
The Set Up Billing modal window will open, and you will be prompted to set up billing for your new organization. Do not set up billing. Instead, click Exit to close the modal window.
From the main Platform menu, select Projects, then All Projects.
In the projects list, find the row for the project in question.
Click the vertical "..." icon at the end of the row, then select Project Settings from the More Actions menu.
Locate the Billed To field, in the Billing section of the Settings screen.
Click on the downward-facing caret at the right end of the Billed To field.
Select the grant org from the list of available billing accounts.
Credits are issued quarterly at the beginning of March, June, September, and December. Your approval email from UK Biobank will specify when your credits will be issued, pending the creation of an org to receive the credits.
Yes, you must create a new grant org to receive enhanced credits.
No, funds from the grant org cannot be transferred to other accounts.
Check the billing account used by your project or projects. Be sure that the grant org is set as the "Billed To" account for each.
Field Category | Field ID | Field Title |
---|---|---|
to create a Research Analysis Platform account, access the Platform, and connect your account to the UK Biobank Access Management System.
On the Research Analysis Platform, you'll be able to work with UK Biobank data from data-fields associated with your approved access application. To access this data, .
Each account has an initial credit of £40 toward usage charges. Before depleting this credit, , in order to continue using the Platform.
for help with accessing and using the Platform.
The gives Platform users a forum to share what you've learned and done on the Platform, and learn from one another.
If you do not have an account, visit the and select Create an account. You will need to provide your full name and email, as well as a username and password that you want to use.
It looks like you already have a DNAnexus account. If your organization has been set up for SSO, follow your organization's internal procedures. Otherwise visit the Research Analysis Platform and enter your email address. You will receive an email with a password reset link, which you can use to reset the password of your account.
Otherwise, visit the and select Log In to log in with your Research Analysis Platform account.
If you have forgotten your AMS username or password, you can retrieve them via the .
Create an AMS account via the .
You must finish the AMS registration process, and be approved by UK Biobank. For more information, see the .
See detailed instructions on the page.
An access application is a research application submitted to UK Biobank by a Principal Investigator. It includes a written research proposal and a set of to which access is requested. UK Biobank assigns a unique numeric identifier to each application. All activity on the Research Analysis Platform needs to be done within the context of such an access application.
For more information on your access application, log into the (AMS) and select the Applications tab.
For new applications, please check your application in the . If the project's status is “Underway,” then your data should be ready to be dispensed to your project on the Platform.
All data in the UK Biobank resource are organized into data-fields. Your access application is approved for a precise subset of those data-fields. You can find more information about data-fields, broken down by type, on the .
See , for data within the "Bulk" folder.
See , for data within the "Bulk" folder.
UK Biobank is a resource compiled from approximately 500,000 volunteer participants. Each participant is uniquely identified by a 7-digit numeric identifier (EID), typically in the 1,000,000 - 6,000,000 range. These identifiers are scrambled for each access application, hence the EIDs will not match across applications. For more information, refer to .
You can find all files associated with a specific participant EID using the Web UI. Inside a project, select the filter icon on the right, then select the filter settings icon below and choose Properties. A new property filter will appear in the filter bar. Select Any properties and type "eid" (lowercase, without the quotes) in the left textbox that is labeled "Any Key". Type the numeric 7-digit value in the right textbox that is labeled "Any Value", and select Apply. To search across all folders, make sure the search scope is set to Entire Project instead of the default Current Folder Only.
This folder contains all the files published by UK Biobank, as described on the . These files describe aspects of the UK Biobank Showcase, including all fields available in the UK Biobank resource.
From a policy standpoint, you are responsible for complying with the Material Transfer Agreement (MTA) and with any other rules set forth by UK Biobank. As of June 2021, Annex 1 of the MTA states that "any WGS (whole genome sequence) or WES (whole exome sequence) files [..] must not be transmitted or downloaded from the research analysis platform". In addition, depending on the of your access application, you may or may not be allowed to egress certain other data.
.
You can assign each job a different priority, depending on whether you want to prioritize job execution speed or cost control. See the page .
See the for a full list of available AWS instance types, including detailed specs for each on number of cores, amount of RAM, storage memory type and size, and cost.
See for more information about databases and datasets.
For all other tables (such as hospital records, GP records, death records, or COVID-19 records), the column names are identical to what UK Biobank provides in the data showcase. For more information on the columns of these tables, consult (hospital records), (GP records), (death records), (COVID-19 GP records), or (COVID-19 test results).
See .
Your existing project is not affected, and will continue to reflect the data release version from the time that the project was created. Data updates will not happen automatically and you have the choice to decide whether or not you want to update your project data. If you choose to update your dispensed projects, the files and tabular data in your project will be updated. See the details .
To learn how to get the most recent data update, see the page .
No. If you have been approved for new fields, this change will not apply to existing projects automatically. To get access to the new data you are approved for, you will need to perform a data update. To learn how to get the most recent data update, see the page .
Previously-generated result files are not affected. Cohorts and dashboards that point to the previous dataset will be evaluated against the updated database. To migrate these to the latest dataset, run the "” app.
Make sure that you've logged into your . Once you've done so, you should be able to create a new org, following the instructions just above.
Do not upgrade your grant org to a billable account, or add your credit card to a grant org account. Grant orgs are only for receiving and using grants. If funds in the grant org account are running low, change the billing account used by any affected project, so that it uses a personal billing account linked to a credit card. You can also apply for additional credits. .
Exome sequences
Exome OQFE variant call files (VCFs)
Exome OQFE variant call file (VCFs) indices
Exome OQFE CRAM files
Exome OQFE CRAM indices
Exome sequences - Previous exome releases
Exome OQFE variant call files (VCFs) - interim 200k release
Exome OQFE variant call file (VCFs) indices - interim 200k release
Exome OQFE CRAM files - interim 200k release
Exome OQFE CRAM indices - interim 200k release
Exome sequences - Alternative exome processing
Exome variant call files (DRAGEN) (VCFs)
Exome variant call file (DRAGEN) (VCFs) indices
Whole genome sequences - GATK and GraphTyper
Whole genome GATK variant call files (VCFs) and indices [500k release]
Whole genome GATK CRAM files and indices [500k release]
BQSR - GATK BaseRecalibrator [500k release]
Concatenated QC Metrics [500k release]
Genotype Concordance [500k release]
Genotype Concordance - Contingency Metrics [500k release]
Genotype Concordance - Detail Metrics [500k release]
Genotype Concordance - Summary Metrics (Picard) [500k release]
Sample Contamination (ReadHaps) [500k release]
Sample Contamination (verifyBamID) - depthSM [500k release]
Sample Contamination (verifyBamID) - selfSM [500k release]
Whole genome sequences - Dragen WGS
Whole genome CNV call files (DRAGEN) [500k release]
Whole genome CNV supplementary files (DRAGEN) [500k release]
Whole genome CRAM files (DRAGEN) [500k release]
Whole genome CYP2D6 genotype calls (DRAGEN) [500k release]
Whole genome STR call files (DRAGEN) [500k release]
Whole genome STR supplementary files (DRAGEN) [500k release]
Whole genome SV call files (DRAGEN) [500k release]
Whole genome SV supplementary files (DRAGEN) [500k release]
Whole genome diagnostics files (DRAGEN) [500k release]
Whole genome supplementary files (DRAGEN) [500k release]
Whole genome variant call files (GVCFs) (DRAGEN) [500k release]
Whole genome variant call files (VCFs) (DRAGEN) [500k release]
Genotypes - Geotyping process and sample
CEL files
Whole genome sequences - Previous WGS releases - WGS pilot studies
BGI WGS CRAM files
Broad WGS CRAM files
Table name | Description |
| These tables contain the main UK Biobank participant data. Each participant is represented as one row, and each data-field is represented as one or more columns. For scalability reasons, the data-fields are horizontally split across multiple tables, starting from table participant_0001 (which contains the first few hundred columns for all participants), followed by participant_0002 (which contains the next few hundred columns), etc. The exact number of tables depends on how many data-fields your application is approved for. |
| Hospitalization records. This table is only included if your application is approved for data-field #41259. |
| Hospital critical care records. This table is only included if your application is approved for data-field #41290. |
| Hospital delivery records. This table is only included if your application is approved for data-field #41264. |
| Hospital diagnosis records. This table is only included if your application is approved for data-field #41234. |
| Hospital maternity records. This table is only included if your application is approved for data-field #41261. |
| Hospital operation records. This table is only included if your application is approved for data-field #41149. |
| Hospital psychiatric records. This table is only included if your application is approved for data-field #41289. |
| Death records. This table is only included if your application is approved for data-field #40023. |
| Death cause records. This table is only included if your application is approved for data-field #40023. |
| GP clinical event records. This table is only included if your application is approved for data-field #42040. |
| GP registration records. This table is only included if your application is approved for data-field #42038. |
| GP prescription records. This table is only included if your application is approved for data-field #42039. |
| GP clinical event records (COVID TPP). This table is only included if your application is approved for data-field #40101. |
| GP prescription records (COVID TPP). This table is only included if your application is approved for data-field #40102. |
| GP clinical event records (COVID EMIS). This table is only included if your application is approved for data-field #40103. |
| GP prescription records (COVID EMIS). This table is only included if your application is approved for data-field #40104. |
| COVID19 Test Result Record (England). This table is only included if your application is approved for data-field #40100. |
| COVID19 Test Result Record (Scotland). This table is only included if your application is approved for data-field #40100. |
| COVID19 Test Result Record (Wales). This table is only included if your application is approved for data-field #40100. |
| Olink NPX values for the instance 0 visit. This table is only included if your application is approved for data-field #30900. For scalability reasons, the protein columns are horizontally split across multiple tables, starting from table |
| Olink NPX values for the instance 2 visit. This table is only included if your application is approved for data-field #30900. |
| Olink NPX values for the instance 3 visit. This table is only included if your application is approved for data-field #30900. |
Learn how to create a Research Analysis Platform account, access the Platform, and connect your account to the UK Biobank Access Management System.
Before you can use the Research Analysis Platform, you need to:
Create a UK Biobank Access Management System (AMS) account, via the AMS signup page.
Ensure that you’ve received UK Biobank access approval from the UK Biobank Access Management Team (AMT). If you haven’t, log into the AMS and follow the directions to complete your registration and get AMT approval.
Ensure that you are listed as a collaborator on a UK Biobank-approved access application by that application's Principal Investigator (PI).
See the AMS user guide for more on creating and managing an AMS account.
If you already have a DNAnexus account, log into the Research Analysis Platform using your existing username and password. Then skip to Connecting Your Account to UK Biobank below.
If your organization uses SSO with DNAnexus, follow your organization's login procedures to access the Research Analysis Platform. Note that your AMS and DNAnexus accounts do not need to use the same email address.
If you need to create an account, follow the instructions in the next section.
If you don’t already have a Research Analysis Platform account, you’ll need to create one. Here’s how:
Navigate to the Research Analysis Platform. Click Create an Account.
Fill out the Create New Account form, then click Create Account. Note that in selecting a username, you don’t need to use the same username you use on the UK Biobank AMS.
You'll receive an email with a link to click, to activate your account. Click the link.
Log into the Platform.
Complete your profile, then click Access Platform.
Once you have a Research Analysis Platform account. you need to connect it to your AMS account. You need do this only once.
The first time you access the Research Analysis Platform, you’ll see a prompt to Connect Your Account to UK Biobank.
Read the Terms of Service, then click Accept & Continue.
You'll be redirected to the AMS. Log into the AMS.
Your AMS and Research Analysis Platform accounts are now connected.
You'll be redirected once again, this time back to the Research Analysis Platform. You can now use the Platform.
You can log in without having to provide a username and password for a certain amount of time by using authentication tokens (henceforth referred to as tokens). Broadly, authentication tokens are generated by providing your username and password to the platform and specifying a time period for which the token may be used to log in.
If you provide your token to a third party, they may access the Research Analysis Platform with your token, effectively impersonating you as a user. They will have the same access level as you for any projects to which the token has access, potentially allowing them to run jobs and incur charges to your account. Please keep your token safe and secure.
In order to generate a token:
After logging into the Research Analysis Platform, access your profile by clicking on your user icon in the top right side of the screen and click Profile.
Once you are on your profile page, click on the API Tokens tab.
The New Token modal will appear on screen. Fill out the required fields and then click Generate Token.
A pop-up will then appear saying “Your token has been generated. Please copy it for later use; for security purposes, this is the only time you will see it.” with a 32-character token comprised of letters and numbers in the line below. Copy down this token in a secure location and save it for later.
Some examples in which tokens might be used include the following:
Writing a script: Tokens can be helpful when writing scripts that require logging in to the platform. However, if you incorporate a token into a script, the token should only be valid for as long as the script requires access to the platform; the "Expiration Date" field should be modified accordingly. Furthermore, the token should only be granted access to the projects required by the script. If your script uploads data to only one project, the "Token Scope" should reflect the limited access.
Logging in to the command line for interactive use: If for some reason you don't wish to log in to the command line with your username and password, tokens are quite useful. This is the only scenario in which you should use a full-scope token, thereby allowing you to access all of your available projects.
In order to use a token to log in on the command line, you must use dx login
with the --token
flag. See the example below of what a user would see after logging in:
You must navigate to the API tokens tab of your profile in order to revoke a token.
Once in the API tokens tab, select which token you wish to revoke and then click the Revoke button:
Once you confirm that you wish to revoke the token, the token will be revoked.
Some examples in which tokens might be revoked include the following:
Token accidentally shared too widely: If more people have access to your token than you would like, revocation of the token will cut off access to your account by unwanted parties.
Token no longer needed: For example, if the script utilizing the token is no longer in use; or if the group granted access to the platform using the token no longer exist; or in any other instance in which the token is no longer necessary, you should revoke it to restrict access to your account.
See Step 3 in the Before You Begin section above. You must complete the AMS registration process, and get the approval of the AMT.
For additional information, see the Creating an Account and Registration sections of the AMS Getting Started guide.
You already have a DNAnexus account. Use your existing username and password to log in.
Occasionally the platform may ask you to refresh your account association, due to security reasons. Among others, this can happen if your state on the AMS changes for any reason - if you update your contact details on the AMS, for example.
If you update your AMS account information, you may experience access issues until UK Biobank reviews and re-approves your account.
One of two issues is likely the reason why you've lost access to a project.
You may have been removed from the UK Biobank-approved access application. In this case, you will not be able to regain access to projects linked to that application.
To check the access applications in which you're included, log onto the AMS.
UK Biobank may have temporarily suspending the access application to which the project is linked. If this is the case, you can be added back to the project when the suspension is lifted.
If you are the admin of the wallet to which the project is billed, follow these steps to add yourself back to the project:
Select Org Admin from the main Platform menu, then select the org wallet in question.
Open the Projects tab
Select the project to which you need to regain access
Click Grant Permission
If you are not the wallet admin, but the admin is also in the access application to which the project is linked, have the wallet admin:
If he or she is not in the project in question, follow the steps above, to add himself or herself to the project
Share the project with you
Optionally, remove himself or herself from the project
If you are not the wallet admin, and the wallet admin is not in the access application to which the project is linked, have the wallet admin transfer project ownership to you.
Note that:
An AMS account may be associated to exactly one Research Analysis Platform account
A Research Analysis Platform account may be associated to exactly one AMS account
The Research Analysis Platform holds a copy of all UK Biobank data that is used to create all of the platform's projects. The UK Biobank periodically releases new data and makes updates to the existing data. Whenever the Research Analysis Platform updates its copy of the data, it will be indicated by a new data release version. The UK Biobank might expand or remove certain eligible fields within users’ applications at times. To get up-to-date data in their projects, users can check if an update is available and perform a data refresh to update the dispensed data.
The data update process will synchronize the project against the latest data release. This affects both tabular data and file data.
For file data, the update process will dispense any new files, potentially rearranging previous folders if folder names have changed. It will also re-generate *.fam, *.sample, and ukb_rel.dat files, removing any instances of previous ones (even if the previous ones had been copied to other projects).
Users can also dispense a new project if they want the latest version of the data in a separate area, instead of updating a project in-place.
To check for updates, go to the "Setting" page of a dispensed project and click the "Check for Updates" button in the "UK Biobank" section.
By clicking "Check for Updates", if the dispensed data is already up-to-date, you should see the below:
If there is a data update available for the dispensed data, you should see the below:
When data updates are available, you can click the "Show Update" button to see more information about the latest data update.
NOTE: We highly recommend not to launch a large number of jobs and perform clone/copy operations on a large number of objects in the requested project.
By clicking the "Start Update" button, you kick off the data refresh process. You can check the progress of the update process from the "status" section.
After the refresh is done, the status will return to "Ready".
Learn how to create a project and populate it with UK Biobank data.
On the UK Biobank Research Analysis Platform, all work takes place in the context of a project. Projects allow a defined set of users to:
Access specific datasets
Conduct analyses on these datasets
Once you have access to the Platform, start by creating a new project. You can follow the step-by-step instructions in the Setting Up Your Project section below.
On the Research Analysis Platform Projects screen, click the New Project button. The New Project wizard will open in a modal window.
In the Project Name field, enter a name for your project.
In the Application ID field, enter the number of the approved UK Biobank access application from which you'll draw the data to be used in this project.
Check the Dispense data to the project to populate the project with the data specified on the linked access application.
In the Billed To field, choose a wallet to which project billable activities should be charged.
In the Access section, specify who will be able to Copy Data, Delete Data, and Download Data.
Within the Research Analysis Platform, every project must be linked to one and only one access application. A project cannot be linked to multiple access applications.
Dispensing data to your new project will take some time. Depending on the type of data being dispensed, this process can take over an hour, or even longer, in some cases. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.
Ensure that:
The UK Biobank access application lists you as a collaborator
You're using an application that has received UK Biobank approval
As noted above, the process of dispensing data happens over a short period of time. When you first create a new project, you won’t see the data right away. You can monitor the process by checking the status of the project, in the project list. The process can take as little as 20 minutes or as long as 2 hours, depending on the scope of the access application.
Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.
Learn how UK Biobank data is organized and named on the Research Analysis Platform. Learn how to find and access bulk files and tabular data.
UK Biobank contains data collected from approximately 500,000 volunteer participants. Within an access application, each participant is identified by a unique, 7-digit number, or EID. An EID is typically a number between 1,000,000 and 6,000,000.
Note that each access application receives a different set of randomized EIDs, unique to the application. This EID randomization process - also known as "pseudonymization" - is managed by UK Biobank and is automatically applied to the data by the Research Analysis Platform. (For a given access application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.)
When you create a project on the Research Analysis Platform, the system contacts UK Biobank to get your application's EIDs, then uses them to pseudonymize your dataset. The pseudonymized EIDs are used, for example, to populate the "eid" column in the database, to name per-participant files, to generate the EID-specific content of FAM files for genotyping fields, and to adjust pVCF headers.
When you create a project on the UK Biobank Research Analysis Platform, the system dispenses the data corresponding to the data-fields listed in the access application associated with the project.
The dispensed data correspond to a specific data release version. See Data Release Versions for more.
Within your project, the Bulk folder contains files associated with UK Biobank data-fields of type "bulk." These are particularly large and/or complex items, such as genotyping array data, genome sequencing data, imaging data, and fitness data.
The Bulk folder uses the following subfolder structure:
There is a subfolder for each UK Biobank bulk field category. For example, whole genome CRAM files are stored in the subfolder named Whole genome sequences. These categories are defined by the UK Biobank, specifically for the Research Analysis Platform.
Within each category subfolder, there is a subfolder for each bulk field (or group of related fields). For example, a subfolder named Whole genome CRAM files would contain files for that field.
Within each field folder, files related to an individual participant are grouped in subfolders named using the prefix of the participant's EID. Typically these are two-digit names, ranging between "10" and "60."
In certain cases, the system may dispense related files of different types into the same folder, to improve usability. For example, whole genome CRAM indices files (field ID #23194) would be dispensed into the same folder as whole genome CRAM files (field ID #23193), rather than into their own folder. Similarly, the folder /Bulk/Brain MRI/dMRI
includes data from fields #20218 ("Multiband diffusion brain images - DICOM") and #20250 ("Multiband diffusion brain images - NIFTI").
The Research Analysis Platform uses the following naming conventions for bulk data files:
Files that contain data on an individual participant are named in this fashion:
<EID>_<FIELD-ID>_<INSTANCE-ID>_<ARRAY-ID>.<SUFFIX>
For example, whole genome CRAM files (field ID #23193) are named like so:
<EID>_23193_0_0.cram
Some exceptions apply to this rule. When a field is meant as a companion to a main field, such as a CRAI index accompanying a CRAM file, or a TBI index accompanying a VCF file, the system uses the prefix of the main field. For example, whole genome CRAM indices (field ID #23194) are named like so:
<EID>_23193_0_0.cram.crai
Files that contain data on a cohort of participants (such as PLINK, BGEN or pVCF files) are named in this fashion:
ukb<FIELD-ID>_c<CHROM>_b<BLOCK>_v<VERSION>.<SUFFIX>
Where <CHROM> represents the chromosome (such as "1", "2" or "X"), <BLOCK> represents an index (starting from "0") for datasets that have been split into multiple pieces, and <VERSION> represents a dataset version assigned by UK Biobank.
The Research Analysis Platform pseudonymizes the content of pVCF headers for the following fields:
When accessing pVCF files in these fields, the header is pseudonymized. The sample ids in the header are EIDs that correspond to the access application. If a participant has withdrawn, the corresponding sample id is marked as "W000001" (for the first encountered sample that belongs to a withdrawn participant), "W000002" (for the second encountered sample that belongs to a withdrawn participant), etc. Overall, the non-withdrawn EIDs in the pVCF header are expected to match the set of application EIDs used elsewhere, such as in the "eid" column of the phenotypic data and the FAM files of genotyping array fields. This allows you to conduct analyses that combine phenotypic, genotyping array, and whole exome pVCF (or whole genome pVCF) data without having to translate any EIDs.
The Research Analysis Platform supports file properties. These are key-value pairs of strings that are attached to files. When bulk files are dispensed to a project, the Platform adds some initial file properties, as below:
These properties are searchable both via the Web UI and CLI. Refer to the following section for an example.
Proteomics data stored in UK Biobank Showcase data-field 30900 has been transformed to enable users to visualize it using the Cohort Browser. The transformation produces the per-instance entities stored in the dispensed database. As a result of this transformation, the data can be visualized, at the individual protein level, on a per-instance basis. For example, each protein for each instance can then be added as a tile within the Cohort Browser, and used as a filter along with other data modalities.
The transformation involves the following steps:
Abbreviated protein names are substituted for protein ID codes - protein ID code "3", for example, becomes "aarsd1":
The file is then split into new files by instance, with each new file containing participant records that share the same value in the ins_index field:
Each instance file is then pivoted by creating a column for each protein, with cells filled in by the value in the result column. This value is the NPX value, for that protein, for the participant in question. For example, for the file containing "instance 0" data:
If a participant doesn’t have NPX data for a given protein - like participant 2345678, for protein "abca2," in the example shown above - the cell in question will not contain a value.
Note the column order in the last illustration, showing the fully transformed table. The eid column is first, followed by the columns for protein data, sorted in alphabetical order, by column name.
Tabular data-fields and linked health data are stored in a SQL database. This database is based on Spark SQL technology, a modern and more scalable technology than that used by classic relational database systems (RDBMS). This database is located on the root folder of your project, and is typically named in accord with this pattern:
app<APPLICATION-ID>_<CREATION-TIME>
(e.g. app12345_20210101123456)
To launch the Cohort Browser, navigate to the project's root folder and click the dataset (or tick the dataset and click Explore Data).
To explore what fields are available in your dataset, click Add Tile. The system will present all available fields, organized in a folder structure inspired by the UK Biobank Showcase. You can search this list by folder name, field name, or field value (for categorical fields).
Click a field to see more information. The Data Field Details pane contains the field title (such as Type of accommodation lived in | Instance 0
), and the Link label contains the field name (such as p670_i0
). These field names and titles can be used to retrieve data programmatically using JupyterLab.
Using the Cohort Browser features, including the "Export sample IDs" option or the "Download" option in the Data Preview tab, will not lead to any charges.
The Cohort Browser can be used to further explore the data, create charts, or define and compare cohorts. Refer to the following DNAnexus Platform documentation entries:
If your access application has been approved for data-field 23146, 23148 and/or 23157, the Cohort Browser will automatically include a "GENOMICS" section, where you can browse variants in your cohort. The data backing the section depends on the dataset version dispensed: 23157 for version 11 and later, 23148 for version 7 and later, 23146 for previous versions. These variants are sourced from the pVCF files of field #23146, 23148 and/or 23157, after annotating with snpEff GRCh38.92, dbSNP b154 and gnomAD r2.1.1. You can also use these variants to apply genomic filters. Refer to the following DNAnexus Platform documentation entries:
Go in-depth with DNAnexus experts, who'll show you how to get the most out of the Research Analysis Platform.
For tabular data, the update process will make in-place changes to the previously-dispensed database, including any schema changes (such as for new fields) and row updates. A new associated dataset will be generated. The previously-dispensed dataset will remain in the project, along with any previously-saved cohorts or dashboards. Although the previous dataset, cohorts, and dashboards will persist, they will be evaluated against the updated database; therefore cohort counts and field distributions may change. The new dataset will contain updated metadata as well as any new fields; to migrate old cohorts or dashboards to the new dataset to take advantage of updated metadata and new fields, please run the "" app.
Updates to existing tabular data can be done using the app.
For additional questions about data updates, see the list of .
Once data has been dispensed to your project, see the page to learn more about how this data is organized, and how to access and use it.
All data in the UK Biobank resource are organized into . Your access application is approved for a precise subset of those data-fields.
The provides an in-depth look into the types of data stored in the UK Biobank, how it's collected, and how it's organized. You can find more information about data-fields, broken down by type, on the .
Bulk data-fields are dispensed as files. See below for more.
Tabular data-fields and linked health data are placed into a Spark SQL database and an associated dataset. See below for more.
For a full list of folders, see below.
Key | Value | Which files have this property? |
---|
for in-depth guidance on searching and analyzing UK Biobank bulk data files.
For details on instance data and the meaning of each instance value, see the .
In the same folder, there is an associated dataset named after the database but with the .dataset
suffix appended. This dataset is a higher-level construct, using technology that is unique to the Research Analysis Platform. It combines the low-level SQL columns with field-level metadata from the UK Biobank Showcase, and presents a collection of rich fields that can be explored visually in the Cohort Browser, or programmatically in JupyterLab. For general information on the underlying technology, see the DNAnexus Platform documentation overview of .
When performing a data update, creating a new project for the same application, or adding data to a Dataset using , a new Dataset is created. To migrate saved cohorts to the new Datasets, use the app. Note that all Fields defined in the cohort must exist in the new Dataset.
If you are used to working with tabular data as a TSV file - a format used by UK Biobank in distributing tabular data directly via its website - see .
Apache Spark is a modern, scalable framework for parallel processing of big data. to analyze UK Biobank tabular data using Spark in JupyterLab.
Field Id | Description |
20278 | BEAGLE Phased VCFs Whole genome sequences |
20279 | SHAPEIT Phased VCFs Whole genome sequences |
23146 | Population level exome OQFE variants, pVCF format - interim 300k release |
23148 | Population level exome OQFE variants, pVCF format - interim 450k release |
23156 | Population level exome OQFE variants, pVCF format - interim 200k release |
23157 | Population level exome OQFE variants, pVCF format - 500k release |
23195 | Whole genome GraphTyper joint call pVCF (deprecated) |
23196 | Whole genome GATK joint call pVCF |
23352 | Whole genome GraphTyper joint call pVCF |
23353 | Whole genome GraphTyper SV data |
23354 | GraphTyper WGS 500k SV |
23374 | Population level WGS variants, pVCF format - 500k release |
24068 | Exome variant call files (gnomAD) (VCFs) |
24304 | Population level WGS variants pVCF format - interim 200k release |
24310 | DRAGEN population level WGS variants, pVCF format [500k release] |
eid | The corresponding participant EID | Files that correspond to a single participant. |
field_id | The corresponding data-field id. | All files. |
instance_id | The corresponding instance id (typically a visit to an assessment centre). |
array_id | The corresponding array index. |
resource_id |
This document provides a general guide for troubleshooting when you encounter an error running an application (app) from the Tool Library.
If you are having issues with a specific tool, try navigating to the “Troubleshooting” section on the specific tool page. For example, here is the troubleshooting section for JupyterLab.
When running an app on the UKB-RAP you can check the status of this job by navigating to the Monitor tab. If the job you ran failed, you’ll see an entry in the table indicating that the Status is Failed.
Alternatively, if you set your account preferences to receive email notifications* then, you will be sent a notification email when the job completes - whether the job successfully finished or it failed. If the job failed you will get an email that looks like the following:
* To set your email notifications:
Select My profile from the drop down menu on the top right
In the Email section set Job Email Notification to “Always” if you want to receive notifications for all jobs run; or “Only on Failure” if you only want to receive a notification email when the job fails.
In the case where your job failed, you want to understand the problem - what went wrong? A good place to start is looking at the error messages in the log file to determine where the app stopped working.
Navigate to the Monitor Tab
Click on the job that failed
Open log file by clicking the View Log button
If there is a job tree then click View Failure Source, which will bring you to the subjob that returned the error.
Then click on log to view the log file.
The log file contains a record that says what the app did behind the scenes. The file includes information about the system the app is running on:
There is also information about what the steps the app has performed. Below the log file describes the files that were downloaded onto the worker as well as the steps performed, like filtering.
The error message can usually be found at the end of the log file. Here is an example of what an error message looks like:
The error can be found in the last line of the error message. In this case an input was not provided for the parameter Sample ID
file
Alternatively, the error can be found right before the error message as seen below. In this case there is low variance likely due not having enough input samples.
Here are some common error messages and tips on what to check
To learn more about other error types, please visit this article
After understanding the problem you can re-launch the same job making any adjustments needed. To re-launch a job, you can either use the Launch as new job button which will allow you to update the required inputs and re-run. Otherwise if there are optional parameters that need to be updated you’ll need to submit a new job from the original execution page for the app.
A helpful resource when troubleshooting is the tool documentation, which can either be found on the info page after selecting your tool of choice
Alternatively, you can view the documentation on the tool’s execution page by selecting the icon in the top right corner.
It's also good practice to document the steps of what you did for your future reference in case you run into the same issue, this can also help in reporting the issue to DNAnexus support. You can use the dx describe
command to get metadata associated with the job including: instance type used, runtime as well as input parameters.
If you continue to have issues and don’t understand how to proceed, you can email ukbiobank-support@dnanexus.com if you have paid support. Otherwise you can post on the community forum community.ukbiobank.ac.uk to get advice and help from peers working on the UKB-RAP.
Custom apps and applets are useful for when you have a custom analysis script or pipeline that you plan to run repeatedly.
If you are just getting started with your custom script here is a quick way to submit a custom script to analyze UKB data.
Documentation on how to build an applet can be found here.
For general tips on troubleshooting, see guide.
Learn how to use RStudio Workbench, an interactive R development environment, on the DNAnexus Platform.
The RStudio app runs the commercial edition of RStudio Workbench, an interactive R development environment, on the DNAnexus Platform. Users of this app can analyze, visualize and gain insights into data, and interactively run commands in a cloud based terminal.
From the "Tools" tab in the upper menu, click on “RStudio”. This RStudio Sessions page shows the previously launched RStudio Workbench sessions, allows you to stop a running session, relaunch an ended session or to start a new session by clicking on the "New RStudio" button in the upper right.
When you click on the "New RStudio" button, the RStudio Workbench setup modal will appear. Fill out the optional "Environment Name" field and override the instance type if you need a smaller or larger instance. It is strongly recommended to use the default "High" priority to avoid losing your interactive work due to spot instance interruptions. Then click “Start Environment”.
You will then see your new session appear on the RStudio Session page. This entry’s status will be set to "Initializing".
Once the status of the launched RStudio Workbench session changes to "Ready", click on the Session name or "Open" link next to the status to open the RStudio environment.
You can stop a session from the Sessions page by hovering over the icon of the three dots on the right side of the session row. This icon also allows you to launch a new session with the same settings as a previously ended session (Figure 1).
Clicking on the "i" next to the "New Rstudio" button will display more info for each session (Figure 2).
Inside the working environment, the Terminal tab allows you to download DNAnexus project files to the RStudio environment using dx download
, and upload RStudio files to the DNAnexus project with dx upload
.
You can also execute commands with root privileges by prefixing the commands with sudo
, for example, to install a wget
package, use the following commands in the RStudio Terminal Window:
Any changes you make to the Rstudio environment (e.g. adding files, installing packages, building projects) are limited to the DNAnexus worker in which the current DNAnexus job is running and thus will not be saved when the job (and hence the worker) is terminated. Always save scripts and any data you want to keep by uploading them to the platform. See the “Working with Data” section for additional details.
To make your data from a DNAnexus project available for processing in the Rstudio, you need to download the data into the Rstudio worker execution environment. In the Terminal tab, run:
where FILE is the name or ID of a file in a DNAnexus project. The file will be downloaded to your current working directory. You may download multiple files and whole folders at once; for more information please check dx download -h
. To see the listing of the project files, use dx ls
.
If your input files are large and you need to scan the content of the files in the DNAnexus project once or to read only a small fraction of a project file's content you may consider reading files from /mnt/project
folder. A project in which the app is running is mounted in a read-only fashion at /mnt/project
folder. Reading the content of the files in /mnt/project
dynamically fetches the content from the DNAnexus platform, so this method uses minimal disk space in the RStudio execution environment, but uses more API calls to fetch the content.
The app runs in a temporary worker execution environment and any outputs generated in an RStudio session will not be persisted when the job running the app stops running. If you'd like to save individual result files in a DNAnexus project, you can upload them from an Rstudio Terminal using the dx upload
command, for example dx upload FILE
where FILE is the file to be uploaded to the current project. You may upload multiple files and whole folders at once; for more information please check dx upload -h
. The app has VIEW access to all the projects accessible to the launching user and CONTRIBUTE access to the project in which the app is running, which makes it possible to upload files from the Rstudio session to the DNAnexus project.
To back up a folder to your project, use dx-backup-folder
command.
As an example, to backup current folder to /.Backups/rstudio_workbench_ukbrap.testuser.2022-03-21T16-32-59.tar.gz
use the following command:
The name of the backup file in the current DNAnexus project defaults to .Backups/<rstudio_workbench_ukbrap>.<dnanexus-username>.tar.gz
To backup workspace folder to /.Backups/workspace1.tar.gz
platform file use:
To backup the current folder excluding R subfolder and any .RData files in any subfolder to a platform files in /small_backup/rstudio_workbench_ukbrap.testuser.2022-03-14T20-47-03.tar.gz
use:
The optional --exclude
and --exclude-from
arguments work the same way as in the GNU tar command to exclude the specified files and directories from the backup.
To restore the content of a previously created backup file in the current folder, while not overwriting any local files:
You can also overwrite local files with restored files by specifying the optional –overwrite
command line flag, and specify a local folder you wish to restore the backup into using the --output
argument. For more information please check dx-restore-folder -h
.
To terminate the session, select the “Terminate” button on the upper right side of the Rstudio Environment header.
Your RStudio sessions can be viewed by selecting the “RStudio” option from the global Tools menu. You can also end a session from this page by right clicking the three vertical dots next to the “Open” option of a session that is currently running and selecting “End environment”.
Note that closing the browser does not stop the app. A running app will continue accruing compute charges until its termination. As long as the job is running, you can go back to that Rstudio environment by loading the job-xxxx.dnanexus.cloud URL mentioned above.
Files that correspond to data-fields with .
Files that correspond to data-fields.
The corresponding UK Biobank id.
Auxiliary files to a on the UK Biobank Showcase.
Example error message | What to check |
---|---|
Issue | Example error message | What to do |
---|---|---|
For general tips on troubleshooting, see .
Issue | Example error message | What to do |
---|
You can export selected phenotypic fields for your UKB study into a TSV or CSV file using the Table Exporter app as described . You can then dx download
the CSV file to the RStudio worker execution environment and read it into RStudio using read.csv()
command.
Error: Failed to open <filename> : No such file or directory
Check that you provided the correct file path as input - there are no typos in your file path or file name
Check if you need to use a mounted (/mnt/project/) file path
Out of memory: Killed process 51323
You need to update your instance type selection to make sure you have enough memory (see “Memory (GiB) column in the rate card
Could not find input
Could not find index of <file name>
Make sure that your input file is specified in the section in the dxapp.json
file, under inputSpec
Also make sure that you add these lines to the echo
and dx download
sections in the code.sh
Timeout exceeded
job timeout 48.000h has been exceeded
If the analysis/process needs a larger number of cores or requires a longer processing time then there are a couple of options:
At the time of execution, you can extend the timeout policy or specify the instance type (see example for extending the timeout for the REGENIE app):
dx run app-regenie --instance-type mem2_ssd1_x2 --extra-args '{
"timeoutPolicyByExecutable": {
"app-MAIN_REGENIE_APP_ID": {
"*": { "hours": 12 }
},
"app-STEP1_APP_ID": {
"*": {"hours": 200}
},
"app-STEP2_APP_ID": {
"*": {"hours": 200 }}}}'
When building the applet, you can set the timeout policy and instance type in the dxapp.json file when you build the applet:
"timeoutPolicy": {
"main": {
"hours": 72 },
…
"systemRequirements": {"*": {"instanceType": "mem2_ssd1_x2"}}
More about dxapp.json parameters can be found here.
Learn about the particularities of the Research Analysis Platform, and where to find detailed guidance on using its features.
The UK Biobank Research Analysis Platform is built on the DNAnexus Platform, and using the two is largely the same experience. See the DNAnexus Platform documentation for detailed information on Platform features and how to use them. Of particular note:
Key Concepts - Learn about projects, organizations, apps, and workflows, and how to create and use each.
User Interface Quickstart - Learn to access and use Platform features via the web user interface (UI).
Command Line Quickstart - Learn to access and use Platform features via the command-line interface (CLI), using dx
command-line client, available for download as part of the DNAnexus SDK.
Cohort Browser - Learn to explore related genomic and phenotypic datasets, and easily create subject cohorts for analysis.
Introduction to Building Apps - Learn to build and deploy custom analysis applications.
Introduction to Building Workflows - Learn to build and deploy custom workflows for processing and analyzing data.
Uploading and Sharing Files - Learn to upload files to use in analyses, via either the dx
client or, for multiple or large files, the Upload Agent. See also this step-by-step walkthrough on uploading data to a project, then sharing the project with other project members. Note that to access a project on the Research Analysis Platform, a user needs to be listed on the linked access application.
Running Apps and Workflows - Learn to configure and run apps and workflows, monitor progress, and view results, both from the UI and the CLI.
Platform API - Learn to access Platform features and capabilities programmatically.
JupyterLab - Learn to use Jupyter notebooks to create, hone, and run sophisticated custom analyses, in your preferred programming language.
HAIL with JupyterLab - Learn some of the capabilities of HAIL with these example notebooks and basic guidelines.
Stata - Learn how to access and use Stata, a popular statistical software package for data science.
Integrated Genomics Viewer - Learn to use the Integrated Genomics Viewer (IGV) with genomic data files stored on the Platform.
The first time you access the Research Analysis Platform, the system will create for you a "wallet" (i.e. a "personal billing account"), funded in the amount of £40, courtesy of DNAnexus. Note that this wallet is separate and unrelated to the "Trial Subscription" mentioned in section 2.8 of the DNAnexus Terms of Service.
This section describes functionality and limitations specific to the Research Analysis Platform.
Note that as a rule, Research Analysis Platform access and sharing restrictions are related to UK Biobank access applications. All Research Analysis Platform projects are linked to a UK Biobank-approved access application. To access a project, a user must be named as a Principal Investigator or collaborator, on the linked access application.
For more detail on UK Biobank access applications, see Creating a Project.
On the Research Analysis Platform, projects can only be created via the UI. A project cannot be created via the CLI.
You can collaborate on the DNAnexus Platform by giving project access to other users. For details, refer to the Platform documentation sections covering Project Sharing and Project Access.
On the Research Analysis Platform, you can only share a project with another user, if he or she is named on the linked UK Biobank access application, and has Research Analysis Platform access.
On the DNAnexus Platform, you can easily copy files from one project to another.
On the Research Analysis Platform, you can only copy files from one project to another, if both projects are linked to the same UK Biobank access application.
To enable shared billing for a group of users, follow these instructions.
When accessing the Research Analysis Platform, note the following default settings:
For logging in via the command line (with 'dx login'): auth.dnanexus.com:443
For interacting with the Platform API to perform uploads: api.dnanexus.com:443
For the actual data transfer that the client will perform to S3: dnanexus-eu-west-2-platform-upload-ukb-prod.s3-eu-west-2.amazonaws.com:443
For connecting to the Platform Thrift server, follow these instructions, substituting this JDBC URL: jdbc:hive2://query.eu-west-2.apollo.dnanexus.com:10000/\;ssl=true
On the DNAnexus Platform, you can easily transfer project ownership to another user.
On the Research Analysis Platform, project ownership can only be transferred to a user who is listed on the UK Biobank access application linked to the project.
Participants are free to withdraw from UK Biobank at any time and request that their data no longer be used.
If a participant’s withdrawal affects your dispensed data, you will receive an email from UK Biobank containing the anonymized IDs of these participants and any others who have withdrawn previously. It is possible that this list will contain IDs which you have never received as they may have withdrawn before your datasets were generated.
For any data, files, objects generate on the platform through usage, you are responsible for removing the records corresponding to all withdrawn participants from further analyses. Any files you download or manipulate (ex. results file from a GWAS, a CSV file created using Table Exporter) will not be changed by the withdrawal of participant data.
The Research Analysis Platform is hosted in the AWS "aws:eu-west-2" or "Europe (London)" region.
Issue with input |
| Check that the input file path you specified is correct |
Error fitting null model (Step 1) |
| In SAIGE GWAS GRM app, set |
SPAGMMAT test error (Step 2) |
| Specify "--chr" under the input "Extra Options". |
Get full details on data included in each UK Biobank Research Analysis Platform data release.
The Research Analysis Platform holds a copy of all UK Biobank data. All projects are created using this copy.
As UK Biobank updates the data on its end, the copy held by the Research Analysis Platform is periodically updated to reflect these upstream updates. Whenever this happens, this change will be indicated by a new data release version.
How do I select what data I’d like to dispense? This is now part of the project creation process. On the project creation screen, users will now see a new section with the different data types available to dispense. For a faster dispensal time, only select what data you’ll need.
Each Showcase release includes both newly released data, and all Showcase data included in previous releases. So the June 2021 Showcase release, for example, includes all data released in all previous Showcase releases, as detailed on the UK Biobank website.
The following table lists all the bulk fields, along with their folders and suffixes, as of the releases (v17.1, v18.1).
This list follows the same order as that in which the subfolders will appear, in projects on the Platform.
These short tutorial videos provide an in-depth guide to accessing and using the Research Analysis Platform via the command-line interface (CLI).
In many cases the Platform allows you to use apps and workflows you’ve developed in another environment. This page provides a rundown on how to import apps or workflows of a wide variety of types.
To facilitate reproducibility in working with data, the DNAnexus Platform supports the use of WDL and Nextflow workflows.
Nextflow workflows can be run from the command line in a Cloud Workstation session, or in a Jupyter notebook.
See the following .
Field ID | Data Type | Subfolder | Suffixes |
---|---|---|---|
For information on how to use HAIL with Jupyterlab, see example notebooks .
For general tips on troubleshooting, see .
Issue | Example error message | What to do |
---|
If you haven’t already, download and install the , and familiarize yourself with basic dx commands. When working from the CLI, you’ll use the dx-toolkit to perform various tasks in the course of importing, configuring, and running apps and workflows.
Your apps and workflows must have access to any files they need as inputs. If any input file is not already in the project in which you intend to use a particular app or workflow, .
The Platform supports executing Docker images. Using a Docker image to package and run your app ensures your app can access, and you have full control over, dependencies, configurations settings, and all needed files. .
If your app can run on 64-bit Ubuntu Linux, you can package it to run on the Platform. from the DNAnexus Platform documentation, to import, configure, test, and run your app. You’ll use the dx-app-wizard to set up the proper directory structure to run your app on the platform.
The “” section of the DNAnexus Platform documentation provides further guidance on how to configure your app, and how to upload and ensure it has access to dependencies and input files.
You can run your R/-Shiny app on the Platform by wrapping it as a special DNAnexus web app, then accessing it using a web browser. .
The Stata app itself does not run on the Platform. If you have a Stata license, you can use stata_kernel to access Stata commands and functionality within a Jupyter notebook running on the Platform. .
To ssh into your Platform project, then run a script on the command line, use the .
There are third-party tools available that can compile a C++ app to run on 64-bit Ubuntu Linux. Once you’ve done so, you can then wrap the newly created Linux app as a Platform app or applet, following .
Converting scripts to WDL workflows ensures reproducibility and allows your workflows to be easily published and shared. Use to compile WDL workflows so that they can be run on the Platform.
, you can configure and package your Bash script, along with its dependencies and inputs, then run it on the Platform.
Bash scripts can also be run on the Platform using the , which provides command-line access via ssh, or in a . To ensure reproducibility of results and make your workflow more easily shareable, , before uploading it to the Platform.
Field ID
Data Type
Subfolder
Suffixes
Acceleration intensity time-series
/Bulk/Activity/Epoch/
.csv
Acceleration data - cwa format
/Bulk/Activity/Raw/
.cwa
Arterial spin labelling brain images - DICOM
/Bulk/Brain MRI/ASL/
.zip
Multiband diffusion brain images - DICOM
/Bulk/Brain MRI/dMRI/
.zip
Multiband diffusion brain images - NIFTI
/Bulk/Brain MRI/dMRI/
.zip
Functional brain images - resting - DICOM
/Bulk/Brain MRI/rfMRI/
.zip
Functional brain images - resting - NIFTI
/Bulk/Brain MRI/rfMRI/
.zip
rfMRI full correlation matrix, dimension 25
/Bulk/Brain MRI/rfMRI/
.txt
rfMRI full correlation matrix, dimension 100
/Bulk/Brain MRI/rfMRI/
.txt
rfMRI partial correlation matrix, dimension 25
/Bulk/Brain MRI/rfMRI/
.txt
rfMRI partial correlation matrix, dimension 100
/Bulk/Brain MRI/rfMRI/
.txt
rfMRI component amplitudes, dimension 25
/Bulk/Brain MRI/rfMRI/
.txt
rfMRI component amplitudes, dimension 100
/Bulk/Brain MRI/rfMRI/
.txt
Scout images for brain scans - DICOM
/Bulk/Brain MRI/Scout/
.zip
Phoenix - DICOM
/Bulk/Brain MRI/Scout/
.zip
Susceptibility weighted brain images - DICOM
/Bulk/Brain MRI/SWI/
.zip
Susceptibility weighted brain images - NIFTI
/Bulk/Brain MRI/SWI/
.zip
T1 structural brain images - DICOM
/Bulk/Brain MRI/T1/
.zip
T1 structural brain images - NIFTI
/Bulk/Brain MRI/T1/
.zip
T1 surface model files and additional structural segmentations
/Bulk/Brain MRI/T1/
.zip
T2 FLAIR structural brain images - DICOM
/Bulk/Brain MRI/T2 FLAIR/
.zip
T2/PD brain images - DICOM
/Bulk/Brain MRI/T2 FLAIR/
.zip
T2 FLAIR structural brain images - NIFTI
/Bulk/Brain MRI/T2 FLAIR/
.zip
Functional brain images - task - DICOM
/Bulk/Brain MRI/tfMRI/
.zip
Functional brain images - task - NIFTI
/Bulk/Brain MRI/tfMRI/
.zip
Eprime advisor file
/Bulk/Brain MRI/tfMRI/
.adv
Eprime txt file
/Bulk/Brain MRI/tfMRI/
.txt
Eprime ed2 file
/Bulk/Brain MRI/tfMRI/
.ed2
Cardiac monitoring phase 1 - Acceleration
/Bulk/Cardiac monitoring phase 1/Acceleration/
.zacl
Cardiac monitoring phase 1 - ECG trace
/Bulk/Cardiac monitoring phase 1/ECG trace/
.zip
Cardiac monitoring phase 1 - Impedance
/Bulk/Cardiac monitoring phase 1/Impedance/
.zimp
Cardiac monitoring phase 2 - Episodic data for specific arrhythmia
/Bulk/Cardiac monitoring phase 2/Episodic data for specific arrhythmia (result file xml)/
.xml
Cardiac monitoring phase 2 - Hourly summary and QT
/Bulk/Cardiac monitoring phase 2/Hourly summary and QT (result file xml)/'
.report
Cardiac monitoring phase 2 - Human-readable analysis report
/Bulk/Cardiac monitoring phase 2/Human-readable analysis report (result file pdf)/
Cardiac monitoring phase 2 - Raw ECG data from monitor
/Bulk/Cardiac monitoring phase 2/Raw ECG data from monitor (result file edf)/
.edf
Cardiac monitoring phase 2 - Summary analysis data
/Bulk/Cardiac monitoring phase 2/Summary analysis data (result file csv)/
.csv
Carotid artery ultrasound image (left)
/Bulk/Carotid Ultrasound/Carotid artery (left)/
.zip
Carotid artery ultrasound image (right)
/Bulk/Carotid Ultrasound/Carotid artery (right)/
.zip
Raw carotid device data
/Bulk/Carotid Ultrasound/Raw data/
.zip
Carotid artery ultrasound report
/Bulk/Carotid Ultrasound/Report/
.zip
Whole genome CRAM files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome CRAM files (DRAGEN) [200k]/
dragen.cram
Whole genome CRAM indices (DRAGEN) [200k]
WGS/Whole genome CRAM indices (DRAGEN) [200k]/
dragen.cram.crai
Whole genome supplementary files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome supplementary files (DRAGEN) [200k]/
dragen.sample-level-supplementary.zip
Whole genome variant call files (GVCFs) (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome variant call files (GVCFs) (DRAGEN) [200k]/
dragen.hard-filtered.gvcf.gz
Whole genome variant call files (GVCFs) (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome variant call files (GVCFs) (DRAGEN) [200k]/
dragen.hard-filtered.gvcf.gz.tbi
Whole genome variant call files (GVCFs) (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome variant call files (GVCFs) (DRAGEN) [200k]/
dragen.hard-filtered.vcf.gz
Whole genome variant call files (VCFs) (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [200k]/
dragen.hard-filtered.vcf.gz.tbi
Whole genome diagnostics files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome diagnostics files (DRAGEN) [200k]/
dragen.diagnostics.zip
Whole genome CNV call files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome CNV call files (DRAGEN) [200k]/
dragen.cnv.vcf.gz
Whole genome CNV call files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome CNV call files (DRAGEN) [200k]/
dragen.cnv.vcf.gz.tbi
Whole genome CNV supplementary files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome CNV supplementary files (DRAGEN) [200k]/
dragen.cnv-supplementary.zip
Whole genome SV call files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome SV call files (DRAGEN) [200k]/
dragen.sv.vcf.gz
Whole genome SV call files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome SV call files (DRAGEN) [200k]/
dragen.sv.vcf.gz.tbi
Whole genome SV supplementary files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome SV supplementary files (DRAGEN) [200k]/
dragen.sv-supplementary.zip
Whole genome STR call files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome STR call files (DRAGEN) [200k]/
dragen.repeats.vcf.gz
Whole genome STR call files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome STR call files (DRAGEN) [200k]/
dragen.repeats.vcf.gz.tbi
Whole genome STR supplementary files (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome STR supplementary files (DRAGEN) [200k]/
dragen.str-supplementary.zip
Whole genome CYP2D6 genotype calls (DRAGEN) [200k]
/Bulk/DRAGEN WGS/Whole genome CYP2D6 genotype calls (DRAGEN) [200k]/
dragen.cyp2d6.tsv
Fitness test results, including ECG data
/Bulk/Electrocardiogram/Fitness/
.xml
ECG datasets
/Bulk/Electrocardiogram/Resting/
.xml
Exome OQFE CRAM files
/Bulk/Exome sequences/Exome OQFE CRAM files/
.cram
Exome OQFE CRAM indices
/Bulk/Exome sequences/Exome OQFE CRAM files/
.cram.crai
Exome sequences
/Bulk/Exome sequences_Alternative exome processing/Exome variant call files (DRAGEN) (VCFs)/
.vcf.gz
Exome sequences
/Bulk/Exome sequences_Alternative exome processing/Exome variant call files (DRAGEN) (VCFs)/
vcf.gz.tbi
Exome sequences
/Bulk/Exome sequences_Alternative exome processing/Exome variant call files (gnomAD) (VCFs)/
.ukb24068_c10_b0_v1.vcf.gz, ukb24068_c10_b1_v1.vcf.gz...
Exome sequences
/Bulk/Exome sequences_Alternative exome processing/Exome variant call files (gnomAD) (VCFs)/helper_files/
.Broad_455k_exome_gnomAD_QC_summary.md
Exome sequences
/Bulk/Exome sequences_Alternative exome processing/Exome variant call files (gnomAD) (VCFs)/
.ukb24068_c10_b0_v1.vcf.gz.tbi,ukb24068_c10_b1_v1.vcf.gz.tbi
Exome OQFE CRAM files
/Bulk/Exome sequences_Previous exome releases/Exome OQFE CRAM files - interim 200k release/
.cram
Exome OQFE CRAM indices
/Bulk/Exome sequences_Previous exome releases/Exome OQFE CRAM files - interim 200k release/
.cram.crai
Exome OQFE variant call files (VCFs)
/Bulk/Exome sequences/Exome OQFE variant call files (VCFs)/
.g.vcf.gz
Exome OQFE variant call file (VCF) indices
/Bulk/Exome sequences/Exome OQFE variant call files (VCFs)/
.g.vcf.gz.tbi
Exome OQFE variant call files (VCFs)
/Bulk/Exome sequences_Previous exome releases/Exome OQFE variant call files (VCFs) - interim 200k release/
.g.vcf.gz
Exome OQFE variant call file (VCF) indices
/Bulk/Exome sequences_Previous exome releases/Exome OQFE variant call files (VCFs) - interim 200k release/
.g.vcf.gz.tbi
Population level exome OQFE variants, BGEN format - interim 300k release
/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, BGEN format - interim 300k release/
.bgen, .bgi, .sample
Population level exome OQFE variants, BGEN format - interim 450k release
/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - interim 450k release/
.bgen,
.bgi,
.sample
Population level exome OQFE variants, BGEN format - 500k release
/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - 500k release/
ukb23159_c1_b0_v1.bgen, ukb23159_c1_b0_v1.bgen.bgi, ukb23159_c1_b0_v1.sample
Population level exome OQFE variants, PLINK format
/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, PLINK format - interim 200k release/
.bed, .bim, .fam
Population level exome OQFE variants, PLINK format - interim 300k release
/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, PLINK format - interim 300k release/
.bed, .bim, .fam, .masks, .txt, .txt.gz
Population level exome OQFE variants, PLINK format - interim 450k release
/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - interim 450k release/
.bed, .bim, .fam, .masks, .txt, .txt.gz
Population level exome OQFE variants, PLINK format - 500k release
/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - 500k release/
ukb23158_c1_b0_v1.bed, ukb23158_c1_b0_v1.bim, ukb23158_c1_b0_v1.fam, ...
Population level exome OQFE variants, PLINK format - 500k release helper files
/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - 500k release/helper_files/
ukb23158_500k_OQFE.sets.txt.gz, ukb23158_500k_OQFE.masks, ukb23158_500k_OQFE.annotations.txt.gz, ukb23158_500k_OQFE.90pct10dp_qc_variants.txt, ukb23158_500k_OQFE.variant_ID_mappings.txt
Population level exome OQFE variants, pVCF format - 500k release
el exome OQFE variants, pVCF format - 500k release/
ukb23158_500k_OQFE.sets.txt.gz, ukb23158_500k_OQFE.masks, ukb23158_500k_OQFE.annotations.txt.gz, ukb23158_500k_OQFE.90pct10dp_qc_variants.txt, ukb23158_500k_OQFE.variant_ID_mappings.txt
Population level exome OQFE variants, pVCF format
/Bulk/Previous exome releases/Population level exome OQFE variants, pVCF format - interim 200k release/
.vcf.gz, .vcf.gz.tbi
Population level exome OQFE variants, pVCF format - interim 300k release
/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, pVCF format - interim 300k release/
.vcf.gz, .vcf.gz.tbi
Population level exome OQFE variants, pVCF format - interim 450k release
/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - interim 450k release/
.vcf.gz, .vcf.gz.tbi
CEL files
/Bulk/Genotype Results/Genotype CEL files/
.cel
Genotype calls
/Bulk/Genotype Results/Genotype calls/
.bed, .bim, .dat, .fam, .txt
Genotype calls
/Bulk/Genotype Results/Genotype calls/posteriors/
.batch, .bim, .bin
Genotype confidences
/Bulk/Genotype Results/Genotype confidences/
.txt
Genotype copy number variants, log2ratios
/Bulk/Genotype Results/Genotype copy number variants, log2ratios
.txt
Genotype copy number variants B-allele frequencies
/Bulk/Genotype Results/Genotype copy number variants B-allele frequencies/
.txt
Genotype intensities
/Bulk/Genotype Results/Genotype intensities/
.bin
Aortic distensibilty images - DICOM
/Bulk/Heart MRI/Aortic distensibility/
.zip
Blood flow images - DICOM
/Bulk/Heart MRI/Blood flow/
.zip
Cine tagging images - DICOM
/Bulk/Heart MRI/CINE tagging/
.zip
Left ventricular outflow tract images - DICOM
/Bulk/Heart MRI/Left ventricular outflow tract/
.zip
Long axis heart images - DICOM
/Bulk/Heart MRI/Long axis/
.zip
Scout images for heart MRI - DICOM
/Bulk/Heart MRI/Scout/
.zip
Experimental shMOLLI sequence images - DICOM
/Bulk/Heart MRI/ShMOLLI/
.zip
Short axis heart images - DICOM
/Bulk/Heart MRI/Short axis/
.zip
Haplotypes (WTCHG)
/Bulk/Imputation/Haplotypes/
.bgen, .bgi
21008
Imputation from genotype (GEL)
/Bulk/Imputation/Imputation from genotype (GEL)/
ukb21008_c1_b0_v1.bgen, ukb21008_c1_b0_v1.bgen.bgi, 'ukb21008_c1_b0_v1.sample', ...
21008
Imputation from genotype (GEL) helper files
/Bulk/Imputation/Imputation from genotype (GEL)/helper_files/
caution_sites.tsv.gz, ukb21008_c1_b0_v1.bgen.pos, ...
21007
Imputation from genotype (TOPmed)
/Bulk/Imputation/Imputation from genotype (TOPmed)/
ukb21007_c1_b0_v1.bgen, ukb21007_c1_b0_v1.bgen.bgi, ukb21007_c1_b0_v1.sample, ...
21007
Imputation from genotype (TOPmed) helper files
/Bulk/Imputation/Imputation from genotype (TOPmed)/helper_files
ukb21007_c1_b0_v1.sites.vcf.gz, ukb21007_c1_b0_v1.sites.vcf.gz.csi, ...
Imputation from genotype (WTCHG)
/Bulk/Imputation/UKB imputation from genotype/
.bgen, .bgi, .sample, .txt
Kidney Imaging - gradient echo - DICOM
/Bulk/Kidney MRI/Gradient echo/
.zip
Kidney Imaging - T1 ShMOLLI - DICOM
/Bulk/Kidney MRI/ShMOLLI/
.zip
Kidney Imaging - T2 haste - DICOM
/Bulk/Kidney MRI/T2 HASTE/
.zip
Kidney imaging - T2 Vibe - DICOM
/Bulk/Kidney MRI/T2 VIBE/
.zip
Liver images - gradient echo - DICOM
/Bulk/Liver MRI/Gradient echo/
.zip
Liver images - IDEAL protocol - DICOM
/Bulk/Liver MRI/IDEAL/
.zip
Liver Imaging - T1 ShMoLLI - DICOM
/Bulk/Liver MRI/ShMOLLI/
.zip
Pancreas Images - gradient echo - DICOM
/Bulk/Pancreas MRI/Gradient echo/
.zip
Pancreatic fat - DICOM
/Bulk/Pancreas MRI/Pancreatic fat/
.zip
Measurements of pancreas volume - DICOM
/Bulk/Pancreas MRI/Pancreatic volume/
.zip
Pancreas Images - ShMoLLI - DICOM
/Bulk/Pancreas MRI/ShMOLLI/
.zip
Protein biomarkers - Olink helper files
/Bulk/Protein biomarkers/Olink/helper_files/
.pdf, .dat
FDA data file (left)
/Bulk/Retinal Optical Coherence Tomography/FDA (left)/
.fda
FDA data file (right)
/Bulk/Retinal Optical Coherence Tomography/FDA (right)/
.fda
FDS data file (left)
/Bulk/Retinal Optical Coherence Tomography/FDS (left)/
.fds
FDS data file (right)
/Bulk/Retinal Optical Coherence Tomography/FDS (right)/
.fds
Fundus retinal eye image (left)
/Bulk/Retinal Optical Coherence Tomography/Fundus (left)/
.png
Fundus retinal eye image (right)
/Bulk/Retinal Optical Coherence Tomography/Fundus (right)/
.png
OCT image slices (left)
/Bulk/Retinal Optical Coherence Tomography/Slices (left)/
.zip
OCT image slices (right)
/Bulk/Retinal Optical Coherence Tomography/Slices (right)/
.zip
DXA images
/Bulk/Whole Body DXA/DXA/
.zip
Dixon technique for internal fat - DICOM
/Bulk/Whole Body MRI/Dixon/
.zip
BGI WGS CRAM files
/Bulk/Whole genome sequences/BGI WGS CRAM files/
.cram
BGI WGS CRAM indices
/Bulk/Whole genome sequences/BGI WGS CRAM files/
.cram.crai
BQSR - GATK BaseRecalibrator
/Bulk/Whole genome sequences/BQSR - GATK BaseRecalibrator/
.recal_table
Broad WGS CRAM files
/Bulk/Whole genome sequences/Broad WGS CRAM files/
.cram
Broad WGS CRAM indices
/Bulk/Whole genome sequences/Broad WGS CRAM files/
.cram.crai
Concatenated QC Metrics
/Bulk/Whole genome sequences/Concatenated QC Metrics/
.qaqc_metrics
Genotype Concordance - Contingency Metrics
/Bulk/Whole genome sequences/Genotype Concordance - Contingency Metrics/
.genotype_concordance_contingency_metrics
23319
Genotype Concordance - Detail Metrics
/Bulk/Whole genome sequences/Genotype Concordance - Detail Metrics/
.genotype_concordance_detail_metrics
Genotype Concordance - Summary Metrics (Picard)/
/Bulk/Whole genome sequences/Genotype Concordance - Summary Metrics (Picard)/
.genotype_concordance_summary_metrics
Genotype Concordance
/Bulk/Whole genome sequences/Genotype Concordance/
.nrd.stats
Manta-called scored structural variant and indel candidates
/Bulk/Whole genome sequences/Manta-called scored structural variant and indel candidates/
.diploidSV.vcf.gz, .diploidSV.vcf.gz.tbi
Manta-called unscored structural variant and indel candidates
/Bulk/Whole genome sequences/Manta-called unscored structural variant and indel candidates/
.candidateSV.vcf.gz, .candidateSV.vcf.gz.tbi
Microsatellite data generated using a specially optimized tool (popSTR) for microsatellite (STR) marker calling in whole-genome sequencing studies.
/Bulk/Whole genome sequences/Microsatellites - 150k release/
ukb23365_c10_b0_v1.vcf.gz.tbi,
ukb23365_c10_b1_v1.vcf.gz.tbi, ukb23365_c10_b2_v1.vcf.gz.tbi, ukb23365_c9_b2406_v1.vcf.gz.tbi, ukb23365_c9_b2407_v1.vcf.gz.tbi, ukb23365_c9_b2408_v1.vcf.gz.tbi
Population level genome variants, BGEN format - interim 200k release
/Bulk/Whole genome sequences/Population level genome variants, BGEN format - interim 200k release/
ukb24306_c10_b0_v1.bgen, ukb24306_c11_b0_v1.bgen 'ukb24306_c12_b0_v1.bgen, ukb24306_c13_b0_v1.bgen, ukb24306_c14_b0_v1.bgen, ukb24306_c15_b0_v1.bgen, ukb24306_c16_b0_v1.bgen, ukb24306_c17_b0_v1.bgen, ukb24306_c18_b0_v1.bgen, ukb24306_c19_b0_v1.bgen, ukb24306_c1_b0_v1.bgen, ukb24306_c20_b0_v1.bgen, ukb24306_c21_b0_v1.bgen, ukb24306_c22_b0_v1.bgen, ukb24306_c2_b0_v1.bgen, ukb24306_c3_b0_v1.bgen, ukb24306_c4_b0_v1.bgen, ukb24306_c5_b0_v1.bgen, ukb24306_c6_b0_v1.bgen, ukb24306_c7_b0_v1.bgen, ukb24306_c8_b0_v1.bgen, ukb24306_c9_b0_v1.bgen, ukb24306_cX_b0_v1.bgen, ukb24306_c10_b0_v1.sample, ukb24306_c11_b0_v1.sample, ukb24306_c12_b0_v1.sample, ukb24306_c13_b0_v1.sample, ukb24306_c14_b0_v1.sample, ukb24306_c15_b0_v1.sample, ukb24306_c16_b0_v1.sample, ukb24306_c17_b0_v1.sample, ukb24306_c18_b0_v1.sample, ukb24306_c19_b0_v1.sample, ukb24306_c1_b0_v1.sample, ukb24306_c20_b0_v1.sample, ukb24306_c21_b0_v1.sample, ukb24306_c22_b0_v1.sample, ukb24306_c2_b0_v1.sample, ukb24306_c3_b0_v1.sample, ukb24306_c4_b0_v1.sample, ukb24306_c5_b0_v1.sample, ukb24306_c6_b0_v1.sample, ukb24306_c7_b0_v1.sample, ukb24306_c8_b0_v1.sample, ukb24306_c9_b0_v1.sample, ukb24306_cX_b0_v1.sample
Population level WGS variants, PLINK format - interim 200k release
/Bulk/Whole genome sequences/Population level WGS variants, PLINK format - interim 200k release/
ukb24305_c10_b0_v1.bed, ukb24305_c11_b0_v1.bed, ukb24305_c12_b0_v1.bed, ukb24305_c13_b0_v1.bed, ukb24305_c14_b0_v1.bed, ukb24305_c15_b0_v1.bed, ukb24305_c16_b0_v1.bed, ukb24305_c17_b0_v1.bed, ukb24305_c18_b0_v1.bed, ukb24305_c19_b0_v1.bed, ukb24305_c1_b0_v1.bed, ukb24305_c20_b0_v1.bed, ukb24305_c21_b0_v1.bed, ukb24305_c22_b0_v1.bed, ukb24305_c2_b0_v1.bed, ukb24305_c3_b0_v1.bed, ukb24305_c4_b0_v1.bed, ukb24305_c5_b0_v1.bed, ukb24305_c6_b0_v1.bed, ukb24305_c7_b0_v1.bed, ukb24305_c8_b0_v1.bed, ukb24305_c9_b0_v1.bed, ukb24305_cX_b0_v1.bed, ukb24305_c10_b0_v1.bim, ukb24305_c11_b0_v1.bim, ukb24305_c12_b0_v1.bim, ukb24305_c13_b0_v1.bim, ukb24305_c14_b0_v1.bim, ukb24305_c15_b0_v1.bim, ukb24305_c16_b0_v1.bim, ukb24305_c17_b0_v1.bim, ukb24305_c18_b0_v1.bim, ukb24305_c19_b0_v1.bim, ukb24305_c1_b0_v1.bim, ukb24305_c20_b0_v1.bim, ukb24305_c21_b0_v1.bim, ukb24305_c22_b0_v1.bim, ukb24305_c2_b0_v1.bim, ukb24305_c3_b0_v1.bim, ukb24305_c4_b0_v1.bim, ukb24305_c5_b0_v1.bim, ukb24305_c6_b0_v1.bim, ukb24305_c7_b0_v1.bim, ukb24305_c8_b0_v1.bim, ukb24305_c9_b0_v1.bim, ukb24305_cX_b0_v1.bim, ukb24305_c10_b0_v1.fam, ukb24305_c11_b0_v1.fam, ukb24305_c12_b0_v1.fam, ukb24305_c13_b0_v1.fam, ukb24305_c14_b0_v1.fam, ukb24305_c15_b0_v1.fam, ukb24305_c16_b0_v1.fam, ukb24305_c17_b0_v1.fam, ukb24305_c18_b0_v1.fam, ukb24305_c19_b0_v1.fam, ukb24305_c1_b0_v1.fam, ukb24305_c20_b0_v1.fam, ukb24305_c21_b0_v1.fam, ukb24305_c22_b0_v1.fam, ukb24305_c2_b0_v1.fam, ukb24305_c3_b0_v1.fam, ukb24305_c4_b0_v1.fam, ukb24305_c5_b0_v1.fam, ukb24305_c6_b0_v1.fam, ukb24305_c7_b0_v1.fam, ukb24305_c8_b0_v1.fam, ukb24305_c9_b0_v1.fam, ukb24305_cX_b0_v1.fam
Population level WGS variants, pVCF format - interim 200k release
/Bulk/Whole genome sequences/Population level WGS variants, pVCF format - interim 200k release/
ukb24304_c10_b0_v1.vcf.gz.tbi, ukb24304_c10_b1_v1.vcf.gz.tbi, ukb24304_c10_b2_v1.vcf.gz.tbi, ukb24304_cX_b3118_v1.vcf.gz.tbi, ukb24304_cX_b3119_v1.vcf.gz.tbi, ukb24304_cX_b3120_v1.vcf.gz.tbi
Population level WGS variants, pVCF format - 500k release
/Bulk/Whole genome sequences/Population level WGS variants, pVCF format - 500k release/
ukb23374_c10_b0_v1.vcf.gz.tbi, ukb23374_c10_b1_v1.vcf.gz.tbi, ukb23374_c10_b2_v1.vcf.gz.tbi, ukb23374_cX_b7800_v1.vcf.gz.tbi, ukb23374_cX_b7801_v1.vcf.gz.tbi, ukb23374_cX_b7802_v1.vcf.gz.tbi
Sample Contamination (ReadHaps)
/Bulk/Whole genome sequences/Sample Contamination (ReadHaps)/
.contamination
Sample Contamination (verifyBamID) - depthSM
/Bulk/Whole genome sequences/Sample Contamination (verifyBamID) - depthSM/
.verifyBamID.depthSM
Sample Contamination (verifyBamID) - selfSM
/Bulk/Whole genome sequences/Sample Contamination (verifyBamID) - selfSM/
.verifyBamID.selfSM
Whole genome CRAM files
/Bulk/Whole genome sequences/Whole genome CRAM files/
.cram
Whole genome CRAM files (reserved)
/Bulk/Whole genome sequences/Whole genome CRAM files (reserved)/
.cram
Whole genome CRAM indices (reserved)
/Bulk/Whole genome sequences/Whole genome CRAM files (reserved)/
.cram.crai
Whole genome CRAM indices
/Bulk/Whole genome sequences/Whole genome CRAM files/
.cram.crai
Whole genome GATK joint call pVCF
/Bulk/Whole genome sequences/Whole genome GATK joint call pVCF/
.vcf.gz, .vcf.gz.tbi, qc_metrics_gatk_variant_qc.tab.gz, qc_metrics_gatk_variant_qc.tab.gz.tbi, qc_metrics_GATK_version.txt, qc_metrics_README.pdf
Whole genome GraphTyper joint call pVCF (deprecated)
/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF (deprecated)/
.vcf.gz, .vcf.gz.tbi,
qc_metrics_graphtyper_variant_qc.tab.gz,
qc_metrics_graphtyper_variant_qc.tab.gz.tbi,
qc_metrics_README.pdf
Whole genome GraphTyper joint call pVCF
/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/
.vcf.gz, .vcf.gz.tbi,
qc_metrics_graphtyper_v2.7.1_qc.tab.gz,
qc_metrics_graphtyper_v2.7.1_qc.tab.gz.tbi,
qc_metrics_graphtyper_v2.7.1_README.pdf
Whole genome GraphTyper SV data
/Bulk/Whole genome sequences/Whole genome GraphTyper SV data/
.vcf.gz, .vcf.gz.tbi
Whole genome variant call files (VCFs)
/Bulk/Whole genome sequences/Whole genome variant call files (VCFs)/
.g.vcf.gz
Whole genome variant call files (VCFs)
/Bulk/Whole genome sequences/Whole genome variant call files (VCFs)/
.g.vcf.gz.tbi
Whole genome variant call files (VCFs)
/Bulk/Whole genome sequences/Whole genome variant call files (VCFs) (reserved)/
.g.vcf.gz
Whole genome variant call files (VCFs)
/Bulk/Whole genome sequences/Whole genome variant call files (VCFs) (reserved)/
.g.vcf.gz.tbi
BQSR - GATK BaseRecalibrator
/Bulk/Whole genome sequences/BQSR - GATK BaseRecalibrator (reserved)/
.recal_table
Sample Contamination (ReadHaps)
/Bulk/Whole genome sequences/Sample Contamination (ReadHaps) (reserved)/
.contamination
Genotype Concordance - Contingency Metrics
/Bulk/Whole genome sequences/Genotype Concordance - Contingency Metrics (reserved)/
.genotype_concordance_contingency_metrics
Genotype Concordance - Detail Metrics
/Bulk/Whole genome sequences/Genotype Concordance - Detail Metrics (reserved)/
.genotype_concordance_detail_metrics
Genotype Concordance - Summary Metrics (Picard)
/Bulk/Whole genome sequences/Genotype Concordance - Summary Metrics (Picard) (reserved)/
.genotype_concordance_summary_metrics
Genotype Concordance
/Bulk/Whole genome sequences/Genotype Concordance (reserved)/
.nrd.stats
Concatenated QC Metrics
/Bulk/Whole genome sequences/
Concatenated QC Metrics (reserved)/
.qaqc_metrics
Sample Contamination (verifyBamID) - depthSM
/Bulk/Whole genome sequences/Sample Contamination (verifyBamID) - depthSM (reserved)/
.verifyBamID.depthSM
Sample Contamination (verifyBamID) - selfSM
/Bulk/Whole genome sequences/Sample Contamination (verifyBamID) - selfSM (reserved)/
.verifyBamID.selfSM
Manta-called scored structural variant and indel candidates (Vanguard)
/Bulk/Whole genome sequences/Manta-called scored structural variant and indel candidates (Vanguard)/
.diploidSV.vcf.gz.tbi, .diploidSV.vcf.gz
Manta-called unscored structural variant and indel candidates (Vanguard)
/Bulk/Whole genome sequences/Manta-called unscored structural variant and indel candidates (Vanguard)/
.candidateSV.vcf.gz.tbi, .candidateSV.vcf.gz
DRAGEN population level WGS variants, pVCF format [500k release]
/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr{1-22,X,Y,M}/
vcf.gz, vcf.gz.tbi
MNI Native Transform
/Bulk/Brain MRI/Native atlases/
zip
Native aparc a2009s dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native aparc dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native Glasser dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native Schaefer7n200p dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native Schaefer7n500p dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native Tian Subcortex S1 3T dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native Tian Subcortex S4 3T dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native Schaefer7n1000p dMRI
/Bulk/Brain MRI/Native atlases (Diffusion)/
zip
Native aparc a2009s SF
/Bulk/Brain MRI/Native atlases (Structural and functional)/
zip
Native aparc SF
/Bulk/Brain MRI/Native atlases (Structural and functional)/
zip
Native Glasser SF
/Bulk/Brain MRI/Native atlases (Structural and functional)/
zip
Native Schaefer7n100p to 1000p SF
/Bulk/Brain MRI/Native atlases (Structural and functional)/
zip
Native Tian Subcortex S1 to S4 3T
/Bulk/Brain MRI/Native atlases (Structural and functional)/
zip
fMRI timeseries aparc a2009s
/Bulk/Brain MRI/Functional time series/
zip
fMRI timeseries aparc
/Bulk/Brain MRI/Functional time series/
zip
fMRI timeseries Glasser
/Bulk/Brain MRI/Functional time series/
zip
fMRI timeseries global signal
/Bulk/Brain MRI/Functional time series/
zip
fMRI timeseries Schaefer7ns 100p to 1000p
/Bulk/Brain MRI/Functional time series/
zip
fMRI timeseries Tian Subcortex S1 to S4 3T
/Bulk/Brain MRI/Functional time series/
zip
Connectome aparc a2009s and Tian Subcortex S1 3T
/Bulk/Brain MRI/Connectomes/
zip
Connectome aparc and Tian Subcortex S1 3T
/Bulk/Brain MRI/Connectomes/
zip
Connectome Glasser and Tian Subcortex S1 3T
/Bulk/Brain MRI/Connectomes/
zip
Connectome Glasser and Tian Subcortex S4 3T
/Bulk/Brain MRI/Connectomes/
zip
Connectome Schaefer7n1000p and Tian Subcortex S4 3T
/Bulk/Brain MRI/Connectomes/
zip
Connectome Schaefer7n200p and Tian Subcortex S1 3T
/Bulk/Brain MRI/Connectomes/
zip
Connectome Schaefer7n500p and Tian Subcortex S4 3T
/Bulk/Brain MRI/Connectomes/
zip
Tractography endpoints coordinates
/Bulk/Brain MRI/Connectomes/
zip
Tractography quality metrics
/Bulk/Brain MRI/Connectomes/
zip
Data release version
Tabular participant data
Bulk Data
Released on the Research Analysis Platform
v18.1
This release includes all fields from previous releases, plus these new fields:
Note: As a part of this release, the individual-level data, such as CRAM and VCF files, from the first 200k participant WGS release has been merged into the enduring 500k release fields.
November 30 2023
v17.1
November 30 2023
v16.1
This release includes all fields from previous releases, plus these new fields:
26301, 24048, 24050, 24051, 24053, 24055, 24056, 24058, 24059, 24061, 24062, 24064, 24065, 20278, 20279.
Note: As part of this release, data were added covering additional participants to the following fields:
20216, 20220, 21011, 21013, 21014, 21012, 21016, 21015, 20222, 20223, 20226, 20211, 20212, 20213, 20214, 20207, 20208, 20209, 20210, 20225, 20217, 20218, 20219, 20224, 20201, 20204, 20202, 20241, 20158, 20260, 20259, 20254, 20205, 20266, 20243, 20264, 20267, 24664, 24662, 24663, 24661, 24665, 24030, 24031, 24032, 24033, 24034, 24035, 24036, 24037, 24038, 24039, 24040, 24041, 24042, 24043, 24044, 24045, 24046, 24047
Note: As a part of this release corrected version of files were issued for fields 23374, 21007 and 21008.
August 2 2023
v15.1
This release includes all fields from previous releases, plus these new fields: 23365, 24305, 24306, 30900
Note: As part of this release, imaging data were added covering additional participants to the following fields: 20216, 20220, 21011, 21013, 21014, 21012, 21016, 21015, 20212, 20213, 20214, 20207, 20208, 20209, 20210, 20225, 20217, 20218, 20219, 20224, 20201, 20204, 20202, 20241, 20158, 20252, 20253, 20249, 20227, 25750, 25752, 25751, 20250, 25753, 20251, 20260, 20259, 20254, 25754, 25755, 20266, 20243, 20264, 20267
April 12 2023
v14.1
This release includes all fields from previous releases, plus these new fields: 23374, 24664, 22300, 24662, 22298, 22299, 24663, 24661,
Note: As part of this release, imaging data were added covering additional participants to the following fields: 20216, 20220, 20222, 20223, 20226, 20211,
20212, 20213, 20214, 20207, 20208, 20209, 20210, 20225, 20217, 20218, 20219, 20224, 20201, 20204, 20202, 20241, 20158, 20260, 20259, 20254, 20205, 20266, 20243, 20264,
v13.1
This release includes all fields from previous releases, plus these new fields: 21007 and 21008.
Note: As part of this release, for fields 23372, 23373, 23370, 23371, 23376, 23377, 23383, 23384, 23380, 23379, 23378, 23382, and 23381 data was added for any remaining WGS participants.
Note: As part of this release, for fields 23192, 23193, and 23194, the corrected version of CRAMs/CRAIs/gVCF.tbi files were issued for 95 files.
Note: As part of this release, for fields 23376, 23377, 23378, 23379, 23380, 23382, 23383, and 23384, corrected version of supplemental files were issued.
December 7 2022
v12.1
This release includes all fields from previous releases.
Note: As part of this release, imaging data were added, covering additional participants, to fields 20158, 20201, 20202, 20204, 20205, 20207, 20208, 20209, 20210, 20211, 20212, 20213, 20214, 20216, 20217, 20218, 20219, 20220, 20222, 20223, 20224, 20225,
20226, 20227, 20241, 20243, 20249, 20250, 20251, 20252, 20253, 20254, 20259, 20260, 20263, 20264, 20266, 20267, 25750, 25751, 25752, 25753, 25754, and 25755.
Note: As part of this release, folder names were updated for the following fields: 23157, 23158, and 23159.
June 29 2022
v11.1
This release includes all fields from previous releases, plus these additional fields: 24030, 24031, 23157, 23158, 23159
Note: As part of this release, for fields 23141, 23142, 23143, and 23144, data were added for an additional 20k participants, and data were updated for 51 existing participants.
Note: As part of this release, for fields 23372, 23373, data were added for an additional 20k participants.
Note: As part of this release, for fields 23370, 23371, 23376, 23377, 23378, 23379, 23380, 23381, 23382, 23383, 23384, data were added for an additional 90k participants.
May 25 2022
v10.1
April 14 2022
v9.1
Same as v6.1
This release includes the all fields from previous releases, plus these new fields:
As part of this release, imaging data were added, covering additional participants, to fields 20158, 20201, 20202, 20204, 20205, 20207, 20208, 20209, 20210, 20211, 20212, 20213, 20214, 20215, 20216, 20217, 20218, 20219, 20220, 20222, 20223, 20224, 20225,
20226, 20227, 20241, 20243, 20249, 20250, 20251, 20252, 20253, 20254, 20259, 20260, 20263, 20264, 20265, 20266, 20267, 21014, 21015, 22002, 25750, 25751, 25752, 25753, 25754, and 25755. WGS data covering additional participants were added to fields 23372 and 23373.
Feb 9 2022
v8.1
Same as v6.1
This release includes all fields from previous releases, with these updates:
As part of this release, for fields 23191, 23193, 23194, 23197, 23346, 23348, 23349, 23350, and 23351, data were added for an additional 50k participants, bringing the total for these fields to 200k participants.
As part of this release, fields 23370, 23371, 23372, 23373, 23376, 23377, 23378, 23379, 23380, 23381, 23382, 23383, 23384, fields were updated to contain data for an additional 200k participants (over and above the 200k participants of fields 23191 etc.) These fields are reserved for the WGS consortium to house data prior to becoming public.
Nov 15 2021
v7.1
Same as v6.1
This release includes all fields from previous releases, plus these additional fields:
Note: Starting from this release, FAM files for field 23145 and SAMPLE files for field 23147 contain gender info.
Note: As part of this release, for fields 23141, 23142, 23143, and 23144, data were added for an additional 150k participants, and data were updated for 44 existing participants.
Oct 29 2021
v6.1
All fields from previous releases
Sept 22 2021
v5.0
Same as v4.0
Sept 3 2021
v4.0
July 27 2021
v3.0
This release includes all fields from v1.0 and v2.0, plus these additional fields, for participants for which there were data, as of March 2021:
6025, 20158, 20201, 20202, 20203, 20204, 20205, 20206, 20207, 20208, 20209, 20210, 20211, 20212, 20213, 20214, 20215, 20216, 20217, 20218, 20219, 20220, 20221, 20222, 20223, 20224, 20225, 20226, 20227, 20241, 20243, 20249, 20250, 20251, 20252, 20253, 20254, 20259, 20260, 20263, 20264, 20265, 20266, 20267, 21011, 21012, 21013, 21014, 21015, 21016, 21017, 21018, 22002, 23181, 23182, 23183, 23184, 25747, 25748, 25749, 25750, 25751, 25752, 25753, 25754, 25755, 90001, 90004
June 4 2021
v2.0
Jan 28 2021
v1.0
Nov 19 2020
Cannot open or launch Jupyter Lab session |
|
| After the job is launched to access the Jupyterlab server this may take ~ 10 - 15 minutes. Note: this wait time applies to all cloud applications including Rstudio. Alternatively, you can try to add port_number to the address if waiting doesn't work. Currently it helps for 8080 and 8081 ports |
Timeout error when working with spark object |
| Try using the latest version of Jupyterlab. The latest version of Jupyterlab is available via the Tools > JupyterLab tab where you can launch a new environment using the New JupyterLab button or re-launching an old Jupyterlab session. By default the Jupyterlab environment will use the latest version available. |
| If the issue is limited memory, you may need to use a new instance type with a larger memory allocation |
Issue accessing large dataset using spark |
| You have exceeded the allowable buffer limit size for kryo serialization. You should adjust the buffer using the following code:
|
Learn how to export selected phenotypic fields into a TSV or CSV file, for easy browsing and analysis.
If you've worked with UK Biobank data prior to using the Research Analysis Platform, you may be aware that UK Biobank distributes the main tabular dataset in a large encoded file with the extension .enc_ukb
. To work with the dataset, you first convert this file to TSV or CSV format.
On the Research Analysis Platform, this dataset is dispensed into your project as a database, in Parquet format. You can access this database within a Spark environment - for example, by querying it from inside a Spark JupyterLab session.
If you have existing code that relies on reading just a handful of fields from a file, you may find it easier to extract those fields from the database, and dumping them into a TSV or CSV file. You can then run your code or otherwise work with the file, without having to do so within a Spark environment.
Start by navigating to your project and clicking on the name of the dispensed dataset. The Cohort Browser will launch.
In the Cohort Browser, open the Data Preview tab:
Click the "grid" icon at the right end of the Participant ID header row. Then click Add Columns. The Add Columns to Table dialog will open:
Navigate to any field, either directly or via search. Once you've found the field you're looking for, click Add as Column:
Continue locating the fields you're interested in, and adding them as columns. Note that as you add additional fields as columns, you do not have to wait for the Data Preview to finish loading.
Once you've finished, close the dialog by clicking the X to the right of the Add Column to Table title. In the Data Preview tab, you'll see the first few rows of the data.
In the upper right corner of the screen, click Views, then click Save View. Enter a name for the view, then save it.
Now convert your saved view into a TSV or CSV file, using the Table Exporter app.
Navigate back to your project and click the Start Analysis button in the upper right corner of the screen. In the Start New Analysis dialog, select the Table Exporter app, then click Run Selected. Note that if this is the first time you've run Table Exporter, you'll be prompted to install it first.
Within the Table Exporter app, open the Analysis Inputs tab on the right side of the screen. Then click the Dataset or Cohort or Dashboard tile:
A modal window will open. Select the view that you created and saved in the Cohort Browser.
Within the Options section, configure your output options.
In the Output File Name field, enter a filename prefix. In the Output File Format field, select "CSV" or "TSV." You may find it easier to work with a TSV file downstream, because the values in certain fields contain commas, complicating the parsing of a CSV file.
In the Coding Option field, select "RAW" so that you can work with the original UK Biobank data, as you would get them from the Biobank. (For example, in the Sex field, you will see the coded value "0" rather than "Female.")
In the Header Style field, select "UKB-FORMAT" to get headers that match the original UK Biobank format (e.g. 123-4.5).
Click Start Analysis. Once the conversion finishes and the file is ready, you will be notified via email. To access the file, either return to your project, or click the link in the email.
For general tips on troubleshooting, see guide.
Learn how to search and analyze UK Biobank bulk data files.
This section provides a detailed breakdown of how to search for an EID in participant-specific files, such as individual VCF and CRAM files. Note that these methods won't work for cohort-wide files, such as PLINK and pVCF files.
Turn on the filters in your project, by clicking on the filter icon.
Use the filter picker to open the Properties filter.
Select Any Properties and type "eid" (without the quotes, in lower-case letters) in the Any Key textbox.
In the Any Value textbox, enter the 7-digit EID you're trying to locate.
Select Apply.
To search across all folders, set the search scope to Entire Project.
Search for an EID as follows, replacing "1234567" with the EID you're trying to find: dx find data --property eid=1234567
To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:
Navigate to the project containing the files you want to visualize.
Select the Visualize tab.
Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".
Select the files you want to visualize.
If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.
For CRAM files, you must select both the CRAM and the associated CRAI file.
For VCF files, you must select both the VCF and the associated TBI file.
IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.
Select Launch Viewer.
The Research Analysis Platform provides many different tools for analyzing files. The Swiss Army Knife app is a simple starting point for many common bioinformatics manipulations. Launching the app will instantiate a Linux VM on the cloud with several preinstalled tools, and run a user-provided command. For more information about this app and its possibilities, visit its entry in the Tools Library.
To launch Swiss Army Knife, navigate to your project and click Start Analysis. Select Swiss Army Knife and click Run Selected. Select the Analysis Inputs tab. You can choose between specifying explicit inputs or using a mounted project folder.
Explicit Inputs. Use this strategy to analyze files that will be first downloaded on the local disk of the cloud VM.
Click Input files. Navigate to a folder of interest (for example, Bulk
> Genotype Results
> Genotype calls
), and tick the files of interest (for example, <Chromosome 21 file>.bed
, <Chromosome 21 file>.bim
and <Chromosome 21 file>.fam
). Click Select as Input.
In the Command line textbox, enter a command, referring to files directly with their names (for example, plink --bfile <Chromosome 21 file> --maf 0.1 --out filtered_chr21
)
Mounted project folder. Use this strategy to analyze files that will be streamed directly without first writing them on disk.
In the Command line textbox, enter a command, referring to any files in the project using the prefix /mnt/project
(for example, plink --bfile "/mnt/project/Bulk/<Path to chromosome calls>" --maf 0.1 --out filtered_chr21
It is also possible to combine the two strategies. For example, you can provide an R script as explicit input (such as statistics.r
), a command to run the script (such as Rscript statistics.r
) , and inside the script you can read any project files by opening them from the /mnt/project
folder (such as fields <- read.csv("/mnt/project/<Path to project files>", sep="\t")
)
For general tips on troubleshooting, see guide.
Learn how to determine what priority level is suited for your analysis, and how the different options affect job execution.
On the Research Analysis Platform, analyses are executed on Virtual Machines (VMs) using the Amazon Elastic Compute Cloud (Amazon EC2).
When a new job is submitted, the Platform requests a new VM from EC2. Two types of VMs can be used:
On-demand VMs: These VMs cost more, but are always available.
Spot VMs: These VMs cost less, but may not be available at the time of the request, so the system may have to wait until they become available. Note also that even after becoming available, the availability of spot VMs can be interrupted, which leads to interruptions in your job execution.
On the Research Analysis Platform, each job has a priority level. The priority level of a job can be set from the following ways.
To set a job’s priority level when launching JupyterLab from the Tools menu:
Click on the Tools tab from the global menu on the top of your screen on the Platform and click on JupyterLab.
In the New JupyterLab modal that appears, fill out the required fields for your JupyterLab job. For the Priority field, choose a priority level for your job (high, normal, or low).
After filling out the rest of the required fields, click Start Environment.
To select a job’s priority level from the Manage tab:
From your project’s Manage tab, click Start Analysis.
Select the tool you want to run and click the Run Selected button.
In the Analysis Settings tab, fill out the required fields. For the Priority field, choose a priority level for your job (high, normal, or low).
Fill out the rest of the steps and then click Start Analysis.
High priority jobs run with an on-demand VM. The use of high priority is recommended for workloads that need to be executed as soon as possible including any interactive web applications such as JupyterLab or Cloud Workstation. Since the JupyterLab app takes time to initiate, users who wish to execute their analyses without interruptions or restarts may want to use high priority to execute these jobs.
For normal priority jobs, the Platform first requests spot VMs from EC2, then waits up to 15 minutes for these to become available. If spot VMs become available within that time frame, the job will start executing on those spot VMs, but this execution may be interrupted. If spot VMs do not become available within 15 minutes, the Platform requests on-demand VMs from EC2.
For normal priority jobs, the final outcome (on-demand vs. spot) depends on spot availability.
Use normal priority for analyses for which you are willing to tolerate spot risks in exchange for a lower price, but don't want to wait more than 15 minutes for spot VMs.
For a low priority job, the Platform always asks EC2 for spot VMs. If spot VMs are available, the job will start executing right away. Otherwise, the job will remain in a "runnable" state for an indefinite period while waiting for a spot VM to become available. Once running, jobs can become interrupted if a spot VM becomes unavailable in the middle of execution, usually due to increased cloud demand.
Use low priority for nonurgent workloads that do not have to run right away, or for workloads that do not need to be uninterrupted.
When jobs run on spot VMs, there is a small risk of interruption. An interruption occurs when a spot VM becomes unavailable in the middle of execution, usually due to increased cloud demand. In that case, the job is marked as failed due to an unresponsive VM. What happens next depends on the job's "restart policy." By default, the Platform retries any job that fails due to unresponsive VMs, up to two times. When a job is retried, a new VM is allocated depending on priority. Low priority jobs are retried again on spot, whereas normal priority jobs are retried on spot for 15 minutes and then switched to on-demand.
Note that if your job is interrupted, you will still be billed for usage charges incurred prior to the interruption.
Learn to use Spark in JupyterLab to analyze UK Biobank tabular data.
In the platform select Tools, then JupyterLab.
Click New JupyterLab.
Type a descriptive title for your JupyterLab session in the Environment Name textbox.
Select a project where you want to run JupyterLab.
Click Spark Cluster under Cluster Configuration.
Select an instance type and number of nodes. This will affect how powerful the Spark cluster will be. The default settings allow for casual interrogation of the data. If you will be running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.
Click Start Environment. A new row will appear and the environment will begin initializing.
Once the the status becomes Ready, click the name to connect to JupyterLab.
Inside JupyterLab, select the DNAnexus menu, then select New Notebook.
Click the DNAnexus tab on the left and locate the new notebook (Untitled_<DATE>.ipynb
).
Double-click the notebook name to open it.
Select Python 3 as the kernel. You are now ready to begin using the notebook.
To begin, import relevant Spark and DNAnexus libraries, and instantiate a Spark context and Spark session at the very top of your notebook, as shown below.
Ensure that your Spark session is only initialized once per JupyterLab session. If you try to evaluate this cell multiple times (for example, by selecting "Run All Cells" to rerun a notebook after it's already run, or by opening and running multiple notebooks in the same JupyterLab session), you may encounter errors or your notebook may hang. If that happens, you may need to restart the specific notebook's kernel.
As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
To improve the reproducibility of your notebooks, and ensure they are portable across projects, it is better not to hardcode any database or dataset names. Instead, you can use the following code to automatically discover the database and dataset:
To evaluate SQL, you can use the spark.sql("...")
function, which returns a Spark DataFrame.
You can view the contents of a DataFrame (in full width) by calling .show(truncate=False)
on it.
The following example lists the tables in the database:
The database contains the following tables:
When listing tables in SQL, you may notice each table appearing twice, using a regular name and a versioned name, such as"gp_clinical"
and"gp_clinical_v4_0_9b7a7f3"
. This naming scheme is part of the system's architecture, supporting data refreshes and participant withdrawals.
The "regularly named" table is actually a SQL VIEW pointing to the versioned table. When data is updated, the VIEW is switched to point to a new versioned table, and the old versioned table is deleted. Due to this behavior, please make sure to always use the regularly named tables - such as "gp_clinical"
- because the versioned tables do not persist over time.
If your access application has been approved for Data-field 23146, 23148, and/or 23157 you will also see the following tables:
Allele_23146, allele_23148, allele_23157, annotation_23146, annotation_23148, annotation_23157, assay_eid_map_23146, assay_eid_map_23148, assay_eid_map_23157, genotype_23146, genotype_23148, genotype_23157, pheno_assay_23146_link, rsid_lookup_r81_23146, pheno_assay_23146_link, rsid_lookup_r81_23148, pheno_assay_23157_link, and rsid_lookup_r81_23157.
These tables contain limited information about alleles and genotypes, transcribed into SQL from the pVCF files of Data-field 23146 and/or 23148 and/or 23157 (along with added annotations). These tables are used by the Cohort Browser in the creation of the "GENOMICS" tab. They have not been optimized for direct SQL querying, and their schema and conventions are subject to change. For this reason, it is not recommended to access these tables on your own but to access the bulk files instead.
For the main UK Biobank participant tables, the column-naming convention is generally as follows:
p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>
However, the following additional rules apply:
If a field is not instanced, the _i<INSTANCE-ID>
piece is skipped altogether.
If a field is not arrayed, the _a<ARRAY-ID>
piece is skipped altogether.
If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the _a<ARRAY-ID>
piece is skipped altogether.
Examples:
Age at recruitment: p21022
Date of attending assessment centre: p53_i0
, p53_i1
, ...
Diagnoses - ICD10 (converted into embedded array): p41270
For linked health care tables, it is easier to use SQL directly to extract data as a Spark DataFrame. The following example retrieves all GP records related to serum HDL cholesterol levels.
Spark DataFrames are lazy-evaluated. In the code block above, the command will return right away, assigning the variable df
without executing the query. The query is only evaluated when needed, potentially with additional transformations.
For example, typing df.count()
later will evaluate an equivalent SELECT COUNT(*)
...
As mentioned above, the dataset combines the low-level database structure with metadata from the UK Biobank Showcase. Database tables are exposed as virtual entities, and database columns are exposed as fields. The split participant tables are all combined into a single entity called participant
.
To fetch participant fields, you must first make a list of field names of interest. There are three main ways to look up field names:
Once you have gathered field names of interest, you can create a Spark DataFrame of corresponding participant data using the retrieve_fields
function:
This function automatically joins across the split participant tables as needed, returning the requested columns as a Spark DataFrame.
If a particular Spark command is taking too long to evaluate, you can monitor the Spark status by visiting the Spark console page. To do that, copy the URL of your current JupyterLab session (typically ending in ".dnanexus.cloud/lab?"
), open a new browser tab, and paste the URL. Replace "/lab?"
with ":8081/jobs/"
and press Enter.
It is a good idea to always include the participant EID ("eid"
) as the first field name, so that it is returned as the first column. If you don't include it, the system will not return the value.
By default, the system returns the data as encoded by UK Biobank. For example, field p31
(participant sex) will be returned as an integer column with values of 0 and 1. To receive decoded values, supply the coding_values="replace"
argument.
The returned DataFrame uses the field names as the column titles. If you prefer to give them some human-readable names, you can provide a mapping from field names to your own names using the column_aliases={"p21002_i0": "weight", ...}
argument.
To retrieve the EIDs of a cohort that you previously saved via the Cohort Browser - in this example, a cohort named "controls" in the root folder - use the following code:
To retrieve participant fields for a cohort, supply the filter_sql=cohort.sql
argument:
Learn about how usage is billed on the Research Analysis Platform, how to set up a billing account to cover charges you’ll incur, your initial credit, and more.
Users of the Research Analysis Platform are billed according to usage. Platform users incur costs for:
Using compute resources
Storing data other than that dispensed to a project by UK Biobank. This includes uploaded data, or data generated in the course of work on the Platform.
Data egress
Users are not charged for the cost of storing UK Biobank data that has been dispensed to a project. The cost of storing this data is sponsored by Amazon Web Services.
Each new user receives a £40 credit toward covering usage costs. Setting up a billing account will not impact your initial usage credit and you will only be invoiced for usage beyond this initial £40.
To ensure uninterrupted access to all features, you must provide a valid billing account before your credit runs out. See the next section for info on how to set up a new billing account.
On the Research Analysis Platform, you can enable shared billing for a group of users, by doing either of the following:
Set up billing for your personal billing account - i.e. your "wallet" - and add users to this account.
Create and set up billing for a new organization, then add users to it.
To set up billing for, and add members to your personal billing account:
From the global menu, select Org Admin.
From the dropdown menu, select All Orgs.
On the Organizations list page, select your personal billing account.
Open the Billing tab, then click Set Up Billing.
Follow the steps in the Set Up Billing wizard.
Open the Members tab for the new organization.
Repeat this for each new person you'd like to add.
On the Research Analysis Platform, users can create an organization for the specific purpose of enabling shared billing for a group of users. To do this:
From the global menu, select Org Admin.
From the dropdown menu, select All Orgs.
On the Organizations list page, click the New Organization button.
A New Organization form will open in a dialog. In the form, enter a unique name and ID for your new organization.
Click Create Organization.
You'll be prompted to set up billing for your new organization. Click Start Setup in the Set Up Billing modal window.
Follow the steps in the Set Up Billing wizard.
Open the Members tab for the new organization.
Repeat this for each new person you'd like to add.
Every billing account has a spending limit - a limit on unpaid charges that can be incurred, before access is limited to key features, including the ability to perform billable activities. See the next section for more information.
Note that you are responsible for paying all charges you incur on the Platform.
When a billing account’s spending limit is exceeded, functionality is restricted for those whose usage is billed to the account. They temporarily lose the ability to launch new analyses, upload data, or egress data.
To restore full functionality, payment must be made to cover incurred charges, or the billing account’s spending limit must be raised. For users working in a project linked to the restricted billing account, the project admin may also link the project to a different billing account, whose spending limit has not been exceeded.
Learn how to work cost-effectively on the Research Analysis Platform
Platform users incur costs for:
Using compute resources
Storing data other than that dispensed to a project by UK Biobank. This includes uploaded data, or data generated in the course of work on the Platform.
Data egress
Users are not charged for the cost of storing UK Biobank data that has been dispensed to a project. The cost of storing this data is sponsored by Amazon Web Services.
Smart Reuse is available to all Research Analysis Platform users.
When running a job on the Platform, you must select a compute instance to use, in executing the job. It’s difficult to make general recommendations about what instances are best in each situation, and how to balance speed and cost-efficiency.
It can be helpful to log into a running instance using dx ssh
and check how the CPU and memory of a machine is being utilized using a utility such as htop. If CPUs are idle or memory is being under-utilized, you may be able to save money by selecting a smaller instance type for that job or by changing the configuration of the tool being run to use more threads (if applicable to that specific tool).
Each analysis job is run with a priority setting.
High priority jobs use on-demand virtual machines, i.e. compute instances that are immediately available. This costs more than running a job at low priority, which uses spot instances, i.e. virtual machines that may or may not be immediately available. Running a job at normal priority, meanwhile, ensures the the system will first try, for 15 minutes, to secure a spot instance or instances, only using more expensive on-demand instances if spot instances are unavailable.
Storage costs can add up, if you create or upload large files, particularly if you store them for long periods of time. For this reason, proper file management is essential to using RAP in a cost-efficient fashion. For example, if, in the course of running an analysis, you generate intermediate files, you should consider carefully whether these are worth saving. They may be useful for future analyses. But if the compute cost and effort needed to generate them is low, you might consider re-creating them rather than storing them until you need them again.
Be aware that when users are added to a billing account, they can incur costs that must be paid by the person or entity responsible for that account. Note as well that incurring such costs will affect the billing account’s spending limit, and thus other users’ ability to run jobs billed to that account.
Note that when a billing account’s spending limit is exceeded, an email notification is sent to the account owner, and functionality is restricted for all those whose usage is billed to the account. They temporarily lose the ability to launch new analyses, upload data, or egress data. To restore full functionality, payment must be made to cover incurred charges or the billing account’s spending limit must be raised. For users working in a project linked to the restricted billing account, the project admin may also link the project to a different billing account, whose spending limit has not been exceeded.
The Research Analysis Platform provides a full-featured toolkit for preparing and analyzing a wide range of datatypes.
The Tools Library under the Tools tab of the platform shows a complete list of apps and workflows available to you. Use filters to quickly find items by name, category, etc.
Clicking on the name of an app will open up a separate information page which contains details about the app's inputs, outputs, and other documentation details. For apps encapsulating existing bioinformatics tools this page also contains licensing information, links to the website for that software, and citations to any related publications. This page also shows the version history for the app and developer documentation which describes the inner workings of the app in detail.
BCFtools
BEDtools
VCFtools
vcflib
PLINK
PLINK2
Sambamba
SAMtools
Picard
Tabix
Seqtk
bgzip
bgenix
GATK4 apps
Germline Best Practice BAM/CRAM to VCF workflow
PLINK
PLINK2
PLATO
BOLT-LMM
REGENIE
QCTool
SAIGE apps
BiocManager
MendelianRandomization
coloc
HyPrColoc
epiR
prevalence
incidence
outbreak
Hail
VEP
NumPy
SciPy
Matplotlib
Seaborn
Pandas
JupyterLab app with STATA feature provides access to
Stata (stata license to be provided by the user)
nipype
FreeSurfer
FSL
tensorflow
torch
cntk
keras
scikit-Learn
MLlib
Navigate to the Settings tab. Check the Delete Access policy is set to 'Contributors & Admins'. If it is set to ‘Admins only’ it will cause the project to be considered protected. Information on protected projects can be found in the following
Try to monitor the job’s using the spark UI:
Issue | Example error message | What to do |
---|---|---|
Issue | Example error message | What to do |
---|---|---|
Pricing for these jobs can be found in the "On-demand GBP/hr" column on the Research Analysis Platform .
Pricing depends on which type of VM is used to execute the job. See pricing details in the "On-demand GBP/hr" and "Spot GBP/hr" columns of the Research Analysis Platform .
Pricing for these jobs can be found in the "Spot GBP/hr" column named on the Research Analysis Platform .
Apache Spark is a modern, scalable framework for parallel processing of big data. To analyze tabular data using Spark in JupyterLab, you first need to launch JupyterLab in a Spark cluster configuration. For information on how to use HAIL with Jupyterlab, see example notebooks .
For all other tables - such as hospital records, GP records, death records, and COVID-19 records - the column names are identical to what UK Biobank provides in its Showcase. For more information on the columns of these tables, consult (hospital records), (GP records), (death records), (COVID-19 GP records), or (COVID-19 test results).
The main participant data is horizontally split into multiple tables, and you may find that SQL is less than suitable for querying those tables directly. To access main participant data, consider using the .
For a list of all DataFrame functions, consult the .
For an example Jupyter notebook that demonstrates how to extract data, see .
After , you can load the dataset and access the participant entity as follows:
If you already know the UK Biobank data-field id, or if you navigate to the and browse or search for data-fields, you can construct the field name . For example, data-field (participant weight) corresponds to field names p21002_i0
through p21002_i3
.
You can look up . The following screenshot shows an example of searching for the "weight" keyword and locating the name of a field ( p21002_i0
, shown next to the Link label).
You can look up fields programmatically, by iterating over all fields in the participant.fields
array, or by using the function participant.find_fields
. Refer to the for more information. The following example finds all fields with a matching case-insensitive keyword "weight" in their titles:
You can continue to work with the Spark DataFrame and leverage Spark functions for counting, filtering, aggregations, or statistics. Consult the for more information. Spark functions are executed in a distributed manner across the Spark cluster.
If you prefer to load all the results in memory, instead of keeping them in a parallelized and decentralized Spark DataFrame, simply convert the Spark DataFrame to a Pandas DataFrame by calling .toPandas()
. This will return a Pandas DataFrame in memory, which you can manipulate further using other . Pandas functionality runs in the same VM as JupyterLab and does not leverage the Spark cluster.
For detailed guidance on controlling costs, see this guide to .
Setting up a billing account is the same on the Research Analysis Platform as on the DNAnexus Platform. .
Click the Invite New Member button. Complete the form in the dialog, to add a new member to your organization. More about project access levels can be found .
In the Access section of the form, select your preferred options in the Project Transfer Access and Member List Access fields. For more information on the available options, see the DNAnexus Platform documentation on .
Click the Invite New Member button. Complete the form in the dialog to add a new member to your organization. More about project access can be found
When you first set up a billing account, your account spending limit is set to £250. .
All Platform rates are quoted in British pounds (£). See the Platform for detailed information on rates for data storage, data egress, and use of compute resources.
To set up DNAnexus as a vendor, contact the .
As detailed in this , when your analysis is particularly complex, submit jobs in small batches, to check for errors and ensure that you’re achieving the balance you’re trying to strike between speed and cost-effectiveness.
Smart Reuse is a feature that can enable significant cost savings. Smart Reuse enables the testing of complex workflows in a maximally resource-efficient fashion, before they’re run in a production environment. For a full description of Smart Reuse and how to use it, refer to the .
When first running a workflow, one useful approach is to select an instance that meets your cost standard, then , to prevent the job from running too long, and thus incurring too high a usage charge.
See and how to choose the right one for your purposes.
Consult the for details on rates for using different types of instances.
In some situations, you might prefer to egress data from the Platform, then analyze it on your computer or local cluster. But be aware of . It’s almost always more cost-efficient to use the Platform for all data processing, relying on local resources only for post-processing.
When creating an app or a global workflow for use on the Platform, you can set a cost limit, to ensure that running the app or workflow does not incur charges above a set amount. See the DNAnexus Platform documentation for details on how to set , and .
, by a user adding others to his or her “wallet,” i.e. personal billing account, or by setting up a new organization with billing, and adding users to it.
Spending limits can be used to control spending by users on a common billing account. See t, including the default limit, how to raise it as needed, and usage limitations that follow on an organization exceeding its limit.
app provides access to the following tools
provides access to the following tools
and apps provide access to the following R packages
app provides access to
app and app provide access to
app with IMAGE_PROCESSING feature provides access to
app with ML feature provides access to
app provides access to
fetches and uploads a file from a remote URL to the platform
transfers files from S3 bucket to the platform
provides a web-based terminal to a platform virtual machine
provides an ssh-accessible workstation
exports a specific entity or the cohort table to CSV
ingests a data file and create a new, superset Dataset
move cohorts from one Dataset to another in your application
February 1 2023
Data not exported
Warning: Out of memory
Try to adjust the instance type to use one with more memory/storage and re-run your table exporter query. Alternatively, you could try using the dx extract_dataset command within spark Jupyterlab. Example code here.
Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13,...
Check that you provided your inputs correctly following the documentation.
Note: If you don’t provide an entity value, then by default Table exporter will use the “Participant” entity table
Failed to export data: An error occurred while calling o305.csv. : org.apache.spark.SparkException: Job aborted
Make sure you specified the entity to use
Export participant id (EID)
By default the participant identifier (EID) is no longer extracted.
In the Table exporter app you’ll need to add “eid” to the File containing Field Names
parameter as well as specify entity
parameter in the Advanced Options. Entity refers to the entity table from which we are extracting data from - e.g. “participant” or “olink_instance_0”
Alternatively, if using dx extract_dataset
command, then you’ll need to specify <entity>.eid as one of the field names in your query. See example.
Are there spaces in the input argument "output" (Example: "physical activity")?
table_exporter.py: error: unrecognized arguments: activity
Remove or Replace spaces with underscore. Example: “physical_activity”
Is there a file containing a list of field names for the proteomics dataset?
All protein fields can be found here.
Failed job
Error while running the command (please refer to the job log for more information)
See troubleshooting guide to understand the problem
Error while mounting project-GKv25k0Jv6jzF4Gz2z0pxzzJ in /mnt/project (please refer to the job log for more information)
Error opening file: Bulk/Exome\sequences/Exome\OQ
Check if file path works using dx describe /project-xxx/file-xxx
or dx describe <filename>
(with a relative path)
Check that there are no typos in your file path
Issue with instance type
The machine running the job was terminated by the cloud provider", "try": 0
The spot instance that you’re running your job on may become unavailable in the middle of your execution during a period of increased demand for cloud computing capacity, which means that you will lose the work done on this instance and will have to restart the job.
To avoid this you have a few options:
1) Add SpotInstanceInterruption to your application restart-on policy. That will automatically restart the job in case of spot instance interruption. See documentation for more information.
2) Use the "High" priority settings in order to run your jobs on-demand and thus avoid SpotInstanceInterruption errors entirely, but these instances have a higher price.
3) Restart your job on a spot instance manually
Warning: Low disk space during this job
You need to update your instance type selection to make sure you have enough memory (see “Memory (GiB)” column or “Storage (GiB)” in the rate card
Cannot find input files or directory
No such file or directory
There are 2 ways to specify input files (see tool documentation):
Select inputs from the drop down menu
Specify input paths in the command prompt using mounting - i.e. add “/mnt/project/” as a prefix to file path
How can I run my command across all chromosomes?
Use the following bash script if data is already separated per chromosome:
for chr in {1..22}; do \
dx run app-swiss-army-knife --instance-type mem1_ssd1_v2_x8 -y \
-iin="project-xxxx:/path/to/file/ukb#####_c${chr}_b#_v#.bgen" \
… ;
done
If you need to separate by chromosome you can use the plink or bcftools command in swiss-army knife.
Table name | Description |
| These tables contain the main UK Biobank participant data. Each participant is represented as one row, and each data-field is represented as one or more columns. For scalability reasons, the data-fields are horizontally split across multiple tables, starting from table participant_0001 (which contains the first few hundred columns for all participants), followed by participant_0002 (which contains the next few hundred columns), etc. The exact number of tables depends on how many data-fields your application is approved for. |
| Hospitalization records. This table is only included if your application is approved for data-field #41259. |
| Hospital critical care records. This table is only included if your application is approved for data-field #41290. |
| Hospital delivery records. This table is only included if your application is approved for data-field #41264. |
| Hospital diagnosis records. This table is only included if your application is approved for data-field #41234. |
| Hospital maternity records. This table is only included if your application is approved for data-field #41261. |
| Hospital operation records. This table is only included if your application is approved for data-field #41149. |
| Hospital psychiatric records. This table is only included if your application is approved for data-field #41289. |
| Death records. This table is only included if your application is approved for data-field #40023. |
| Death cause records. This table is only included if your application is approved for data-field #40023. |
| GP clinical event records. This table is only included if your application is approved for data-field #42040. |
| GP registration records. This table is only included if your application is approved for data-field #42038. |
| GP prescription records. This table is only included if your application is approved for data-field #42039. |
| GP clinical event records (COVID TPP). This table is only included if your application is approved for data-field #40101. |
| GP prescription records (COVID TPP). This table is only included if your application is approved for data-field #40102. |
| GP clinical event records (COVID EMIS). This table is only included if your application is approved for data-field #40103. |
| GP prescription records (COVID EMIS). This table is only included if your application is approved for data-field #40104. |
| COVID19 Test Result Record (England). This table is only included if your application is approved for data-field #40100. |
| COVID19 Test Result Record (Scotland). This table is only included if your application is approved for data-field #40100. |
| COVID19 Test Result Record (Wales). This table is only included if your application is approved for data-field #40100. |
| COVID-19 vaccination data. This table is only included if your application is approved for data-field #32040. |
| Olink NPX values for the instance 0 visit. This table is only included if your application is approved for data-field #30900. For scalability reasons, the protein columns are horizontally split across multiple tables, starting from table |
| Olink NPX values for the instance 2 visit. This table is only included if your application is approved for data-field #30900. |
| Olink NPX values for the instance 3 visit. This table is only included if your application is approved for data-field #30900. |
| OMOP Condition Era. This table is only included if your application is approved for data-field #20142. |
| OMOP Condition Occurrence. This table is only included if your application is approved for data-field #20142. |
| OMOP Death. This table is only included if your application is approved for data-field #20142. |
| OMOP Device Exposure. This table is only included if your application is approved for data-field #20142. |
| OMOP Dose Era. This table is only included if your application is approved for data-field #20142. |
| OMOP Drug Era. This table is only included if your application is approved for data-field #20142. |
| OMOP Drug Exposure. This table is only included if your application is approved for data-field #20142. |
| OMOP Measurement. This table is only included if your application is approved for data-field #20142. |
| OMOP Note. This table is only included if your application is approved for data-field #20142. |
| OMOP Observation. This table is only included if your application is approved for data-field #20142. |
| OMOP Observation Period. This table is only included if your application is approved for data-field #20142. |
| OMOP Person. This table is only included if your application is approved for data-field #20142. |
| OMOP Procedure Occurrence. This table is only included if your application is approved for data-field #20142. |
| OMOP Specimen. This table is only included if your application is approved for data-field #20142. |
| OMOP Visit Detail. This table is only included if your application is approved for data-field #20142. |
| OMOP Visit Occurrence. This table is only included if your application is approved for data-field #20142. |
Learn how to prepare pVCF files and return them to UK Biobank.
Researchers using UK Biobank data are obliged to return their research results to UK Biobank, in keeping with the terms detailed here. To return your research results, share the project containing the results with UK Biobank staff as per these instructions.
When users access a pVCF file on the Research Analysis Platform, each sample has an identifier that is unique to the version of the file dispensed to projects linked to a particular UK Biobank access application. Giving samples unique, access-application specific identifiers helps ensure the anonymity of UK Biobank participants. See this presentation for more on this technique.
Before returning pVCF files to UK Biobank, researchers must format file headers in such a way as to support the use of this type of identifier. The portion of each file's header referring to samples must be formatted as zero-level-compression bgzip blocks.
To do this, run each pVCF file through the Bash script included below. The script takes a VCF file as input, then uses bcftools
, bgzip
, and tabix
to modify the relevant part of the header. The script then outputs a VCF and a TBI index file. It also prints the byte coordinates of the zero-level-compressed blocks to stdout
.
The output VCF file, when uncompressed, will be identical to the input file. So zcat input.vcf.gz | md5sum
will return identical results to zcat input.repackaged.vcf.gz | md5sum
.
When returning pVCF files to UK Biobank, include all of the following:
The repackaged VCF files, processed as per the instructions above
The accompanying TBI files
The byte-level coordinates of the zero-level-compressed blocks, as printed to stdout
, when processing the original VCF files. These coordinations are required for validation.
This FAQ addresses what's included with Standard Support on UKB-RAP, what's included with purchases of a UKB-RAP Service Package and how DNAnexus handles more complex queries.
Email support (ukbiobank-support@dnanexus.com) is available for users to submit questions for billing, administration issues, to report platform performance issues, or to report bugs in the DNAnexus provided tools. You can also learn more in our blog post announcing the new model.
UK Biobank’s Community is also there to answer any questions you might have about accessing data, using it, navigating and working on the UK Biobank Research Analysis Platform (UKB-RAP).
This free site contains articles, guides, webinars, and videos to help you – along with a Forum where you and other researchers can ask and answer questions. The search function gathers answers from all these sources to help you.
Additionally, all of DNAnexus tutorials & webinars on YouTube, documentation will remain available for users to access.
Users can now purchase service packages that will provide you with 1:1 expert guidance from DNAnexus to answer questions ranging from troubleshooting custom applets to asking for scientific guidance when using the UKB-RAP. Note that complex questions may require more than one ticket and tickets expire within 12 months from purchase. The DNAnexus team will assess the complexity of your question(s) and advise how many tickets will be required to help.
It should also be noted that some requests, for example where you wish DNAnexus to run an analytical pipeline on your behalf, would require DNAnexus to become a Third-Party Processor[1] on your research application. DNAnexus would be open to supporting you in this capacity and, if this is of interest, please get in touch with us directly at ukbiobankrap@dnanexus.com. DNAnexus will work together to ensure compliance with UK Biobank policies.
With the new service package, the DNAnexus team will walk you through solving your more complex bioinformatics queries in the UKB-RAP, equipping you with the right understanding to succeed.
Our DNAnexus team will work directly with users, to understand their objectives and develop a plan to achieve early wins ensuring their success using UKB-RAP.
Learn first-hand from our team’s deep experience working with UKB data. Learn best practices for working with large multi-omics sets and complex data structures.
Receive answers to your specific questions on research topics such as genetic association, clinical data, imaging analysis, multimodal data analysis, integrated analysis or machine learning.
Please fill in the form here. You will be then contacted by our DNAnexus team to understand your needs and help direct you to the most appropriate package. Once this is completed and submitted our DNAnexus support team will be able to answer your questions.
You can find the pricing documentation here.
Service packages come in different bundles of service tickets: 5, 20, 50 and 100. You can choose a smaller bundle if you’re unsure of how many service tickets you will actually need or purchase a larger one and save the service tickets for future inquiries. Service Packages are valid for 12 months after purchase.
If your support inquiry requires DNAnexus to run a pipeline or test a pipeline within your workspace, then you will be required to name us as a 3rd Party Data Processor in your annual report
Examples of questions that require DNAnexus to be a 3rd Party Processor:
Set up, optimization or executing of the pipelines where access to the data is required
Running of scripts to convert standards or provide mapping
Set up and hands on training of tooling/pipelines within the customers’ project
Yes. DNAnexus can advise on the best way to approach analysis in the UKB-RAP, help build pipelines or help users optimize their current workflow processes. Note that if you are interested in having DNAnexus experts directly access and/or process UK Biobank data, please reach out to ukbiobankrap@dnanexus.com.
DNAnexus will ensure that your inquiry is resolved satisfactorily and in a timely fashion.
[1] For more information on guidelines for Third Party Processors, please reference the UK Biobank MTA.
The UK Biobank Research Analysis Platform (UKB RAP) hosts a wide array of biomedical data sampled from hundreds of thousands of individuals across many years, and contains varied types of data ranging from MRI imaging to accelerometer measures. The platform provides the opportunity for researchers to conduct analyses on an increasingly large scale in varied ways (e.g QC-ing the sequencing data, performing whole genome variant calling, or genotyping a particular gene). However, processing data at this magnitude presents RAP researchers with multiple challenges, including how to:
encapsulate the analysis algorithm so it runs efficiently on the platform
break up the processing of the large data sets into parallel jobs
submit and monitor multiple job executions
identify and resubmit failed jobs.
In this guide, we will go over an example of how to perform HLA typing on 200K exome samples on the UKB RAP platform in a cost-efficient way. We will then provide guidelines for extending the techniques used in the example to other types of analyses that users may use.
This guide assumes the user has:
familiarity with the DNAnexus command line interface and UI features
the ability to write simple Bash and Python scripts
a high-level understanding of concepts behind DNAnexus applets and the Workflow Description Language (WDL).
In our example, we'll perform HLA typing on 200K exome samples on the UKB RAP platform. The HLA (human leukocyte antigen) complex is one of the most diverse gene complexes found in humans and plays a central role in human immunity. Mutations in this complex may be linked to autoimmune disorders. Researchers are often interested in identifying mutations in this complex as they can be used to learn more about treatment for various autoimmune conditions like type I diabetes or rheumatoid arthritis.
For this tutorial, the location of the files we need are as follows:
The inputs to our HLA typing analysis are
1: The 200K read-mapped samples in a UKB RAP-dispensed project with access to UKB Data-Field 23153, which is found in the folder containing the exome OQFE CRAM files.
2: The reference genome that can be fetched from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
using the url_fetcher app and stored in genome_reference
folder in your project.
Instance types
HLA typing runs independently on each sample, and each sample corresponds to one individual. Doing HLA typing on 1 exome sample takes 11 minutes on a mem1_ssd1_v2_x2 instance.
Outputs
We will store the output (HLA type in a file with .genotype.json
extension and HLA expression level in a file with .gene.json
extension) of the analysis in the "/HLA_process" folder of the RAP-dispensed project.
Since the UK Biobank contains so many samples, the naive way of running one job for each of 200K samples is inefficient because of inefficiencies derived from submitting, scheduling and managing 200,000 jobs. We therefore suggest reducing the total number of jobs by processing a batch of 100 samples in each job.
We recommend to structure the computation in such a way that the runtime of each job is less than a day to decrease the chances of job failure due to spot termination. In the example below, this is achieved by using mem1_ssd1_v2_x2 instances to process a batch of 100 samples in about 19 hours.
Here is a brief overview of the steps to our analysis. Later in this tutorial, we will describe each step in greater detail:
Prepare the applet:
Package the HLA analysis tools into a Docker image and upload the image to RAP.
Create an applet using WDL (see documentation here) that performs HLA typing using the Docker image from the previous step. The applet takes an array of exome sample files as input.
Compile the WDL task using dxCompiler to a DNAnexus applet.
Generate job submission script:
Efficiently fetch the 200K input file names from RAP.
Create job submissions where each job processes 100 samples.
Submit and monitor jobs:
Run one job to make sure there are no errors in the code.
Run the rest of the jobs.
Monitor job execution.
Resubmit any jobs that failed due to external factors such as spot instance termination.
First launch the cloud_workstation app to leverage the cloud-scale network bandwidth for uploading large Docker images to DNAnexus from AWS EC2 instances to AWS S3 buckets in the same region. Make sure that you have the Docker command line tool pre-installed. The command is:
At the cloud_workstation prompt, create a docker
folder which contains a Dockerfile file with the content available for download in this github repository describing the installation of samtools, bedtools, kallisto and arcasHLA tools in the Docker image.
Build the Docker image:
Save Docker image as a tarball:
Upload Docker image to the parent project:
We recommend encapsulating analysis tools in a Docker image to preserve reproducibility. We also recommend storing Docker images on the platform instead of external Docker registries such as Docker Hub and Quai.io for better reliability.
On your local computer, create an applet using Workflow Description Language (WDL) that executes using the Docker image from the previous step. The applet takes an array of sample files as input. Below is the code for our applet:
Line 18: Remove intermediate outputs after each sample is processed in a batch for greater storage efficiency.
Line 27: Public docker registries have a daily pull limit. We save the Docker image on RAP for better reliability.
Line 28: Use the appropriate timeout policy based on the expected runtime of your job to ensure job costs remain under control on the off chance that your job hangs. While rare, running at scale on AWS virtual machines increases the chances that at least one job will need to time out and be restarted.
Line 35: Streaming allows the system to avoid downloading the entire batch of inputs, and instead streams each input when it is read in the samtools view
line.
Install dxCompiler on your local machine following the installation instructions.
Compile WDL code into a DNAnexus applet using dxCompiler:
Note that by default, compiled DNAnexus applets are configured to auto-restart on transient failures as documented in therestartOn
field of the executionPolicy
argument in the DNAnexus documentation. You can adjust the restart policy by providing extras.json
input to dxCompiler below as shown in dxCompiler documentation here.
To sort your input file based on name use the standard bash sort
:
This command sorts files by the full path of the file.
If you need other information that is not present with default dx find data
, you can use --json
with dx find data
and use jq
to extract the fields you need.
We create job submission commands using the following script (the latest version can be found here):
Line 30: Store output for each batch (from 100 samples in this case) in a dedicated folder. This will avoid the problem of creating too many files in a specific directory and make tracking errors easier when some jobs produce unexpected numbers of output files.
Line 38-40: We tag each job with 3 tags:
200K_exome_HLA_analysis
represents the name of study and will help us distinguish jobs from this analysis from other work you may be doing in the same project.
original
indicates that this is the first (original) attempt at running a job. Subsequent reruns of failed jobs will be tagged with rerun{rerun_attempt}
.
batch_n_{batch_number}
records a particular batch of 100 jobs.
These tags illustrate the use of execution metadata to help track the progress of your analysis, identify which studies had all their jobs complete successfully, and restart any failed jobs. Metadata consisting of tags and properties can be associated with DNAnexus objects such as files and executions and is documented here.
This command below shows the dx run
invocation for the first job, then submits it:
Monitor the rest of the jobs with dx watch
.
We recommend submitting jobs gradually, rather than all at once. Submit the first job and see if it produces the expected output in the right location. After that, submit another 500 jobs and see if the variation of running time and cost among these jobs is within the expected range before submitting the rest of your jobs.
The code below creates a new list of submission commands by removing the previously launched first submission, then splits the remaining 1999 submissions into batches of 500:
We can monitor the execution of the 200K_exome_HLA_analysis
analysis using dx command line tool to search for jobs tagged with 200K_exome_HLA_analysis
and display only the last n
jobs that we've submitted:
Similarly, you can view the jobs corresponding to your analysis in the web browser UI by filtering on 200K_exome_HLA_analysis
tag value from the Monitor page in your project.
If you decide not to use a retry policy, occasionally, some jobs may fail due to sample-specific issues or due to external factors such as spot instance termination or other intermittent system errors. You can find failed jobs by using the job state filter set to failed
in the Monitor tab in the web browser UI or using the dx command line tool as shown below:
After fixing issues associated with a particular failed job, resubmit the job using a distinguishable tag, so you can track which batch has already been analyzed. For example, if original jobs/analyses has tags 200K_exome_HLA_analysis, batch_n_0, original
, you may resubmit the job using tag --name 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1
.
To retry a job that failed due to a intermittent system error such as spot instance termination or network connectivity problem, you can use:
If you had to fix your analysis code and want to rerun a failed job with a new applet, you can use the following:
To prevent to internet disconnection when submitting the large batch of jobs, you can upload submission file to project and use swiss-army-knife to submit. In such case, make sure you use --detach job for each job, because each (sub)job would inherit priority from the main swiss-army-knife job, so all those batch jobs might all be on-demand. Another option is to use --head-job-on-demand in order to request the head job of an app or applet be run in an on-demand instance - especially good option for workflows. Note that --head-job-on-demand option will override the --priority setting for the head job.
Before developing your analysis, define what each “unit” of independent work consists of so you can break your overall analysis down into multiple smaller sections for parallel processing. For example, for the HLA typing example or for variant calling, each individual sample can be considered as one independent unit. For joining variant calling, your "unit" might consist of a small genomic region for all samples.
Plan for your batch run using an end-to-end approach that covers naming of the analysis, preparing and submitting jobs, and organizing output files.
It is good practice to use a human readable name like <sample ID>.<type of file or processing>.<file format extension>
for ease of reviewing or troubleshooting your work. For example, in HLA typing, we name the output as 12345_6789_0.genotype.json
which represents <sample ID>.<type of file>.<file format extension>
.
Keep the number of files per folder under 10,000 to make viewing and querying more efficient.
To analyze hundreds of thousands of units, analyze multiple units in a single job to reduce the total number of jobs to submit and manage. To limit the impact of spot instance termination, we recommend limiting the runtime of each job to about a day by selecting an appropriate instance type. Executing jobs on larger instances with more CPUs can be used to decrease job execution time.
Large number of jobs would be hard to manage or modify. If you have more input to analyze than 5,000, consider combining multiple input per jobs or gradually scale up your job submission. In case you have a solid control over the input data and the gradual submission process, you do not need to group inputs.
Encapsulate your analysis tools in a Docker image for better reproducibility. Docker image includes an operating system version, a specific version of your tools and their dependencies. Specify an explicit (instead of latest or default) version of external tools such as samtools or bamtools in the Dockerfile for reproducible creation of docker image from the Dockerfile. Store Docker images on the platform for use in the DNAnexus apps instead of external Docker registries such as Docker Hub and Quai.io for better reliability and to avoid pull limits imposed by public Docker registries.
Optimize applet execution
Use available CPUs: In the HLA example, each execution of the applet processed 100 samples. It's important to make applet execution use available CPUs efficiently as the applet execution will be performed 2,000 times. During the applet's execution, the 100 input samples were analyzed serially as shown in line 13-19 of the WDL_APPLET
code snippet above (i.e. second sample was analyzed after the analysis of the first sample was finished). Since samtools and arcasHLA tools are multi-threaded, the sequential processing of samples still resulted in high CPU utilization. If the analysis tools are not multi-threaded, you may consider processing multiple units in parallel (e.g. using xargs) for better CPU utilization.
Manage disk space: If your app processes units in sequential manner, and can stream the input, you can avoid downloading all the inputs at the start of applet's execution by using the "stream" WDL input option as shown on line 35 of the WDL_APPLET
code snippet shown above. We also recommend removing unnecessary intermediate files to save storage disk space, shown in line 18 of WDL_APPLET
.
The outputs produced by the applet in the HLA example above required little disk space, so preserving all 100 outputs until the end of the job did not exhaust the disk space on the instance running the job. If your analysis produces large outputs, you can either select an instance with lots of disk space (such as mem3_ssd3 family), or implement your processing step using a native DNAnexus applet instead of WDL that allows for more control over the upload of output files from the worker to the DNAnexus platform.
Select an instance type that balances CPU, memory and disk requirements for your analysis. For example if your analysis requires 2GB of memory per core you can use mem1 instance family, while requirement of 7GB of memory per core calls for mem3 instance family.
For more general considerations for large-scale data analysis, please refer to the peer-reviewed publication “Ten Simple Rules for Large-scale Data Processing” published in PLOS Computational Biology.
If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.
The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this tutorial.
Note that all run times estimates and configurations given in this tutorial are not a guarantee and these estimates are given only as a reference.
This research project provides an example of performing end-to-end genomic target discovery on UK Biobank data using ischaemic heart disease as an example phenotype. For this tutorial, ischaemic heart disease was chosen because of the large cohort size and because of the existence of previous GWAS studies with this phenotype, allowing for simpler comparison of results. The first step in this analysis was to create the case and control cohorts. After that, sample data (phenotypic data) from cohorts and genomic data (array and imputed) were cleaned. Then a genome-wide association studies (GWAS) analysis was performed and significant GWAS variants were aggregated using the linkage disequilibrium (LD) clumping approach. Lastly, a phenome-wide association study (PheWAS) was performed for each variant.
The original analysis and guide was created by Anastazie Sedlakova and there is a corresponding Youtube tutorial here.
This tutorial demonstrates how to:
Create control and cases cohorts
Perform sample QC in JupyterLab
Lift over array data by compiling and running a WDL workflow
Perform quality control of array data using Swiss Army Knife
Perform quality control of imputed data by compiling and running WDL workflow
Run GWAS analysis using REGENIE app
Perform LD clumping on imputed data in JupyterLab
Extract variant allele counts from imputed data in JupyterLab
Run PheWAS analysis in JupyterLab
Create phenome dataset from a ICD10 field
This analysis was done as part of the UKB 46926 research application.
For this GWAS, two types of genomic data were used: array data (field 22418) and imputed data (Genomics England) (field 21008). For linkage disequilibrium (LD) clumping, only imputed data were used.
First, ischaemic heart disease was chosen as a phenotype of interest (ICD 10 code I20-I25). The following fields were also retrieved for the sample QC:
31 - Sex
22001 - Genetic sex
22006 - Genetic ethnic grouping
22019 - Sex chromosome aneuploidy
Covariates were added both to GWAS and PheWAS analysis to increase power and reduce confounding. Here is the list of covariates used:
31 - Sex
2966 - Age high blood pressure diagnosed (not used in PheWAS)
21022 - Age at recruitment
23104 - Body mass index (BMI)
20160 - Ever smoked
30760 - HDL cholesterol (not used in PheWAS)
30780 - LDL direct (not used in PheWAS)
22009 - Genetic principal components
The phenotype (ischaemic heart disease) being used in this analysis and other phenotypes for PheWAS testing were taken from the 41270 - Diagnoses - ICD10 field.
Once the cohorts are created they can be used for subsequent analysis (GWAS). In the present example, the phenotype definition is simple, but often the phenotype definition is quite complex and can include multiple fields. Defined cohorts will be combined in the QC step.
First, the Cohort Browser is used to create the selected cohort. The "ischemic_cases" cohort was created by selecting participants that have I20-I25 Ischaemic heart disease in the field 41270. The "ischemic_controls" cohort is created by using “Cohorts compare: not” in "ischemic_cases". This results in a sample of 57,383 "ischemic_cases" and 445,027 "ischemic_controls".
On RAP, click on the dataset you wish to use from the Manage tab. Clicking on the dataset name, as shown in the image below, will take you to the Cohort Browser page.
Then, click the Add Filter button. Type in “ischaemic heart disease” into the search bar and select “Diagnoses - ICD10”, which has data-field 41270. Confirm your selection by clicking on Add Cohort Filter.
In the modal prompt for “Includes any of”, type “I20-I25 Ischaemic heart disease” and select that option. Then click the Apply Filter button.
Name the cohort. In this example, the name is “ischemic_cases” (Note: The example cohort name uses the American spelling, however mentions of the specific UKB data-field will use the UK spelling).
To create the control group, clicking on the “+” (mouse-over: “Compare/combine cohort”) button.
From the drop down menu select the “not in ischemic_cases” option and apply this filter.
Name and save this cohort. In this example, the "ischemic_cases" cohort had 57,383 samples and the "ischemic_controls" cohort had 445,027 samples.
For additional information about creating cohorts, see this detailed tutorial on how to explore data in Cohort Browser.
Cleaning samples will decrease noisiness in data and increase accuracy of the GWAS analysis results. For example, checking for sex discordance and sex chromosome aneuploidy removes possible sample swaps and genotyping errors. Population substructure is minimized by selecting just one population (White British). In order to check phenotype-genotype associations, only samples of non-related participants were selected in choosing the samples that were used to calculate the PCA.
The code for the sample QC can be found in gwas-phenotype-samples-qc.ipynb. Phenotypic data was retrieved for each cohort using dx extract_dataset as a table. Then, samples were selected using the following criteria:
Sex and genetic sex are the same
Participant has White British ancestry
Not sex chromosome aneuploidy
Participant was used to calculate PCA (only non-relatives were included)
After filtering, 38,197 and 298,886 samples remained in "ischemic_cases" and "ischemic_controls", respectively.
After QC, covariate variables were then imputed as part of the GWAS process. The covariates and a summary of what the above notebook code does to clean the data are listed below:
Age high blood pressure diagnosed in participant (2966) field
The code combines all instances and create the hypertension boolean variable
Participant ever smoked
The code makes all missing values as 0
Participant body mass index (BMI), HDL cholesterol, direct LDL
The code replaces missing values by mean for that variable
Phenotypic data was then merged with the list of samples for which imputed data is available. The resulting table contained 38,197 cases and 298,886 controls.
The next step is to use JupyterLab to run the GWAS step. The JupyterLab configuration used when running this example GWAS is listed below, however, individual user preferences may cause costs and runtime to vary.
Cluster configuration: Single node
Recommended instance: mem1_ssd1_v2_x36
Run time: Approximately 5 minutes
On your local machine, clone the repository for shared UKB Jupyter notebooks here.
Upload the Jupyter notebook to your project directory on RAP: dx upload
gwas-phenotype-samples-qc.ipynb
—-destination <directory path>
.
On RAP, launch JupyterLab. The recommended configuration is listed below:
Once JupyterLab is open, click on the DNAnexus tab on the left hand side, and open the gwas-phenotype-samples-qc.ipynb
notebook.
Once the notebook is open, go to the JupyterLab menu located at the top of the page, click the Run tab and click Run All Cells.
Array data that is provided by UKB is mapped to the older version of the reference genome (GRCh37). However, WES and WGS data that is released is mapped onto the current version of the reference genome, GRCh38. In order to perform association testing with the sequencing data, we need to ensure the data is mapped to the current version of the reference genome first using the liftOver script.
For this step, the liftOver WDL script created by Yih-Chii Hwang was used. As a result of this script, data was lifted to the newer reference genome and all chromosomes were merged. LiftOver was performed using Picard LiftoverVcf. This step took approximately 34 hours.
On your local machine’s terminal, install Java:
(Mac OS): brew install openjdk
(Linux OS): apt install default-jre
Download the latest JAR compiler here.
Login to Platform using the dx login
command. Then, use the following command to compile the workflow and make it available for use on the Platform: java -jar dxCompiler-2.10.8.jar compile liftover_plink_beds.wdl -project project-xxxx -folder <directory path>
On RAP you will now find:
A workflow called liftover_plink_beds
Applets created that correspond to the steps of the workflow
In the UI, launch the liftover_plink_beds
workflow and use the following input parameters:
Common parameters to specify:
plink_beds: /Bulk/Genotype Results/Genotype calls/*.bed
(22 files)
plink_bims: /Bulk/Genotype Results/Genotype calls/*.bim
(22 files)
plink_fams: /Bulk/Genotype Results/Genotype calls/*.fam
(22 files)
reference_fastagz: /Bulk/Exome sequences/Exome OQFE CRAM files/helper_files/GRCh38_full_analysis_set_plus_decoy_hla.fa
ucsc_chain: b37ToHg38.over.chain
See a detailed tutorial on how to work with WDL on DNAnexus.
Cleaning of genotypic data, as in case of sample QC, will reduce its noisiness and increase accuracy of the GWAS analysis results. In this step we also check for Hardy-Weinberg equilibrium deviation in order to detect genotyping error. To clean the data, data with high missing rate for both per-variant and per-sample are excluded. The GWAS will be performed on autosomes only, so variants in sex chromosomes are excluded.
Code for array data QC is in the run_array_qc.sh script. For QC filtering, array data (field 22418) was used. Variants in array data were filtered with the Swiss Army Knife (SAK) app using following options:
plink2 --bfile ukb_c1-22_merged --keep ischemia_df.phe --autosome --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out imputed_array_snps_qc_pass
Below are criteria used for the filtering:
Use only for samples that are contained in ischemia_df.phe
file: --keep ischemia_df.phe
Keep only variants located in autosoNes: --autosome
Minor allele frequency is greater than 0.01: --maf 0.01
Minor allele count is greater than 100: --mac 100
Missing call rate for variant is not exceeding 0.1: --geno 0.1
Missing call rate for sample is not exceeding 0.1: --mind 0.1
Hardy-Weinberg equilibrium exact test p-value for the variant is greater than 1e-15: --hwe 1e-15
This is due to the fact that serious genotyping errors often yield extreme p-values.
For more information on filtering, see PLINK2 documentation.
Priority: Normal
Recommended instance: mem1_ssd1_v2_x36
Run time: Approximately 10 minutes
For the CLI, use the command below:
On your local machine run: sh run_array_qc.sh
In the UI:
After logging into the Research Analysis Platform, navigate to the Tools Library.
Select the “Swiss Army Knife” app and click Run to run it in your desired project.
Copy and paste the following into the command prompt:
plink2 --bfile “/mnt/project/<file path>/ukb_c1-22_merged” --keep “/mnt/project/<file path>/ischemia_df.phe” --autosome --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out imputed_array_snps_qc_pass
The code for the imputed data QC is in bgens_qc.wdl script created by Yih-Chii Hwang. For QC filtering, imputation from genotype (GEL) (field 21008) was used. Variants were filtered using following options: --mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1
Below are the criteria used for the filtering:
Use only for samples that are contained in ischemia_df.phe
file: --keep ischemia_df.phe
Minor allele frequency is greater than 0.0001: --maf 0.0001
Minor allele count is greater than 10: --mac 10
Missing call rate for variant is not exceeding 0.1: --geno 0.1
Missing call rate for sample is not exceeding 0.1: --mind 0.1
Hardy-Weinberg equilibrium exact test p-value for the variant is greater than 1e-15: --hwe 1e-15
Deviations from Hardy-Weinberg can often indicate genotyping error.
For more details on filtering, see PLINK2 documentation. This script ran for approximately 10 hours when normal priority was used.
The installation of Java is needed to run dxCompiler. See above for Java and dxCompiler download instructions.
Log in to the Platform. Then, compile the workflow to make it available on DNAnexus: java -jar dxCompiler-2.10.8.jar compile bgens_qc.wdl -project project-xxxx -folder <directory path>
On RAP, you’ll now find
A workflow called bgens_qc
Applets created that correspond to the tasks/steps of the workflow.
In the UI, launch the bgens_qc
workflow and use the following input parameters:
Common:
geno_bgen_files: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgen
(22 files)
geno_sample_files: /Bulk/Imputation/Imputation from genotype (GEL)/*.sample
(22 files)
Note that you will need to correct the sample file header.
output_prefix: gel_imputed_snps_data_qc_pass
Keep_file: ischemia_df.phe
plink2_options: --mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1
See a detailed tutorial on how to work with WDL.
On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP: dx upload
generate_inputs.ipynb
—-destination <directory path>
From the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once JupyterLab is open, click on the DNAnexus tab on the left menu, and then click on the generate_inputs.ipynb
notebook. Once the notebook is open go to the JupyterLab menu at the top of the page and click the Run tab and then click Run All Cells.
Running generate_inputs.ipynb
notebook will generate bgens_qc_input.json
. You may need to generate new sample files, see discussion in the community for more details.
Log into the DNAnexus Platform and navigate to the UKB_RAP repository (see Step 1) to compile the workflow: java -jar dxCompiler-2.10.8.jar compile bgens_qc.wdl -project project-xxxx -folder <directory path>
Compile the WDL workflow into DNAnexus native workflow and load it into the Platform.
java -jar <path_to_downloaded_dxCompiler>dxCompiler-2.XX.X.jar compile research_project/bgens_qc/bgens_qc.wdl -project project-XXX -inputs research_project/bgens_qc/bgens_qc_input.json -archive -folder '<path_to_folder_on_UKB_RAP>'
The folder takes a file path input, which is where you want to store the workflow files. This code assumes that you are in the UKB_RAP directory in the terminal on your local machine. It will generate bgens_qc_input.dx.json
in the working directory.
To run the workflow, type the following command in the terminal:
dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json
A GWAS analysis checks for association between variants and selected phenotype. Firth correction is used because of an unbalanced dataset (i.e. there are have fewer cases than controls). The additive test is selected as a recommended first approach when running GWAS using REGENIE.
The GWAS analysis was done using the REGENIE app. For the first step the quality-controlled array data was used, for the second step the quality-controlled imputed data was used. When running REGENIE app, the following options were used:
Step 2
SPA instead of Firth approximation? False
Test type: Additive
Run time: approximately 7 hours
After analysis, 2549 variants out of 36,695,747 tested variants had significant association with the phenotype.
Log in to the Research Analysis Platform and run the REGENIE app.
In the Analysis Settings tab, select “Execution Output Folder". In the "Analysis Outputs 7" field, select the following options:
Genotype BED for Step 1: path to array data after liftOver
Genotype BIM for Step 1: path to array data after liftOver
Genotype FAM for Step 1: path to array data after liftOver
Genotype BGEN files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgen
(22 files)
Genotype BGI index files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgi
(22 files)
Sample files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.sample
(22 files)
Note: You will need to correct the sample file header.
Phenotypes file: phenotype file generated in the Sample QC step
Variant IDs to extract (Step 1): snplist file generated in the Array data QC step (1 file)
Variant IDs to extract (Step 2): snplist file generated in the Impute data QC step (1 file)
In the ASSOCIATION TESTING (STEP 2):
Ensure that "SPA" instead of "Firth approximation?" Is set to False
In the COMMON section:
Quantitative traits?: False
Phenotypes to include: ischemia_cc
Array of covariates to apply: sex, age, bmi, ever_smoked, hdl_cholesterol, ldl_cholesterol, hypertension, pc1, pc2, pc3, pc4, pc5, pc6, pc7, pc8, pc9, pc10
In the App Settings tab, select the instance type. The recommended instance is mem1_ssd1_v2_x36
Then click Start Analysis to begin.
Figures showing the example results of this GWAS are shown below.
Table 1. Number of significant GWAS variants divided by chromosome
LD clumping is the next step in the process in order to reduce the number of significant GWAS variants by clumping genetically linked variants together. This will then report only the most significant variant from each clump.
Code for the LD clumping is in run_ld_clumping.ipynb. In this notebook, significant variants are extracted from the GWAS report. Then, for each chromosome, significant variants are extracted from the imputed BGEN files, converted to PLINK files and then LD clumping is performed using PLINK software.
Out of 2531 variants, 82 remained as index variants after clumping.
Cluster configuration: Single node
Recommended instance: mem2_ssd1_v2_x32
Run time: approximately 20 minutes
On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP: dx upload
run_ld_clumping.ipynb
—-destination <directory path>
From the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once your JupyterLab session is open, click on the DNAnexus tab on the left hand side and open the run_ld_clumping.ipynb
notebook.
From the JupyterLab menu at the top of the page, click the Run tab and click Run All Cells.
An example of a table of index variants after LD clumping is shown below.
Table 2. Number of index variants after LD clumping by chromosome
PheWAS studies examine causal associations with other phenotypes. Conducting a PheWAS can help to distinguish when comorbidities are caused by horizontal pleiotropy (one locus influencing multiple diseases) and causality or vertical pleiotropy (one disease is causing another disease).
Code for preparing phenotype data can be found in get-phewas-data.ipynb. Code for running PheWAS analysis is in run-phewas.ipynb. The phenome wide association study was performed using the PheWAS R package. For this analysis, genotypic data for each of the variants selected by LD clumping method, covariates and ICD10 phenotypes were used. To create ICD10 phenotypes, the UKB Diagnoses - ICD10 (41270) field was used. A table was then created, where each row contains one phenotype (ICD10 diagnosis) for one participant. Therefore, when a participant has multiple diagnoses, a participants' eid
will appear multiple times.
The PheWAS was run in parallel for each variant with the following options:
Use p-value, Bonferroni and FDR to calculate significance threshold: significance.threshold = c
("p-value", "bonferroni", "fdr")
Additive genotypes are not being supplied: additive.genotypes = FALSE
Cluster configuration: Single node
Recommended instance: mem2_ssd1_v2_x32
Run time for data preparation: Approximately 20 min
Run time for PheWAS: Approximately 6 hours
On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP:
dx upload
get-phewas-data.ipynb
--destination <directory path>
dx upload
run-phewas.ipynb
--destination <directory path>
On the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once theJupyterLab session is open, click on the DNAnexus tab on the left hand side, open the get-phewas-data.ipynb
and run-phewas.ipynb
notebooks.
From the JupyterLab menu at the top of the screen, click the Run tab and click Run All Cells.
The results from the PheWAS are summarized in the below figures.
Table 3. Number of significant associations for each phenotype
26 out of 82 variants had significant association with phenotype. The phenotypes with the most significant PheWAS associations were essential hypertension or hypertension.
This tutorial demonstrated how UKB-RAP can be used for end-to-end genomic target discovery pipeline. The analysis started with ischemia cohort creation using the Cohort Browser. The QC was done using JupyterLab (samples), PLINK in Swiss Army Knife (array data) or inside WDL script (imputed data). GWAS was performed using REGENIE, where array and imputed data was used for step 1 and 2 respectively. Linkage disequilibrium clumping then extracted the most informative variant among significant GWAS variants. In the final step, PheWAS analysis was performed for each variant, and the phenotypes with the most significant PheWAS associations were found to be essential hypertension and hypertension.
Single unfiltered multi-sample VCF files were provided for all UK Biobank whole exome sequencing (WES) releases (200k, 300k, 450k, and the final exome release). To aid researchers to generate a quality control data set for genotype-phenotype association analyses, a “90pct10dp” QC filter was applied to all UK Biobank aggregate data sets based on analyses on the UK Biobank 200k data release.
The processing methodology described on this page applies to all OQFE datasets available on the Research Analysis Platform, including the 300k, 450k, and the final release OQFE whole exome sequencing datasets.
Please note that only users with approved access to the 300k WES data will be able to view both the sequencing data and this auxiliary file from the user’s approved project. Users must request access to the 300k WES data and be approved by UK Biobank for access before they can view sequencing data and auxiliary files.
The breadth and depth of UKB phenotypes provide researchers a broad landscape of possibilities for association analyses. These range from single-variant tests to gene burden testing, across individual and aggregated phenotypes. While no singular set of filtered genotypes can be optimized for all possible analyses, there are features fundamental to the UKB WES data that can lead to spurious association results if not accounted for.
Alternatively, use the provided helper file named
ukb23145_300k_OQFE.90pct10dp_qc_variants.txt
, which is a single-column text file containing variants failing the “90pct10dp” depth filter in the CHR:POS:REF:ALT format. The following command using PLINK 1.9 can be used to remove the filtered variants from the UKB 300k WES PLINK files:
The exact file path may change depending on where the file is located or mounted from the RAP project.
The above figure shows re- and post-filtering UKB WES 200k association results with asthma phenotype (Phe10_J45. Subfigures A and B (top and bottom, respectively) show results on the unfiltered UKB WES 200k genotypes and the 90% DP>10 variant-filtered genotypes. The tests were logistic regressions performed with standard covariates (10 PCs, age, sex, age^2, age_x_sex).
This document describes the UKB whole exome sequencing (WES) protocol of the single- and aggregate-sample processing, including read alignment, variant calling, joint genotyping and post aggregation reformatting, employed by the Regeneron Genetics Center to generate the UK Biobank WES data set (category 170) for public release. Following this protocol, researchers can aggregate their own sequencing data with the UK Biobank single sample data, enabling mega-analysis.
This protocol applies to all UK Biobank whole exome sequencing data sets, including 200k, 300k, 450k, and the final release OQFE data sets.
The protocol is as follows:
The OQFE Docker file takes either FASTQ or CRAM files as inputs and outputs an OQFE CRAM, ensuring that all steps are executed exactly as specified in the OQFE protocol. We strongly recommend that users seeking to harmonize their data with UKB WES data execute the OQFE protocol via these implementations.
To use the OQFE app, run the following command from the CLI:
Prerequisites:
DeepVariant v0.10.0 docker file
OQFE CRAM generated by following section 1 of this protocol (e.g. aligned.cram)
Calling regions in BED format (which can be found in the folder containing exome sequences in your project)
A custom model file, if applicable
The details of the applet used to run the dataset are as follows:
Input Files:
Reference FASTA
Input BAM/CRAM
Input BAM/CRAM index
Interval file
Parabricks license file
Parabricks engine file for custom model
Input Options:
--disable-use-window-selector-mode : True
--include-med-dp : True
--gvcf : True
--gzip-output : True
Output Files:
Output gVCF file in bgzipped format
Index of output gVCF file
One sample is expected to run in around 5 minutes including the data upload and download time. The DNAnexus command line input for the run is:
This section of the UKB WES protocol describes the aggregation of variant genotypes and variant allele harmonization across sample-level gVCFs into multi-sample project-level VCF file (pVCF) organized in chromosomes or genomic segments using GLnexus, containing a row for every set of overlapping variants and each sample’s genotype for every variant allele. This pVCF is “squared-off”, in that for samples that do not contain an alternative allele genotype for a given variant, genotypes are derived from the gVCF reference blocks, reporting the read depth and most likely genotype (i.e. 0/0 or missing) for that sample at that variant position.
Prerequisites:
BEDtools
Custom target regions in BED format (e.g. targets.bed).
FASTA sequences (e.g. references.fa) of the reference index file (e.g. references.fa.fai). For the corresponding FASTA and index files used for UKB WES data, can be found in the dispensed UKB project within the folder containing exome sequences.
Steps:
Generate 100 bp buffer regions of each side of the target regions.
bedtools flank -i targets.bed -g references.fa.fai -b 100 > buffers.bed
Combine the target and buffer regions and sort based on chromosome and start coordinates.
cat targets.bed buffers.bed | sort -k1,1 -k2,2n > target_buffers.bed
Merge overlapping regions.
bedtools merge -i target_buffers.bed > calling_regions.bed
Prerequisites:
BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.
Combine the two calling regions and sort based on chromosome and start coordinates.
cat calling_regions_1.bed calling_regions_2.bed | sort -k1,1 -k2,2n > combined.bed
Merge overlapping regions.
bedtools merge -i combined.bed > combined_calling_regions.bed
Prerequisites:
BEDtools
Two different calling regions in BED format (e.g. calling_regions_1.bed, calling_regions_2.bed) Note: both BED files need to have coordinates of the same genome build and the same chromosome naming convention as the reference sequences.
Steps:
Get the intersect of the two calling regions with BEDtools.
bedtools intersect -a calling_regions_1.bed -b calling_regions_2.bed > intersect.bed
Prerequisites:
Access to a DNAnexus Research Analysis Platform folder containing the single sample genomic VCF (gVCF) for all samples to be included in pVCF.
A BED format file containing the genomic regions to be aggregated and reported in the pVCF.
Steps:
Generation of a manifest file containing the DNAnexus file ids of the gVCF files by matching the naming convention of the files (e.g. *.gvcf.gz) to be aggregated.
dx find data --project=<rap_project_name> --folder=<rap_folder_name> --name="*.g.vcf.gz" --brief | cut -d\: -f2 > <manifest_file>
Upload the <manifest_file> to the DNAnexus Research Analysis platform.
dx upload --path <dx_project>:/<dx_folder>/ <manifest_file> --brief
dx run glnexus -y --brief -ivariants_gvcfgz_manifest_csv=<manifest_file_location> -igenotype_range_bed=<calling_regions.bed> -ioutput_prefix=<base_name> --folder=<output_folder_in_RAP>
Prerequisites:
PLINK 1.9
PLINK 2.0 (for BGEN conversion)
Reference sequences in FASTA format (e.g. references.fa)
Steps:
Split multiallelic variants in pVCF file and variant normalization.
bcftools norm -f references.fa -m -any -Oz -o pvcf.norm.vcf.gz pvcf.vcf.gz
Convert normalized biallelic pVCF to PLINK files.
plink --vcf pvcf.norm.vcf.gz --keep-allele-order --vcf-idspace-to _ --double-id --allow-extra-chr 0 --make-bed --vcf-half-call m --out pvcf.norm
Convert PLINK files to BGEN files and prepare BGEN files.
First convert PLINK to a zlib-compressed BGEN file:
plink2 --bfile pvcf.norm --export bgen-1.2 bits=8 id-paste=iid ref-first --out pvcf.norm_zlib
Convert zlib-compressed BGEN to zstd-compressed BGEN file and omitting the sample identifier block in .bgen file using QCTOOL.
qctool -g pvcf.norm_zlib.bgen -s pvcf.norm_zlib.sample -og pvcf.norm.bgen -os pvcf.norm.sample -ofiletype bgen -bgen-bits 8 -bgen-compression zstd -bgen-omit-sample-identifier-block
Generate BGEN index.
bgenix -g pvcf.norm.bgen -index -clobber
Only variant sites with at least 90% of the genotypes having DP>10 were retained by this filter. The filtered data sets are provided in the “helper_files” folders of all UK Biobank WES releases. For details on the analysis and other considerations, please refer to the UKB-RAP documentation: "".
The current release of the UK Biobank (UKB) whole exome sequencing (WES) data on 302,333 participants comprises single- and multi-sample variant data generated via the same protocols that were applied to the UKB 200k WES release in early 2021. All samples are processed with the OQFE mapping protocol, and variants are called with DeepVariant and aggregated into a multi-sample . The multi-sample VCF contains per-genotype metrics including depth and genotype qualities, allowing researchers to perform custom variant- and genotype-level filtering as appropriate for their desired analyses. As such, a single unfiltered multi-sample VCF was provided for the 300k WES release along with the derived PLINK files. In response to feedback from the UK Biobank community, the 300k WES release also includes an auxiliary file, ukb23145_300k_OQFE.90pct10dp_qc_variants.txt
, to aid researchers in implementing basic best practices for genotype-phenotype association analyses.
Specifically, the UKB WES data was generated in two phases: the first 50k participants (Phase 1) and then the balance of the total 500k cohort (Phase 2). As described in the , the 50k release participants were selected to enrich for specific phenotypes. Given the non-random order of participant sequencing, variations in sequencing coverage that occur over long-term projects can manifest as spurious association results. The UKB community reported such spurious hits when single-variant tests were run on the unfiltered UKB WES 200k genotypes. As an example, Figure 1A shows all single-variant hits of the UKB WES 200k unfiltered genotypes tested against an asthma phenotype (PHE10_J45), indicating a large number of likely spurious variants with significant or near-significant P-values. Examination of these spurious hits in the UKB WES 200k unfiltered set indicates that these variants tend to be enriched for sample-genotypes with low per-genotype read depth.
As noted in the UKB WES 200k FAQ (, section 23.d), we suggest the inclusion of a batch covariate in association tests on these data, to account for differences in oligo lots between Phase 1 and Phase 2. These coverage heterogeneities can also be mitigated by a single variant-level filter requiring that at least 90% of all genotypes for a given variant - independent of variant allele zygosity - have a read depth of at least 10 (i.e. DP>=10). When this filter is applied to the UKB WES 200k data prior to association analysis, the results are largely devoid of the spurious hits (Fig. 1B).
Application of this depth filter (“90pct10dp”) is consistent across the UKB 200k and UKB 300k WES sets with respect to numbers of variants removed (Table 1). The filtering can also be performed directly on the multi-sample VCF with the commands below:
Details on the <reference>
above can be found in this reference document .
bcftools and plink can be used as part of the Swiss Army Knife app found in the Tools Library on the Research Analysis Platform. See here for a detailed tutorial on how to use .
For more information about app documentation for Swiss Army Knife, see (requires Research Analysis Platform login).
For more on the 300k WES dataset and how to work with it, .
If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform ()" in your work.
The UKB WES data are reference-aligned with the , which employs BWA-MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates, and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. All constituent steps of the OQFE protocol are executed with open-source software and described in detail in the OQFE manuscript linked above. Given the impact even small changes to the protocol can introduce into large analyses, the OQFE protocol is available as a and as an on the DNAnexus Research Analysis Platform that executes the OQFE Docker file.
If you are using the UI, start from .
This section of the UKB WES protocol describes the variant calling using . To call variants in WES data, either the default DeepVariant WES model or a custom model can be used. The custom model that is trained on WES data generated by and used for the generation of UK Biobank data is available as supplementary materials in .
(which can be found under the folder containing the exome sequences in your project, detailed in)
The UKB WES variant data set () was generated using NVIDIA Clara Parabricks Pipelines accelerated version of DeepVariant, which is a faithful reproduction of Google’s DeepVariant (with the additional ), and is an order of magnitude faster. The results of the two implementations are equivalent when tested across many samples. Across 100 test samples, 7.5 million sample level variants were called with 100% concordance between the two implementations of DeepVariant, with 1 zygosity mismatch and MEAN GQ difference of 0.43 on Phred scale. This protocol first outlines how to run using the GPU accelerated DeepVariant provided by , and made available on the DNAnexus Research Analysis Platform as an .
If you are using the UI, start from .
The GLnexus aggregation process requires only the gVCFs and the desired aggregation regions in the BED format (BED format specifications, ) as inputs. The BED file for UKB WES data is the exome capture region buffered by 100 bp on each side of each target, with overlapping buffered regions merged. Users who are applying this protocol to a set of gVCFs derived from sequencing across multiple capture designs will need to generate a unified BED file for the regions of interest. The protocol here describes the BED generation process for aggregating variants across multiple capture designs: the intersection of all capture designs and the union of all capture designs.
The BCFtools and commands mentioned in this section of the UKB WES protocol can be executed via command line using the Swiss Army Knife app. Please refer to the for more information about running the app with the command along with managing the inputs/outputs.
The following commands mentioned in this section of the UKB WES protocol can be executed in the Swiss Army Knife app. Please refer to the for more information about running the app with the command along with managing the inputs/outputs.
Launch to generate pVCF files.
If you are using the UI, start from .
This section of the UKB WES protocol describes how UKB WES PLINK and BGEN format files are derived from the pVCF, including the decomposition of multi-allelic variants into biallelic variants and variant normalization prior to format conversion to PLINK and BGEN. BGEN is recommended for performing GWAS using Regenie ().
The following commands mentioned in this section of the UKB WES protocol can be executed in the Swiss Army Knife app. Please refer to the for more information about running the app with the command along with managing the inputs/outputs.
Examples of Standard Support Questions
Examples of Questions that Require Service Packages
“We need help setting up a purchase order.”
“I am having trouble logging into the platform”
“I have opened a new RStudio session but I get an error message when I try to open it.”
“I would like to request access to the whole exome and whole genome sequence data through the RAP”
“What happens when my initial credit expires?.”
“My team is trying to implement a pipeline analyzing rare variants in exome data but need some guidance setting it up efficiently”
“We want to query WES genotypes across a large selection of UKB samples. Can you provide recommendations on the best way to do this?”
“What’s the best way to merge genomic & proteomic data?”
Recommended for efficiency
HLA Example
Number of concurrent jobs
<=5,000
2,000
Number of input per job
<=1,000
100
Job run time
~1 day
19 hours
Chromosome
Number of significant variants
1
115
2
598
4
1
6
658
7
3
8
27
9
204
10
11
11
13
12
104
13
1
15
437
16
63
17
140
19
149
21
7
Chromosome
Number of index variants
1
8
2
9
4
1
6
29
7
1
8
1
9
4
10
2
11
2
12
8
13
1
15
5
16
2
17
2
19
6
21
1
Phenotype
Number of significant PheWAS associations
Essential hypertension
17
Hypertension
17
Hematuria
4
Delirium dementia and amnestic and other cognitive disorders
3
Dementias
3
Colorectal cancer
2
Disorders of fluid, electrolyte, and acid-base balance
2
Hypovolemia
2
Malignant neoplasm of rectum, rectosigmoid junction, and anus
2
Cancer of prostate
2
Malignant neoplasm of other and ill-defined sites within the digestive organs and peritoneum
1
Atrial fibrillation and flutter
1
Malignant neoplasm of gallbladder and extrahepatic bile ducts
1
Cardiac dysrhythmias
1
Urinary incontinence
1
Hyperplasia of prostate
1
Helminthiases
1
Other symptoms/disorders or the urinary system
1
Intestinal helminthiases
1
Name
Description
Performs sample QC on cases and controls cohorts and selects only those samples, for which imputed data are available.
Lifts array data to the newer version of the reference genome - GRCh38 and merges files per chromosome as one file.
Performs QC on lifted array PLINK files.
Performs QC on imputed data BGEN files and merges results for individual chromosomes into one file.
Performs LD clumping on significant GWAS variants
Create phenotype data in long format suitable for the PheWAS R package. Extract Genotype data for each variant selected by LD clumping.
Run PheWAS analysis using PheWAS R package
% Filtered Variants | SNP | Indel |
UKB 200k | 1.52% | 5.54% |
UKB 300k | 1.57% | 5.20% |
Command section | Annotation |
bedtools flank | To create flanking intervals for each BED feature |
-i targets.bed | The input BED file |
-g references.fa.fai | The genome file defining chromosome bounds |
-b 100 | The number of base pairs in each direction to add to the input BED file |
> buffers.bed | Redirect the output to the buffers.bed |
Command section | Annotation |
cat targets.bed buffers.bed | Concatenate the target and buffer BED files to the standard output |
| sort -k1,1 -k2,2n | Pipe to a sort command to sort the BED by chromosome first and then by the start coordinates in the numeric order |
> target_buffers.bed | Redirect the output to a target_buffers.bed file |
Command section | Annotation |
bedtools merge | Merge overlapping BED features into a single interval |
-i target_buffers.bed | The input BED file |
> calling_regions.bed | Redirect the output to a calling_regions.bed file |
Command section | Annotation |
cat calling_regions_1.bed calling_regions_2.bed | Concatenate the calling_regions_1 and calling_regions_2 BED files to the standard output |
| sort -k1,1 -k2,2n | Pipe to a sort command to sort the BED by chromosome first and then by the start coordinates in the numeric order |
> combined.bed | Redirect the output to a combined.bed file |
Command section | Annotation |
bedtools merge | Merge overlapping BED features into a single interval |
-i combined.bed | The input BED file |
> combined_calling_regions.bed | Redirect the output to a combined_calling_regions.bed file |
Command section | Annotation |
bedtools intersect | Bedtools intersect |
-a calling_regions_1.bed | First calling regions in BED format |
-b calling_regions_2.bed | Section calling regions in BED format |
> intersect.bed | Redirect the output to a intersect.bed file |
Command section | Annotation |
dx find data | Find data objects subject to the given search parameters |
--project=<dx_project_name> | The DNAnexus project name where the data is in |
--folder=<folder_name> | The folder where the data is in the DNAnexus project |
--name="*.gvcf.gz" | The files to look for using a wildcard “*” |
--brief | Print a DNAnexus ID per line for all matching gVCF files |
| cut -d\: -f2 | Pipe to the cut command to extract the DNAnexus file IDs only |
> <manifest_file> | Redirect the output to a file |
Command section | Annotation |
dx upload | Upload local file to DNAnexus |
--path <dx_project>:/<dx_folder>/ | The folder <dx_folder> in the DNAnexus project <dx_project> that the manifest file is uploaded to |
<manifest_file> | The local manifest file that needs to be uploaded |
--brief | Return a DNAnexus file ID for the uploaded manifest file |
Command section | Annotation |
bcftools norm | Split multiallelic variants and normalization |
-f references.fa | Specify the reference sequences (required for left-alignment and normalization) |
-m -any | Split any multiallelic variants |
-Oz | The output type is ‘compressed VCF’ |
-o pvcf.norm.vcf.gz | Write output to a file named ‘pvcf.norm.vcf.gz’ |
pvcf.vcf.gz | The input pVCF file |
Command section | Annotation |
plink | PLINK 1.9 |
--vcf pvcf.norm.vcf.gz | Specify the input pVCF file |
--keep-allele-order | Keep the allele order |
--vcf-idspace-to _ | Change spaces in the variant IDs to underscore (_) |
--double-id | Set both family ID and within-family ID to the same sample ID |
--allow-extra-chr 0 | Treat the unrecognized chromosome codes as if they have been set to zero |
--make-bed | Creates a new PLINK 1 binary fileset |
--vcf-half-call m | Convert half calls (./1) in the pVCF to missing in PLINK |
--out pvcf.norm | Specify the prefix of the output PLINK |
Command section | Annotation |
plink2 | PLINK2 |
--bfile pvcf.norm | The input genotype file |
--export bgen-1.2 bits=8 id-paste=iid ref-first | The output format BGEN with 8-bits probability precision, iid used to construct ID column and correct allele order |
--out pvcf.norm_zlib | The output genotype file |
Command section | Annotation |
qctool | QCTOOL v2 |
-g pvcf.norm_zlib.bgen | The input genotype file |
-s pvcf.norm_zlib.sample | The input sample file |
-og pvcf.norm.bgen | The output genotype file |
-os pvcf.norm.sample | The output sample file |
-ofiletype bgen | The filetype of the output genotype file specified by -og |
-bits 8 | Store each probability in 8 bits |
-bgen-compression zstd | Use zstd algorithm for BGEN compression |
-bgen-omit-sample-identifier-block | Omit the sample identifier block |
Command section | Annotation |
bgenix | bgenix |
-g pvcf.norm.bgen | Specify the input genotype file |
-index | Create an index file for the given bgen file |
-clobber | bgenix will overwrite existing index file if it exists |
Contact the Research Analysis Platform Support team for help in using the Platform.
You can contact the UK Biobank Research Analysis Platform Support team in one of two ways:
In the menu at the top of the UI, click Help, then select Contact Support from the dropdown menu. A contact form will open in a modal window. Fill out the form and click Send.
After you submit a request, you'll receive an automated reply, confirming that your request has been logged in the Research Analysis Platform Support tracking system.
This article describes the format of the annotation files for the 500k WES dataset, and demonstrates how to use regenie on the UK Biobank Research Analysis Platform for generating variant masks on which to perform association tests.
Specifically, this article will go over how to use the following helper files:
ukb23158_500k_OQFE.sets.txt.gz
ukb23158_500k_OQFE.annotations.txt.gz
ukb23158_500k_OQFE.masks
A typical application of this tool would be in rare variant analyses where single variant tests have lower power, and combining variants into masks can boost association power. The commands shown below are executed in the JupyterLab environment with Bash kernel. For more on how to use JupyterLab on the Research Analysis Platform, see these tutorials. Alternatively, you can use ttyd (web-based terminal) or Cloud Workstation.
We will go over building masks on the fly in regenie, covering the following:
The format of the input files
How to run regenie to build and test masks
The LOVO scheme
We assume that the annotation files are located along with the dispensed files in the user’s project. We also assume the RAP project is mounted on “/mnt/project” on the worker as per the approach detailed here.
The path to the annotation file:
For this tutorial, we need the following files:
Annotation file
Masks definition file
Set list file
Use the following command to view the directory structure of the files:
The following files should be in your directory:
ukb23158_500k_OQFE.annotations.txt.gz
ukb23158_500k_OQFE.masks
ukb23158_500k_OQFE.sets.txt.gz
The file ukb23158_500k_OQFE.sets.txt.gz
defines how sets - i.e. genes - are to be constructed, by listing variants corresponding to each gene.
Each line contains the gene name, followed by a chromosome and physical position - to be used in the association result file - then by a comma-separated list of variants included in the gene, in the format CHR:POS:REF:ALT ID.
The variant IDs correspond to those in the genotype file (if not in the file, they will be ignored when running regenie). To view the file:
The file details will appear as the following:
In this file there are almost 19,000 defined genes. To see the count, use the following command:
Which will give you:
The file ukb23158_500k_OQFE.annotations.txt.gz
defines a functional annotation for each variant given a gene set. Each line contains the variant name, the gene name (corresponds to the name in the set list file above), and a single annotation label.
To view this file:
The file details will appear as the following:
There are a total of 5 annotation labels in the file. Check the file ukb23158_helper_files.pdf
that is available on the UK Biobank showcase here for more details on the labels. Variants in the set list file which don't have annotations will be assigned to a default NULL category in regenie.
Output:
The file ukb23158_500k_OQFE.masks
specifies which annotation labels should be combined into masks. Each line contains a mask name, followed by a comma-separated list of the annotations included in the mask (i.e. taking a union over the annotation categories).
Output:
The specified M1 mask includes only loss-of-function annotated variants in the mask. You can easily add new masks by selecting the annotations you want to combine. When doing so, make sure the annotation labels match the formatting of those in the annotation file above.
For example, to have a mask called M2 that includes LoF and missense annotated variants, you would generate a mask definition file as:
For more on these input files, refer to the file ukb23158_helper_files.pdf
here.
From this point on, you will need to have regenie installed in your working environment.
Here we show how to use annotation files for burden testing using regenie. For more detailed instructions on using regenie, refer to the regenie documentation page.
To install regenie in your JupyterLab environment:
regenie is also available as part of the Swiss Army Knife app, if you want to run an analysis as part of a larger workflow on the Research Analysis Platform.
In regenie, we will use the following options:
--aaf-bins
to specify AAF cutoffs to use when building masks (the singleton class of masks is always included)
e.g. `--aaf-bins 0.05,0.01
` will build masks using singletons, 1% and 5% AAF cutoffs.
In the examples below, we will use the OQFE genotype data in BGEN format. For the purposes of this tutorial, we will skip regenie step 1 and use --ignore-pred
to bypass specifying the LOCO PRS.
Note that when running an actual analysis, it is highly recommended that you run Step 1 to control for relatedness, population structure and polygenicity.
We will focus on a specific gene, PCSK9. Start by getting its ID from the annotation files:
The output should be:
We will use a phenotype file containing LDL measurements, and will carry out burden tests using variant masks built on-the-fly in regenie:
We will first run regenie building variant masks using a 0.1% and 1% AAF cutoff. We use option --extract-setlist
to specify a subset of the 19K genes to test. Alternatively, you can provide a file named like "extract_gene_names.txt", with a single column containing names of genes to analyze, then pass this file to regenie using --extract-sets extract_gene_names.txt
:
regenie will output a summary statistics file ldl_pcsk9_test_LDL.regenie
, containing association results for the built masks, whose name will be PCSK9(ENSG00000169174).M1.*
, split between the 3 AAF cutoffs (0.01, 0.001, or singletons).
You have the option to save the built masks to a file, which will be useful and save compute, if you want to use them again in another analysis. To do so, use --write-mask
, which will store the masks in PLINK BED format, in files ldl_pcsk9_test_masks.{bed,bim,fam}
.
Note that if you want to build masks and save them to file without testing them for association, you can use option --skip-test
.
By default, masks are created by taking the maximum ALT allele count across sites included in the mask - so it will take values "0/1/2" or "NA" for missing. Alternatively, you can specify different rules to build masks:
--build-mask sum
will take the sum of the ALT allele count across sites
--build-mask comphet
is used to identify compound heterozygotes, which is defined as carriers of 2 or more ALT alleles across any site included in the mask
Note that building masks using the “sum” rule is not compatible with the use of --write-mask
as detailed above.
To obtain the list of single variants that went into each mask, you can use the option --write-mask-snplist
when building masks. This will generate a file where each row has the mask name followed by the list of variants that went into the mask.
If you want to make sure that you’re correctly specifying input annotation files, you can use the option --check-burden-files
, which will create a report that checks the input files for concordance. It will check that the same annotation labels are used in all files, and check that there are variants in the set list file which are not in the input genotype file.
If there are built masks that come up as significant, it is usually of interest to determine which of the single variants in the mask is driving the signal. This is what the LOVO scheme, which stands for leave-one-variant-out, aims to do.
To specify LOVO, you need to use option --mask-lovo
followed by the gene name, the mask name, and the AAF cutoff (either 'singleton' or a value in (0,1)).
For example, if we wanted to apply LOVO to the M1 mask with 1% AAF cutoff for PCSK9, we would use:
This will generate the result output file ldl_pcsk9_test_lovo_LDL.regenie
, which will contain association results for each LOVO mask, as well as results for the full mask that considers all sites.
If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.
Explore one of the ways to conduct a GWAS on the Research Analysis Platform.
The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this guide.
This tutorial will provide the Python code needed to conduct the QC portion of a genome-wide association study (GWAS) focusing on discovering variants associated with Alzheimer's Disease (AD) in JupyterLab. For this GWAS, we used UK Biobank data to search for genetic loci associated with AD. For more detailed information about the AD study, see this blog post.
One of the motivations for performing this study was to demonstrate the process of running an end-to-end GWAS on the UK Biobank Research Analysis Platform. While researchers can bring their own programming languages and tools to the Platform, the Platform’s user interface enables researchers to run analyses easily and quickly. We wanted to demonstrate that conducting a GWAS could be done without the knowledge of sophisticated command-line methods.
Here is a brief overview of the steps that we conducted to perform this GWAS:
In JupyterLab:
Access data and construct phenotype.
Perform sample QC
In CLI or UI:
Conduct LiftOver.
Variant QC using Swiss Army Knife.
Conduct GWAS using regenie.
In UI:
Visualize analysis results using LocusZoom.
The genetic data needed to run this GWAS is stored in multiple folders in the Research Analysis Platform’s file structure, which is detailed here. For this analysis, we use the array genotype and whole exome sequencing data stored in the “/Bulk/Genotype Results/Genotype calls/” and “/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/” folders, respectively. Phenotypic data needed to collect the AD phenotype is located in the SQL database named “app<APPLICATION-ID>_<CREATION-TIME>”, and can be accessed via the corresponding dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” in the root (“/”) folder.
The first part of this analysis will consist of conducting sample QC and processing the UK Biobank data fields to calculate the individual’s AD risk by proxy.
We conducted our QC using a Jupyter Lab spark node. Here is the following configuration:
Cluster configuration: Spark Cluster
Instance type: mem1_hdd1_v2_x4
Number of Nodes: 2
We first imported dxpy
to find and extract the UKB datasets on the Research Analysis Platform.
The extracted UKB data is returned as 3 .csv files. The data_dictionary.csv
contains a table with participants as the rows and metadata along the columns, including field names (see table below for naming convention). The codings.csv
contains a lookup table for the different medical codes, including ICD-10 codes that will be displayed in the diagnosis field column (p41270). The entity_dictionary.csv
contains the different entities that can be queried, where participant
is the main entity that corresponds to most phenotype fields.
To access the data_dictionary.csv
, use the command:
For the main participant phenotype entity, the Research Analysis Platform uses field names with the following convention:
Using the UKB showcase, we located the following field IDs of interest from UKB showcase. We note them below so we can easily refer to them throughout the rest of the notebook.
The full list of all data-fields used in our case study:
The fields we want include principal components as well as the fields we need to construct our phenotype. We use the following function to collect all the field names and create a list of field IDs:
If you view these field names now, each will look something like this: p<field_id>_iYYY_aZZZ
.
Now we can retrieve our necessary fields as a Pandas DataFrame:
The dispensed_dictionary.csv file contains a table with participants as rows and the fields of interest as columns. From this point onward, we are interacting with the data in a Pandas DataFrame.
Use this command to visualize your table:
We will format the column headers to drop the "participant" prefix for the field names:
Here we resolve the ICD-10 codes for AD from the dataset. This list of codes will be used later for defining the phenotype.
Your output may look like the following:
The AD phenotype is generated using both a participant’s disease status and that participant's parental disease status and age. We are following the definition of AD-by-proxy described in this article by Jansen et al.
Here is a summary of the steps of the process. There are two calculation methods:
Participant's disease status.
If the participant has been diagnosed with AD. Count risk as the maximum value, 2.
Additive of participant's parents' risk, determined by disease status along with parent's age.
If a biological parent has been diagnosed with AD, count the risk as a 1. The risk is set to 2 if both biological parents were diagnosed with AD.
Otherwise, get the parent' risk by the parents age (recorded age or age at death, whichever is older):
risk = max( 0.32, (100 - age)/100 )
0.32 is the general risk for an adult to be diagnosed with AD
Take the maximum of steps 1 and 2.
To get the participant's disease status, we look for the participant’s primary or secondary ICD-10 codes. If a participant has ICD-10 codes for AD, we record their AD risk as “2”. Note that we first need to format the values in the columns to be a list object instead of a string.
Now we look at the illness statuses of the participant’s mother and father. We first get a list of illnesses with which the participants' biological parents have been diagnosed.
Data-fields 20107 and 20110 record illnesses of each participant's father and mother. If diagnosed with AD, assign the parent's risk as "1":
UKB participants attended multiple measurement-taking sessions, so there are multiple fields for the participant’s parents’ ages. Since the risk of developing AD increases by age, we only need the maximum value for each parent’s age to calculate the AD risk:
If the parent does not have any AD diagnosis, calculate the parent’s risk using their age, as shown in the following, then organize the data into a table:
Now we conduct the Sample QC steps. We filtered for participants for whom reported sex and genetic sex are the same, ancestry is listed as "White British," and there is no sex chromosome aneuploidy. We filter out for outliers that are heterozygous and missing rates, or are distinguished excessive kinship, i.e. have ten or more third-degree relatives. Participants who had parental ages as “do not know" or "prefer not to answer" were also removed from the cohort:
Use this command to check the resulting number of samples:
Since GWAS analyses have been run before for AD, and the results have been documented, we took the extra step of visualizing the distribution of cases with a histogram, to double check that our results looked reasonable in a general population. Our histogram looked reasonable, so we continued with our analysis:
Seeing the distribution, we decide to set the case/control threshold with AD risk at 1, with the range going from 0 to 2. All participants with a value for 1 or above for “AD-by-proxy” are considered cases, and those below 1 are considered to be controls:
Next we cleaned up the DataFrame to set up as input with regenie. We renamed the columns for human-readability and added an FID column, which is the required format for regenie. We also convert the list of parental illnesses to a string, for ease of formatting:
At this point, your table should look something like this, noting that displayed data values are placeholders, so to not disclose actual values:
To upload your phenotype file back to the project:
Now the QC steps are complete.
The next steps for the GWAS is to conduct variant QC and the array-genotype LiftOver.
Note that the UKB SNP array genotype calls (bulk, genotype calls) are mapped to GRCh37, while the whole-exome sequences are mapped to GRCh38, the most recent version. We chose to lift the coordinates of genotype calls over to GRCh38 using LiftOver before conducting the rest of the analysis, so that our genotype calls are consistent.
The following pipeline may be useful in conducting this part of the analysis.
To conduct variant QC, we use PLINK2, available as part of the Swiss Army Knife app, to create a list of variants that pass the QC threshold for array genotype and WES variant data. To access it from the UK Biobank Research Analysis Program, open your project, then click Start Analysis. From the list of tools that appear, click Swiss Army Knife. The following shows what you will put as input string under the cmd
option of the Swiss Army Knife app.
For more information about PLINK2 see the official PLINK2 documentation.
The input files to our variant QC job for running the array genotype data is:
The UK Biobank genotype call files in .bed, .bim, and .fam formats. We merged the chromosomes together into one file for each file format in preparation for running the whole genome regression model using Step 1 of regenie.
Below are the Swiss Army Knife cmd
inputs used to run the job:
The outputs we get from this step are:
A list of SNPs that pass QC, final_array_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC, final_array_snps_CRCh38_qc_pass.id
A log file (.log
)
The WES QC is performed using the bgens_qc.wdl
script found here. The variants are QC’d per chromosome and then concatenated at the end to create a single list of variants to keep. The variants were filtered using following options:
The input file to run the bgens_qc.wdl
was created by first running the generate_inputs.ipynb
notebook shown here. Within this notebook you’ll need to update the following parameters:
Run the WDL workflow using the .json file created in the generate_inputs.ipynb
notebook by running the command: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json
. Otherwise you can run the WDL workflow in the UI and specify the inputs manually instead of using the Jupyter notebook.
The outputs we get from this step are:
A list of SNPs that pass QC, final_WES_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC, final_WES_snps_CRCh38_qc_pass.id
A log file (.log
)
We will be using regenie to conduct our GWAS. DNAnexus provides a suite of apps to run the analysis as a whole. The Regenie orchestration app is used to facilitate the run of two required steps as well as input preprocessing, validation, resource allocation and post-processing (including creation of Manhattan plots per trait).
For Step 1, we will estimate how background SNPs contribute to the phenotype.
The inputs used for Step 1 are:
A merged array genotype calls: ukb_c1-22_hs38DH_merged.[bed,bim,fam]
that we also used as our input in the variant QC step above.
Our phenotype file: ad_risk_by_proxy_wes.phe
SNPs that pass QC list: final_array_snps_CRCh38_qc_pass.snplist
and the SNP ID file, final_array_snps_CRCh38_qc_pass.id
The following outputs from Step 1 are the necessary inputs for Step 2:
LOCO file: ukb_c1-22_hs38DH_merged_1.loco
Prediction list: ukb_c1-22_hs38DH_merged_pred.list
Then in Step 2, linear or logistic regression is used to test association between the variants and phenotype conditional upon the prediction from the regression model in Step 1. In Step 2, the orchestration app distributes each set of BGEN files into a separate job, as each variant is independently tested to the phenotype, and this parallelization helps accelerate the end-to-end runtime.
The inputs to the second step are:
23 sets of WES genotype files (ukb23159_c[1-22,X]_b0_v1.[bgen,bgi,sample]
).
ID file of SNPs that pass QC: final_WES_snps_CRCh38_qc_pass.id
Phenotype file: ad_risk_by_proxy_wes.phe
Prediction list: ukb_c1-22_hs38DH_merged_pred.list
(from Step 1)
LOCO file: ukb_c1-22_hs38DH_merged_1.loco
(from Step 1).
After the completion of Step 2, a set of association plots is generated. This concatenation is done per phenotype. On the output of the orchestration app are concatenated Manhattan plots as well as statistic in lmm.tsv.gz
(Extra output files/extra_outputs ) that can be investigated with LocusZoom.
We used the following input options to run the regenie app using the UI. If an input option isn’t specified, then the default option was used.
Inputs:
Genotype BED for Step 1: ukb_c1-22_hs38DH_merged.bed
Genotype BIM for Step 1: ukb_c1-22_hs38DH_merged.bim
Genotype FAM for Step 1: ukb_c1-22_hs38DH_merged.fam
Genotype BGEN files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen
Genotype BGI index files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen.bgi
Sample files for Step 2: ukb23159_c[1-22,X]_b0_v1.sample
Phenotypes file: ad_risk_by_proxy_wes.phe
Covariates file: ad_risk_by_proxy_wes.phe
Variant IDs to extract (Step 1): final_array_snps_CRCh38_qc_pass.snplist
Variant IDs to extract (Step 2): final_WES_snps_CRCh38_qc_pass.snplist
Common
Quantitative traits: FALSE
Phenotypes to include: ad_by_proxy
Array of covariates to apply: age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10
WGR (Step 1)
Size of the genotype blocks (Step 1): 1000
First allele as reference (Step 1): TRUE
Association Testing (Step 2)
Size of the genotype blocks (Step 2): 200
Firth approximation instead of SPA: TRUE
Minimum allele count (MAC): 3
To run regenie in the CLI, do not directly copy and paste this portion of code. Fill in the file paths specified by <> and add file paths for the chromosomes you want to analyze using the igenotype_bgens/bgis/samples.
Next you can use LocusZoom to visualize the results of your analysis. The LocusZoom app is available in the Tools Library of the Research Analysis Platform. It generates Manhattan plots, QQ plots, and a table of high p-value loci.
To visualize your regenie outputs (e.g. the lmm.tsv.gz
file), provide it as input to LocusZoom, and follow the app's instructions. From these results you can conduct further analyses and look into variant effects, or you may find new loci to explore.
Examples of these visualizations are shown below.
A few example cases that might be interesting were tested to provide expectations on the Regenie runtime.
The runtime and resources of various scenarios for Step 1 execution are summarized in the following table. Please note that this is only for orientational purposes and analysis-specific factors can influence the final results.
Focusing on chromosome 1 (99236 variants after QC) as we analyze all chromosomes in parallel. The runtime of Step 2 varied from 7 to 140 minutes for mentioned use cases. The resources used are relatively stable across the tested scenarios and the default values are established to 3700 MB RAM and 40 GB disk.
If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.
For each variant included in a mask, the LOVO scheme will build the mask excluding the variant. Hence, if there are variants included in a mask, there will be LOVO masks generated. Thus, if a variant highly contributes to the full mask signal the corresponding LOVO mask should have weaker association signal.
Type of field
Syntax for field name
Example
Neither instanced nor arrayed
p<FIELD>
p31
Instanced but not arrayed
p<FIELD>_i<INSTANCE>
p40005_i0
Arrayed but not instanced
p<FIELD>_a<ARRAY>
p41262_a0
Instanced and arrayed
p<FIELD>_i<INSTANCE>_a<ARRAY>
p93_i0_a0
Data-Field ID
Description
Data-Coding
Sex
Father's age at death
Units of measurement are years or 100291
Mother's age
Units of measurement are years or 100291
Father's age
Units of measurement are years or 100291
Mother's age at death
Units of measurement are years or 100291
Illnesses of father
Illnesses of mother
Age at recruitment
Units of measurement are years
Genetic sex
Genetic ethnic grouping
Genetic principal components
Units of measurement are Units; Array indices run from 1 to 40
Sex chromosome aneuploidy
Genetic kinship to other participants
Outliers for heterozygosity or missing rate
Diagnoses - main ICD10
Diagnoses - secondary ICD10
IID
sex
age
missingness
ethnic_group
sex_chromosome_aneuploidy
0
0000011
1
45
0.003
1
None
1
0000022
0
67
0.022
1
None
2
0000003
1
50
0.001
1
None