1 of 40

Research Analysis Platform

About the Research Analysis Platform

What is UK Biobank?

UK Biobank is a national and international health resource with unparalleled research opportunities, open to all bona fide health researchers. UK Biobank aims to improve the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses – including cancer, heart diseases, stroke, diabetes, arthritis, osteoporosis, eye disorders, depression and forms of dementia. It is following the health and well-being of 500,000 volunteer participants and provides health information, which does not identify them, to approved researchers in the UK and overseas, from academia and industry.

The UK Biobank Research Analysis Platform

The UK Biobank Research Analysis Platform is an informatics platform that provides access to, and analysis of, UK Biobank data by its registered researcher community. Read the announcement.

See the DNAnexus website for more information on the Research Analysis Platform, including details on applying for access.

500k WGS FAQ

This FAQ addresses questions related to the new data dispensing functionality that allows users to select which elements of the data to dispense. If you would like more information on the new 500k WGS data release, visit the UK Biobank FAQ.

How can I follow the status regarding platform maintenance?

You can subscribe at https://status.dnanexus.com/

Can I “refresh” existing projects to get the 500k WGS data?

Currently the refresh feature is unavailable to ensure that the maximum number of users can get access to the new data as soon as possible via dispensal.

We recommend that users dispense a new project to get the 500k WGS data, and migrate data analysis workflows from existing projects to the new project. We will enable the “refresh” feature again in the future and send notifications out once it is available.

How many projects can I dispense data to?

We recommend that each research application dispense data to only one project to be considerate to other researchers who would like to access the data.

How long will the dispensal process take?

Each dispense request will take about 4-8 hours once your project starts dispensing. However, due to the large number of people interested in 500k WGS data and the size of this data, you might experience a long waiting time for your project dispensal to start due to the queue of requests. Please do not dispense more than one project.

I created a project but it's stuck at "0%".

Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time. https://dnanexus.gitbook.io/uk-biobank-rap/frequently-asked-questions#i-created-a-project-but-its-stuck-at-0-.

How do I select what data I’d like to dispense?

You will need to create a new project in order to access the data. Note that you will not be able to refresh an already existing project. On the project creation screen, users will now see a new section with the different data types available to dispense. For a faster dispensal time, only select what data you’ll need. You will have the option to dispense the data on project creation or later in the project settings of that new project.

What data should I select for dispensal?

If you are interested in accessing the updated phenotypic, health care and proteomics data, select structured tabular data. This option is selected by default, but can be unselected if the data is not necessary for your project.
If you are interested in accessing the updated imaging data or the population-level WGS pVCF data, select unstructured bulk data files. This option will dispense population-level WGS pVCF data (600,000 files), but not individual-level WGS data such as CRAM or gVCF files. This was decided in order to streamline the new project experience for all users. If your research requires access to the individual-level WGS data (18 million files), return back to the project once the initial dispensing is completed and request an additional dispensing of these data files.
1. Due to the size of the dispensal we recommend waiting until demand for the WGS has decreased.

How do I dispense the individual-level data?

Due to the size of the data (18 million files), we recommend waiting until the demand for the WGS has reduced. If your research requires access to the individual-level WGS data, you will have to request "Additional Bulk Data Files" after your first request has been completed. You can make the request in your project settings by selecting the “Dispense More Data” button.

Can I create a project without having to dispense data?

You can create an empty project without dispensing data by deselecting both checkboxes on the project creation screen.

What fields & data require the “Dispense More Data” step?

See the below table for details.

Where can I find the population-level files from the 500k WGS release after my dispensal is completed?

They can be found at the two locations below:

/Bulk/GATK and GraphTyper WGS/GraphTyper population level WGS variants, pVCF format [500k release]/
/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]

About This Documentation

The UK Biobank Research Analysis Platform is built on the DNAnexus Platform.

This documentation provides descriptions and instructions for accessing UK Biobank data within the Research Analysis Platform.

For a detailed rundown of DNAnexus Platform features and how to leverage them in your work, refer to the Platform documentation.

Frequently Asked Questions

Get answers to common questions about the Research Analysis Platform, and about UK Biobank data and systems.

The Research Analysis Platform is open to researchers who are listed as collaborators on UK Biobank-approved access applications.

How do I register for the Research Analysis Platform?

Registration is a two-step process. You must first create a Research Analysis Platform user account, and then you must link it to your UK Biobank Access Management System (AMS) account.

How do I create a Research Analysis Platform account?

If your organization has been set up for Single Sign On (SSO), follow internal procedures specific to your organization. Otherwise:

If you already have a DNAnexus account, you do not need to create a separate Research Analysis Platform account. You can use your existing DNAnexus account on the Research Analysis Platform.

Does my Research Analysis Platform info (username and email address) need to match my AMS info?

No, your username and email address on Research Analysis Platform can be different from those you use on the AMS.

I tried creating a Research Analysis Platform account and got an error "Email Already Registered."

How do I log in to the Research Analysis Platform?

To log in:

If your organization has been set up for SSO, follow internal procedures.

How do I link my Research Analysis Platform account to my AMS account?

This process happens automatically upon first login (see "How do I log in to the Research Analysis Platform?"). You will be presented with the Research Analysis Platform Terms of Service, and once you read them (by scrolling down) and accept them, you will be taken to the AMS website, where you must enter your UK Biobank credentials.

What are my AMS credentials?

Can I access the Research Analysis Platform without an AMS account?

No. To access the Research Analysis Platform you must have an AMS account, and you must be listed as a collaborator in one of the Research Analysis Platform-approved applications.

How do I obtain an AMS account?

I entered my valid AMS credentials but got an approval error.

Can I link multiple Research Analysis Platform accounts to the same AMS account?

No, an AMS account may be linked to only one Research Analysis Platform account.

Can I link multiple AMS accounts to the same Research Analysis Platform account?

No, a Research Analysis Platform account may be linked to only one AMS account.

Can I "unlink" my Research Analysis Platform account from my AMS account?

No, this operation is not supported.

I previously linked my Research Analysis Platform account to my AMS account, but during a subsequent Research Analysis Platform login, I was asked to do so again.

Occasionally the platform may ask you to refresh your link, for security reasons. Among others, this can happen if your state on the AMS changes for any reason (e.g. if you update your contact details on the AMS).

Data Removal

How many projects should I create?

Currently, there is no limit of how many projects the users can create. However, we recommend everyone under the same research application use the same single project. This would allow better coordination when there is a new data release and also better reuse of tools, workflows, and data that users generated.

How long can each project live? Would it be removed if not used?

The projects on UKB-RAP are eligible for deletion considered inactive or unfunded and will be removed if one of the following criteria are met. This will help ensure the best user experience for active projects and will help optimize use of the platform.

The project has not been accessed for the last 60 days, with no requests from those with access to the project have been made to browse project folders and files. In addition, the project contains only the dispensed UK Biobank data and does not have any derived data generated by the user or others

The project is billed to a wallet that has no funds available. In addition, the project contains generated data resulting in ongoing storage charges.

How to reactivate a project if I receive a warning email?

If your project is considered inactive and for any reason you would like to keep this project, please re-access the project. If the project is no longer inactive, it will not be deleted. If your project is considered unfunded and for any reason you would like to keep this project, then one of the following actions:

Add funds to the wallet that the project is billed to.
Transfer the project to another wallet that has funds.
Delete all generated data and ensure no user generated data remains in the project

Projects and Files

What is a project?

On the UK Biobank Research Analysis Platform, all work takes place in the context of a project. Projects allow a defined set of users to:

Access specific datasets
Conduct analyses on these datasets

How do I create a project?

What is an Access Application?

When making a project, I get the error "Application does not belong to this user or is not an approved application".

Please ensure that you are listed as a collaborator in the access application on the AMS.

Can I have a project associated with multiple applications?

No, each project is tied to one application only.

After a project has been created, can I change its application?

No, the application is set at project creation time and cannot be changed.

Can I have multiple projects associated with the same application?

Yes, you can make multiple projects using the same application id.

Data has been dispensed to my project, but not all the data I am expecting is there

The most common reason for data not showing in the Research Analysis Platform is due to your UKB Access Application not being fully completed.

New Applications

Upgraded / Additional Data Requested

If you have applied to move your application to a tiered application, or requested further data, this will need to go through a number of steps; quotation, new MTAs and payment etc.. You will receive an email when notifying you that the MTA has been executed. MTA execution is the final step in the process. Once you have received this email, your data will be ready for dispensal. If you have already had data dispensed to a project on the Platform, you will need to have data dispensed to a new project, in order to receive any new data.

I created a project and chose to dispense data, but I don't see any data.

The process of dispensing data happens over a short period of time. When you first select the Create Project button to submit the New Project dialog, the new project will appear empty. Subsequently, it will begin to get populated with files and other data. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.

How long does it take for the data to appear in a new project?

The process can take as little as 20 minutes or as long as a full day, depending on the scope of the access application.

Do I need to remain logged in while the data is being dispensed in a new project?

No, the process happens in the background, even if you are logged out.

I created a project but it's stuck at "0%".

Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.

What kind of data is dispensed when I create a project?

The system dispenses the data that correspond to the approved data-fields of the access application associated with the project. Tabular data-fields and linked health data are dispensed into a SQL database, and bulk data-fields are dispensed as files.

Can I access and use Research Analysis Platform projects on the DNAnexus Platform?

If you use the same account on both platforms, you will be able to access and use Research Analysis Platform projects on the DNAnexus Platform. Note, however, that:

You will only be able to access and use tools that are hosted in the London AWS region.
All sharing, download, and other data-use restrictions apply fully to UK Biobank data, on both platforms.

What is a data-field?

What data-fields is my access application approved for?

You can get more information on your access application by logging into the AMS and selecting the Applications tab on the left.

What is in the "Bulk" folder?

The "Bulk" folder contains files associated with data-fields of type "bulk". These are data items that are particularly large and/or complex and are therefore made available as files, such as genome sequencing files.

How are folders determined for bulk fields?

How are filenames determined for bulk fields?

What is a participant EID?

How are EIDs used on the Research Analysis Platform?

When you create a project on the Research Analysis Platform, the system contacts UK Biobank to get your application's EIDs, then uses them to pseudonymize your dataset. The pseudonymized EIDs are used, for example, to populate the "eid" column in the database, to name per-participant files, to generate the EID-specific content of FAM files for genotyping fields, and to adjust pVCF headers.

For a given application, is the Research Analysis Platform using the same EIDs as data on UK Biobank's website?

Yes, for a given application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.

I am interested in a specific participant EID. What files are available?

You can also do the same using the CLI. Type:

dx find data --property eid=1234567

Note that these methods find participant-specific files (like individual VCF or CRAM files) and not cohort-wide files (like PLINK or pVCF files).

In the header of pVCF files, why are there samples named "W000001", "W000002", etc.?

These samples correspond to participants that have withdrawn, and the Research Analysis Platform uses this convention to denote them in the header, to help you exclude them from your research.

Since the Research Analysis Platform pseudonymizes pVCF headers, does that mean that different researchers see different content when accessing the same file?

The pseudonymized pVCF headers are specific to a specific access application. Researchers who work on different applications will encounter different headers for each, just as they encounter different content for the FAM files of PLINK fields.

Are the headers of gVCF or CRAM files pseudonymized?

No, the content of these files is not pseudonymized. However, the names of these files are pseudonymized accordingly. Therefore, we recommend relying on the filename prefix for determining the EID corresponding to a gVCF or CRAM file, and discarding any identifiers found in the gVCF or CRAM header.

What is in the "Showcase metadata" folder?

The files under "Showcase metadata" are different from what's on the UK Biobank showcase website.

The files in the "Showcase metadata" folder represent the showcase metadata at the time that the data was ingested in the system, and may not reflect the latest showcase updates.

How did UK Biobank generate the bulk files? What instruments, assays or scientific workflows were used?

For information about data provenance, please consult the UK Biobank website or contact UK Biobank directly.

I found my bulk data-fields under the "Bulk" folder. Where are the rest of the non-bulk data-fields?

All other non-bulk data-fields (for which UK Biobank defines the item type as "data", "sample", or "record") are dispensed into a SQL database and associated Research Analysis Platform dataset.

Can data be downloaded, exported or otherwise egressed out of the Research Analysis Platform?

To help you comply, the Platform may restrict external downloads of certain original files, using rules specific to your application tier. These restrictions are not comprehensive, and it is your responsibility to refrain from actions that would violate the MTA even if the Platform does not technically restrict those actions.

How do I see who has access to a project?

In the projects list, under the "Members" column, select the number corresponding to the row of interest. Alternatively, from inside a project, select the Share icon on the upper right (next to the "Access:" label).

I just created a project, and the system says there are two users. Who is the other user?

While the project is being populated with data, the system adds a service user called "UK Biobank Robot" (username: ukb.robot).

Why is the user "ukb.robot" in my project?

The system automatically adds this service user to a project whenever the project is being edited or updated, such as when data is being dispensed in a newly created project. The system uses that user to perform any necessary data manipulations in an automated manner.

Can I remove or alter the access of the user "ukb.robot" in my project?

No, but the system will automatically remove that user once any necessary data manipulations are completed.

If you are a project administrator, from inside a project select the Share icon on the upper right to launch the sharing dialog. Enter the username or email of the user you want to share the project with, select their access level, then select Add User.

You can only share a project with Research Analysis Platform users who are listed as collaborators in the project's access application on the AMS. If you receive an error, please ensure the following:

The username or email you are entering exists. You cannot share a project with someone if they have not yet signed up for an account.
You are sharing with a linked Research Analysis Platform account. You cannot share a project with an account if they have not yet logged into the Research Analysis Platform and linked their account to the AMS (or if their link needs to be refreshed).
You are sharing with someone on the same application. You cannot share a project with a linked Research Analysis Platform account unless they are listed as collaborators in the project's access application on the AMS.

No, you must share with each person individually, as the platform needs to enforce AMS permissions at the user level.

Yes. By default, Customer Support does not have access to any projects, unless you explicitly share a project with them. To do that, in the project sharing dialog enter "org-support" (without the quotes) as the username, select Viewer as the access level, and select Add User.

Yes. The system supports a special alias that you can use to share a project with UK Biobank. In the project sharing dialog enter "org-ukb_reviewers" (without the quotes) as the username, select Viewer as the access level, and select Add User. This action shares your project with a specific UK Biobank team, managed by UK Biobank themselves. The purpose of this team is to receive your research results.

Sharing is on a project basis. If you need to share a subset of data, such as the files in one folder, we recommend copying them to a new project and then sharing that project, as follows:

In the project list page, select New Project. Enter the same application id as your existing project, and deselect the option Dispense data to the project. Select Create Project. This will create a new empty project, associated with the same application as your existing project.
In your existing project, tick the items you want to share, and select Copy. Select the new project, then select Copy Selected.
Share the new project.

Are there any restrictions in copying data across projects?

You may only copy data across projects associated with the same access application. If you have uploaded a file in a project associated with one application, and you need to use it in a second project associated with a different application, you must re-upload it in the second project.

Running Analyses

Which job priority do I choose for my analysis?

What compute instance types are available for running my analysis?

File Visualizations

How do I visualize a CRAM or VCF file using IGV.js?

To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:

Navigate to the project containing the files you want to visualize.
Select the VISUALIZE tab.
Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".
Select the files you want to visualize.
- If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.
- For CRAM files, you must select both the CRAM and the associated CRAI file.
- For VCF files, you must select both the VCF and the associated TBI file.
- IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.
Select Launch Viewer.

Databases and Datasets

What is the database found in the root folder of the project?

This is a database containing tables, columns, and rows, that correspond to the approved data-fields of the access application associated with the project. It is a SQL database that is based on Spark SQL technology, which is a modern and more scalable technology than classic relational database technologies (RDBMS).

What tables are included in the dispensed database?

The database contains the following tables:

How are column names determined for the dispensed database?

For the main UK Biobank participant tables, the column naming convention is generally as follows:

p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>

However, the following additional rules apply:

If a field is not instanced, the _i<INSTANCE-ID> piece is skipped altogether.
If a field is not arrayed, the _a<ARRAY-ID> piece is skipped altogether.
If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the _a<ARRAY-ID> piece is skipped altogether.

Examples:

Age at recruitment: p21022
Date of attending assessment centre: p53_i0, p53_i1, ...
Diagnoses - ICD10 (converted into embedded array): p41270

Data Releases

What is a data release version?

The Research Analysis Platform holds a copy of all UK Biobank data. All projects are created using this copy of UK Biobank data. As UK Biobank updates the data on their end, the copy held by the Research Analysis Platform is periodically updated to reflect these upstream updates. Whenever the Research Analysis Platform updates its copy of the data, it will be indicated by a new data release version.

Where can I get full detail on data included in each data release?

I am about to make a new project and choose the option to dispense data. What data release version will my data correspond to?

The data in your project will be dispensed out of whatever copy is held by the Research Analysis Platform at the time that you create the project. Therefore, your data will correspond to the latest data release version at that time.

I previously created a project. How is my project affected by new data releases?

The data in my project is not up to date. What can I do?

How can I find out what data release version was used, for an existing project?

In a project, locate and tick the dataset that was dispensed in the root folder. Click the info icon on the upper right to open the info panel. Scroll at the bottom to reveal the "Details" section. The value of the "Description" key contains the original version, e.g.

"Description" = "Dataset: app68444_202101290057.dataset Original Version: v3.0+ae7924f"

I refreshed my project after a data release, but before the new data has been unrestricted. Could I update my data by doing the data dispensal again?

After each data release, the data need to be unrestricted by the UKB before it is available for the researchers. If the user begins the process of data dispensal during this time, the project will be set to the latest version, but the restricted data will still not be not available.

The user can re-dispense data after the data has been unrestricted by UKB. UKB will typically send out an email after the initial data release to notify users when data has been unrestricted. The version numbers of data releases can be the same, however the version signature will change when restricted data becomes available. For example, the user whose current data is "v3.0+ae7924f" could re-dispense data to version "3.0+ae9999f", even though both versions start with “v3.0”. The version signature, “ae7924f” and “ae9999f”, will be different between these two data dispensal batches.

The scope of my UK Biobank access application has been expanded. Will new data automatically appear in my projects?

Can I just update selected data fields using the data update feature?

No. The data update process will update all the data fields that your application is eligible for.

I just updated my project data. Why is my update progress stuck at 0%?

The update process will take some time to complete. Your request to update the data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.

Will the data update affect ongoing jobs?

If ongoing jobs use files that need to be removed as part of the data update, these jobs may fail. We recommend starting the update process when there are no jobs running and waiting for the update process to complete before starting any new jobs.

When doing a data update, how are previous artifacts (previously generated results, previously saved cohorts, dashboards, etc) affected?

AWS Credits Program

UK Biobank has approved my credits application, what should I do to receive my approved credits?

If you have received confirmation from UK Biobank that your grant application has been approved, the next step is to create a new grant org on the Research Analysis Platform. This will enable you to receive the grant. See the next question for more information.

How do I create a grant org?

Log onto the Research Analysis Platform.
From the main Platform menu, select Org Admin.
From the dropdown menu, select All Orgs.
On the Organizations list page, click the New Organization button.
A New Organization form will open in a modal window. Enter the following information in the form: Org Name: Enter a name of your choice for this organization Org ID: You can edit the default value, as long as you preserve the prefix "ukbgrant_yymm_" Note that this field has a character limit. A valid org ID would be, for example, "ukbgrant_yymm_FirstNameLastName”.
Click Create Organization.
The Set Up Billing modal window will open, and you will be prompted to set up billing for your new organization. Do not set up billing. Instead, click Exit to close the modal window.

What should I do if I do not see an option to create a new org?

How do I use the grant org as a project's billing account?

From the main Platform menu, select Projects, then All Projects.
In the projects list, find the row for the project in question.
Click the vertical "..." icon at the end of the row, then select Project Settings from the More Actions menu.
Locate the Billed To field, in the Billing section of the Settings screen.
Click on the downward-facing caret at the right end of the Billed To field.
Select the grant org from the list of available billing accounts.

When will I see credits in my Research Analysis Platform account?

Credits are issued quarterly at the beginning of March, June, September, and December. Your approval email from UK Biobank will specify when your credits will be issued, pending the creation of an org to receive the credits.

I am running out of credits in the grant org, can I set up billing on this account and add my credit card?

I have been approved for enhanced credits. Do I have to create a new org to receive these credits?

Yes, you must create a new grant org to receive enhanced credits.

Can I transfer funds from the grant org to my personal billing account?

No, funds from the grant org cannot be transferred to other accounts.

Why is my personal billing account still being charged after I received grant funds and created a grant org?

Check the billing account used by your project or projects. Be sure that the grant org is set as the "Billed To" account for each.

Getting Started

Quickstart

Get started using the Research Analysis Platform.

Register

Access UK Biobank Data

Data dispensal can take over an hour, or even longer, in some cases. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.

Collaboration

To collaborate with colleagues who are named on the same approved access application, add them as members to the project associated with this application.

Billing and Costs

Getting Help

UKB Community Site

Accessing the Research Analysis Platform

Learn how to create a Research Analysis Platform account, access the Platform, and connect your account to the UK Biobank Access Management System.

Before You Begin

Before you can use the Research Analysis Platform, you need to:

Create a UK Biobank Access Management System (AMS) account, via the AMS signup page.
Ensure that you’ve received UK Biobank access approval from the UK Biobank Access Management Team (AMT). If you haven’t, log into the AMS and follow the directions to complete your registration and get AMT approval.
Ensure that you are listed as a collaborator on a UK Biobank-approved access application by that application's Principal Investigator (PI).

See the AMS user guide for more on creating and managing an AMS account.

Accessing the Research Analysis Platform

If you already have a DNAnexus account, log into the Research Analysis Platform using your existing username and password. Then skip to Connecting Your Account to UK Biobank below.

If your organization uses SSO with DNAnexus, follow your organization's login procedures to access the Research Analysis Platform. Note that your AMS and DNAnexus accounts do not need to use the same email address.

If you need to create an account, follow the instructions in the next section.

Creating a Research Analysis Platform Account

If you don’t already have a Research Analysis Platform account, you’ll need to create one. Here’s how:

Navigate to the Research Analysis Platform. Click Create an Account.
Fill out the Create New Account form, then click Create Account. Note that in selecting a username, you don’t need to use the same username you use on the UK Biobank AMS.
You'll receive an email with a link to click, to activate your account. Click the link.
Log into the Platform.
Complete your profile, then click Access Platform.

Connecting Your Account to UK Biobank

Once you have a Research Analysis Platform account. you need to connect it to your AMS account. You need do this only once.

The first time you access the Research Analysis Platform, you’ll see a prompt to Connect Your Account to UK Biobank.
Read the Terms of Service, then click Accept & Continue.
You'll be redirected to the AMS. Log into the AMS.
Your AMS and Research Analysis Platform accounts are now connected.
You'll be redirected once again, this time back to the Research Analysis Platform. You can now use the Platform.

Authentication Tokens

Generating tokens

You can log in without having to provide a username and password for a certain amount of time by using authentication tokens (henceforth referred to as tokens). Broadly, authentication tokens are generated by providing your username and password to the platform and specifying a time period for which the token may be used to log in.

If you provide your token to a third party, they may access the Research Analysis Platform with your token, effectively impersonating you as a user. They will have the same access level as you for any projects to which the token has access, potentially allowing them to run jobs and incur charges to your account. Please keep your token safe and secure.

In order to generate a token:

After logging into the Research Analysis Platform, access your profile by clicking on your user icon in the top right side of the screen and click Profile.
Once you are on your profile page, click on the API Tokens tab.
The New Token modal will appear on screen. Fill out the required fields and then click Generate Token.

A pop-up will then appear saying “Your token has been generated. Please copy it for later use; for security purposes, this is the only time you will see it.” with a 32-character token comprised of letters and numbers in the line below. Copy down this token in a secure location and save it for later.

Some examples in which tokens might be used include the following:

Writing a script: Tokens can be helpful when writing scripts that require logging in to the platform. However, if you incorporate a token into a script, the token should only be valid for as long as the script requires access to the platform; the "Expiration Date" field should be modified accordingly. Furthermore, the token should only be granted access to the projects required by the script. If your script uploads data to only one project, the "Token Scope" should reflect the limited access.
Logging in to the command line for interactive use: If for some reason you don't wish to log in to the command line with your username and password, tokens are quite useful. This is the only scenario in which you should use a full-scope token, thereby allowing you to access all of your available projects.

Logging in with an authentication token

In order to use a token to log in on the command line, you must use dx login with the --token flag. See the example below of what a user would see after logging in:

$ dx login --token 

Note: Use "dx select --level VIEW" or "dx select --public" to select from
projects for which you only have VIEW permissions.

Available projects (CONTRIBUTE or higher):
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Mouse (ADMINISTER)

Pick a numbered choice [2]: 1
Setting current project to: Scratch Project

Revoking authentication tokens

You must navigate to the API tokens tab of your profile in order to revoke a token.

Once in the API tokens tab, select which token you wish to revoke and then click the Revoke button:

Once you confirm that you wish to revoke the token, the token will be revoked.

Some examples in which tokens might be revoked include the following:

Token accidentally shared too widely: If more people have access to your token than you would like, revocation of the token will cut off access to your account by unwanted parties.
Token no longer needed: For example, if the script utilizing the token is no longer in use; or if the group granted access to the platform using the token no longer exist; or in any other instance in which the token is no longer necessary, you should revoke it to restrict access to your account.

Troubleshooting

I entered valid credentials on the AMS, but got an error message saying I’m not yet registered.

‌See Step 3 in the Before You Begin section above. You must complete the AMS registration process, and get the approval of the AMT.

For additional information, see the Creating an Account and Registration sections of the AMS Getting Started guide.

‌I tried creating a Research Analysis Platform account and got the error message "Email Already Registered."

‌‌You already have a DNAnexus account. Use your existing username and password to log in.

I previously completed my account setup, but now I am being asked to accept Terms of Service and enter AMS credentials again.

Occasionally the platform may ask you to refresh your account association, due to security reasons. Among others, this can happen if your state on the AMS changes for any reason - if you update your contact details on the AMS, for example.

I updated my AMS account information and am having issues accessing the AMS via the Research Analysis Platform

If you update your AMS account information, you may experience access issues until UK Biobank reviews and re-approves your account.

I lost access to a project

One of two issues is likely the reason why you've lost access to a project.

You were removed from the access application

You may have been removed from the UK Biobank-approved access application. In this case, you will not be able to regain access to projects linked to that application.

To check the access applications in which you're included, log onto the AMS.

The access application has been temporarily suspended

UK Biobank may have temporarily suspending the access application to which the project is linked. If this is the case, you can be added back to the project when the suspension is lifted.

If you are the admin of the wallet to which the project is billed, follow these steps to add yourself back to the project:

Select Org Admin from the main Platform menu, then select the org wallet in question.
Open the Projects tab
Select the project to which you need to regain access
Click Grant Permission

If you are not the wallet admin, but the admin is also in the access application to which the project is linked, have the wallet admin:

If he or she is not in the project in question, follow the steps above, to add himself or herself to the project
Share the project with you
Optionally, remove himself or herself from the project

If you are not the wallet admin, and the wallet admin is not in the access application to which the project is linked, have the wallet admin transfer project ownership to you.

Working with Multiple Accounts

Note that:

‌An AMS account may be associated to exactly one Research Analysis Platform account
A Research Analysis Platform account may be associated to exactly one AMS account

‌

Creating a Project

Learn how to create a project and populate it with UK Biobank data.

On the UK Biobank Research Analysis Platform, all work takes place in the context of a project. Projects allow a defined set of users to:

Access specific datasets
Conduct analyses on these datasets

Once you have access to the Platform, start by creating a new project. You can follow the step-by-step instructions in the Setting Up Your Project section below.

Setting Up Your Project

On the Research Analysis Platform Projects screen, click the New Project button. The New Project wizard will open in a modal window.
In the Project Name field, enter a name for your project.
In the Application ID field, enter the number of the approved UK Biobank access application from which you'll draw the data to be used in this project.
Check the Dispense data to the project to populate the project with the data specified on the linked access application.
In the Billed To field, choose a wallet to which project billable activities should be charged.
In the Access section, specify who will be able to Copy Data, Delete Data, and Download Data.

Within the Research Analysis Platform, every project must be linked to one and only one access application. A project cannot be linked to multiple access applications.

Dispensing Data

Dispensing data to your new project will take some time. Depending on the type of data being dispensed, this process can take over an hour, or even longer, in some cases. You can monitor the process by going back to the project list, where you can see the project status, including what percentage of the data has been populated.

Accessing and Using Project Data

Troubleshooting

I entered a valid Application ID but the Research Analysis Platform doesn’t accept it.

‌Ensure that:

The UK Biobank access application lists you as a collaborator
You're using an application that has received UK Biobank approval

I created a project and selected "Dispense Data to the Project," but I don't see any data.

As noted above, the process of dispensing data happens over a short period of time. When you first create a new project, you won’t see the data right away. You can monitor the process by checking the status of the project, in the project list. The process can take as little as 20 minutes or as long as 2 hours, depending on the scope of the access application.

I created a project but the dispensing process is stuck at "0%."

Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.

Updating Dispensed Data

Data Updates

Overview

The Research Analysis Platform holds a copy of all UK Biobank data that is used to create all of the platform's projects. The UK Biobank periodically releases new data and makes updates to the existing data. Whenever the Research Analysis Platform updates its copy of the data, it will be indicated by a new data release version. The UK Biobank might expand or remove certain eligible fields within users’ applications at times. To get up-to-date data in their projects, users can check if an update is available and perform a data refresh to update the dispensed data.

The data update process will synchronize the project against the latest data release. This affects both tabular data and file data.

For file data, the update process will dispense any new files, potentially rearranging previous folders if folder names have changed. It will also re-generate *.fam, *.sample, and ukb_rel.dat files, removing any instances of previous ones (even if the previous ones had been copied to other projects).

Users can also dispense a new project if they want the latest version of the data in a separate area, instead of updating a project in-place.

Performing Data Updates

Check for Updates

To check for updates, go to the "Setting" page of a dispensed project and click the "Check for Updates" button in the "UK Biobank" section.

Data Update Availability

By clicking "Check for Updates", if the dispensed data is already up-to-date, you should see the below:

If there is a data update available for the dispensed data, you should see the below:

Show Updates

When data updates are available, you can click the "Show Update" button to see more information about the latest data update.

NOTE: We highly recommend not to launch a large number of jobs and perform clone/copy operations on a large number of objects in the requested project.

Start Update

By clicking the "Start Update" button, you kick off the data refresh process. You can check the progress of the update process from the "status" section.

After the refresh is done, the status will return to "Ready".

Cohort Rebasing

Additional Questions

UK Biobank Data on the Research Analysis Platform

Learn how UK Biobank data is organized and named on the Research Analysis Platform. Learn how to find and access bulk files and tabular data.

How the Data is Organized

EIDs and Data-fields

UK Biobank contains data collected from approximately 500,000 volunteer participants. Within an access application, each participant is identified by a unique, 7-digit number, or EID. An EID is typically a number between 1,000,000 and 6,000,000.

Note that each access application receives a different set of randomized EIDs, unique to the application. This EID randomization process - also known as "pseudonymization" - is managed by UK Biobank and is automatically applied to the data by the Research Analysis Platform. (For a given access application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.)

Project Data

When you create a project on the UK Biobank Research Analysis Platform, the system dispenses the data corresponding to the data-fields listed in the access application associated with the project.

The dispensed data correspond to a specific data release version. See Data Release Versions for more.

Bulk Data Files

Overview

Within your project, the Bulk folder contains files associated with UK Biobank data-fields of type "bulk." These are particularly large and/or complex items, such as genotyping array data, genome sequencing data, imaging data, and fitness data.

Folder Conventions

The Bulk folder uses the following subfolder structure:

There is a subfolder for each UK Biobank bulk field category. For example, whole genome CRAM files are stored in the subfolder named Whole genome sequences. These categories are defined by the UK Biobank, specifically for the Research Analysis Platform.
Within each category subfolder, there is a subfolder for each bulk field (or group of related fields). For example, a subfolder named Whole genome CRAM files would contain files for that field.
Within each field folder, files related to an individual participant are grouped in subfolders named using the prefix of the participant's EID. Typically these are two-digit names, ranging between "10" and "60."

In certain cases, the system may dispense related files of different types into the same folder, to improve usability. For example, whole genome CRAM indices files (field ID #23194) would be dispensed into the same folder as whole genome CRAM files (field ID #23193), rather than into their own folder. Similarly, the folder /Bulk/Brain MRI/dMRI includes data from fields #20218 ("Multiband diffusion brain images - DICOM") and #20250 ("Multiband diffusion brain images - NIFTI").

Filename Conventions

The Research Analysis Platform uses the following naming conventions for bulk data files:

Files that contain data on an individual participant are named in this fashion: <EID>_<FIELD-ID>_<INSTANCE-ID>_<ARRAY-ID>.<SUFFIX> For example, whole genome CRAM files (field ID #23193) are named like so: <EID>_23193_0_0.cram Some exceptions apply to this rule. When a field is meant as a companion to a main field, such as a CRAI index accompanying a CRAM file, or a TBI index accompanying a VCF file, the system uses the prefix of the main field. For example, whole genome CRAM indices (field ID #23194) are named like so: <EID>_23193_0_0.cram.crai
Files that contain data on a cohort of participants (such as PLINK, BGEN or pVCF files) are named in this fashion: ukb<FIELD-ID>_c<CHROM>_b<BLOCK>_v<VERSION>.<SUFFIX> Where <CHROM> represents the chromosome (such as "1", "2" or "X"), <BLOCK> represents an index (starting from "0") for datasets that have been split into multiple pieces, and <VERSION> represents a dataset version assigned by UK Biobank.

Pseudonymization of pVCF headers

The Research Analysis Platform pseudonymizes the content of pVCF headers for the following fields:

When accessing pVCF files in these fields, the header is pseudonymized. The sample ids in the header are EIDs that correspond to the access application. If a participant has withdrawn, the corresponding sample id is marked as "W000001" (for the first encountered sample that belongs to a withdrawn participant), "W000002" (for the second encountered sample that belongs to a withdrawn participant), etc. Overall, the non-withdrawn EIDs in the pVCF header are expected to match the set of application EIDs used elsewhere, such as in the "eid" column of the phenotypic data and the FAM files of genotyping array fields. This allows you to conduct analyses that combine phenotypic, genotyping array, and whole exome pVCF (or whole genome pVCF) data without having to translate any EIDs.

File Properties

The Research Analysis Platform supports file properties. These are key-value pairs of strings that are attached to files. When bulk files are dispensed to a project, the Platform adds some initial file properties, as below:

These properties are searchable both via the Web UI and CLI. Refer to the following section for an example.

Working with Bulk Data Files

About Proteomics Data

Proteomics data stored in UK Biobank Showcase data-field 30900 has been transformed to enable users to visualize it using the Cohort Browser. The transformation produces the per-instance entities stored in the dispensed database. As a result of this transformation, the data can be visualized, at the individual protein level, on a per-instance basis. For example, each protein for each instance can then be added as a tile within the Cohort Browser, and used as a filter along with other data modalities.

The transformation involves the following steps:

Abbreviated protein names are substituted for protein ID codes - protein ID code "3", for example, becomes "aarsd1":
The file is then split into new files by instance, with each new file containing participant records that share the same value in the ins_index field:
Each instance file is then pivoted by creating a column for each protein, with cells filled in by the value in the result column. This value is the NPX value, for that protein, for the participant in question. For example, for the file containing "instance 0" data:

If a participant doesn’t have NPX data for a given protein - like participant 2345678, for protein "abca2," in the example shown above - the cell in question will not contain a value.

Note the column order in the last illustration, showing the fully transformed table. The eid column is first, followed by the columns for protein data, sorted in alphabetical order, by column name.

About Instance Values in Proteomics Data

Tabular Data

Database and Dataset

Tabular data-fields and linked health data are stored in a SQL database. This database is based on Spark SQL technology, a modern and more scalable technology than that used by classic relational database systems (RDBMS). This database is located on the root folder of your project, and is typically named in accord with this pattern:

app<APPLICATION-ID>_<CREATION-TIME> (e.g. app12345_20210101123456)

Browsing Dataset Fields Using the Cohort Browser

To launch the Cohort Browser, navigate to the project's root folder and click the dataset (or tick the dataset and click Explore Data).

To explore what fields are available in your dataset, click Add Tile. The system will present all available fields, organized in a folder structure inspired by the UK Biobank Showcase. You can search this list by folder name, field name, or field value (for categorical fields).

Click a field to see more information. The Data Field Details pane contains the field title (such as Type of accommodation lived in | Instance 0), and the Link label contains the field name (such as p670_i0). These field names and titles can be used to retrieve data programmatically using JupyterLab.

Using the Cohort Browser features, including the "Export sample IDs" option or the "Download" option in the Data Preview tab, will not lead to any charges.

The Cohort Browser can be used to further explore the data, create charts, or define and compare cohorts. Refer to the following DNAnexus Platform documentation entries:

If your access application has been approved for data-field 23146, 23148 and/or 23157, the Cohort Browser will automatically include a "GENOMICS" section, where you can browse variants in your cohort. The data backing the section depends on the dataset version dispensed: 23157 for version 11 and later, 23148 for version 7 and later, 23146 for previous versions. These variants are sourced from the pVCF files of field #23146, 23148 and/or 23157, after annotating with snpEff GRCh38.92, dbSNP b154 and gnomAD r2.1.1. You can also use these variants to apply genomic filters. Refer to the following DNAnexus Platform documentation entries:

Analyzing Tabular Data as a File

Analyzing Tabular Data Using Spark in JupyterLab

Research Analysis Platform Training Webinars

Go in-depth with DNAnexus experts, who'll show you how to get the most out of the Research Analysis Platform.

Working on the Research Analysis Platform

Science Corner

Frequently Asked Questions

Get answers to common questions about the Research Analysis Platform, and about UK Biobank data and systems.

The Research Analysis Platform is open to researchers who are listed as collaborators on UK Biobank-approved access applications.

How do I register for the Research Analysis Platform?

Registration is a two-step process. You must first create a Research Analysis Platform user account, and then you must link it to your UK Biobank Access Management System (AMS) account.

How do I create a Research Analysis Platform account?

If your organization has been set up for Single Sign On (SSO), follow internal procedures specific to your organization. Otherwise:

If you already have a DNAnexus account, you do not need to create a separate Research Analysis Platform account. You can use your existing DNAnexus account on the Research Analysis Platform.
If you do not have an account, visit the and select Create an account. You will need to provide your full name and email, as well as a username and password that you want to use.

Does my Research Analysis Platform info (username and email address) need to match my AMS info?

No, your username and email address on Research Analysis Platform can be different from those you use on the AMS.

I tried creating a Research Analysis Platform account and got an error "Email Already Registered."

It looks like you already have a DNAnexus account. If your organization has been set up for SSO, follow your organization's internal procedures. Otherwise visit the Research Analysis Platform and enter your email address. You will receive an email with a password reset link, which you can use to reset the password of your account.

How do I log in to the Research Analysis Platform?

To log in:

If your organization has been set up for SSO, follow internal procedures.
Otherwise, visit the and select Log In to log in with your Research Analysis Platform account.

How do I link my Research Analysis Platform account to my AMS account?

What are my AMS credentials?

If you have forgotten your AMS username or password, you can retrieve them via the .

Can I access the Research Analysis Platform without an AMS account?

No. To access the Research Analysis Platform you must have an AMS account, and you must be listed as a collaborator in one of the Research Analysis Platform-approved applications.

How do I obtain an AMS account?

Create an AMS account via the .

I entered my valid AMS credentials but got an approval error.

You must finish the AMS registration process, and be approved by UK Biobank. For more information, see the .

Can I link multiple Research Analysis Platform accounts to the same AMS account?

No, an AMS account may be linked to only one Research Analysis Platform account.

Can I link multiple AMS accounts to the same Research Analysis Platform account?

No, a Research Analysis Platform account may be linked to only one AMS account.

Can I "unlink" my Research Analysis Platform account from my AMS account?

No, this operation is not supported.

I previously linked my Research Analysis Platform account to my AMS account, but during a subsequent Research Analysis Platform login, I was asked to do so again.

Data Removal

How many projects should I create?

How long can each project live? Would it be removed if not used?

The project has not been accessed for the last 60 days, with no requests from those with access to the project have been made to browse project folders and files. In addition, the project contains only the dispensed UK Biobank data and does not have any derived data generated by the user or others

The project is billed to a wallet that has no funds available. In addition, the project contains generated data resulting in ongoing storage charges.

How to reactivate a project if I receive a warning email?

Add funds to the wallet that the project is billed to.
Transfer the project to another wallet that has funds.
Delete all generated data and ensure no user generated data remains in the project

Projects and Files

What is a project?

On the UK Biobank Research Analysis Platform, all work takes place in the context of a project. Projects allow a defined set of users to:

Access specific datasets
Conduct analyses on these datasets

How do I create a project?

See detailed instructions on the page.

What is an Access Application?

An access application is a research application submitted to UK Biobank by a Principal Investigator. It includes a written research proposal and a set of to which access is requested. UK Biobank assigns a unique numeric identifier to each application. All activity on the Research Analysis Platform needs to be done within the context of such an access application.

For more information on your access application, log into the (AMS) and select the Applications tab.

When making a project, I get the error "Application does not belong to this user or is not an approved application".

Please ensure that you are listed as a collaborator in the access application on the AMS.

Can I have a project associated with multiple applications?

No, each project is tied to one application only.

After a project has been created, can I change its application?

No, the application is set at project creation time and cannot be changed.

Can I have multiple projects associated with the same application?

Yes, you can make multiple projects using the same application id.

Data has been dispensed to my project, but not all the data I am expecting is there

The most common reason for data not showing in the Research Analysis Platform is due to your UKB Access Application not being fully completed.

New Applications

For new applications, please check your application in the . If the project's status is “Underway,” then your data should be ready to be dispensed to your project on the Platform.

Upgraded / Additional Data Requested

I created a project and chose to dispense data, but I don't see any data.

How long does it take for the data to appear in a new project?

The process can take as little as 20 minutes or as long as a full day, depending on the scope of the access application.

Do I need to remain logged in while the data is being dispensed in a new project?

No, the process happens in the background, even if you are logged out.

I created a project but it's stuck at "0%".

Your request to dispense data may be queued behind that of other users. The system will service your request in the order it was received. We appreciate your patience during that time.

What kind of data is dispensed when I create a project?

Can I access and use Research Analysis Platform projects on the DNAnexus Platform?

If you use the same account on both platforms, you will be able to access and use Research Analysis Platform projects on the DNAnexus Platform. Note, however, that:

You will only be able to access and use tools that are hosted in the London AWS region.
All sharing, download, and other data-use restrictions apply fully to UK Biobank data, on both platforms.

What is a data-field?

All data in the UK Biobank resource are organized into data-fields. Your access application is approved for a precise subset of those data-fields. You can find more information about data-fields, broken down by type, on the .

What data-fields is my access application approved for?

You can get more information on your access application by logging into the AMS and selecting the Applications tab on the left.

What is in the "Bulk" folder?

How are folders determined for bulk fields?

See , for data within the "Bulk" folder.

How are filenames determined for bulk fields?

See , for data within the "Bulk" folder.

What is a participant EID?

UK Biobank is a resource compiled from approximately 500,000 volunteer participants. Each participant is uniquely identified by a 7-digit numeric identifier (EID), typically in the 1,000,000 - 6,000,000 range. These identifiers are scrambled for each access application, hence the EIDs will not match across applications. For more information, refer to .

How are EIDs used on the Research Analysis Platform?

For a given application, is the Research Analysis Platform using the same EIDs as data on UK Biobank's website?

Yes, for a given application, data on the Research Analysis Platform contain the same EIDs as data directly downloaded from UK Biobank's website.

I am interested in a specific participant EID. What files are available?

You can find all files associated with a specific participant EID using the Web UI. Inside a project, select the filter icon on the right, then select the filter settings icon below and choose Properties. A new property filter will appear in the filter bar. Select Any properties and type "eid" (lowercase, without the quotes) in the left textbox that is labeled "Any Key". Type the numeric 7-digit value in the right textbox that is labeled "Any Value", and select Apply. To search across all folders, make sure the search scope is set to Entire Project instead of the default Current Folder Only.

You can also do the same using the CLI. Type:

dx find data --property eid=1234567

Note that these methods find participant-specific files (like individual VCF or CRAM files) and not cohort-wide files (like PLINK or pVCF files).

In the header of pVCF files, why are there samples named "W000001", "W000002", etc.?

These samples correspond to participants that have withdrawn, and the Research Analysis Platform uses this convention to denote them in the header, to help you exclude them from your research.

Since the Research Analysis Platform pseudonymizes pVCF headers, does that mean that different researchers see different content when accessing the same file?

Are the headers of gVCF or CRAM files pseudonymized?

What is in the "Showcase metadata" folder?

This folder contains all the files published by UK Biobank, as described on the . These files describe aspects of the UK Biobank Showcase, including all fields available in the UK Biobank resource.

The files under "Showcase metadata" are different from what's on the UK Biobank showcase website.

The files in the "Showcase metadata" folder represent the showcase metadata at the time that the data was ingested in the system, and may not reflect the latest showcase updates.

How did UK Biobank generate the bulk files? What instruments, assays or scientific workflows were used?

For information about data provenance, please consult the UK Biobank website or contact UK Biobank directly.

I found my bulk data-fields under the "Bulk" folder. Where are the rest of the non-bulk data-fields?

All other non-bulk data-fields (for which UK Biobank defines the item type as "data", "sample", or "record") are dispensed into a SQL database and associated Research Analysis Platform dataset.

Can data be downloaded, exported or otherwise egressed out of the Research Analysis Platform?

From a policy standpoint, you are responsible for complying with the Material Transfer Agreement (MTA) and with any other rules set forth by UK Biobank. As of June 2021, Annex 1 of the MTA states that "any WGS (whole genome sequence) or WES (whole exome sequence) files [..] must not be transmitted or downloaded from the research analysis platform". In addition, depending on the of your access application, you may or may not be allowed to egress certain other data.

How do I see who has access to a project?

I just created a project, and the system says there are two users. Who is the other user?

While the project is being populated with data, the system adds a service user called "UK Biobank Robot" (username: ukb.robot).

Why is the user "ukb.robot" in my project?

Can I remove or alter the access of the user "ukb.robot" in my project?

No, but the system will automatically remove that user once any necessary data manipulations are completed.

The username or email you are entering exists. You cannot share a project with someone if they have not yet signed up for an account.
You are sharing with a linked Research Analysis Platform account. You cannot share a project with an account if they have not yet logged into the Research Analysis Platform and linked their account to the AMS (or if their link needs to be refreshed).
You are sharing with someone on the same application. You cannot share a project with a linked Research Analysis Platform account unless they are listed as collaborators in the project's access application on the AMS.

No, you must share with each person individually, as the platform needs to enforce AMS permissions at the user level.

Sharing is on a project basis. If you need to share a subset of data, such as the files in one folder, we recommend copying them to a new project and then sharing that project, as follows:

In the project list page, select New Project. Enter the same application id as your existing project, and deselect the option Dispense data to the project. Select Create Project. This will create a new empty project, associated with the same application as your existing project.
In your existing project, tick the items you want to share, and select Copy. Select the new project, then select Copy Selected.
Share the new project.

Are there any restrictions in copying data across projects?

Running Analyses

Which job priority do I choose for my analysis?

You can assign each job a different priority, depending on whether you want to prioritize job execution speed or cost control. See the page .

What compute instance types are available for running my analysis?

See the for a full list of available AWS instance types, including detailed specs for each on number of cores, amount of RAM, storage memory type and size, and cost.

File Visualizations

How do I visualize a CRAM or VCF file using IGV.js?

To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:

Navigate to the project containing the files you want to visualize.
Select the VISUALIZE tab.
Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".
Select the files you want to visualize.
- If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.
- For CRAM files, you must select both the CRAM and the associated CRAI file.
- For VCF files, you must select both the VCF and the associated TBI file.
- IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.
Select Launch Viewer.

Databases and Datasets

What is the database found in the root folder of the project?

See for more information about databases and datasets.

What tables are included in the dispensed database?

The database contains the following tables:

How are column names determined for the dispensed database?

For the main UK Biobank participant tables, the column naming convention is generally as follows:

p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>

However, the following additional rules apply:

If a field is not instanced, the _i<INSTANCE-ID> piece is skipped altogether.
If a field is not arrayed, the _a<ARRAY-ID> piece is skipped altogether.
If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the _a<ARRAY-ID> piece is skipped altogether.

Examples:

Age at recruitment: p21022
Date of attending assessment centre: p53_i0, p53_i1, ...
Diagnoses - ICD10 (converted into embedded array): p41270

For all other tables (such as hospital records, GP records, death records, or COVID-19 records), the column names are identical to what UK Biobank provides in the data showcase. For more information on the columns of these tables, consult (hospital records), (GP records), (death records), (COVID-19 GP records), or (COVID-19 test results).

Data Releases

What is a data release version?

Where can I get full detail on data included in each data release?

See .

I am about to make a new project and choose the option to dispense data. What data release version will my data correspond to?

I previously created a project. How is my project affected by new data releases?

Your existing project is not affected, and will continue to reflect the data release version from the time that the project was created. Data updates will not happen automatically and you have the choice to decide whether or not you want to update your project data. If you choose to update your dispensed projects, the files and tabular data in your project will be updated. See the details .

The data in my project is not up to date. What can I do?

To learn how to get the most recent data update, see the page .

How can I find out what data release version was used, for an existing project?

"Description" = "Dataset: app68444_202101290057.dataset Original Version: v3.0+ae7924f"

I refreshed my project after a data release, but before the new data has been unrestricted. Could I update my data by doing the data dispensal again?

The scope of my UK Biobank access application has been expanded. Will new data automatically appear in my projects?

No. If you have been approved for new fields, this change will not apply to existing projects automatically. To get access to the new data you are approved for, you will need to perform a data update. To learn how to get the most recent data update, see the page .

Can I just update selected data fields using the data update feature?

No. The data update process will update all the data fields that your application is eligible for.

I just updated my project data. Why is my update progress stuck at 0%?

Will the data update affect ongoing jobs?

When doing a data update, how are previous artifacts (previously generated results, previously saved cohorts, dashboards, etc) affected?

Previously-generated result files are not affected. Cohorts and dashboards that point to the previous dataset will be evaluated against the updated database. To migrate these to the latest dataset, run the "” app.

AWS Credits Program

UK Biobank has approved my credits application, what should I do to receive my approved credits?

How do I create a grant org?

Log onto the Research Analysis Platform.
From the main Platform menu, select Org Admin.
From the dropdown menu, select All Orgs.
On the Organizations list page, click the New Organization button.
A New Organization form will open in a modal window. Enter the following information in the form: Org Name: Enter a name of your choice for this organization Org ID: You can edit the default value, as long as you preserve the prefix "ukbgrant_yymm_" Note that this field has a character limit. A valid org ID would be, for example, "ukbgrant_yymm_FirstNameLastName”.
Click Create Organization.
The Set Up Billing modal window will open, and you will be prompted to set up billing for your new organization. Do not set up billing. Instead, click Exit to close the modal window.

What should I do if I do not see an option to create a new org?

Make sure that you've logged into your . Once you've done so, you should be able to create a new org, following the instructions just above.

How do I use the grant org as a project's billing account?

From the main Platform menu, select Projects, then All Projects.
In the projects list, find the row for the project in question.
Click the vertical "..." icon at the end of the row, then select Project Settings from the More Actions menu.
Locate the Billed To field, in the Billing section of the Settings screen.
Click on the downward-facing caret at the right end of the Billed To field.
Select the grant org from the list of available billing accounts.

When will I see credits in my Research Analysis Platform account?

I am running out of credits in the grant org, can I set up billing on this account and add my credit card?

Do not upgrade your grant org to a billable account, or add your credit card to a grant org account. Grant orgs are only for receiving and using grants. If funds in the grant org account are running low, change the billing account used by any affected project, so that it uses a personal billing account linked to a credit card. You can also apply for additional credits. .

I have been approved for enhanced credits. Do I have to create a new org to receive these credits?

Yes, you must create a new grant org to receive enhanced credits.

Can I transfer funds from the grant org to my personal billing account?

No, funds from the grant org cannot be transferred to other accounts.

Why is my personal billing account still being charged after I received grant funds and created a grant org?

Check the billing account used by your project or projects. Be sure that the grant org is set as the "Billed To" account for each.

UK Biobank Data on the Research Analysis Platform

Learn how UK Biobank data is organized and named on the Research Analysis Platform. Learn how to find and access bulk files and tabular data.

How the Data is Organized

EIDs and Data-fields

All data in the UK Biobank resource are organized into . Your access application is approved for a precise subset of those data-fields.

The provides an in-depth look into the types of data stored in the UK Biobank, how it's collected, and how it's organized. You can find more information about data-fields, broken down by type, on the .

Project Data

When you create a project on the UK Biobank Research Analysis Platform, the system dispenses the data corresponding to the data-fields listed in the access application associated with the project.

Bulk data-fields are dispensed as files. See below for more.
Tabular data-fields and linked health data are placed into a Spark SQL database and an associated dataset. See below for more.

The dispensed data correspond to a specific data release version. See Data Release Versions for more.

Bulk Data Files

Overview

Folder Conventions

The Bulk folder uses the following subfolder structure:

There is a subfolder for each UK Biobank bulk field category. For example, whole genome CRAM files are stored in the subfolder named Whole genome sequences. These categories are defined by the UK Biobank, specifically for the Research Analysis Platform.
Within each category subfolder, there is a subfolder for each bulk field (or group of related fields). For example, a subfolder named Whole genome CRAM files would contain files for that field.
Within each field folder, files related to an individual participant are grouped in subfolders named using the prefix of the participant's EID. Typically these are two-digit names, ranging between "10" and "60."

For a full list of folders, see below.

Filename Conventions

The Research Analysis Platform uses the following naming conventions for bulk data files:

Files that contain data on an individual participant are named in this fashion: <EID>_<FIELD-ID>_<INSTANCE-ID>_<ARRAY-ID>.<SUFFIX> For example, whole genome CRAM files (field ID #23193) are named like so: <EID>_23193_0_0.cram Some exceptions apply to this rule. When a field is meant as a companion to a main field, such as a CRAI index accompanying a CRAM file, or a TBI index accompanying a VCF file, the system uses the prefix of the main field. For example, whole genome CRAM indices (field ID #23194) are named like so: <EID>_23193_0_0.cram.crai
Files that contain data on a cohort of participants (such as PLINK, BGEN or pVCF files) are named in this fashion: ukb<FIELD-ID>_c<CHROM>_b<BLOCK>_v<VERSION>.<SUFFIX> Where <CHROM> represents the chromosome (such as "1", "2" or "X"), <BLOCK> represents an index (starting from "0") for datasets that have been split into multiple pieces, and <VERSION> represents a dataset version assigned by UK Biobank.

Pseudonymization of pVCF headers

The Research Analysis Platform pseudonymizes the content of pVCF headers for the following fields:

File Properties

Key

Value

Which files have this property?

These properties are searchable both via the Web UI and CLI. Refer to the following section for an example.

Working with Bulk Data Files

for in-depth guidance on searching and analyzing UK Biobank bulk data files.

About Proteomics Data

The transformation involves the following steps:

Abbreviated protein names are substituted for protein ID codes - protein ID code "3", for example, becomes "aarsd1":
The file is then split into new files by instance, with each new file containing participant records that share the same value in the ins_index field:
Each instance file is then pivoted by creating a column for each protein, with cells filled in by the value in the result column. This value is the NPX value, for that protein, for the participant in question. For example, for the file containing "instance 0" data:

If a participant doesn’t have NPX data for a given protein - like participant 2345678, for protein "abca2," in the example shown above - the cell in question will not contain a value.

Note the column order in the last illustration, showing the fully transformed table. The eid column is first, followed by the columns for protein data, sorted in alphabetical order, by column name.

About Instance Values in Proteomics Data

For details on instance data and the meaning of each instance value, see the .

Tabular Data

Database and Dataset

app<APPLICATION-ID>_<CREATION-TIME> (e.g. app12345_20210101123456)

In the same folder, there is an associated dataset named after the database but with the .dataset suffix appended. This dataset is a higher-level construct, using technology that is unique to the Research Analysis Platform. It combines the low-level SQL columns with field-level metadata from the UK Biobank Showcase, and presents a collection of rich fields that can be explored visually in the Cohort Browser, or programmatically in JupyterLab. For general information on the underlying technology, see the DNAnexus Platform documentation overview of .

Browsing Dataset Fields Using the Cohort Browser

To launch the Cohort Browser, navigate to the project's root folder and click the dataset (or tick the dataset and click Explore Data).

Using the Cohort Browser features, including the "Export sample IDs" option or the "Download" option in the Data Preview tab, will not lead to any charges.

The Cohort Browser can be used to further explore the data, create charts, or define and compare cohorts. Refer to the following DNAnexus Platform documentation entries:

When performing a data update, creating a new project for the same application, or adding data to a Dataset using , a new Dataset is created. To migrate saved cohorts to the new Datasets, use the app. Note that all Fields defined in the cohort must exist in the new Dataset.

Analyzing Tabular Data as a File

If you are used to working with tabular data as a TSV file - a format used by UK Biobank in distributing tabular data directly via its website - see .

Analyzing Tabular Data Using Spark in JupyterLab

Apache Spark is a modern, scalable framework for parallel processing of big data. to analyze UK Biobank tabular data using Spark in JupyterLab.

Guide to Analyzing Large Sample Sets

Introduction

The UK Biobank Research Analysis Platform (UKB RAP) hosts a wide array of biomedical data sampled from hundreds of thousands of individuals across many years, and contains varied types of data ranging from MRI imaging to accelerometer measures. The platform provides the opportunity for researchers to conduct analyses on an increasingly large scale in varied ways (e.g QC-ing the sequencing data, performing whole genome variant calling, or genotyping a particular gene). However, processing data at this magnitude presents RAP researchers with multiple challenges, including how to:

encapsulate the analysis algorithm so it runs efficiently on the platform
break up the processing of the large data sets into parallel jobs
submit and monitor multiple job executions
identify and resubmit failed jobs.

In this guide, we will go over an example of how to perform HLA typing on 200K exome samples on the UKB RAP platform in a cost-efficient way. We will then provide guidelines for extending the techniques used in the example to other types of analyses that users may use.

This guide assumes the user has:

familiarity with the DNAnexus command line interface and UI features
the ability to write simple Bash and Python scripts
a high-level understanding of concepts behind DNAnexus applets and the Workflow Description Language (WDL).

HLA Typing Overview

In our example, we'll perform HLA typing on 200K exome samples on the UKB RAP platform. The HLA (human leukocyte antigen) complex is one of the most diverse gene complexes found in humans and plays a central role in human immunity. Mutations in this complex may be linked to autoimmune disorders. Researchers are often interested in identifying mutations in this complex as they can be used to learn more about treatment for various autoimmune conditions like type I diabetes or rheumatoid arthritis.

For this tutorial, the location of the files we need are as follows:

The inputs to our HLA typing analysis are
- 1: The 200K read-mapped samples in a UKB RAP-dispensed project with access to UKB Data-Field 23153, which is found in the folder containing the exome OQFE CRAM files.
- 2: The reference genome that can be fetched from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa using the url_fetcher app and stored in genome_reference folder in your project.

Instance types

HLA typing runs independently on each sample, and each sample corresponds to one individual. Doing HLA typing on 1 exome sample takes 11 minutes on a mem1_ssd1_v2_x2 instance.

Outputs

We will store the output (HLA type in a file with .genotype.json extension and HLA expression level in a file with .gene.json extension) of the analysis in the "/HLA_process" folder of the RAP-dispensed project.

Since the UK Biobank contains so many samples, the naive way of running one job for each of 200K samples is inefficient because of inefficiencies derived from submitting, scheduling and managing 200,000 jobs. We therefore suggest reducing the total number of jobs by processing a batch of 100 samples in each job.

We recommend to structure the computation in such a way that the runtime of each job is less than a day to decrease the chances of job failure due to spot termination. In the example below, this is achieved by using mem1_ssd1_v2_x2 instances to process a batch of 100 samples in about 19 hours.

Here is a brief overview of the steps to our analysis. Later in this tutorial, we will describe each step in greater detail:

Prepare the applet:
1. Package the HLA analysis tools into a Docker image and upload the image to RAP.
2. Create an applet using WDL (see documentation here) that performs HLA typing using the Docker image from the previous step. The applet takes an array of exome sample files as input.
3. Compile the WDL task using dxCompiler to a DNAnexus applet.
Generate job submission script:
1. Efficiently fetch the 200K input file names from RAP.
2. Create job submissions where each job processes 100 samples.
Submit and monitor jobs:
1. Run one job to make sure there are no errors in the code.
2. Run the rest of the jobs.
3. Monitor job execution.
4. Resubmit any jobs that failed due to external factors such as spot instance termination.

Preparing the Applet

Packaging HLA tools into a Docker image and uploading to RAP

First launch the cloud_workstation app to leverage the cloud-scale network bandwidth for uploading large Docker images to DNAnexus from AWS EC2 instances to AWS S3 buckets in the same region. Make sure that you have the Docker command line tool pre-installed. The command is:
```
dx run cloud_workstation --ssh
```
At the cloud_workstation prompt, create a docker folder which contains a Dockerfile file with the content available for download in this github repository describing the installation of samtools, bedtools, kallisto and arcasHLA tools in the Docker image.

Build the Docker image:

sudo docker build -t arcas_hla:0.0.1 docker/

Save Docker image as a tarball:

sudo docker save arcas_hla:0.0.1| gzip -c > arcas_hla_0.0.1.tar.gz

Upload Docker image to the parent project:

dx upload -p arcas_hla_0.0.1.tar.gz --path hla_project:/Docker/arcas_hla_0.0.1.tar.gz

We recommend encapsulating analysis tools in a Docker image to preserve reproducibility. We also recommend storing Docker images on the platform instead of external Docker registries such as Docker Hub and Quai.io for better reliability.

Creating an applet in WDL

On your local computer, create an applet using Workflow Description Language (WDL) that executes using the Docker image from the previous step. The applet takes an array of sample files as input. Below is the code for our applet:

WDL_APPLET

$ cat arcas_hla.wdl 
version 1.0

task arcas_hla_cram_instance_bundle{
    input {
        Array[File]+ mapped_read
        File reference
    }

    command <<<
        set -x -e -o pipefail
        mkdir output
        for input in ~{sep=" " mapped_read}; do
            file_prefix=$( basename $input ".cram")
            time samtools view -b -h ${input} -T ~{reference} > ${file_prefix}.bam
            time arcasHLA extract ${file_prefix}.bam -o output --paired -t $(nproc) -v
            time arcasHLA genotype output/${file_prefix}.extracted.1.fq.gz output/${file_prefix}.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -o output -t 8 -v
            rm output/*.fq.gz output/*.alignment.p ${file_prefix}.bam
        done
    >>>
    output {
        Array[File] genotype = glob("output/*.genotype.json")
        Array[File] gene = glob("output/*.genes.json")
        Array[File] log_files = glob("output/*.log")
    }
    runtime {
        docker: "dx://hla_project:/Docker/arcas_hla_0.0.1.tar.gz"
        dx_timeout: "48H"
        dx_instance_type: "mem1_ssd1_v2_x2"
    }
    parameter_meta {
    mapped_read: {
        description: "mapped short read data",
        patterns: ["*.cram"],
        stream: true
    }
    reference: {
        description: "reference genome",
        patterns: ["*.fa","*.fasta"]
    }
    }
}

Line annotations:

Line 18: Remove intermediate outputs after each sample is processed in a batch for greater storage efficiency.

Line 27: Public docker registries have a daily pull limit. We save the Docker image on RAP for better reliability.

Line 28: Use the appropriate timeout policy based on the expected runtime of your job to ensure job costs remain under control on the off chance that your job hangs. While rare, running at scale on AWS virtual machines increases the chances that at least one job will need to time out and be restarted.

Line 35: Streaming allows the system to avoid downloading the entire batch of inputs, and instead streams each input when it is read in the samtools view line.

Compile the WDL task using dxCompiler to a DNAnexus applet

Install dxCompiler on your local machine following the installation instructions.
Compile WDL code into a DNAnexus applet using dxCompiler:

 $ java -jar dxCompiler-2.4.10.jar compile arcas_hla.wdl
 ...
 // builds applet arcas_hla_cram_instance_bundle in the DNAnexus project

Note that by default, compiled DNAnexus applets are configured to auto-restart on transient failures as documented in therestartOn field of the executionPolicy argument in the DNAnexus documentation. You can adjust the restart policy by providing extras.json input to dxCompiler below as shown in dxCompiler documentation here.

Generating Job Submission Script

Efficiently fetch the 200K input file names from RAP

$ dx find data --path “<path to 200k exome files>” --name “*.cram” --delim > inputfile.txt

To sort your input file based on name use the standard bash sort:

sort -k 4 -t$'\t'

This command sorts files by the full path of the file.

If you need other information that is not present with default dx find data, you can use --json with dx find data and use jq to extract the fields you need.

Create job submissions where each job processes 100 samples

We create job submission commands using the following script (the latest version can be found here):

JOB_SUBMISSION

$ cat create_job_submission.py
import sys
import math

input_file=sys.argv[1]
batch_size=int(sys.argv[2])

def _parse_dx_delim(delim_line):
    '''parse each list of delim output from dx find into NAME, ID, SIZE, and FOLDER'''
    size=_parse_size(delim_line[2])
    id=delim_line[-1]
    split_path=delim_line[3].split('/')
    folder='/'+'/'.join(split_path[:-1])
    name=split_path[-1]

    return name,id,size,folder

fd=open(input_file)
lines=fd.readlines()
sample_number=len(lines)
batch_mapped_files=''
input_number=0
number_of_batch = int(math.ceil(sample_number*1.0/batch_size))
for batch_number in range(number_of_batch):
    batch_mapped_files=''
    for member in range(batch_size):
        delim_line = lines[input_number].strip().split('\t')
        name, id, size, folder = _parse_dx_delim(delim_line)
        batch_mapped_files += '-imapped_read={} '.format(id)
        final_folder='/HLA_process/' + str(batch_number)
        input_number+=1
        if input_number == sample_number:
            break

    print('dx run /arcas_hla_cram_instance_bundle \    
     -ireference=genome_reference/GRCh38_full_analysis_set_plus_decoy_hla.fa {batch_mapped_files} \
     --folder="{final_folder}" \
     --tag 200K_exome_HLA_analysis \
     --tag original 
     --tag batch_n_{batch_number} \   
     --priority normal\
     -y \
     --brief \
          '.format(
  batch_mapped_files=batch_mapped_files,
  batch_number=batch_number,
  final_folder=final_folder))

$ python create_job_submission.py inputfile.txt 100 > submission_command.txt

Line annotations:

Line 30: Store output for each batch (from 100 samples in this case) in a dedicated folder. This will avoid the problem of creating too many files in a specific directory and make tracking errors easier when some jobs produce unexpected numbers of output files.

Line 38-40: We tag each job with 3 tags:

200K_exome_HLA_analysis represents the name of study and will help us distinguish jobs from this analysis from other work you may be doing in the same project.
original indicates that this is the first (original) attempt at running a job. Subsequent reruns of failed jobs will be tagged with rerun{rerun_attempt}.
batch_n_{batch_number} records a particular batch of 100 jobs.

These tags illustrate the use of execution metadata to help track the progress of your analysis, identify which studies had all their jobs complete successfully, and restart any failed jobs. Metadata consisting of tags and properties can be associated with DNAnexus objects such as files and executions and is documented here.

Submitting and Monitoring Jobs

Running one job to check for errors

This command below shows the dx run invocation for the first job, then submits it:

$ head -1 submission_command.txt
dx run /arcas_hla_cram_instance_bundle  \
  -ireference=genome_reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
 -imapped_read=<file-id0> … -imapped_read=<file-id99> \
 --folder="/HLA_process/0" \
 --tag 200K_exome_HLA_analysis --tag original --tag batch_n_0 \
 --priority normal -y --brief
$ head -1 submission_command.txt | sh

Monitor the rest of the jobs with dx watch.

Running remaining jobs

We recommend submitting jobs gradually, rather than all at once. Submit the first job and see if it produces the expected output in the right location. After that, submit another 500 jobs and see if the variation of running time and cost among these jobs is within the expected range before submitting the rest of your jobs.

The code below creates a new list of submission commands by removing the previously launched first submission, then splits the remaining 1999 submissions into batches of 500:

$ tail -n +2 submission_command.txt > submission_command_remainder.txt
$ split -500  submission_command_remainder.txt submission_command_new
#This would generate submission_command_newaa, submission_command_newab, submission_command_newac, submission_command_newad. Then the user can submit each split file using the command. 
$ sh submission_command_newaa  
$ sh submission_command_newab  
$ sh submission_command_newac  
$ sh submission_command_newad

Monitoring job executions

We can monitor the execution of the 200K_exome_HLA_analysis analysis using dx command line tool to search for jobs tagged with 200K_exome_HLA_analysis and display only the last n jobs that we've submitted:

$ dx find jobs --tag 200K_exome_HLA_analysis --origin-jobs -n <number of total batches we've submitted>

Similarly, you can view the jobs corresponding to your analysis in the web browser UI by filtering on 200K_exome_HLA_analysis tag value from the Monitor page in your project.

Resubmitting any jobs that failed due to external factors

If you decide not to use a retry policy, occasionally, some jobs may fail due to sample-specific issues or due to external factors such as spot instance termination or other intermittent system errors. You can find failed jobs by using the job state filter set to failed in the Monitor tab in the web browser UI or using the dx command line tool as shown below:

$ dx find jobs --state failed --tag 200K_exome_HLA_analysis --origin-jobs -n 10

After fixing issues associated with a particular failed job, resubmit the job using a distinguishable tag, so you can track which batch has already been analyzed. For example, if original jobs/analyses has tags 200K_exome_HLA_analysis, batch_n_0, original, you may resubmit the job using tag --name 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1.

To retry a job that failed due to a intermittent system error such as spot instance termination or network connectivity problem, you can use:

$ dx run --clone <job-id> --tag 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1 -y --brief

If you had to fix your analysis code and want to rerun a failed job with a new applet, you can use the following:

$ dx run new_applet_executable --clone <job-id> --tag 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1 -y --brief

Tip: Submitting the large batch of jobs via Swiss Army Knife

To prevent to internet disconnection when submitting the large batch of jobs, you can upload submission file to project and use swiss-army-knife to submit. In such case, make sure you use --detach job for each job, because each (sub)job would inherit priority from the main swiss-army-knife job, so all those batch jobs might all be on-demand. Another option is to use --head-job-on-demand in order to request the head job of an app or applet be run in an on-demand instance - especially good option for workflows. Note that --head-job-on-demand option will override the --priority setting for the head job.

General Guidelines

Before developing your analysis, define what each “unit” of independent work consists of so you can break your overall analysis down into multiple smaller sections for parallel processing. For example, for the HLA typing example or for variant calling, each individual sample can be considered as one independent unit. For joining variant calling, your "unit" might consist of a small genomic region for all samples.
Plan for your batch run using an end-to-end approach that covers naming of the analysis, preparing and submitting jobs, and organizing output files.
1. It is good practice to use a human readable name like <sample ID>.<type of file or processing>.<file format extension> for ease of reviewing or troubleshooting your work. For example, in HLA typing, we name the output as 12345_6789_0.genotype.json which represents <sample ID>.<type of file>.<file format extension> .
2. Keep the number of files per folder under 10,000 to make viewing and querying more efficient.
To analyze hundreds of thousands of units, analyze multiple units in a single job to reduce the total number of jobs to submit and manage. To limit the impact of spot instance termination, we recommend limiting the runtime of each job to about a day by selecting an appropriate instance type. Executing jobs on larger instances with more CPUs can be used to decrease job execution time.
2. Large number of jobs would be hard to manage or modify. If you have more input to analyze than 5,000, consider combining multiple input per jobs or gradually scale up your job submission. In case you have a solid control over the input data and the gradual submission process, you do not need to group inputs.
Encapsulate your analysis tools in a Docker image for better reproducibility. Docker image includes an operating system version, a specific version of your tools and their dependencies. Specify an explicit (instead of latest or default) version of external tools such as samtools or bamtools in the Dockerfile for reproducible creation of docker image from the Dockerfile. Store Docker images on the platform for use in the DNAnexus apps instead of external Docker registries such as Docker Hub and Quai.io for better reliability and to avoid pull limits imposed by public Docker registries.
Optimize applet execution
1. Use available CPUs: In the HLA example, each execution of the applet processed 100 samples. It's important to make applet execution use available CPUs efficiently as the applet execution will be performed 2,000 times. During the applet's execution, the 100 input samples were analyzed serially as shown in line 13-19 of the WDL_APPLET code snippet above (i.e. second sample was analyzed after the analysis of the first sample was finished). Since samtools and arcasHLA tools are multi-threaded, the sequential processing of samples still resulted in high CPU utilization. If the analysis tools are not multi-threaded, you may consider processing multiple units in parallel (e.g. using xargs) for better CPU utilization.
2. Manage disk space: If your app processes units in sequential manner, and can stream the input, you can avoid downloading all the inputs at the start of applet's execution by using the "stream" WDL input option as shown on line 35 of the WDL_APPLET code snippet shown above. We also recommend removing unnecessary intermediate files to save storage disk space, shown in line 18 of WDL_APPLET.
  The outputs produced by the applet in the HLA example above required little disk space, so preserving all 100 outputs until the end of the job did not exhaust the disk space on the instance running the job. If your analysis produces large outputs, you can either select an instance with lots of disk space (such as mem3_ssd3 family), or implement your processing step using a native DNAnexus applet instead of WDL that allows for more control over the upload of output files from the worker to the DNAnexus platform.
3. Select an instance type that balances CPU, memory and disk requirements for your analysis. For example if your analysis requires 2GB of memory per core you can use mem1 instance family, while requirement of 7GB of memory per core calls for mem3 instance family.
For more general considerations for large-scale data analysis, please refer to the peer-reviewed publication “Ten Simple Rules for Large-scale Data Processing” published in PLOS Computational Biology.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.

Genomic Target Discovery for Ischaemic Heart Disease using GWAS, LD Clumping, and PheWAS

The code provided in the following tutorial is delivered "As-Is." Notwithstanding anything to the contrary, DNAnexus will have no warranty, support or other obligations with respect to Materials provided hereunder. The MIT License applies to this tutorial.

Note that all run times estimates and configurations given in this tutorial are not a guarantee and these estimates are given only as a reference.

Introduction

This research project provides an example of performing end-to-end genomic target discovery on UK Biobank data using ischaemic heart disease as an example phenotype. For this tutorial, ischaemic heart disease was chosen because of the large cohort size and because of the existence of previous GWAS studies with this phenotype, allowing for simpler comparison of results. The first step in this analysis was to create the case and control cohorts. After that, sample data (phenotypic data) from cohorts and genomic data (array and imputed) were cleaned. Then a genome-wide association studies (GWAS) analysis was performed and significant GWAS variants were aggregated using the linkage disequilibrium (LD) clumping approach. Lastly, a phenome-wide association study (PheWAS) was performed for each variant.

The original analysis and guide was created by Anastazie Sedlakova and there is a corresponding Youtube tutorial here.

This tutorial demonstrates how to:

Create control and cases cohorts
Perform sample QC in JupyterLab
Lift over array data by compiling and running a WDL workflow
Perform quality control of array data using Swiss Army Knife
Perform quality control of imputed data by compiling and running WDL workflow
Run GWAS analysis using REGENIE app
Perform LD clumping on imputed data in JupyterLab
- Extract variant allele counts from imputed data in JupyterLab
Run PheWAS analysis in JupyterLab
- Create phenome dataset from a ICD10 field

This analysis was done as part of the UKB 46926 research application.

Data

Genome Data

For this GWAS, two types of genomic data were used: array data (field 22418) and imputed data (Genomics England) (field 21008). For linkage disequilibrium (LD) clumping, only imputed data were used.

Phenotype Data

First, ischaemic heart disease was chosen as a phenotype of interest (ICD 10 code I20-I25). The following fields were also retrieved for the sample QC:

31 - Sex
22001 - Genetic sex
22006 - Genetic ethnic grouping
22019 - Sex chromosome aneuploidy
22020 - Used in genetic principal components

Covariates were added both to GWAS and PheWAS analysis to increase power and reduce confounding. Here is the list of covariates used:

31 - Sex
2966 - Age high blood pressure diagnosed (not used in PheWAS)
21022 - Age at recruitment
23104 - Body mass index (BMI)
20160 - Ever smoked
30760 - HDL cholesterol (not used in PheWAS)
30780 - LDL direct (not used in PheWAS)
22009 - Genetic principal components

The phenotype (ischaemic heart disease) being used in this analysis and other phenotypes for PheWAS testing were taken from the 41270 - Diagnoses - ICD10 field.

Creating the Cohort

Once the cohorts are created they can be used for subsequent analysis (GWAS). In the present example, the phenotype definition is simple, but often the phenotype definition is quite complex and can include multiple fields. Defined cohorts will be combined in the QC step.

Overview

First, the Cohort Browser is used to create the selected cohort. The "ischemic_cases" cohort was created by selecting participants that have I20-I25 Ischaemic heart disease in the field 41270. The "ischemic_controls" cohort is created by using “Cohorts compare: not” in "ischemic_cases". This results in a sample of 57,383 "ischemic_cases" and 445,027 "ischemic_controls".

Steps (UI)

On RAP, click on the dataset you wish to use from the Manage tab. Clicking on the dataset name, as shown in the image below, will take you to the Cohort Browser page.
Then, click the Add Filter button. Type in “ischaemic heart disease” into the search bar and select “Diagnoses - ICD10”, which has data-field 41270. Confirm your selection by clicking on Add Cohort Filter.
In the modal prompt for “Includes any of”, type “I20-I25 Ischaemic heart disease” and select that option. Then click the Apply Filter button.
Name the cohort. In this example, the name is “ischemic_cases” (Note: The example cohort name uses the American spelling, however mentions of the specific UKB data-field will use the UK spelling).

Creating the Control

To create the control group, clicking on the “+” (mouse-over: “Compare/combine cohort”) button.
From the drop down menu select the “not in ischemic_cases” option and apply this filter.
Name and save this cohort. In this example, the "ischemic_cases" cohort had 57,383 samples and the "ischemic_controls" cohort had 445,027 samples.

For additional information about creating cohorts, see this detailed tutorial on how to explore data in Cohort Browser.

Sample QC

Cleaning samples will decrease noisiness in data and increase accuracy of the GWAS analysis results. For example, checking for sex discordance and sex chromosome aneuploidy removes possible sample swaps and genotyping errors. Population substructure is minimized by selecting just one population (White British). In order to check phenotype-genotype associations, only samples of non-related participants were selected in choosing the samples that were used to calculate the PCA.

Overview

The code for the sample QC can be found in gwas-phenotype-samples-qc.ipynb. Phenotypic data was retrieved for each cohort using dx extract_dataset as a table. Then, samples were selected using the following criteria:

Sex and genetic sex are the same
Participant has White British ancestry
Not sex chromosome aneuploidy
Participant was used to calculate PCA (only non-relatives were included)

After filtering, 38,197 and 298,886 samples remained in "ischemic_cases" and "ischemic_controls", respectively.

Sample Phenotype Imputation

After QC, covariate variables were then imputed as part of the GWAS process. The covariates and a summary of what the above notebook code does to clean the data are listed below:

Age high blood pressure diagnosed in participant (2966) field
- The code combines all instances and create the hypertension boolean variable
Participant ever smoked
- The code makes all missing values as 0
Participant body mass index (BMI), HDL cholesterol, direct LDL
- The code replaces missing values by mean for that variable

Phenotypic data was then merged with the list of samples for which imputed data is available. The resulting table contained 38,197 cases and 298,886 controls.

JupyterLab Configuration

The next step is to use JupyterLab to run the GWAS step. The JupyterLab configuration used when running this example GWAS is listed below, however, individual user preferences may cause costs and runtime to vary.

Cluster configuration: Single node
Recommended instance: mem1_ssd1_v2_x36
Run time: Approximately 5 minutes

Steps (UI)

On your local machine, clone the repository for shared UKB Jupyter notebooks here.
Upload the Jupyter notebook to your project directory on RAP: dx upload gwas-phenotype-samples-qc.ipynb —-destination <directory path> .
On RAP, launch JupyterLab. The recommended configuration is listed below:
Once JupyterLab is open, click on the DNAnexus tab on the left hand side, and open the gwas-phenotype-samples-qc.ipynb notebook.
Once the notebook is open, go to the JupyterLab menu located at the top of the page, click the Run tab and click Run All Cells.

Array Data LiftOver

Array data that is provided by UKB is mapped to the older version of the reference genome (GRCh37). However, WES and WGS data that is released is mapped onto the current version of the reference genome, GRCh38. In order to perform association testing with the sequencing data, we need to ensure the data is mapped to the current version of the reference genome first using the liftOver script.

Overview

For this step, the liftOver WDL script created by Yih-Chii Hwang was used. As a result of this script, data was lifted to the newer reference genome and all chromosomes were merged. LiftOver was performed using Picard LiftoverVcf. This step took approximately 34 hours.

Steps (CLI/UI)

On your local machine’s terminal, install Java:
1. (Mac OS): brew install openjdk
2. (Linux OS): apt install default-jre
Download the latest JAR compiler here.
Login to Platform using the dx login command. Then, use the following command to compile the workflow and make it available for use on the Platform: java -jar dxCompiler-2.10.8.jar compile liftover_plink_beds.wdl -project project-xxxx -folder <directory path>
On RAP you will now find:
1. A workflow called liftover_plink_beds
2. Applets created that correspond to the steps of the workflow
In the UI, launch the liftover_plink_beds workflow and use the following input parameters:
1. Common parameters to specify:
  1. plink_beds: /Bulk/Genotype Results/Genotype calls/*.bed (22 files)
  2. plink_bims: /Bulk/Genotype Results/Genotype calls/*.bim (22 files)
  3. plink_fams: /Bulk/Genotype Results/Genotype calls/*.fam (22 files)
  4. reference_fastagz: /Bulk/Exome sequences/Exome OQFE CRAM files/helper_files/GRCh38_full_analysis_set_plus_decoy_hla.fa
  5. ucsc_chain: b37ToHg38.over.chain

See a detailed tutorial on how to work with WDL on DNAnexus.

Variant QC

Cleaning of genotypic data, as in case of sample QC, will reduce its noisiness and increase accuracy of the GWAS analysis results. In this step we also check for Hardy-Weinberg equilibrium deviation in order to detect genotyping error. To clean the data, data with high missing rate for both per-variant and per-sample are excluded. The GWAS will be performed on autosomes only, so variants in sex chromosomes are excluded.

Array Data QC

Code for array data QC is in the run_array_qc.sh script. For QC filtering, array data (field 22418) was used. Variants in array data were filtered with the Swiss Army Knife (SAK) app using following options:

plink2 --bfile ukb_c1-22_merged --keep ischemia_df.phe --autosome --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out imputed_array_snps_qc_pass

Below are criteria used for the filtering:

Use only for samples that are contained in ischemia_df.phe file: --keep ischemia_df.phe
Keep only variants located in autosoNes: --autosome
Minor allele frequency is greater than 0.01: --maf 0.01
Minor allele count is greater than 100: --mac 100
Missing call rate for variant is not exceeding 0.1: --geno 0.1
Missing call rate for sample is not exceeding 0.1: --mind 0.1
Hardy-Weinberg equilibrium exact test p-value for the variant is greater than 1e-15: --hwe 1e-15
- This is due to the fact that serious genotyping errors often yield extreme p-values.

For more information on filtering, see PLINK2 documentation.

SAK Configuration

Priority: Normal
Recommended instance: mem1_ssd1_v2_x36
Run time: Approximately 10 minutes

Steps (CLI/UI)

For the CLI, use the command below:

On your local machine run: sh run_array_qc.sh

In the UI:

After logging into the Research Analysis Platform, navigate to the Tools Library.
Select the “Swiss Army Knife” app and click Run to run it in your desired project.
Copy and paste the following into the command prompt:
1. plink2 --bfile “/mnt/project/<file path>/ukb_c1-22_merged” --keep “/mnt/project/<file path>/ischemia_df.phe” --autosome --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out imputed_array_snps_qc_pass

Imputed Data QC

The code for the imputed data QC is in bgens_qc.wdl script created by Yih-Chii Hwang. For QC filtering, imputation from genotype (GEL) (field 21008) was used. Variants were filtered using following options: --mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1

Below are the criteria used for the filtering:

Use only for samples that are contained in ischemia_df.phe file: --keep ischemia_df.phe
Minor allele frequency is greater than 0.0001: --maf 0.0001
Minor allele count is greater than 10: --mac 10
Missing call rate for variant is not exceeding 0.1: --geno 0.1
Missing call rate for sample is not exceeding 0.1: --mind 0.1
Hardy-Weinberg equilibrium exact test p-value for the variant is greater than 1e-15: --hwe 1e-15
- Deviations from Hardy-Weinberg can often indicate genotyping error.

For more details on filtering, see PLINK2 documentation. This script ran for approximately 10 hours when normal priority was used.

Steps (CLI/UI)

The installation of Java is needed to run dxCompiler. See above for Java and dxCompiler download instructions.
Log in to the Platform. Then, compile the workflow to make it available on DNAnexus: java -jar dxCompiler-2.10.8.jar compile bgens_qc.wdl -project project-xxxx -folder <directory path>
On RAP, you’ll now find
1. A workflow called bgens_qc
2. Applets created that correspond to the tasks/steps of the workflow.
In the UI, launch the bgens_qc workflow and use the following input parameters:
1. Common:
  1. geno_bgen_files: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgen (22 files)
  2. geno_sample_files: /Bulk/Imputation/Imputation from genotype (GEL)/*.sample (22 files)
    Note that you will need to correct the sample file header.
  3. output_prefix: gel_imputed_snps_data_qc_pass
  4. Keep_file: ischemia_df.phe
  5. plink2_options: --mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1

See a detailed tutorial on how to work with WDL.

CLI Steps

On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP: dx upload generate_inputs.ipynb —-destination <directory path>
From the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once JupyterLab is open, click on the DNAnexus tab on the left menu, and then click on the generate_inputs.ipynb notebook. Once the notebook is open go to the JupyterLab menu at the top of the page and click the Run tab and then click Run All Cells.
Running generate_inputs.ipynb notebook will generate bgens_qc_input.json. You may need to generate new sample files, see discussion in the community for more details.
Log into the DNAnexus Platform and navigate to the UKB_RAP repository (see Step 1) to compile the workflow: java -jar dxCompiler-2.10.8.jar compile bgens_qc.wdl -project project-xxxx -folder <directory path>
Compile the WDL workflow into DNAnexus native workflow and load it into the Platform. java -jar <path_to_downloaded_dxCompiler>dxCompiler-2.XX.X.jar compile research_project/bgens_qc/bgens_qc.wdl -project project-XXX -inputs research_project/bgens_qc/bgens_qc_input.json -archive -folder '<path_to_folder_on_UKB_RAP>'
1. The folder takes a file path input, which is where you want to store the workflow files. This code assumes that you are in the UKB_RAP directory in the terminal on your local machine. It will generate bgens_qc_input.dx.json in the working directory.
To run the workflow, type the following command in the terminal: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json

GWAS

A GWAS analysis checks for association between variants and selected phenotype. Firth correction is used because of an unbalanced dataset (i.e. there are have fewer cases than controls). The additive test is selected as a recommended first approach when running GWAS using REGENIE.

Overview

The GWAS analysis was done using the REGENIE app. For the first step the quality-controlled array data was used, for the second step the quality-controlled imputed data was used. When running REGENIE app, the following options were used:

Step 2
- SPA instead of Firth approximation? False
- Test type: Additive
- Run time: approximately 7 hours

After analysis, 2549 variants out of 36,695,747 tested variants had significant association with the phenotype.

Steps (UI)

Log in to the Research Analysis Platform and run the REGENIE app.
In the Analysis Settings tab, select “Execution Output Folder". In the "Analysis Outputs 7" field, select the following options:
1. Genotype BED for Step 1: path to array data after liftOver
2. Genotype BIM for Step 1: path to array data after liftOver
3. Genotype FAM for Step 1: path to array data after liftOver
4. Genotype BGEN files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgen (22 files)
5. Genotype BGI index files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.bgi (22 files)
6. Sample files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/*.sample (22 files)
  1. Note: You will need to correct the sample file header.
7. Phenotypes file: phenotype file generated in the Sample QC step
8. Variant IDs to extract (Step 1): snplist file generated in the Array data QC step (1 file)
9. Variant IDs to extract (Step 2): snplist file generated in the Impute data QC step (1 file)
10. In the ASSOCIATION TESTING (STEP 2):
  1. Ensure that "SPA" instead of "Firth approximation?" Is set to False
11. In the COMMON section:
  1. Quantitative traits?: False
  2. Phenotypes to include: ischemia_cc
  3. Array of covariates to apply: sex, age, bmi, ever_smoked, hdl_cholesterol, ldl_cholesterol, hypertension, pc1, pc2, pc3, pc4, pc5, pc6, pc7, pc8, pc9, pc10
12. In the App Settings tab, select the instance type. The recommended instance is mem1_ssd1_v2_x36
13. Then click Start Analysis to begin.

Figures showing the example results of this GWAS are shown below.

Table 1. Number of significant GWAS variants divided by chromosome

LD Clumping

LD clumping is the next step in the process in order to reduce the number of significant GWAS variants by clumping genetically linked variants together. This will then report only the most significant variant from each clump.

Overview

Code for the LD clumping is in run_ld_clumping.ipynb. In this notebook, significant variants are extracted from the GWAS report. Then, for each chromosome, significant variants are extracted from the imputed BGEN files, converted to PLINK files and then LD clumping is performed using PLINK software.

Out of 2531 variants, 82 remained as index variants after clumping.

JupyterLab Configuration

Cluster configuration: Single node
Recommended instance: mem2_ssd1_v2_x32
Run time: approximately 20 minutes

Steps:

On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP: dx upload run_ld_clumping.ipynb —-destination <directory path>
From the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once your JupyterLab session is open, click on the DNAnexus tab on the left hand side and open the run_ld_clumping.ipynb notebook.
From the JupyterLab menu at the top of the page, click the Run tab and click Run All Cells.

An example of a table of index variants after LD clumping is shown below.

Table 2. Number of index variants after LD clumping by chromosome

PheWAS

PheWAS studies examine causal associations with other phenotypes. Conducting a PheWAS can help to distinguish when comorbidities are caused by horizontal pleiotropy (one locus influencing multiple diseases) and causality or vertical pleiotropy (one disease is causing another disease).

Overview

Code for preparing phenotype data can be found in get-phewas-data.ipynb. Code for running PheWAS analysis is in run-phewas.ipynb. The phenome wide association study was performed using the PheWAS R package. For this analysis, genotypic data for each of the variants selected by LD clumping method, covariates and ICD10 phenotypes were used. To create ICD10 phenotypes, the UKB Diagnoses - ICD10 (41270) field was used. A table was then created, where each row contains one phenotype (ICD10 diagnosis) for one participant. Therefore, when a participant has multiple diagnoses, a participants' eid will appear multiple times.

The PheWAS was run in parallel for each variant with the following options:

Use p-value, Bonferroni and FDR to calculate significance threshold: significance.threshold = c("p-value", "bonferroni", "fdr")
Additive genotypes are not being supplied: additive.genotypes = FALSE

JupyterLab Configuration

Cluster configuration: Single node
Recommended instance: mem2_ssd1_v2_x32
Run time for data preparation: Approximately 20 min
Run time for PheWAS: Approximately 6 hours

Steps

On your local machine, clone the UKB JupyterLab notebooks repository.
Upload the Jupyter notebook to your project directory on RAP:
1. dx upload get-phewas-data.ipynb --destination <directory path>
2. dx upload run-phewas.ipynb --destination <directory path>
On the Research Analysis Platform, launch JupyterLab. The recommended configuration is listed above.
Once theJupyterLab session is open, click on the DNAnexus tab on the left hand side, open the get-phewas-data.ipynb and run-phewas.ipynb notebooks.
From the JupyterLab menu at the top of the screen, click the Run tab and click Run All Cells.

The results from the PheWAS are summarized in the below figures.

Table 3. Number of significant associations for each phenotype

26 out of 82 variants had significant association with phenotype. The phenotypes with the most significant PheWAS associations were essential hypertension or hypertension.

Conclusion

This tutorial demonstrated how UKB-RAP can be used for end-to-end genomic target discovery pipeline. The analysis started with ischemia cohort creation using the Cohort Browser. The QC was done using JupyterLab (samples), PLINK in Swiss Army Knife (array data) or inside WDL script (imputed data). GWAS was performed using REGENIE, where array and imputed data was used for step 1 and 2 respectively. Linkage disequilibrium clumping then extracted the most informative variant among significant GWAS variants. In the final step, PheWAS analysis was performed for each variant, and the phenotypes with the most significant PheWAS associations were found to be essential hypertension and hypertension.

Links to Published Code

Burden Testing with regenie Using WES Annotation Files

Introduction

This article describes the format of the annotation files for the 500k WES dataset, and demonstrates how to use regenie on the UK Biobank Research Analysis Platform for generating variant masks on which to perform association tests.

Specifically, this article will go over how to use the following helper files:

ukb23158_500k_OQFE.sets.txt.gz

ukb23158_500k_OQFE.annotations.txt.gz

ukb23158_500k_OQFE.masks

A typical application of this tool would be in rare variant analyses where single variant tests have lower power, and combining variants into masks can boost association power. The commands shown below are executed in the JupyterLab environment with Bash kernel. For more on how to use JupyterLab on the Research Analysis Platform, see these tutorials. Alternatively, you can use ttyd (web-based terminal) or Cloud Workstation.

We will go over building masks on the fly in regenie, covering the following:

The format of the input files
How to run regenie to build and test masks
The LOVO scheme

We assume that the annotation files are located along with the dispensed files in the user’s project. We also assume the RAP project is mounted on “/mnt/project” on the worker as per the approach detailed here.

The path to the annotation file:

path_to_500kwes_helper_files="/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release/helper_files/"

Input File Format

For this tutorial, we need the following files:

Annotation file
Masks definition file
Set list file

Use the following command to view the directory structure of the files:

ls -1 "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
   "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
   "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" | xargs -L1 -I{} basename "{}"

The following files should be in your directory:

ukb23158_500k_OQFE.annotations.txt.gz
ukb23158_500k_OQFE.masks
ukb23158_500k_OQFE.sets.txt.gz

Set list file

The file ukb23158_500k_OQFE.sets.txt.gz defines how sets - i.e. genes - are to be constructed, by listing variants corresponding to each gene.

Each line contains the gene name, followed by a chromosome and physical position - to be used in the association result file - then by a comma-separated list of variants included in the gene, in the format CHR:POS:REF:ALT ID.

The variant IDs correspond to those in the genotype file (if not in the file, they will be ignored when running regenie). To view the file:

# FORMAT: NAME CHR POS VARIANT_LIST
( zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" 2>/dev/null ) | head -n 2 | cut -d',' -f1-3 | awk '{print $0"..."}'

The file details will appear as the following:

A1BG(ENSG00000121410)	19	58347026	19:58347026:C:G,19:58347026:C:T,19:58347030:C:G...
A1CF(ENSG00000148584)	10	50806736	10:50806736:G:A,10:50806738:G:A,10:50806740:C:T...

In this file there are almost 19,000 defined genes. To see the count, use the following command:

zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" | wc -l

Which will give you:

Annotation file

The file ukb23158_500k_OQFE.annotations.txt.gz defines a functional annotation for each variant given a gene set. Each line contains the variant name, the gene name (corresponds to the name in the set list file above), and a single annotation label.

To view this file:

# FORMAT: VARIANT SET CATEGORY

(zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" 2>/dev/null) | head -n 5

The file details will appear as the following:

1:69134:A:G	OR4F5(ENSG00000186092) missense(0/5)
1:69144:C:T	OR4F5(ENSG00000186092) synonymous
1:69149:T:A	OR4F5(ENSG00000186092) missense(>=1/5)
1:69217:G:A	OR4F5(ENSG00000186092) missense(0/5)
1:69224:A:T	OR4F5(ENSG00000186092) missense(>=1/5)

There are a total of 5 annotation labels in the file. Check the file ukb23158_helper_files.pdf that is available on the UK Biobank showcase here for more details on the labels. Variants in the set list file which don't have annotations will be assigned to a default NULL category in regenie.

zcat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" | cut -d' ' -f3 | sort | uniq

Output:

LoF
missense(0/5)
missense(5/5)
missense(>=1/5)
synonymous

Mask definition file

The file ukb23158_500k_OQFE.masks specifies which annotation labels should be combined into masks. Each line contains a mask name, followed by a comma-separated list of the annotations included in the mask (i.e. taking a union over the annotation categories).

# FORMAT: MASK CATEGORY_LIST

cat "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks"

Output:

M1 LoF

The specified M1 mask includes only loss-of-function annotated variants in the mask. You can easily add new masks by selecting the annotations you want to combine. When doing so, make sure the annotation labels match the formatting of those in the annotation file above.

For example, to have a mask called M2 that includes LoF and missense annotated variants, you would generate a mask definition file as:

echo "M2 LoF,missense(0/5),missense(5/5),missense(>=1/5)" > ldl_custom_masks.txt

For more on these input files, refer to the file ukb23158_helper_files.pdf here.

Running regenie to Build and Test Masks

From this point on, you will need to have regenie installed in your working environment.

Here we show how to use annotation files for burden testing using regenie. For more detailed instructions on using regenie, refer to the regenie documentation page.

To install regenie in your JupyterLab environment:

# Download and expand the pre-compiled regenie from github
wget https://github.com/rgcgithub/regenie/releases/download/v2.2.4/regenie_v2.2.4.gz_x86_64_Linux_mkl.zip

unzip regenie_v2.2.4.gz_x86_64_Linux_mkl.zip 

# Now we can start use it on the notebook
./regenie_v2.2.4.gz_x86_64_Linux_mkl

regenie is also available as part of the Swiss Army Knife app, if you want to run an analysis as part of a larger workflow on the Research Analysis Platform.

In regenie, we will use the following options:

--aaf-bins to specify AAF cutoffs to use when building masks (the singleton class of masks is always included)

e.g. `--aaf-bins 0.05,0.01` will build masks using singletons, 1% and 5% AAF cutoffs.

In the examples below, we will use the OQFE genotype data in BGEN format. For the purposes of this tutorial, we will skip regenie step 1 and use --ignore-pred to bypass specifying the LOCO PRS.

Note that when running an actual analysis, it is highly recommended that you run Step 1 to control for relatedness, population structure and polygenicity.

We will focus on a specific gene, PCSK9. Start by getting its ID from the annotation files:

echo "name: `zgrep \"PCSK9(\" "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" | awk '{print $1}'`"
echo "chr: `zgrep \"PCSK9(\" "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" | awk '{print $2}'`"

The output should be:

name: PCSK9(ENSG00000169174)
chr: 1

We will use a phenotype file containing LDL measurements, and will carry out burden tests using variant masks built on-the-fly in regenie:

# Specify phenotype file and BGEN genotype file prefix
phenotype_file=/path/to/phenotype_file
genotype_prefix=/prefix/of/oqfe_bgen_file_chr

We will first run regenie building variant masks using a 0.1% and 1% AAF cutoff. We use option --extract-setlist to specify a subset of the 19K genes to test. Alternatively, you can provide a file named like "extract_gene_names.txt", with a single column containing names of genes to analyze, then pass this file to regenie using --extract-sets extract_gene_names.txt:

./regenie_v2.2.4.gz_x86_64_Linux_mkl \
    --step 2 \
    --ignore-pred \
    --bgen ${genotype_prefix}.bgen \
    --ref-first \
    --sample ${genotype_prefix}.sample \
    --phenoFile $phenotype_file \
    --covarFile $phenotype_file \
    --phenoCol LDL \
    --covarColList Sex,Age,Age_Sq,Age_x_Sex,PC{1:10} \
   --set-list "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
    --anno-file "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
    --mask-def "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" \
    --nauto 23 \
    --aaf-bins 0.01,0.001 \
    --bsize 200 \
    --extract-setlist "PCSK9(ENSG00000169174)" \
    --out ldl_pcsk9_test

regenie will output a summary statistics file ldl_pcsk9_test_LDL.regenie, containing association results for the built masks, whose name will be PCSK9(ENSG00000169174).M1.*, split between the 3 AAF cutoffs (0.01, 0.001, or singletons).

Saving masks to a file

You have the option to save the built masks to a file, which will be useful and save compute, if you want to use them again in another analysis. To do so, use --write-mask, which will store the masks in PLINK BED format, in files ldl_pcsk9_test_masks.{bed,bim,fam}.

Note that if you want to build masks and save them to file without testing them for association, you can use option --skip-test.

Using different rules to build masks

By default, masks are created by taking the maximum ALT allele count across sites included in the mask - so it will take values "0/1/2" or "NA" for missing. Alternatively, you can specify different rules to build masks:

--build-mask sum will take the sum of the ALT allele count across sites
--build-mask comphet is used to identify compound heterozygotes, which is defined as carriers of 2 or more ALT alleles across any site included in the mask

Note that building masks using the “sum” rule is not compatible with the use of --write-mask as detailed above.

Getting map composition

To obtain the list of single variants that went into each mask, you can use the option --write-mask-snplist when building masks. This will generate a file where each row has the mask name followed by the list of variants that went into the mask.

Checking input files

If you want to make sure that you’re correctly specifying input annotation files, you can use the option --check-burden-files, which will create a report that checks the input files for concordance. It will check that the same annotation labels are used in all files, and check that there are variants in the set list file which are not in the input genotype file.

Leave-one-variant-out (LOVO) Scheme

If there are built masks that come up as significant, it is usually of interest to determine which of the single variants in the mask is driving the signal. This is what the LOVO scheme, which stands for leave-one-variant-out, aims to do.

To specify LOVO, you need to use option --mask-lovo followed by the gene name, the mask name, and the AAF cutoff (either 'singleton' or a value in (0,1)).

For example, if we wanted to apply LOVO to the M1 mask with 1% AAF cutoff for PCSK9, we would use:

./regenie_v2.2.4.gz_x86_64_Linux_mkl \
    --step 2 \
    --ignore-pred \
    --bgen ${genotype_prefix}.bgen \
    --ref-first \
    --sample ${genotype_prefix}.sample \
    --phenoFile $phenotype_file \
    --covarFile $phenotype_file \
    --phenoCol LDL \
    --covarColList Sex,Age,Age_Sq,Age_x_Sex,PC{1:10} \
    --set-list "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
    --anno-file "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
    --mask-def "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" \
    --nauto 23 \
    --bsize 200 \
    --mask-lovo "PCSK9(ENSG00000169174),M1,0.01" \
    --out ldl_pcsk9_test_lovo

This will generate the result output file ldl_pcsk9_test_lovo_LDL.regenie, which will contain association results for each LOVO mask, as well as results for the full mask that considers all sites.

How to Cite

GWAS Guide Using Alzheimer's Disease

Explore one of the ways to conduct a GWAS on the Research Analysis Platform.

Introduction

This tutorial will provide the Python code needed to conduct the QC portion of a genome-wide association study (GWAS) focusing on discovering variants associated with Alzheimer's Disease (AD) in JupyterLab. For this GWAS, we used UK Biobank data to search for genetic loci associated with AD. For more detailed information about the AD study, see this blog post.

One of the motivations for performing this study was to demonstrate the process of running an end-to-end GWAS on the UK Biobank Research Analysis Platform. While researchers can bring their own programming languages and tools to the Platform, the Platform’s user interface enables researchers to run analyses easily and quickly. We wanted to demonstrate that conducting a GWAS could be done without the knowledge of sophisticated command-line methods.

Overview

Here is a brief overview of the steps that we conducted to perform this GWAS:

In JupyterLab:
1. Access data and construct phenotype.
2. Perform sample QC
In CLI or UI:
1. Conduct LiftOver.
2. Variant QC using Swiss Army Knife.
3. Conduct GWAS using regenie.
In UI:
1. Visualize analysis results using LocusZoom.

Accessing the Data

The genetic data needed to run this GWAS is stored in multiple folders in the Research Analysis Platform’s file structure, which is detailed here. For this analysis, we use the array genotype and whole exome sequencing data stored in the “/Bulk/Genotype Results/Genotype calls/” and “/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/” folders, respectively. Phenotypic data needed to collect the AD phenotype is located in the SQL database named “app<APPLICATION-ID>_<CREATION-TIME>”, and can be accessed via the corresponding dataset named “app<APPLICATION-ID>_<CREATION-TIME>.dataset” in the root (“/”) folder.

Pre-GWAS Steps

The first part of this analysis will consist of conducting sample QC and processing the UK Biobank data fields to calculate the individual’s AD risk by proxy.

Sample QC Steps and Derived Case/Control Phenotype

We conducted our QC using a Jupyter Lab spark node. Here is the following configuration:

Cluster configuration: Spark Cluster
Instance type: mem1_hdd1_v2_x4
Number of Nodes: 2

Getting the Data

We first imported dxpy to find and extract the UKB datasets on the Research Analysis Platform.

# Import packages 
import dxpy
import subprocess

# Automatically discover dispensed dataset ID and load the dataset 
dispensed_dataset_id = dxpy.find_one_data_object(typename='Dataset', name='app*.dataset', folder='/', name_mode='glob')['id']

# Get project ID
project_id = dxpy.find_one_project()["id"]
dataset = (':').join([project_id, dispensed_dataset_id])

cmd = ["dx", "extract_dataset", dataset, "-ddd", "--delimiter", ","]
subprocess.check_call(cmd)

The extracted UKB data is returned as 3 .csv files. The data_dictionary.csv contains a table with participants as the rows and metadata along the columns, including field names (see table below for naming convention). The codings.csv contains a lookup table for the different medical codes, including ICD-10 codes that will be displayed in the diagnosis field column (p41270). The entity_dictionary.csv contains the different entities that can be queried, where participant is the main entity that corresponds to most phenotype fields.

To access the data_dictionary.csv, use the command:

import glob
import os
import pandas as pd

path = os.getcwd()
data_dict_csv = glob.glob(os.path.join(path, "*.data_dictionary.csv"))[0]
data_dict_df = pd.read_csv(data_dict_csv)
data_dict_df.head()

Selecting Participant Fields by Field Index, Instance Index, or Array Index¶

For the main participant phenotype entity, the Research Analysis Platform uses field names with the following convention:

Data-Field ID and Corresponding Description

Using the UKB showcase, we located the following field IDs of interest from UKB showcase. We note them below so we can easily refer to them throughout the rest of the notebook.

The full list of all data-fields used in our case study:

Extract Data-Fields Used for Deriving Phenotype and Sample QC

The fields we want include principal components as well as the fields we need to construct our phenotype. We use the following function to collect all the field names and create a list of field IDs:

from distutils.version import LooseVersion

def field_names_for_ids(field_id):
    field_names = ["eid"]
    for _id in field_id:
        select_field_names = list(data_dict_df[data_dict_df.name.str.match(r'^p{}(_i\d+)?(_a\d+)?$'.format(_id))].name.values)
        field_names += select_field_names
    field_names = sorted([field for field in field_names], key=lambda n: LooseVersion(n))
        
    field_names = [f"participant.{f}" for f in field_names]
    return ",".join(field_names)

field_ids = ['31', '21022', '22001', '22006', '22009', '22019', '22021', '22027',
             '41270', '20107', '2946', '1807',
             '20110', '3526', '1845']
field_names = field_names_for_ids(field_ids)

If you view these field names now, each will look something like this: p<field_id>_iYYY_aZZZ.

Now we can retrieve our necessary fields as a Pandas DataFrame:

cmd = ["dx", "extract_dataset", dataset, "--fields", field_names, "--delimiter", ",", "--output", "extracted_data.sql", "--sql"]
subprocess.check_call(cmd)

import pyspark

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

with open("extracted_data.sql", "r") as file:
    retrieve_sql=""
    for line in file: 
        retrieve_sql += line.strip()  
             
temp_df = spark.sql(retrieve_sql.strip(";"))
pdf = temp_df.toPandas()

The dispensed_dictionary.csv file contains a table with participants as rows and the fields of interest as columns. From this point onward, we are interacting with the data in a Pandas DataFrame.

Use this command to visualize your table:

#Displays first three rows of table
pdf.head(3)

We will format the column headers to drop the "participant" prefix for the field names:

import re
pdf = pdf.rename(columns=lambda x: re.sub('participant.','',x))

Extract ICD-10 Codes for AD by Patterns (G30 and F00)

Here we resolve the ICD-10 codes for AD from the dataset. This list of codes will be used later for defining the phenotype.

codings_csv = glob.glob(os.path.join(path, "*.codings.csv"))[0]
codings_df = pd.read_csv(codings_csv)
codings_df.head()

# Collapse ICD-10 codes for Alzheimer's disease (G30* and F00*)
ad_icd_codes = list(
    codings_df[(codings_df["coding_name"] == "data_coding_19") & ((codings_df["parent_code"] == "G30") | (codings_df["parent_code"] == "F00"))]["code"])
ad_icd_codes

Your output may look like the following:

['G308', 'G300', 'G309', 'G301', 'F009', 'F001', 'F000', 'F002']

Derive the Phenotype (AD-by-Proxy)

The AD phenotype is generated using both a participant’s disease status and that participant's parental disease status and age. We are following the definition of AD-by-proxy described in this article by Jansen et al.

Here is a summary of the steps of the process. There are two calculation methods:

Participant's disease status.
- If the participant has been diagnosed with AD. Count risk as the maximum value, 2.
Additive of participant's parents' risk, determined by disease status along with parent's age.
- If a biological parent has been diagnosed with AD, count the risk as a 1. The risk is set to 2 if both biological parents were diagnosed with AD.
- Otherwise, get the parent' risk by the parents age (recorded age or age at death, whichever is older):
  - risk = max( 0.32, (100 - age)/100 )
  - 0.32 is the general risk for an adult to be diagnosed with AD
Take the maximum of steps 1 and 2.

To get the participant's disease status, we look for the participant’s primary or secondary ICD-10 codes. If a participant has ICD-10 codes for AD, we record their AD risk as “2”. Note that we first need to format the values in the columns to be a list object instead of a string.

import ast
import numpy as np

# Replace NaN with string None for p41270

# Note: eval will return Nonetype for string "None"
pdf["p41270"] = pdf["p41270"].replace(np.nan, "None")


# Get each participant's hospital inpatient records in ICD10 Diagnoses
def icd10_codes(row):
    icd10_codes = row['p41270'] or []
    return list( set(icd10_codes) )

pdf['icd10_codes'] = pdf.apply(icd10_codes, axis=1)

# If the participant has any of the ICD-10 codes for AD, record the risk to "2" 
def has_ad_icd10(row): 
    return 0 if set(row['icd10_codes']).isdisjoint(ad_icd_codes) else 2 
pdf['has_ad_icd10'] = pdf.apply(has_ad_icd10, axis=1)

Now we look at the illness statuses of the participant’s mother and father. We first get a list of illnesses with which the participants' biological parents have been diagnosed.

Data-fields 20107 and 20110 record illnesses of each participant's father and mother. If diagnosed with AD, assign the parent's risk as "1":

pdf['illnesses_of_father'] = pdf.filter(regex=('p20107*')).apply(
    lambda x: list(set(eval(x.any() or "[]"))),  
    axis=1)

pdf['illnesses_of_mother'] = pdf.filter(regex=('p20110*')).apply(
    lambda x: list(set(eval(x.any() or "[]"))), 
    axis=1)

UKB participants attended multiple measurement-taking sessions, so there are multiple fields for the participant’s parents’ ages. Since the risk of developing AD increases by age, we only need the maximum value for each parent’s age to calculate the AD risk:

# Get the max age between age at death and recorded age

pdf['father_age'] = pdf.filter(regex=(r'(p1807_*|p2946_*)')).max(axis=1)
pdf['mother_age'] = pdf.filter(regex=(r'(p3526_*|p1845_*)')).max(axis=1)

If the parent does not have any AD diagnosis, calculate the parent’s risk using their age, as shown in the following, then organize the data into a table:

# If the parent has diagnosed with AD (code 10), record it as 1; 
# else assign parent's AD risk with their risk, which is their age (proportional to diff of age of 100) with minimum risk at 0.32 

def parents_ad_risk(row): 
import numpy as np 

father_ad_risk = 1 if 10 in row['illnesses_of_father'] else np.maximum(0.32, (100 - row['father_age'])/100)
mother_ad_risk = 1 if 10 in row['illnesses_of_mother'] else np.maximum(0.32, (100 - row['mother_age'])/100)
return father_ad_risk + mother_ad_risk 

pdf['parents_ad_risk'] = pdf.apply(parents_ad_risk, axis=1) 
pdf['ad_risk_by_proxy'] = pdf[['has_ad_icd10','parents_ad_risk']].max(axis=1) 
pdf[['ad_risk_by_proxy','parents_ad_risk','has_ad_icd10']].head()

Perform Sample QC and Determine Phenotype

Now we conduct the Sample QC steps. We filtered for participants for whom reported sex and genetic sex are the same, ancestry is listed as "White British," and there is no sex chromosome aneuploidy. We filter out for outliers that are heterozygous and missing rates, or are distinguished excessive kinship, i.e. have ten or more third-degree relatives. Participants who had parental ages as “do not know" or "prefer not to answer" were also removed from the cohort:

pdf_qced = pdf[
           (pdf['p31'] == pdf['p22001']) & # Filter in sex and genetic sex are the same
           (pdf['p22006']==1) &            # in_white_british_ancestry_subset
           (pdf['p22019'].isnull()) &      # Not Sex chromosome aneuploidy
           (pdf['p22021']!=10) &           # Not Ten or more third-degree relatives identified (not 'excess_relatives')
           (pdf['p22027'].isnull()) &      # Not het_missing_outliers
           ((pdf['father_age'].notnull()) & (pdf['father_age']>0)) &  # There is father's age
           ((pdf['mother_age'].notnull()) & (pdf['mother_age']>0)) &  # There is mother's age
           (pdf['illnesses_of_father'].apply(lambda x:(-11 not in x and -13 not in x))) &  # Filter out "do not know" or "prefer not to answer" father illness
           (pdf['illnesses_of_mother'].apply(lambda x:(-11 not in x and -13 not in x)))    # Filter out "do not know" or "prefer not to answer" mother illness
]

Use this command to check the resulting number of samples:

# This is the number of samples we have in our table 

pdf_qced.shape[0]

Visualizing the Distribution and Setting the Case/Control Threshold

Since GWAS analyses have been run before for AD, and the results have been documented, we took the extra step of visualizing the distribution of cases with a histogram, to double check that our results looked reasonable in a general population. Our histogram looked reasonable, so we continued with our analysis:

# Visualize the distribution of the phenotype we derived
pdf_qced['ad_risk_by_proxy'].hist(bins=30)

Seeing the distribution, we decide to set the case/control threshold with AD risk at 1, with the range going from 0 to 2. All participants with a value for 1 or above for “AD-by-proxy” are considered cases, and those below 1 are considered to be controls:

# All participants have ad_risk_by_proxy are considered cases (1) and the rest are considered as controls (0)

pdf_qced['ad_by_proxy'] = 
np.where(pdf_qced['ad_risk_by_proxy']>= 1, 1, 0)
pdf_qced['ad_by_proxy'].value_counts()

Cleaning up the Pandas DataFrame and Saving to Your Project

Next we cleaned up the DataFrame to set up as input with regenie. We renamed the columns for human-readability and added an FID column, which is the required format for regenie. We also convert the list of parental illnesses to a string, for ease of formatting:

# Rename columns for better human readability 
import re
 
pdf_qced = pdf_qced.rename(columns=lambda x: re.sub('p22009_a','pc',x)) 
pdf_qced = pdf_qced.rename(columns={'eid':'IID', 'p31': 'sex', 'p21022': 'age', 'p22006': 'ethnic_group', 
'p22019': 'sex_chromosome_aneuploidy', 
'p22021': 'kinship_to_other_participants', 
'p22027': 'outliers_for_heterozygosity_or_missing'})
 
# Add FID column -- required input format for regenie 
pdf_qced['FID'] = pdf_qced['IID'] 

# Join list of parent's illness -- making it a single string format rather than list format 
pdf_qced['illnesses_of_father'] = pdf_qced['illnesses_of_father'].apply(lambda x: [str(i) for i in x if i]).str.join(",").replace(r'^\s*$', np.nan, regex=True)
pdf_qced['illnesses_of_mother'] = pdf_qced['illnesses_of_mother'].apply(lambda x: [str(i) for i in x if i]).str.join(",").replace(r'^\s*$', np.nan, regex=True)

# Join list of ICD10 code -- making it a single string format rather than list format
pdf_qced['icd10_codes'] = pdf_qced['icd10_codes'].apply(lambda x: 'NA' if x is None else ",".join([str(i) for i in x]))
 
# Create a phenotype table from our QCed data 
pdf_phenotype = pdf_qced[['FID', 'IID', 'sex', 'age', 
'ad_by_proxy', 'ad_risk_by_proxy', 
'icd10_codes', 'has_ad_icd10', 
'ethnic_group', 'sex_chromosome_aneuploidy', 
'kinship_to_other_participants', 
'outliers_for_heterozygosity_or_missing', 
'illnesses_of_father', 'illnesses_of_mother', 
'father_age', 'mother_age', 'parents_ad_risk'] + [f'pc{i}' for i in range(1, 41)]]

pdf_phenotype.to_csv('ad_risk_by_proxy_wes.phe', sep='\t', na_rep='NA', index=False, quoting=3 )

At this point, your table should look something like this, noting that displayed data values are placeholders, so to not disclose actual values:

To upload your phenotype file back to the project:

%%bash 
# Upload the phenotype file back to the RAP project 
dx upload ad_risk_by_proxy_wes.phe -p --path /output directory path/ad_by_proxy_GWAS_500K/ --brief

Now the QC steps are complete.

The next steps for the GWAS is to conduct variant QC and the array-genotype LiftOver.

Variant QC Steps and Array-Genotype LiftOver

LiftOver

Note that the UKB SNP array genotype calls (bulk, genotype calls) are mapped to GRCh37, while the whole-exome sequences are mapped to GRCh38, the most recent version. We chose to lift the coordinates of genotype calls over to GRCh38 using LiftOver before conducting the rest of the analysis, so that our genotype calls are consistent.

The following pipeline may be useful in conducting this part of the analysis.

Variant QC

To conduct variant QC, we use PLINK2, available as part of the Swiss Army Knife app, to create a list of variants that pass the QC threshold for array genotype and WES variant data. To access it from the UK Biobank Research Analysis Program, open your project, then click Start Analysis. From the list of tools that appear, click Swiss Army Knife. The following shows what you will put as input string under the cmd option of the Swiss Army Knife app.

For more information about PLINK2 see the official PLINK2 documentation.

Array Genotype Data

The input files to our variant QC job for running the array genotype data is:

The UK Biobank genotype call files in .bed, .bim, and .fam formats. We merged the chromosomes together into one file for each file format in preparation for running the whole genome regression model using Step 1 of regenie.

Below are the Swiss Army Knife cmd inputs used to run the job:

plink2 \ 
--bfile ukb_c1-22_hs38DH_merged \ 
--out final_array_snps_CRCh38_qc_pass \ 
--mac 100 --maf 0.01 --hwe 1e-15 --mind 0.1 --geno 0.1 \ 
--write-snplist --write-samples --no-id-header \ 
--threads $(nproc)

The outputs we get from this step are:

A list of SNPs that pass QC, final_array_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC, final_array_snps_CRCh38_qc_pass.id
A log file (.log)

WES Data

The WES QC is performed using the bgens_qc.wdl script found here. The variants are QC’d per chromosome and then concatenated at the end to create a single list of variants to keep. The variants were filtered using following options:

--mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1

The input file to run the bgens_qc.wdl was created by first running the generate_inputs.ipynb notebook shown here. Within this notebook you’ll need to update the following parameters:

path_to_data: /Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/

phenotype_folder: /output directory path/ad_by_proxy_GWAS_500K/

phenotype_file: ad_risk_by_proxy_wes.phe

Run the WDL workflow using the .json file created in the generate_inputs.ipynb notebook by running the command: dx run <path_to_folder_on_UKB_RAP>/bgens_qc -f research_project/bgens_qc/bgens_qc_input.dx.json. Otherwise you can run the WDL workflow in the UI and specify the inputs manually instead of using the Jupyter notebook.

The outputs we get from this step are:

A list of SNPs that pass QC, final_WES_snps_CRCh38_qc_pass.snplist
A list of sample IDs that pass QC, final_WES_snps_CRCh38_qc_pass.id
A log file (.log)

GWAS using regenie

We will be using regenie to conduct our GWAS. DNAnexus provides a suite of apps to run the analysis as a whole. The Regenie orchestration app is used to facilitate the run of two required steps as well as input preprocessing, validation, resource allocation and post-processing (including creation of Manhattan plots per trait).

regenie Step 1

For Step 1, we will estimate how background SNPs contribute to the phenotype.

The inputs used for Step 1 are:

A merged array genotype calls: ukb_c1-22_hs38DH_merged.[bed,bim,fam] that we also used as our input in the variant QC step above.
Our phenotype file: ad_risk_by_proxy_wes.phe
SNPs that pass QC list: final_array_snps_CRCh38_qc_pass.snplist and the SNP ID file, final_array_snps_CRCh38_qc_pass.id

The following outputs from Step 1 are the necessary inputs for Step 2:

LOCO file: ukb_c1-22_hs38DH_merged_1.loco
Prediction list: ukb_c1-22_hs38DH_merged_pred.list

regenie Step 2

Then in Step 2, linear or logistic regression is used to test association between the variants and phenotype conditional upon the prediction from the regression model in Step 1. In Step 2, the orchestration app distributes each set of BGEN files into a separate job, as each variant is independently tested to the phenotype, and this parallelization helps accelerate the end-to-end runtime.

The inputs to the second step are:

23 sets of WES genotype files (ukb23159_c[1-22,X]_b0_v1.[bgen,bgi,sample]).
ID file of SNPs that pass QC: final_WES_snps_CRCh38_qc_pass.id
Phenotype file: ad_risk_by_proxy_wes.phe
Prediction list: ukb_c1-22_hs38DH_merged_pred.list (from Step 1)
LOCO file: ukb_c1-22_hs38DH_merged_1.loco (from Step 1).

After the completion of Step 2, a set of association plots is generated. This concatenation is done per phenotype. On the output of the orchestration app are concatenated Manhattan plots as well as statistic in lmm.tsv.gz (Extra output files/extra_outputs ) that can be investigated with LocusZoom.

Running regenie in the UI

We used the following input options to run the regenie app using the UI. If an input option isn’t specified, then the default option was used.

Inputs:

Genotype BED for Step 1: ukb_c1-22_hs38DH_merged.bed

Genotype BIM for Step 1: ukb_c1-22_hs38DH_merged.bim

Genotype FAM for Step 1: ukb_c1-22_hs38DH_merged.fam

Genotype BGEN files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen

Genotype BGI index files for Step 2: ukb23159_c[1-22,X]_b0_v1.bgen.bgi

Sample files for Step 2: ukb23159_c[1-22,X]_b0_v1.sample

Phenotypes file: ad_risk_by_proxy_wes.phe

Covariates file: ad_risk_by_proxy_wes.phe

Variant IDs to extract (Step 1): final_array_snps_CRCh38_qc_pass.snplist

Variant IDs to extract (Step 2): final_WES_snps_CRCh38_qc_pass.snplist

Common

Quantitative traits: FALSE

Phenotypes to include: ad_by_proxy

Array of covariates to apply: age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10

WGR (Step 1)

Size of the genotype blocks (Step 1): 1000

First allele as reference (Step 1): TRUE

Association Testing (Step 2)

Size of the genotype blocks (Step 2): 200

Firth approximation instead of SPA: TRUE

Minimum allele count (MAC): 3

Running regenie in the CLI

To run regenie in the CLI, do not directly copy and paste this portion of code. Fill in the file paths specified by <> and add file paths for the chromosomes you want to analyze using the igenotype_bgens/bgis/samples.

dx run app-regenie \
-iwgr_genotype_bed="<file path>/<PATH_TO_LIFTEDOVER_DATA>/ukb_c1-22_hs38DH_merged.bed" \
-iwgr_genotype_bim="<file path>/<PATH_TO_LIFTEDOVER_DATA>/ukb_c1-22_hs38DH_merged.bim" \
-iwgr_genotype_fam="<file path>/<PATH_TO_LIFTEDOVER_DATA>/ukb_c1-22_hs38DH_merged.fam" \
-igenotype_bgens="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.bgen" \
-igenotype_bgens="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c2_b0_v1.bgen" \
…
-igenotype_bgens="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c22_b0_v1.bgen" \
-igenotype_bgis="Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.bgen.bgi" \
-igenotype_bgis="Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c2_b0_v1.bgen.bgi" \
…
-igenotype_bgis="Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c22_b0_v1.bgen.bgi" \
-igenotype_samples="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.sample" \
-igenotype_samples="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c2_b0_v1.sample" \
…
-igenotype_samples="/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c22_b0_v1.sample" \
-ipheno_txt="<file path>/ad_risk_by_proxy_wes.phe" \
-iquant_traits=False \
-ipheno_names=ad_by_proxy \
-icovar_names=age,sex,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
-istep1_block_size=1000 \
-istep2_block_size=200 \
-imin_mac=3 \
-istep1_extract_txt="<file path>/Final_WES_array_snps_CRCh38_qc_pass.snplist" \
-istep2_extract_txt="<file path>/Final_WES_snps_CRCh38_qc_pass.snplist" \
-istep1_ref_first=1 \
-icovar_txt="<file path>/ad_risk_by_proxy_wes.phe" \
-iuse_firth_approx=True \
--destination "<file path>"

LocusZoom

Next you can use LocusZoom to visualize the results of your analysis. The LocusZoom app is available in the Tools Library of the Research Analysis Platform. It generates Manhattan plots, QQ plots, and a table of high p-value loci.

To visualize your regenie outputs (e.g. the lmm.tsv.gz file), provide it as input to LocusZoom, and follow the app's instructions. From these results you can conduct further analyses and look into variant effects, or you may find new loci to explore.

Examples of these visualizations are shown below.

Benchmarking

A few example cases that might be interesting were tested to provide expectations on the Regenie runtime.

Step 1

The runtime and resources of various scenarios for Step 1 execution are summarized in the following table. Please note that this is only for orientational purposes and analysis-specific factors can influence the final results.

Step 2

Focusing on chromosome 1 (99236 variants after QC) as we analyze all chromosomes in parallel. The runtime of Step 2 varied from 7 to 140 minutes for mentioned use cases. The resources used are relatively stable across the tested scenarios and the default values are established to 3700 MB RAM and 40 GB disk.