Working with Bulk Data Files

Learn how to search and analyze UK Biobank bulk data files.

This section provides a detailed breakdown of how to search for an EID in participant-specific files, such as individual VCF and CRAM files. Note that these methods won't work for cohort-wide files, such as PLINK and pVCF files.

Web UI

  1. Turn on the filters in your project, by clicking on the filter icon.

  2. Use the filter picker to open the Properties filter.

  3. Select Any Properties and type "eid" (without the quotes, in lower-case letters) in the Any Key textbox.

  4. In the Any Value textbox, enter the 7-digit EID you're trying to locate.

  5. Select Apply.

  6. To search across all folders, set the search scope to Entire Project.

CLI

Search for an EID as follows, replacing "1234567" with the EID you're trying to find: dx find data --property eid=1234567

Visualizing a CRAM or VCF file using IGV.js

To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:

  • Navigate to the project containing the files you want to visualize.

  • Select the Visualize tab.

  • Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".

  • Select the files you want to visualize.

    • If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.

    • For CRAM files, you must select both the CRAM and the associated CRAI file.

    • For VCF files, you must select both the VCF and the associated TBI file.

    • IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.

    • Select Launch Viewer.

Analyzing Files with Swiss Army Knife

The Research Analysis Platform provides many different tools for analyzing files. The Swiss Army Knife app is a simple starting point for many common bioinformatics manipulations. Launching the app will instantiate a Linux VM on the cloud with several preinstalled tools, and run a user-provided command. For more information about this app and its possibilities, visit its entry in the Tools Library.

To launch Swiss Army Knife, navigate to your project and click Start Analysis. Select Swiss Army Knife and click Run Selected. Select the Analysis Inputs tab. You can choose between specifying explicit inputs or using a mounted project folder.

  • Explicit Inputs. Use this strategy to analyze files that will be first downloaded on the local disk of the cloud VM.

    • Click Input files. Navigate to a folder of interest (for example, Bulk > Genotype Results > Genotype calls), and tick the files of interest (for example, <Chromosome 21 file>.bed, <Chromosome 21 file>.bim and <Chromosome 21 file>.fam). Click Select as Input.

    • In the Command line textbox, enter a command, referring to files directly with their names (for example, plink --bfile <Chromosome 21 file> --maf 0.1 --out filtered_chr21)

  • Mounted project folder. Use this strategy to analyze files that will be streamed directly without first writing them on disk.

    • In the Command line textbox, enter a command, referring to any files in the project using the prefix /mnt/project (for example, plink --bfile "/mnt/project/Bulk/<Path to chromosome calls>" --maf 0.1 --out filtered_chr21

It is also possible to combine the two strategies. For example, you can provide an R script as explicit input (such as statistics.r), a command to run the script (such as Rscript statistics.r) , and inside the script you can read any project files by opening them from the /mnt/project folder (such as fields <- read.csv("/mnt/project/<Path to project files>", sep="\t"))

Troubleshooting

For general tips on troubleshooting, see guide.

IssueExample error messageWhat to do

Failed job

Error while running the command (please refer to the job log for more information)

See troubleshooting guide to understand the problem

Error while mounting project-GKv25k0Jv6jzF4Gz2z0pxzzJ in /mnt/project (please refer to the job log for more information)

Error opening file: Bulk/Exome\sequences/Exome\OQ

Check if file path works using dx describe /project-xxx/file-xxx or dx describe <filename>(with a relative path)

Check that there are no typos in your file path

Issue with instance type

The machine running the job was terminated by the cloud provider", "try": 0

The spot instance that you’re running your job on may become unavailable in the middle of your execution during a period of increased demand for cloud computing capacity, which means that you will lose the work done on this instance and will have to restart the job.

To avoid this you have a few options:

1) Add SpotInstanceInterruption to your application restart-on policy. That will automatically restart the job in case of spot instance interruption. See documentation for more information.

2) Use the "High" priority settings in order to run your jobs on-demand and thus avoid SpotInstanceInterruption errors entirely, but these instances have a higher price.

3) Restart your job on a spot instance manually

Warning: Low disk space during this job

You need to update your instance type selection to make sure you have enough memory (see “Memory (GiB)” column or “Storage (GiB)” in the rate card

Cannot find input files or directory

No such file or directory

There are 2 ways to specify input files (see tool documentation):

  1. Select inputs from the drop down menu

  2. Specify input paths in the command prompt using mounting - i.e. add “/mnt/project/” as a prefix to file path

How can I run my command across all chromosomes?

Use the following bash script if data is already separated per chromosome:

for chr in {1..22}; do \

dx run app-swiss-army-knife --instance-type mem1_ssd1_v2_x8 -y \

-iin="project-xxxx:/path/to/file/ukb#####_c${chr}_b#_v#.bgen" \

… ;

done

If you need to separate by chromosome you can use the plink or bcftools command in swiss-army knife.

Last updated