Accessing bulk data
Learn how to search and analyze UK Biobank bulk data files.
Finding Files Related to a Specific Participant
This section provides a detailed breakdown of how to search for an EID in participant-specific files, such as individual VCF and CRAM files. Note that these methods won't work for cohort-wide files, such as PLINK and pVCF files.
Web UI
Turn on the filters in your project, by clicking on the filter icon.
Use the filter picker to open the Properties filter.
Select Any Properties and type "eid" (without the quotes, in lower-case letters) in the Any Key textbox.
In the Any Value textbox, enter the 7-digit EID you're trying to locate.
Select Apply.
To search across all folders, set the search scope to Entire Project.
CLI
Search for an EID as follows, replacing "1234567" with the EID you're trying to find: dx find data --property eid=1234567
Visualizing a CRAM or VCF file using IGV.js
To visualize a CRAM or VCF file in the IGV.js genome browser, follow these steps:
Navigate to the project containing the files you want to visualize.
Select the Visualize tab.
Select the option "IGV.js Genome Browser v2.6.6 (*.bam+bai, *.cram+crai, *vcf.gz+tbi)".
Select the files you want to visualize.
If you are looking for a specific participant, enter the EID in the Search Project textbox, to quickly locate any CRAM or VCF files related to that participant.
For CRAM files, you must select both the CRAM and the associated CRAI file.
For VCF files, you must select both the VCF and the associated TBI file.
IMPORTANT: Note that IGV.js cannot visualize extremely large pVCF files, such as those provided for the 200k WES, 300k WES or 150k WGS releases. If you want to visualize variants in the 150k WGS cohort, type either "qc_metrics_graphtyper" or "qc_metrics_gatk" into the Search Project textbox and select the resulting pair of *.tab.gz and *.tab.gz.tbi files.
Select Launch Viewer.
Analyzing Files with Swiss Army Knife
The Research Analysis Platform provides many different tools for analyzing files. The Swiss Army Knife app is a simple starting point for many common bioinformatics manipulations. Launching the app will instantiate a Linux VM on the cloud with several preinstalled tools, and run a user-provided command. For more information about this app and its possibilities, visit its entry in the Tools Library.
To launch Swiss Army Knife, navigate to your project and click Start Analysis. Select Swiss Army Knife and click Run Selected. Select the Analysis Inputs tab. You can choose between specifying explicit inputs or using a mounted project folder.
Explicit Inputs. Use this strategy to analyze files that will be first downloaded on the local disk of the cloud VM.
Click Input files. Navigate to a folder of interest (for example,
Bulk
>Genotype Results
>Genotype calls
), and tick the files of interest (for example,<Chromosome 21 file>.bed
,<Chromosome 21 file>.bim
and<Chromosome 21 file>.fam
). Click Select as Input.In the Command line textbox, enter a command, referring to files directly with their names (for example,
plink --bfile <Chromosome 21 file> --maf 0.1 --out filtered_chr21
)
Mounted project folder. Use this strategy to analyze files that will be streamed directly without first writing them on disk.
In the Command line textbox, enter a command, referring to any files in the project using the prefix
/mnt/project
(for example,plink --bfile "/mnt/project/Bulk/<Path to chromosome calls>" --maf 0.1 --out filtered_chr21
It is also possible to combine the two strategies. For example, you can provide an R script as explicit input (such as statistics.r
), a command to run the script (such as Rscript statistics.r
) , and inside the script you can read any project files by opening them from the /mnt/project
folder (such as fields <- read.csv("/mnt/project/<Path to project files>", sep="\t")
)
Troubleshooting
For general tips on troubleshooting, see guide.
Failed job
Error while running the command (please refer to the job log for more information)
Error while mounting project-GKv25k0Jv6jzF4Gz2z0pxzzJ in /mnt/project (please refer to the job log for more information)
Error opening file: Bulk/Exome\sequences/Exome\OQ
Check if file path works using dx describe /project-xxx/file-xxx
or dx describe <filename>
(with a relative path)
Check that there are no typos in your file path
Issue with instance type
The machine running the job was terminated by the cloud provider", "try": 0
The spot instance that you’re running your job on may become unavailable in the middle of your execution during a period of increased demand for cloud computing capacity, which means that you will lose the work done on this instance and will have to restart the job.
To avoid this you have a few options:
2) Use the "High" priority settings in order to run your jobs on-demand and thus avoid SpotInstanceInterruption errors entirely, but these instances have a higher price.
3) Restart your job on a spot instance manually
Warning: Low disk space during this job
Cannot find input files or directory
No such file or directory
Select inputs from the drop down menu
Specify input paths in the command prompt using mounting - i.e. add “/mnt/project/” as a prefix to file path
How can I run my command across all chromosomes?
Use the following bash script if data is already separated per chromosome:
for chr in {1..22}; do \
dx run app-swiss-army-knife --instance-type mem1_ssd1_v2_x8 -y \
-iin="project-xxxx:/path/to/file/ukb#####_c${chr}_b#_v#.bgen" \
… ;
done
If you need to separate by chromosome you can use the plink or bcftools command in swiss-army knife.
Last updated
Was this helpful?