Guide to Analyzing Large Sample Sets

Introduction

The UK Biobank Research Analysis Platform (UKB RAP) hosts a wide array of biomedical data sampled from hundreds of thousands of individuals across many years, and contains varied types of data ranging from MRI imaging to accelerometer measures. The platform provides the opportunity for researchers to conduct analyses on an increasingly large scale in varied ways (e.g QC-ing the sequencing data, performing whole genome variant calling, or genotyping a particular gene). However, processing data at this magnitude presents RAP researchers with multiple challenges, including how to:

  • encapsulate the analysis algorithm so it runs efficiently on the platform

  • break up the processing of the large data sets into parallel jobs

  • submit and monitor multiple job executions

  • identify and resubmit failed jobs.

In this guide, we will go over an example of how to perform HLA typing on 200K exome samples on the UKB RAP platform in a cost-efficient way. We will then provide guidelines for extending the techniques used in the example to other types of analyses that users may use.

This guide assumes the user has:

HLA Typing Overview

In our example, we'll perform HLA typing on 200K exome samples on the UKB RAP platform. The HLA (human leukocyte antigen) complex is one of the most diverse gene complexes found in humans and plays a central role in human immunity. Mutations in this complex may be linked to autoimmune disorders. Researchers are often interested in identifying mutations in this complex as they can be used to learn more about treatment for various autoimmune conditions like type I diabetes or rheumatoid arthritis.

For this tutorial, the location of the files we need are as follows:

  • The inputs to our HLA typing analysis are

    • 1: The 200K read-mapped samples in a UKB RAP-dispensed project with access to UKB Data-Field 23153, which is found in the folder containing the exome OQFE CRAM files.

    • 2: The reference genome that can be fetched from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa using the url_fetcher app and stored in genome_reference folder in your project.

Instance types

  • HLA typing runs independently on each sample, and each sample corresponds to one individual. Doing HLA typing on 1 exome sample takes 11 minutes on a mem1_ssd1_v2_x2 instance.

Outputs

  • We will store the output (HLA type in a file with .genotype.json extension and HLA expression level in a file with .gene.json extension) of the analysis in the "/HLA_process" folder of the RAP-dispensed project.

Since the UK Biobank contains so many samples, the naive way of running one job for each of 200K samples is inefficient because of inefficiencies derived from submitting, scheduling and managing 200,000 jobs. We therefore suggest reducing the total number of jobs by processing a batch of 100 samples in each job.

We recommend to structure the computation in such a way that the runtime of each job is less than a day to decrease the chances of job failure due to spot termination. In the example below, this is achieved by using mem1_ssd1_v2_x2 instances to process a batch of 100 samples in about 19 hours.

Here is a brief overview of the steps to our analysis. Later in this tutorial, we will describe each step in greater detail:

  1. Prepare the applet:

    1. Package the HLA analysis tools into a Docker image and upload the image to RAP.

    2. Create an applet using WDL (see documentation here) that performs HLA typing using the Docker image from the previous step. The applet takes an array of exome sample files as input.

    3. Compile the WDL task using dxCompiler to a DNAnexus applet.

  2. Generate job submission script:

    1. Efficiently fetch the 200K input file names from RAP.

    2. Create job submissions where each job processes 100 samples.

  3. Submit and monitor jobs:

    1. Run one job to make sure there are no errors in the code.

    2. Run the rest of the jobs.

    3. Monitor job execution.

    4. Resubmit any jobs that failed due to external factors such as spot instance termination.

Preparing the Applet

Packaging HLA tools into a Docker image and uploading to RAP

  1. First launch the cloud_workstation app to leverage the cloud-scale network bandwidth for uploading large Docker images to DNAnexus from AWS EC2 instances to AWS S3 buckets in the same region. Make sure that you have the Docker command line tool pre-installed. The command is:

    dx run cloud_workstation --ssh
  2. At the cloud_workstation prompt, create a docker folder which contains a Dockerfile file with the content available for download in this github repository describing the installation of samtools, bedtools, kallisto and arcasHLA tools in the Docker image.

  3. Build the Docker image:

    sudo docker build -t arcas_hla:0.0.1 docker/
  4. Save Docker image as a tarball:

    sudo docker save arcas_hla:0.0.1| gzip -c > arcas_hla_0.0.1.tar.gz
  5. Upload Docker image to the parent project:

    dx upload -p arcas_hla_0.0.1.tar.gz --path hla_project:/Docker/arcas_hla_0.0.1.tar.gz

We recommend encapsulating analysis tools in a Docker image to preserve reproducibility. We also recommend storing Docker images on the platform instead of external Docker registries such as Docker Hub and Quai.io for better reliability.

Creating an applet in WDL

On your local computer, create an applet using Workflow Description Language (WDL) that executes using the Docker image from the previous step. The applet takes an array of sample files as input. Below is the code for our applet:

WDL_APPLET
$ cat arcas_hla.wdl 
version 1.0

task arcas_hla_cram_instance_bundle{
    input {
        Array[File]+ mapped_read
        File reference
    }

    command <<<
        set -x -e -o pipefail
        mkdir output
        for input in ~{sep=" " mapped_read}; do
            file_prefix=$( basename $input ".cram")
            time samtools view -b -h ${input} -T ~{reference} > ${file_prefix}.bam
            time arcasHLA extract ${file_prefix}.bam -o output --paired -t $(nproc) -v
            time arcasHLA genotype output/${file_prefix}.extracted.1.fq.gz output/${file_prefix}.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -o output -t 8 -v
            rm output/*.fq.gz output/*.alignment.p ${file_prefix}.bam
        done
    >>>
    output {
        Array[File] genotype = glob("output/*.genotype.json")
        Array[File] gene = glob("output/*.genes.json")
        Array[File] log_files = glob("output/*.log")
    }
    runtime {
        docker: "dx://hla_project:/Docker/arcas_hla_0.0.1.tar.gz"
        dx_timeout: "48H"
        dx_instance_type: "mem1_ssd1_v2_x2"
    }
    parameter_meta {
    mapped_read: {
        description: "mapped short read data",
        patterns: ["*.cram"],
        stream: true
    }
    reference: {
        description: "reference genome",
        patterns: ["*.fa","*.fasta"]
    }
    }
}

Line annotations:

Line 18: Remove intermediate outputs after each sample is processed in a batch for greater storage efficiency.

Line 27: Public docker registries have a daily pull limit. We save the Docker image on RAP for better reliability.

Line 28: Use the appropriate timeout policy based on the expected runtime of your job to ensure job costs remain under control on the off chance that your job hangs. While rare, running at scale on AWS virtual machines increases the chances that at least one job will need to time out and be restarted.

Line 35: Streaming allows the system to avoid downloading the entire batch of inputs, and instead streams each input when it is read in the samtools view line.

Compile the WDL task using dxCompiler to a DNAnexus applet

  1. Install dxCompiler on your local machine following the installation instructions.

  2. Compile WDL code into a DNAnexus applet using dxCompiler:

 $ java -jar dxCompiler-2.4.10.jar compile arcas_hla.wdl
 ...
 // builds applet arcas_hla_cram_instance_bundle in the DNAnexus project

Note that by default, compiled DNAnexus applets are configured to auto-restart on transient failures as documented in therestartOn field of the executionPolicy argument in the DNAnexus documentation. You can adjust the restart policy by providing extras.json input to dxCompiler below as shown in dxCompiler documentation here.

Generating Job Submission Script

Efficiently fetch the 200K input file names from RAP

$ dx find data --path “<path to 200k exome files>” --name “*.cram” --delim > inputfile.txt

To sort your input file based on name use the standard bash sort:

sort -k 4 -t$'\t'

This command sorts files by the full path of the file.

If you need other information that is not present with default dx find data, you can use --json with dx find data and use jq to extract the fields you need.

Create job submissions where each job processes 100 samples

We create job submission commands using the following script (the latest version can be found here):

JOB_SUBMISSION
$ cat create_job_submission.py
import sys
import math

input_file=sys.argv[1]
batch_size=int(sys.argv[2])

def _parse_dx_delim(delim_line):
    '''parse each list of delim output from dx find into NAME, ID, SIZE, and FOLDER'''
    size=_parse_size(delim_line[2])
    id=delim_line[-1]
    split_path=delim_line[3].split('/')
    folder='/'+'/'.join(split_path[:-1])
    name=split_path[-1]

    return name,id,size,folder

fd=open(input_file)
lines=fd.readlines()
sample_number=len(lines)
batch_mapped_files=''
input_number=0
number_of_batch = int(math.ceil(sample_number*1.0/batch_size))
for batch_number in range(number_of_batch):
    batch_mapped_files=''
    for member in range(batch_size):
        delim_line = lines[input_number].strip().split('\t')
        name, id, size, folder = _parse_dx_delim(delim_line)
        batch_mapped_files += '-imapped_read={} '.format(id)
        final_folder='/HLA_process/' + str(batch_number)
        input_number+=1
        if input_number == sample_number:
            break

    print('dx run /arcas_hla_cram_instance_bundle \    
     -ireference=genome_reference/GRCh38_full_analysis_set_plus_decoy_hla.fa {batch_mapped_files} \
     --folder="{final_folder}" \
     --tag 200K_exome_HLA_analysis \
     --tag original 
     --tag batch_n_{batch_number} \   
     --priority normal\
     -y \
     --brief \
          '.format(
  batch_mapped_files=batch_mapped_files,
  batch_number=batch_number,
  final_folder=final_folder))

$ python create_job_submission.py inputfile.txt 100 > submission_command.txt

Line annotations:

Line 30: Store output for each batch (from 100 samples in this case) in a dedicated folder. This will avoid the problem of creating too many files in a specific directory and make tracking errors easier when some jobs produce unexpected numbers of output files.

Line 38-40: We tag each job with 3 tags:

  • 200K_exome_HLA_analysis represents the name of study and will help us distinguish jobs from this analysis from other work you may be doing in the same project.

  • original indicates that this is the first (original) attempt at running a job. Subsequent reruns of failed jobs will be tagged with rerun{rerun_attempt}.

  • batch_n_{batch_number} records a particular batch of 100 jobs.

These tags illustrate the use of execution metadata to help track the progress of your analysis, identify which studies had all their jobs complete successfully, and restart any failed jobs. Metadata consisting of tags and properties can be associated with DNAnexus objects such as files and executions and is documented here.

Submitting and Monitoring Jobs

Running one job to check for errors

This command below shows the dx run invocation for the first job, then submits it:

$ head -1 submission_command.txt
dx run /arcas_hla_cram_instance_bundle  \
  -ireference=genome_reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
 -imapped_read=<file-id0> … -imapped_read=<file-id99> \
 --folder="/HLA_process/0" \
 --tag 200K_exome_HLA_analysis --tag original --tag batch_n_0 \
 --priority normal -y --brief
$ head -1 submission_command.txt | sh

Monitor the rest of the jobs with dx watch.

Running remaining jobs

We recommend submitting jobs gradually, rather than all at once. Submit the first job and see if it produces the expected output in the right location. After that, submit another 500 jobs and see if the variation of running time and cost among these jobs is within the expected range before submitting the rest of your jobs.

The code below creates a new list of submission commands by removing the previously launched first submission, then splits the remaining 1999 submissions into batches of 500:

$ tail -n +2 submission_command.txt > submission_command_remainder.txt
$ split -500  submission_command_remainder.txt submission_command_new
#This would generate submission_command_newaa, submission_command_newab, submission_command_newac, submission_command_newad. Then the user can submit each split file using the command. 
$ sh submission_command_newaa  
$ sh submission_command_newab  
$ sh submission_command_newac  
$ sh submission_command_newad

Monitoring job executions

We can monitor the execution of the 200K_exome_HLA_analysis analysis using dx command line tool to search for jobs tagged with 200K_exome_HLA_analysis and display only the last n jobs that we've submitted:

$ dx find jobs --tag 200K_exome_HLA_analysis --origin-jobs -n <number of total batches we've submitted>

Similarly, you can view the jobs corresponding to your analysis in the web browser UI by filtering on 200K_exome_HLA_analysis tag value from the Monitor page in your project.

Resubmitting any jobs that failed due to external factors

If you decide not to use a retry policy, occasionally, some jobs may fail due to sample-specific issues or due to external factors such as spot instance termination or other intermittent system errors. You can find failed jobs by using the job state filter set to failed in the Monitor tab in the web browser UI or using the dx command line tool as shown below:

$ dx find jobs --state failed --tag 200K_exome_HLA_analysis --origin-jobs -n 10

After fixing issues associated with a particular failed job, resubmit the job using a distinguishable tag, so you can track which batch has already been analyzed. For example, if original jobs/analyses has tags 200K_exome_HLA_analysis, batch_n_0, original, you may resubmit the job using tag --name 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1.

To retry a job that failed due to a intermittent system error such as spot instance termination or network connectivity problem, you can use:

$ dx run --clone <job-id> --tag 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1 -y --brief

If you had to fix your analysis code and want to rerun a failed job with a new applet, you can use the following:

$ dx run new_applet_executable --clone <job-id> --tag 200K_exome_HLA_analysis --tag batch_n_0 --tag rerun1 -y --brief

Tip: Submitting the large batch of jobs via Swiss Army Knife

To prevent to internet disconnection when submitting the large batch of jobs, you can upload submission file to project and use swiss-army-knife to submit. In such case, make sure you use --detach job for each job, because each (sub)job would inherit priority from the main swiss-army-knife job, so all those batch jobs might all be on-demand. Another option is to use --head-job-on-demand in order to request the head job of an app or applet be run in an on-demand instance - especially good option for workflows. Note that --head-job-on-demand option will override the --priority setting for the head job.

General Guidelines

  1. Before developing your analysis, define what each “unit” of independent work consists of so you can break your overall analysis down into multiple smaller sections for parallel processing. For example, for the HLA typing example or for variant calling, each individual sample can be considered as one independent unit. For joining variant calling, your "unit" might consist of a small genomic region for all samples.

  2. Plan for your batch run using an end-to-end approach that covers naming of the analysis, preparing and submitting jobs, and organizing output files.

    1. It is good practice to use a human readable name like <sample ID>.<type of file or processing>.<file format extension> for ease of reviewing or troubleshooting your work. For example, in HLA typing, we name the output as 12345_6789_0.genotype.json which represents <sample ID>.<type of file>.<file format extension> .

    2. Keep the number of files per folder under 10,000 to make viewing and querying more efficient.

  3. To analyze hundreds of thousands of units, analyze multiple units in a single job to reduce the total number of jobs to submit and manage. To limit the impact of spot instance termination, we recommend limiting the runtime of each job to about a day by selecting an appropriate instance type. Executing jobs on larger instances with more CPUs can be used to decrease job execution time.

    1. Recommended for efficiency

      HLA Example

      Number of concurrent jobs

      <=5,000

      2,000

      Number of input per job

      <=1,000

      100

      Job run time

      ~1 day

      19 hours

    2. Large number of jobs would be hard to manage or modify. If you have more input to analyze than 5,000, consider combining multiple input per jobs or gradually scale up your job submission. In case you have a solid control over the input data and the gradual submission process, you do not need to group inputs.

  4. Encapsulate your analysis tools in a Docker image for better reproducibility. Docker image includes an operating system version, a specific version of your tools and their dependencies. Specify an explicit (instead of latest or default) version of external tools such as samtools or bamtools in the Dockerfile for reproducible creation of docker image from the Dockerfile. Store Docker images on the platform for use in the DNAnexus apps instead of external Docker registries such as Docker Hub and Quai.io for better reliability and to avoid pull limits imposed by public Docker registries.

  5. Optimize applet execution

    1. Use available CPUs: In the HLA example, each execution of the applet processed 100 samples. It's important to make applet execution use available CPUs efficiently as the applet execution will be performed 2,000 times. During the applet's execution, the 100 input samples were analyzed serially as shown in line 13-19 of the WDL_APPLET code snippet above (i.e. second sample was analyzed after the analysis of the first sample was finished). Since samtools and arcasHLA tools are multi-threaded, the sequential processing of samples still resulted in high CPU utilization. If the analysis tools are not multi-threaded, you may consider processing multiple units in parallel (e.g. using xargs) for better CPU utilization.

    2. Manage disk space: If your app processes units in sequential manner, and can stream the input, you can avoid downloading all the inputs at the start of applet's execution by using the "stream" WDL input option as shown on line 35 of the WDL_APPLET code snippet shown above. We also recommend removing unnecessary intermediate files to save storage disk space, shown in line 18 of WDL_APPLET.

      The outputs produced by the applet in the HLA example above required little disk space, so preserving all 100 outputs until the end of the job did not exhaust the disk space on the instance running the job. If your analysis produces large outputs, you can either select an instance with lots of disk space (such as mem3_ssd3 family), or implement your processing step using a native DNAnexus applet instead of WDL that allows for more control over the upload of output files from the worker to the DNAnexus platform.

    3. Select an instance type that balances CPU, memory and disk requirements for your analysis. For example if your analysis requires 2GB of memory per core you can use mem1 instance family, while requirement of 7GB of memory per core calls for mem3 instance family.

  6. For more general considerations for large-scale data analysis, please refer to the peer-reviewed publication “Ten Simple Rules for Large-scale Data Processing” published in PLOS Computational Biology.

How to Cite

If you use the Research Analysis Platform for performing your analysis, please cite or acknowledge us by including the following statement "The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com)" in your work.

Last updated