Introduction

With ever-increasing amount of next-generation sequencing data, biological life sciences have undergone a revolution and are shifting to large scale and complex systems-based molecular studies. Bioinformaticians are now required to develop and run complex pipelines on large high performance computing resources.

Scientific workflow systems are powerful tools to simplify development of pipelines, promote modularization and reuse of common workflows, and support automation, reproducibility, provenance tracking, and integration with various HPC environments.

There exist a number of open source workflow systems for use in bioinformatics. Some examples are Taverna, Galaxy, Bpipe and Snakemake. With so many different workflow systems, portability has become a significant problem. Workflows defined in one system cannot be used in another system. This has become a big barrier to the collaboration of bioinformaticians within life science community.

Developed by an informal, multi-vendor working group consisting of organizations and individuals aiming to enable scientists to share data analysis workflows, Common Workflow Language, or CWL, is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.

Since its inception, many open source workflow systems have been implementing support for CWL. One of them is Toil. With Toil, we can run CWL workflows on IBM Spectrum LSF cluster. This blog will cover some of the strength and weakness of using Toil to run CWL workflows on an LSF cluster.

Overview of Toil

Developed by UCSC Computational Genomics Lab, Toil is an open source workflow engine primarily for large scale biomedical computation. Its development is motivated by the shift to cloud platforms and the advent of standard workflow languages.

Toil supports workflows defined in CWL, in addition to its own workflow definition that are written in Python scripts.

Toil support running workflows on single node, commercial cloud (Amazon Web Services, Microsoft Azure, Google Compute Engine), or batch systems including LSF, without requiring any modification of the workflow definition.

By supporting CWL workflow specification and a variety of computing environments, Toil achieves great portability of workflows, freeing users to move their computation according to cost, time, and existing data location.

Running CWL flows on LSF

The command “cwltoil” with option “–batchSystem=lsf” can be used to run CWL flows with LSF.

cwltoil --batchSystem=lsf --disableCaching --jobStore /home/qiwang/scratch/tmp --defaultMemory 200000000 --logDebug 1st-workflow.cwl 1st-workflow-job.yml

bjobs output shows how a job in the flow is run:

bjobs output from a job running by toil

As can be seen, it is quite straightforward to run a CWL flow on LSF with Toil. It is also possible to specify memory and CPU requirements for the jobs in the flow.

Some observations on how the flow is run with Toil:

  • Toil must be installed on all compute nodes since /usr/bin/_toil_worker is used to launch jobs.
  • To determine job status, Toil runs bjobs and bacct commands to poll job status every 10 seconds. This is not very efficient.
  • Files in job work directory are copied to job store prior to job run. Default job store is in /tmp. If compute node and submission node (where cwltoil command is issued) are different, job store should be specified to be in a shared location.
  • Some of the limitations to run CWL flows on LSF with Toil:

  • Only CPU, memory and disk are supported for resource requirement, though LSF provides much more comprehensive support for resource requirement.

    This is understandable because Toil supports running workflows across a variety of computing platforms, and only the least common denominator of resource requirement from different computing platforms can be supported. This can also explain some of the other limitations listed below.

  • Resource requirement is specified on cwltoil command line thus applies to all jobs within the flow. One cannot specify resource requirement for individual jobs. To do that would require defining individual jobs in Python, not in CWL.
  • No bsub options are supported except resource requirement. To support other LSF bsub options, new parameters may be introduced in cwltoil command line, but for portability the parameters for cwltoil are supposed to be generic and applicable to all batch systems.
  • For each workflow run, only a single computing environment can be specified. No support for hybrid environment or cloud bursting.
  • These limitations apply to other supported batch systems as well, such as Sun Grid Engine.

    Conclusion

    As an open community workflow standard, CWL is gaining popularity in the life science community, with more and more participating organizations and implementations. It is important to note that with Toil (and possibly other tools), it is possible to run CWL flows in an LSF cluster, taking advantage of the enterprise capabilities from LSF such as efficient workload scheduling and fault-tolerance.

    As a side note, IBM Spectrum LSF Process Manager is a workflow system running on top of LSF. There is research currently underway for LSF Process Manager to work with CWL. The first step is to convert LSF Process Manager flows to CWL. Stay tuned.

    Join The Discussion

    Your email address will not be published. Required fields are marked *