With ever-increasing amount of next-generation sequencing data, biological life sciences have undergone a revolution and are shifting to large scale and complex systems-based molecular studies. Bioinformaticians are now required to develop and run complex pipelines on large high performance computing resources.
Scientific workflow systems are powerful tools to simplify development of pipelines, promote modularization and reuse of common workflows, and support automation, reproducibility, provenance tracking, and integration with various HPC environments.
There exist a number of open source workflow systems for use in bioinformatics. Some examples are Taverna, Galaxy, Bpipe and Snakemake. With so many different workflow systems, portability has become a significant problem. Workflows defined in one system cannot be used in another system. This has become a big barrier to the collaboration of bioinformaticians within life science community.
Developed by an informal, multi-vendor working group consisting of organizations and individuals aiming to enable scientists to share data analysis workflows, Common Workflow Language, or CWL, is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.
Since its inception, many open source workflow systems have been implementing support for CWL. One of them is Toil. With Toil, we can run CWL workflows on IBM Spectrum LSF cluster. This blog will cover some of the strength and weakness of using Toil to run CWL workflows on an LSF cluster.
Overview of Toil
Developed by UCSC Computational Genomics Lab, Toil is an open source workflow engine primarily for large scale biomedical computation. Its development is motivated by the shift to cloud platforms and the advent of standard workflow languages.
Toil supports workflows defined in CWL, in addition to its own workflow definition that are written in Python scripts.
Toil support running workflows on single node, commercial cloud (Amazon Web Services, Microsoft Azure, Google Compute Engine), or batch systems including LSF, without requiring any modification of the workflow definition.
By supporting CWL workflow specification and a variety of computing environments, Toil achieves great portability of workflows, freeing users to move their computation according to cost, time, and existing data location.
Running CWL flows on LSF
The command â€ścwltoilâ€ť with option â€ś–batchSystem=lsfâ€ť can be used to run CWL flows with LSF.
cwltoil --batchSystem=lsf --disableCaching --jobStore /home/qiwang/scratch/tmp --defaultMemory 200000000 --logDebug 1st-workflow.cwl 1st-workflow-job.yml
bjobs output shows how a job in the flow is run:
As can be seen, it is quite straightforward to run a CWL flow on LSF with Toil. It is also possible to specify memory and CPU requirements for the jobs in the flow.
Some observations on how the flow is run with Toil:
Some of the limitations to run CWL flows on LSF with Toil:
This is understandable because Toil supports running workflows across a variety of computing platforms, and only the least common denominator of resource requirement from different computing platforms can be supported. This can also explain some of the other limitations listed below.
These limitations apply to other supported batch systems as well, such as Sun Grid Engine.
As an open community workflow standard, CWL is gaining popularity in the life science community, with more and more participating organizations and implementations. It is important to note that with Toil (and possibly other tools), it is possible to run CWL flows in an LSF cluster, taking advantage of the enterprise capabilities from LSF such as efficient workload scheduling and fault-tolerance.
As a side note, IBM Spectrum LSF Process Manager is a workflow system running on top of LSF. There is research currently underway for LSF Process Manager to work with CWL.