Dr Andreas Moll1, Dr Stephen Mudie2, Mr Robbie Clarken3
1Australian Nuclear Science and Technology Organisation, Melbourne, Australia, firstname.lastname@example.org
2Australian Nuclear Science and Technology Organisation, Melbourne, Australia, email@example.com
3Australian Nuclear Science and Technology Organisation, Melbourne, Australia, firstname.lastname@example.org
The Australian Synchrotron, located in Clayton, Melbourne, is one of Australia’s most important pieces of research infrastructure. Light emitted from accelerated electrons, travelling at nearly the speed of light, is utilised by 10 beamlines in order to conduct a very diverse range of research. After more than 10 years of operation, the beamlines at the Australian Synchrotron are well established and the demand for automation of research tasks is growing. Such tasks routinely involve the processing of TB-scale data, online (realtime) analysis of the recorded data to guide experiments, and fully automated data management workflows. In order to meet these demands we have developed Lightflow , a generic, distributed workflow system. It has been released as open source software on GitHub and can be found at: https://github.com/AustralianSynchrotron/Lightflow
Lightflow models a workflow as a set of individual tasks arranged as a directed acyclic graph (DAG). This specification encodes the direction that data flows as well as dependencies between tasks. Each workflow consists of one or more DAGs. While the arrangement of tasks within a DAG cannot be changed at runtime, other DAGs can be triggered from within a task, therefore enabling a workflow to be adapted to varying inputs or changing conditions during runtime.
Lightflow employs a worker-based queuing system, in which workers consume individual tasks. This allows the processing of workflows to be distributed. Such a scheme has multiple benefits: It is easy to scale horizontally; tasks that can be executed in parallel are executed on available workers at the same time; tasks that require specialised hardware or software environments can be routed to dedicated workers; and it simplifies the integration into existing container based cloud environments.
In order to avoid single points of failure, such as a central daemon often found in other workflow tools, the queuing system is also used to manage and monitor workflows and DAGs. When a new workflow is started, it is placed in a special queue and is eventually consumed by a worker. A workflow is executed by sending its DAGs to their respective queues. Each DAG will then start and monitor the execution of its tasks. The diagram in Figure 1 depicts the worker-based architecture of Lightflow.
Figure 1: Worker based architecture of Lightflow
Lightflow is written in Python 3 and supports Python 3.5 and higher. It uses the Celery  library for queuing tasks and the NetworkX  module for managing the directed acyclic graphs. As redis  is a common database found at many beamlines at the Australian Synchrotron, it is the default backend for Celery in Lightflow. However, any other Celery backend can be used as well. In addition to redis, Lightflow uses MongoDB  in order to store data that is persistent during a workflow run. Examples include the aggregation of values, calculation of running averages, or the storage of flags.
Tasks can receive data from upstream tasks and send data to downstream tasks. Any data that can be serialised can be shared between tasks. Typical examples for data flowing from task to task are file paths, pandas  DataFrames or numpy  arrays. The exchange of data across a distributed system is accomplished by using cloudpickle  in order to serialise and deserialise the data. Lightflow provides a fully featured command line interface for starting, stopping and monitoring workflows and workers. The command line interface is based on the click  Python module. An API is also available, in order to integrate Lightflow with existing tools and software.
In order to keep Lightflow lightweight, the core library focuses on the essential functionality of a distributed workflow system and only implements two tasks, a generic Python task and a bash task for calling arbitrary bash commands. Specialised tasks and functionality is implemented in extensions. Currently there are three extensions to Lightflow available: The filesystem extension offers specialised tasks for watching directories for file changes and tasks covering basic file operations; the EPICS  extension offers tasks that hook into EPICS, a control system used at the Australian Synchrotron for operating the hardware devices of the accelerator and the beamlines; and the REST extension provides a RESTful interface for starting, stopping and monitoring workflows via HTTP calls.
Lightflow at the MX Beamline
The two Crystallography beamlines (MX1, MX2) at the Australian Synchrotron have employed a custom made data management workflow for a number of years. Both the raw and reconstructed data of an experiment is compressed into squashfs files, verified and stored in the central storage system of the Australian Synchrotron. Recently this workflow has been upgraded to use Lightflow in order to take advantage of a distributed system to compress multiple experiments at the same time. The updated setup consists of a management virtual machine that hosts the workflow and DAG queues as well as acting as a REST endpoint for starting the squashfs workflow. Three physical servers act as squashfs nodes. The workflow is triggered by a HTTP REST call from the experiment change management system at the Crystallography beamlines.
Lightflow at the SAXS/WAXS Beamline
Several data processing pipelines are implemented using Lightflow for the SAXS/WAXS beamline. An example is the phaseID pipeline. This pipeline identifies diffraction peak positions within SAXS profiles and infers the most likely Space Group. This pipeline enables researchers to rapidly determine phase diagrams for self-assembled lyotropic liquid crystal systems. These systems are important for drug delivery and controlled release.
Summary and Outlook
Lightflow is a lightweight and distributed workflow system written in Python and has been released as open source software on GitHub. It is currently used at several beamlines at the Australian Synchrotron for managing data or implementing data processing pipelines. The next steps are to extend the use of Lightflow at the Australian Synchrotron to the experiment change management at beamlines, complex data management workflows and auto processing workflows at the Crystallography beamlines.
- Lightflow. Available from https://github.com/AustralianSynchrotron/Lightflow, accessed 22 Jun 2017.
- Celery. Available from http://www.celeryproject.org, accessed 25 June 2017.
- NetworkX. Available from https://networkx.github.io, accessed 28 June 2017.
- redis. Available from https://redis.io, accessed 15 June 2017
- MongoDB. Available from https://www.mongodb.com, accessed 25 June 2017
- pandas. Available from http://pandas.pydata.org, accessed 17 June 2017
- numpy. Available from http://www.numpy.org, accessed 25 June 2017
- cloudpickle. Available from https://github.com/cloudpipe/cloudpickle, accessed 23 June 2017
- click. Available from http://click.pocoo.org, accessed 25 June 2017
- EPICS. Available from http://www.aps.anl.gov/epics, accessed 25 June 2017
Andreas is the leader of the software engineering team at the Australian Synchrotron in Melbourne. His and his team’s work comprises the development of experiment control systems, scientific software, data pipelines and data management tools. Before being allowed to spend his days writing Python code and learning about microservices, he had to go through a 6 year Fortran and C++ bootcamp in his PhD.