Dr. Cornelis Drost1, Augustus Ellerm2, Prof. Mark Gahagen1, Prof. Benjamin Adams2, Dr. Ryan Chard3, Prof. Ian Foster3
1Center For Eresearch, University of Auckland, Auckland, New Zealand, 2Department of Computer Science and Software Engineering, University of Canterbury, Christchurch, New Zealand, 3Data Science and Learning, Argonne National Laboratory, Chicago, USA
Biography:
Cornelis is a complex systems modeller with experience in fields including ecology, archaeology, oncology and epidemiology. Having experience developing both models and analyses in an academic setting, and production software in a commercial setting, he now works as a Snr. Solutions Specialist at the Center for eResearch, where he supports researchers with modelling, data visualization and software development. https://orcid.org/0000-0002-7355-9978
Abstract:
The comprehensive capture of data provenance information, at every stage of the scientific process, greatly enhances our ability to find, access and reuse (FAIR) that data. Extending provenance to include computational and analytical steps carried out during a scientific workflow (often automated through a workflow management system) further allows us to both reuse workflows and reproduce results.
When analyses are carried out in a distributed environment, there are additional challenges to both capturing provenance information from remote execution, and to robustly representing that information in a standardized format.
We discuss the development of tooling built on top of Globus Compute and the Gladier SDK, enabling the automated generation, transfer and recording of provenance information during a Globus Flow execution.
We also introduce extensions to the Provenance Run RO-Crate specification, allowing for a representation of the heterogeneous environments and authentication needs that each computation may have.
We discuss the challenges encountered in both capturing and representing provenance information, and lessons learned about how to make this process easier.
Finally, we also explore the opportunities presented by having a comprehensive record of how scientific results are derived from data and analyses – including the potential for remixing and reusing workflows, reproducing results, and standardizing and automating parts of the scientific publishing process.