Provenance in Practice

Convener: Andrew Treloar1, Mingfang Wu2

1 Australian National Data Service, Melbourne, Australia, andrew.treloar@ands.org.au

2 Australian National Data Service, Melbourne, Australia, mingfang.wu@ands.org.au

Presenter: Steve Cassidy3 Hamish Holewa3, Lance Wilson5, Stuart Woodman6

3 Macquarie University, Sydney, Australia, steve.cassidy@mq.edu.au

4 Griffith University, Nathan Campus, Australia, hholewa@quadrant.org.au

5 Monash University, Clayton Campus, Australia, lance.wilson@monash.edu

6 CSIRO, Clayton, Australia, Stuart.Woodman@data61.csiro.au

 

DESCRIPTION

Provenance is an essential enabler of reproducible research. It becomes increasingly important, especially in the eResearch community, where research can be data intensive and often involves complex data transformations and procedures. In data intensive research, where data producers may configure an instrument or simulation in certain way to collect primary data, or apply certain methodologies and processes to extract, transform and analyse input data to produce an output data, the provision of provenance as part of published data (either primary or secondary) is important for determining the quality, the amount of trust that can be placed on the results, the reproducibility of any result and ultimately the re-usability of that data.

In practice, approaches to capture and represent provenance can be described on a number of dimensions, for example:

  1. internal or external;
  • described using generic or discipline specific schema;
  • represented in machine readable form or human readable report; and
  • data-centred, process-centred or agent-centred.

Examples of the internal approach are workflow systems such as Kepler[1], Galaxy [‎3] or Taverna[2] that capture provenance trails inside their own infrastructure during their processing activity. The provenance information is typically only available to other users of the same system. The external alternative would be to export provenance to a separate provenance store [‎6].

Systems that adopt the internal approach tend to capture provenance in proprietary ways. Systems that adopt an external approach often do so in a standards-based way such as W3C PROV-O [‎5]; the external provenance stores use a standard because they need to interact with many different kinds of systems.

Provenance information can be described in a range of ways, from very generic metadata standard such as Dublin Core metadata, the Registry Interchange Format – Collections and Services (RIF-CS), disciplinary metadata standard such as the linkage model sections in ISO 19115-2 [‎4], to customised schemas that are developed to fit certain use cases. The last two approaches offer greater discipline richness. Alternatively, provenance information can be described directly in the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O) [‎2]. Provenance information captured in Dublin Core, RIF-CS and domain specific metadata can be mapped to PROV-O representation [‎1,‎7,‎8], so that provenance can be analysed at both a domain specific level and at more abstract PROV-O level.

Provenance information can be captured in a way that supports machine-machine interactions (for instance, to allow resource identification and location and workflows to be rerun) and/or at a higher level that allows for human users (without computing and semantics background) to more easily read the provenance trail of a data product or a data processing workflow. In some cases, this might just be a textual description, but might also involve a visualisation of the machine-readable representation such as VisTrails[3].

Finally, there may be different types of provenance to accommodate different use cases [‎2]. For example, process-centred provenance captures the actions and steps taken to generate data in question (e.g., workflow provenance); data-centred provenance captures the history of a data collection being created, processed, and accessed (e.g., data access history from an RDS node); whilst agent-centred provenance captures people (researchers) or organisations that were involved in generating or manipulating a data collection.  There can also be cases where all three types of provenance are captured within one system but can be queried from each of the three perspective.

There are various venues for academic research on provenance, e.g. International Provenance and Annotation Workshop[4], there are few venues for practitioners to share experience and practices of provenance management. This BoF session will provide a venue for the projects to introduce their experience and practices of capturing, storing and retrieving provenance information to enable research trustiness; also for communities who are in their early stage of managing provenance information to ask questions.

The targeted projects are the six Trusted Research Output projects[5] that were funded last year by ANDS. The aim of this set of projects was to improve the trustworthiness of research outputs that are not “publications”. The projects have been exploring all of these dimensions as discussed above in differing ways, and this BoF will enable a subset of the projects to share their experiences and reflect on how to generalise the lessons learned.

DELIVERY

The proposed outline of this session is:

  • Introduction to the session (5 mins)
  • Four speakers share their practices from their own project (10 minutes presentation + 5 minutes Q&A)
  • Panel discussion (20 minutes)
  • Wrap up and actions (5 minutes)

REFERENCES

  1. Feng, C. C. (2013). Mapping Geospatial metadata to open provenance model. IEEE Transactions on Geoscience and Remote Sensing. Vol.51(11). 5073-5081. DOI: 10.1109/TGRS.2013.2252181
  2. Gil, Y., and Miles, S., (editors) 2013. PROV Model Primer. Accessed on 29 May 2017 from https://www.w3.org/TR/prov-primer/
  3. Goecks, J., Nekruternko, A., Taylor, J. and The Galaxy Team (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life science. Genome Biology vol.11(8). DOI: doi:10.1186/gb-2010-11-8-r86
  4. ISO 19115 (2009), ISO 19115-2: Geographic information – metadata – part 2: Extensions for imagery and gridded data. ISO19115-2 Standard. Retrieved Jan. 18, 2016 from http://www.iso.org/iso/catalogue_detail.htm?csnumber=39229
  5. Lebo, T., Sahoo, S., & McGuinness, D. (2013). PROV-O: The PROV Ontology. Accessed on 18 January 2016 from http://www.w3.org/TR/prov-o/
  6. Car, N. (2013) A method and example system for managing provenance information in a heterogeneous process environment – a provenance architecture containing the Provenance Management System (PROMS). In MODSIM2013, 20th International Congress on Modelling and Simulation, Adelaide, Australia.
  7. W3C (2013). Dublin Core to PROV Mapping. Accessed on 28 June from https://www.w3.org/TR/prov-dc/
  8. Wu, M. and Treloar A. (2015). Metadata in Research Data Australia and the Open Provenance Model: A Proposed Mapping. ). In MODSIM2015, 21st International Congress on Modelling and Simulation, Gold Coast, Australia.

 

[1] Getting Started with Kepler Provenance 2.5 (Nov. 2015): https://code.kepler-project.org/code/kepler/trunk/modules/provenance/docs/provenance.pdf

[2] Provenance Management | Taverna: http://www.taverna.org.uk/documentation/taverna-2-x/provenance/

[3] VisTrails main page: http://www.vistrails.org/index.php/Main_Page

[4] International Provenance and Annotation Workshop Series (IPAW): http://www.ipaw.info/

[5] ANDS Project Registry: https://projects.ands.org.au/getAllProjects.php?start=tro

 


Biographies

Andrew Treloar is the Director of Technology for ANDS, with particular responsibility for international engagement. In 2008 he led the project to establish ANDS. Prior to that he was associated with a number of e-research projects as Director or Technical Architect, as well as the development of an Information Management Strategy for Monash University. His research interests include data management, institutional repositories and scholarly communication.  Andrew holds a Bachelor of Arts with first-class honours, a Graduate Diploma in computer science, a Master of Arts in English Literature and a Ph. D. with the thesis topic Hypermedia Online Publishing – The Transformation of the Scholarly Journal. For more information about Andrew is available from at http://andrew.treloar.net/ or follow him on Twitter as @atreloar.

Mingfang Wu has been a senior business analyst at ANDS since 2011. Mingfang has been working on a range of ANDS programs such as data captures, data applications, data eInfrastructures Connectivity, and trusted research outputs. Mingfang is co-chairing the Research Data Alliance: Data Discovery Paradigms Interest Group, and two Australian Interest Groups: Data Provenance and Software Citation. Mingfang received her PhD from RMIT University in 2002 from School of Computer Science, she was a senior research fellow at RMIT from 2006 – 2011 and a research scientist at CSIRO from 1999 – 2006, all in the area of information retrieval.

Hamish Holewa is the Project Manager of the Biodiversity and Climate Change Virtual Laboratory. Hamish has led a variety of national collaborative research eInfrastructure projects including the Quadrant health participant management tool, AeRO Tick Maturity Model, the ANDS MODC Open River project and national user support framework for NeCTAR. Mr Holewa has worked in research related e­Infrastructure projects for 3 years and had held research, management and policy development roles for the previous 10 years. He has an extensive research background and has been involved in over 16 research projects across Australia, New Zealand, UK, China, India and Bangladesh and named author on over 35 research publications.

Robert (Xiaobin) Shen joined the Astronomy Australia Ltd team in September 2016 as program manager. Before that, he worked at Australian National Data Service (ANDS) as senior research analyst for 7.5 years and the University of Melbourne as a research fellow for 3 years.

Lance Wilson is a mechanical engineer, who has been making tools to break things for the last 20 years. His career has moved through a number of engineering subdisciplines from manufacturing to bioengineering. Now he supports the national characterisation research community in Melbourne, Australia using the national research cloud, NeCTAR, to create HPC systems solving problems too large for your laptop. The tools he makes now assists researchers in the provenance space by providing integrated data management systems and software tool version control.

Stuart Woodman is a software engineer for Data61 at CSIRO. He has designed and developed software at CSIRO since 2000, working across multiple fields and disciplines utilising a range of different technologies. During his time at CSIRO Stuart has also lectured at the Royal Melbourne Institute of Technology (RMIT), teaching Engineering students the fundamentals of Java programming. In 2014 Stuart began assisting with the PROvenance Management System (PROMS), a collection of tools and methodologies for managing provenance information, continuing the work for CSIRO when the system creator left the organisation. In this role he has integrated PROMS into a number of Virtual Laboratories (VL) created at the CSIRO, allowing scientists to track how information they have created, used or transformed is utilised by other parties.

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2016 - 2017 Conference Design Pty Ltd