Lesley Wyborn1, Benjamin Evans1, Clare Richards1, Carina Wyborn2
1National Computational Research Infrastructure, Canberra, Australia, Firstname.Lastname@anu.edu.au
2Luc Hoffmann Institute, School of Forestry & Conservation, University of Montana, Montana, USA firstname.lastname@example.org
Exciting opportunities have emerged to undertake new scientific research across multiple domains at scales and/or resolutions never before possible. This is due to several factors and in particular: the growing alignment of large data with powerful, intelligent processing capability, as well as the emergence of large national facilities that are flexible enough to address the challenges of several domains simultaneously. As more data becomes accessible on these new research infrastructures, innovative ways of combining data and software are being trialled that were hitherto difficult due to the previously separated and facility-based approach to research infrastructure and data provision.
The National Computational Infrastructure (NCI) has championed a transdisciplinary approach to the data, data services and analysis tools to offer significant opportunities for integrative research platforms across multiple domains. This has demonstrated that it is becoming a reality that we can integrate data of any type from any source across different scales. This vision requires research domains to better address fundamental issues such as transforming data from incompatible formats; evaluating data for both direct access as well as network protocols, better defining their vocabularies, semantics and data structures; and updating their software to take advantage of these improvements.
Many of these time consuming activities are beyond the reasonable timeframe of an individual research project or even a single research community. Instead, they require a concerted effort by national collaborations to work within and across the domains so as to improve both the quality of the data from individual domains, as well as working to improve shared capability across domain silos.
A CASE STUDY: THE NCI HIGH PERFORMANCE DATA INTEROPERABILITY PLATFORM
NCI has assembled over 10 Petabytes of reference data collections that span the Earth System Sciences, Environmental Sciences, Climate and Weather, Geosciences, Astronomy, Genomics, and Social Sciences. These disparate collections have been increasingly harmonised under NCI’s National Environmental Research Data Interoperability Platform (NERDIP) . The data are sourced from major government research agencies (Bureau of Meteorology, CSIRO, Geoscience Australia), as well as from National and International Research institutions and then transformed with the aim of providing an integrated platform for research across multiple domains.
The key to enabling the next generation data-intensive transdisciplinary research is interoperability. In developing NERDIP, a range of data management policies and procedures were developed based on standards for organising the data and to establish a research environment and software practices that enable common access, via both in-situ and remotely connected computer programs, as well as supporting domain specific requirements.
VARIOUS CATEGORIES OF INTEROPERABILITY
The terms intradisciplinary, multidisciplinary, crossdisciplinary, interdisciplinary, and transdisciplinary are often used loosely and interchangeably within the physical sciences. In contrast, in the social sciences, each of these terms has a more precise definition to categorise approaches to collaborative research projects and programs, so that individual researchers can improve common understandings (e.g., [2, 3]). At NCI, we are seeing various interactions within and between research groups and domains over data sharing, and this has led us to characterise a series of behaviours that relate to how separate groups working on the same project share and interface their data. We also see these terms as part of a spectrum that defines an evolutionary pathway of increasing complexity of data integration.
We propose to adapt/extend terms as used by the social sciences (e.g., [2, 3]) for research data integration as follows:
- Intradisciplinary: Researchers work within a single discipline or data silo with all participants using the same standard and hence no reformatting or translation of data is required;
- Multidisciplinary: Researchers from different discipline silos work together and share knowledge and results, but are not actually integrating at the data level – outputs are combined at the research paper/report level;
- Crossdisciplinary: Researchers participating on a project to integrate data across the groups decide to reformat their datasets to a single agreed suite of specific standards and formats;
- Interdisciplinary: Researchers from each domain integrate their data using customized brokers that cross walk between the different domain silos: the data of each participant remains unchanged in the back-end; and
- Transdisciplinary: Data is born connected to international standards that enable online interaction across the discipline boundaries and beyond academia: researchers participate with stakeholders who can also contribute data.
Today, most projects conducting research that require data integration across one or more domains are either crossdisciplinary or interdisciplinary integrations. However, as the number of research groups that participate increase, it becomes apparent these approaches are limited. Transdisciplinary data integration is clearly the way forward, but will be dependent on the adoption of international standards that increase data interoperability (e.g., W3C, OGC, IEEE), groups supporting integration (e.g., Research Data Alliance) and the improvement in software re-usability.
APPLYING FAIR DATA PRINCIPLES: A PATHWAY TO TRANSDISCIPLINARY RESEARCH
Major international infrastructure investments have promoted the development of FAIR data. The FAIR guiding principles for Findable, Accessible, Interoperable and Reusable data publishing  were developed by the FORCE 11 community in 2016 to enable optimal use of research data across multiple stakeholders. The following describes how the FAIR principles have been applied at NCI:
- Findable: The datasets on the NCI NERDIP have catalogue entries that are accessible via human and machine harvestable interfaces. The metadata standard used is conformant with the ISO 19115 standard for discovery of geospatial information and can be cross-walked with the RIF-CS profile of ISO 2146 used by ANDS Research Data Australia, as well as the Dublin Core and Data Catalog (DCAT) metadata standards used by data.gov.au. Conforming with multiple metadata standards and profiles significantly increases the discoverability of NCI datasets, both nationally and internationally.
- Accessible: Datasets on the NCI platform are made accessible for general research access (e.g., data download for small file sizes), as well as being suitable for advanced techniques and multiple applications, including virtual laboratories, portals, common desktop tools, and programmatic access via well-known network protocols.
- Interoperable: Wherever possible international data standards for interoperability are applied including metadata standards at both data services and at the data level; controlled vocabularies and interchangeable self-describing data formats (e.g., NetCDF4/HDF5); and accessible via network protocols and community standard APIs.
- Reusable: Rigorous QA/QC procedures are used to validate the data against standards so that users are assured that the data can be accessed in consistent ways. The QA/QC validation also demonstrates that the data works across different (non-domain specific) packages, tools and programming languages deployed by the various user communities thus extending the use of the data across domain silos.
At NCI we are steadily building a trustworthy, transdisciplinary High Performance Data Platform. Researchers are able to share, use and reuse significant data collections that were previously difficult to both discover and access. Users can access these datasets in a consistent manner, which supports cross-domain and discipline specific access. The data collections are suitable for use within a high‐end computational and data‐intensive environment with programmatic access enabling new analysis techniques while supporting access for more traditional analytical techniques.
Transdisciplinary research is increasingly required for high impact research. How quickly it can progress will depend on (1) national funding to support cross-domain infrastructure development, (2) the ongoing adoption and improvement of international standards to support such research, and (3) continual software improvements to take advantage of these infrastructures. In the interim, organization of data around the FAIR principles  is enabling new and innovative data-intensive research. The NCI NERDIP infrastructure is a well-used example of this, freeing researchers’ from time-consuming data wrangling, and thus enabling them to spend more effort on ground breaking, integrative research.
- The NCI National Environmental Research Data Interoperability Platform https://nci.org.au/services/vdi/nerdip/ accessed 30 June 2017.
- Stember, M., 1990. Advancing the Social Sciences Through the Interdisciplinary Enterprise. The Social Science Journal, 28, 1-14.
- Stock, P., and Burton, R.J.F., 2011. Defining Terms for Integrated (Multi-Inter-Trans-Disciplinary) Sustainability Research. Sustainability, 2011, 3, 1090-1113; doi:10.3390/su3081090
- The Force 11 FAIR data principles. Available from https://www.force11.org/fairprinciples, accessed 30 June 2017.
Lesley Wyborn is a geochemist by training and worked for BMR/AGSO/GA for 42 years in a variety of geoscience and geoinformatics positions. In 2014 she joined the ANU and currently has a joint adjunct fellowship with National Computational Infrastructure and the Research School of Earth Sciences. She has been involved in many NCRIS funded eResearch projects over the years. She is Deputy Chair of the Australian Academy of Science ‘Data for Science Committee’ and is co-chair of several RDA Interest Groups as well as a member of the AGU Earth and Space Science Executive Committee.