Les Kneebone1, Steven McEachern2, Janet McDougal2
1Analysis & Policy Observatory, Melbourne, Australia
2Australian Data Archive, Canberra, Australia
The research community has been witnessing new and innovative approaches to making data objects discoverable, outside of traditional scholarly publishing contexts. Datasets referenced within scholarly publications can be made persistently identifiable using the same identifying approaches used for the publications themselves. Datasets can be stored, discovered and reused via specialist dataset repository platforms. Therefore, creating graph databases of interlinked research data and publications is now a reality.
Linking grey literature publications to datasets presents special challenges. Analysis & Policy Observatory (APO) has, since 2004, collected grey policy literature and organized its collection with ubiquitous and emerging metadata standards. APO now focuses on expanding the reach of its collection by establishing links with other research objects. Datasets, from which policy reports are derived, are of special interest. APO is therefore working with Australian Data Archive (ADA) to connect its datasets with APO grey literature.
The challenge for grey literature and dataset linking
Persistent identifiers (PIDs) for digital information objects is well recognized as a key data points needed as the basis for links between objects. PIDs such as Digital Object Identifiers (DOIs) have enjoyed significant update in traditional academic publishing contexts. Minting DOIs for grey literature, in contrast, is an exceptional practice in policy sector. APO is taking a lead in promoting use of DOIs for grey literature – nonetheless, DOI coverage remains sporadic in policy collections. A similar context exists for datasets – DOIs are often minted after original publication and only once harvested and curated within special data repositories. ADA, like APO, has undertaken the significant challenge of assigning PIDs to datasets. The challenge for linking grey literature, then is one in which structured publication data is not always available to work with.
The response from ada and apo
Researchers cannot wait for all research objects to become entities. As research repository custodians, we will miss opportunities to combine our collections in ways that helps researchers if we wait for complete, or near complete PID coverage. Therefore ADA and APO are piloting approaches to linking objects using a combination of unstructured, semi-structured and structured data:
- Text mining and natural language processing, to help predict semantic and logical links
- Leveraging metadata, such as controlled vocabularies, to improve link prediction
- Locating and matching existing PIDs in each repository
From the pilot, APO and ADA hope to learn the following:
- What is PID coverage in our repositories?
- At what aggregation level should links be made, i.e. collection vs item level?
- What commonalities, and opportunities exist in respective metadata approaches?
- How can taxonomies be leveraged to improve predictions and matches?
- What interfaces between the repositories are scalable, reusable and in scope?
This research was funded by the Australian Research Council Linkage Infrastructure, Equipment and Facilities grant Linked Semantic Platforms (LE180100094).
Les Kneebone has worked in information management roles in government, school, community and research sectors since 2002. He has mainly contributed to managing metadata, taxonomies and cataloging standards used in these sectors. Les is currently supporting the Analysis & Policy Observatory by developing and refining metadata standards and services that will help to link policy literature with datasets.
Steven McEachern is Director and Manager of the Australian Data Archive at the Australian National University, where he is responsible for the daily operations and technical and strategic development of the data archive. He has high-level expertise in survey methodology and data archiving, and has been actively involved in development and application of survey research methodology and technologies over 15 years in the Australian university sector. https://orcid.org/0000-0001-7848-4912
Janet McDougall is a Senior Data Archivist at the Australian Data Archive, with a background in systems IT, data management, GIS, and social research. Her role includes outreach and curation of research data for preservation, archiving and publication. She is also involved in the ongoing implementation of metadata and standards focussed mainly in the social sciences and humanities, but also has experience with long-term ecological data from curation and procedural perspectives.