Re-publication and Duplication of Digital Datasets: raising concerns on emerging issues of authority, identity and ethics.
Dr Lesley Wyborn1, Dr Jens Klump2, Dr Mingfang Wu3, Dr Kirsten Elger4
1Australian National University, Acton, Australia
2Mineral Resources, CSIRO, Kensington, Australia
3Australian Research Data Commons, Clayton , Australia
4Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences, Potsdam, Germany
As the FAIR principles become more widely implemented, digital datasets are becoming easier to discover through metadata aggregators such as Research Data Australia and Google Dataset Search. However, the downside is that digital datasets can also be easily republished by multiple other portals and stored by more than one repository. Automated crawlers can operate across catalogues and find, and then republish web service endpoints, often without the knowledge/authority of either the creator/owner/publisher of the original dataset. On these new sites, republished datasets can be given new identifiers: information about the original owner, licence, accreditation, citation, etc, is often not carried with individual replicated dataset or any derivatives of that dataset. Users of online datasets frequently have no idea if they are accessing the authoritative version of the data. Dataset creators can lose their identity in the republication process: hence measuring citation and impact is difficult, if not impossible.
Increasingly funders are asking for information about usage and impact of datasets/data acquisition campaigns they funded, whilst journal publishers now require that appropriate credit be given to whoever collected, curated and/or preserved the source data in a publication.
There is clearly a need for community agreed documentation of best practices for republication and mirroring of data to multiple sites. For ethical scientific research there is an urgent need to be able to identify the authoritative or canonical version of a dataset and ensure correct attribution and citation of any data source. This paper will make suggestions on how this can be achieved.
Lesley Wyborn is an Honorary Professor at the Reserach School of Earth Sciences and at the National Computational Infrastructure ANU: she also works part time for ARDC. She had 42 years’ experience in GA in research and data management. She is Chair of the Academy of Science ‘National Data in Science Committee’ and is on the AGU Data Management Advisory Board and on the ESIP Executive Board.