Dr Jonathan Yu1, Dr Simon Cox2, Mr Benjamin Leighton3, Mr Hendra Wijaya4, Mr Qifeng Bai5
1CSIRO Land and Water, Clayton, Victoria, Australia, firstname.lastname@example.org
2CSIRO Land and Water, Clayton, Victoria, Australia, email@example.com
3CSIRO Land and Water, Clayton, Victoria, Australia, firstname.lastname@example.org
4CSIRO/Data61, North Ryde, NSW, Australia, email@example.com
5CSIRO Land and Water, Black Mountain, ACT, Australia, firstname.lastname@example.org
Australia is currently ranked 2nd place according to the OKFN Global Open Data Index  and since 2013, over 7000 datasets have been published through data.gov.au. Increasing amounts of data is being published through state based open data initiatives too through data portals, such as data.nsw.gov.au, data.vic.gov.au. Recently thematic or agency based data portals have been established such as the Sharing and Enabling Environmental Data (SEED) data portal, NSW Office of Environment and Heritage (OEH) Data portal. Various NCRIS facilities also provide many data collections alongside these open government data initiatives, including AuScope, TERN, ALA, IMOS, and NCI covering primarily earth and environmental science data. A number of institutional repositories provide access to research datasets (CSIRO’s Data Access Portal, Research Data Australia through ANDS). Given the range of data being published online and through the various government and NCRIS initiatives, a challenge is to understand the current state of the data landscape in Australia and measure the complexity. Questions such as: how much data is available, how varied are they, are they interoperable, and which data is being used where?
Through the OzNome initiative, our team has been developing tools to (a) understand information infrastructures across Australia in greater detail and (b) enable researchers, industry and key partners to achieve productivity gains around their discovery, access and use of data. As part of this initiative, the CSIRO Knowledge Network (KN) provides a gateway to data across a wide range of initiatives in Australia. KN links to data held across multiple, heterogeneous data repositories. KN harvests, indexes, and registers each resource with KN identifiers, which enable improved linkages between data resources adding significant value to the information in the various source repositories and systems. KN currently provides search and discovery over 70k data collections and 175k spatial objects from 16 open government and research data repositories in Australia. Figure 1 shows a screenshot of analytics for information about data.gov.au and the top 100 datasets by keywords (available here: http://kn.csiro.au/about-dataset-list/data-gov-au).
Figure 1. Top 100 datasets by keyword for datasets in data.gov.au
Using KN, we carried out a preliminary survey of open data and research data in the Australian context across 18 initiatives (9 CKAN, 2 Socrata, 2 Geonetwork instances, plus CSIRO DAP, eReefs, OzNome). These include federal, state and capital city data initiatives, NCRIS facilities and CSIRO. The data formats from these catalogues are quite heterogeneous, with CKAN portals providing the most straightforward source of information. Across the 9 CKAN instances, there were 182 different formats. Figure 2 shows the distribution of these formats and the long tail of lesser published formats. Table 1 shows the top 5 data providers by number of data resources published (Table 1a) and the top 5 data formats published by resource in CKAN based data portals (Table 1b).
Figure 2. Distribution of CKAN based data resource formats
Table 1: a) Top 5 CKAN resources per provider; b) Top 5 formats from CKAN instances
|Data provider||No. Resources||Format||Count|
Most open data portals provide semantic annotation through subject or keyword level metadata, but using different labels and with values sourced from a variety of non-aligned subject vocabularies. For example, what CKAN calls “tags” (uncontrolled vocabulary), Geonetwork calls “subjects” (many from the GCMD keyword list) – see Table 2. It is unclear whether these keywords are granular and adequate enough to describe the resource.
Table 2: a) Top 5 CKAN tags; b) Top 5 Geonetwork Subject Keywords
|CKAN tags||Geonetwork subjects|
|“Earth Sciences”, 15868||“environment”, 346|
|“Oceans”, 15591||“EARTH SCIENCES”, 192|
|“GA Publication”, 13636||“National Computational Infrastructure (NCI)”, 168|
|“Ocean Temperature” ,12052||“climatologyMeteorologyAtmosphere”, 99|
|“Water Temperature”, 11827||“ATMOSPHERIC SCIENCES”, 94|
While Australia is ranked high on the open data scale, semantic annotation is relatively anarchic, and therefore indexing is incomplete. For this preliminary scan a large proportion of the data came from environmental and earth sciences, but even here there is a large variation in formats (at least in the metadata itself). Future work is needed to gain better insight into the variety of data being published and tools for understanding how the data is actually used as that is not well understood.
- OKFN Global Open Data Index, https://index.okfn.org/place, accessed 16 June 2017
- data.gov.au, http://data.gov.au/, accessed 16 June 2017
- data.nsw.gov.au, https://data.nsw.gov.au/, accessed 16 June 2017
- data.vic.gov.au, https://www.data.vic.gov.au, accessed 16 June 2017
- SEED data portal, https://www.seed.nsw.gov.au/, accessed 16 June 2017
- NSW OEH data portal, http://data.environment.nsw.gov.au/, accessed 16 June 2017
- OzNome initiative, https://research.csiro.au/oznome/, accessed 16 June 2017
Dr Jonathan Yu is an information specialist and is part of the Environmental Informatics group in CSIRO Land and Water. He’s currently leading and supporting the development of new approaches, methods and tools for transforming and connecting information flows across the environmental domain and the broader digital economy within Australia and internationally. His particular research interests include understanding information supply chains in various environmental domains to developing new methods and tools for streamlining and enhancing interoperability between them. http://orcid.org/0000-0002-2237-0091