Taking stock of biological data resources

Ms. Keeva Connolly1, Matt Andrews2, Peter Brenton2, Simon Checksfield2, Christopher Mangion1, Winnie Mok1, Caitlin Ramsay2, Sarah Richmond3, Goran Sterjov2, Nigel Ward1, Kathryn Hall2

1Australian Biocommons, Australia, 2Atlas of Living Australia, Australia, 3Bioplatforms Australia, Australia

Biography:

Keeva Connolly works for the Australian BioCommons, and is the scientific business analyst for the Australian Reference Genome Atlas (ARGA) – an online data indexing platform aiming to improve the discoverability of genomic data for Australian and Australian-relevant species.

Abstract:

The rate at which biological data are being generated and made available online is rapidly accelerating. While some databases and infrastructures, such as the International Nucleotide Sequence Database Collaboration (INSDC) repositories and the Global Biodiversity Information Facility (GBIF), are long-established and widely used by researchers, other resources are less well-known. The total number of resources providing access to biological data remains difficult to determine, and the types of data they contain, and the research contexts in which they could be utilised, are not well understood.

To survey the availability of infrastructure supporting biological data, the Global Biodata Coalition (GBC) compiled an inventory of over 3000 data resources identified from abstracts published between 2011 and 2021. We sought to further characterise these resources by assessing whether they were still accessible, identifying the type of data hosted and taxonomic focus, and categorising their general research area. We focussed on non-human genomic data resources and investigated how they sourced or generated their data, and whether they employed community standards or vocabularies.

During this assessment, we encountered a large number of resources that were inaccessible, highlighting broader issues in infrastructure sustainability. For the accessible resources, we found that a majority contained data relating to genes and/or gene products, and observed that resource numbers were skewed towards clinical research areas and model organism groups. For the non-clinical genomic resources, we found that most either curated or analysed data extracted from literature and/or other resources, and that very few hosted novel sequence data.

 

 

Categories