Dr Sebastian Haan1, Dr Januar Harianto1, Dr Nathanial Butterworth1, Prof Thomas Bishop1
1The University of Sydney, Sydney Informatics Hub, Sydney, Australia
The Sydney Informatics Hub brings together experts from diverse analytical and technical backgrounds to enable excellence in data and compute-intensive research. Both presenters endeavour to empower researchers with modern machine learning and data workflows, and are open-source software contributors to research and education.
While there is an enormous amount of national and global space-time data that is free and accessible, ranging from numerous satellite platforms to weather and environmental data, it is typically a non-trivial and time-consuming task for a researcher to extract and post-process all data layers for a specific timescale and region. To jumpstart their analysis, we have developed an open-source software, Data-Harvester, in collaboration with multiple national research facilities and universities as part of the Agricultural Research Federation.
The main goal of the Data-Harvester is to enable researchers with reusable workflows and tools for automatic data download, feature extraction and spatial-temporal processing. The result is a suitable set of spatial and temporal aligned features as a ready-made dataset for machine learning and geospatial analysis. In addition, all requested data layer maps are automatically extracted and aligned for a specific region and time period.
Data-Harvester is designed as a modular and maintainable multi-stage pipeline by providing explicit boundaries among tasks. To encourage interaction and experimentation with the pipeline, we provide multiple frontend notebooks and use case scenarios as Jupyter and R notebooks, as well as standalone Python and R packages. While the Data-Harvester supports an extensive range of geospatial data sources such as weather, multiple satellite platforms, soil, and terrain data, it can be easily extended or integrated into other processes given its modular structure and easy-to-use programmable interface. Data-Harvester seeks to be an example for best practices in data sharing to power future reusable workflows and modern data-pipelines.
Biography:
Dr. Januar Harianto is a data scientist at the Sydney Informatics Hub at the University of Sydney with a research background in climate change biology. Previously, he was a Lecturer and Co-ordinator for data science subjects in the Schools of Agriculture and Mathematics & Statistics, and a data consultant for the Australian Academy of Science and the NSW Department of Primary Industries.
Dr. Sebastian Haan is a data scientist at the Sydney Informatics Hub at the University of Sydney with a background in machine learning, computational physics, and astrophysics, and has held previous international research positions at CSIRO, Caltech, and the Max-Planck-Institute.