Machine learning for the rest of us

Dr Chris Hines1

1Monash Eresearch Centre, Clayton, Australia

 

Neural Networks are the new hawtness in machine learning and more generally in any field that  relies heavily on computers and automation. Many people feel its promise is overhyped, but  there is no denying that the automated image processing available is astounding compared to  ten years ago. While the premise of machine learning is simple, obtaining a large enough  labeled dataset, creating a network and waiting for it to converge before you see a modicum of  progress is beyond most of us. In this talk I consider a hypothetical automated kiosk called  “Beerbot”. Beerbot’s premise is stated simply: keep a database of how many beers each person  has taken from the beer fridge. I show how existing open source published networks can be  chained together to create a “good enough” solution for a real world situation with little data  collection or labeling required by the developer and no more skill than a bit of basic python. I  then consider a number of research areas where further automation could significantly improve  “time to science” and encourage all eResearch practitioners to have a go.


Biography:

Chris has been kicking around the eResearch sector for over a decade. He has a background in quantum physics and with the arrogance of physicists everywhere things this qualifies him to stick his big nose into topics he knows nothing about.

Requirements On a Group Registry Service in Support of Research Activity Identifiers (RAiDs)

Dr Scott Koranda1, Dr Andrew Janke2, Ms Heather Flanagan1, Ms Siobhann Mccafferty3, Mr Benjamin Oshrin1, Mr Terry Smith3

1Spherical Cow Group, Wauwatosa, United States, skoranda@sphericalcowgroup.com,  hlflanagan@sphericalcowgroup.combenno@sphericalcowgroup.com

2Research Data Services, Brisbane, Australia, andrew.janke@uq.edu.au

3Australian Access Federation, Brisbane, Australia, siobhann.mccafferty@aaf.edu.aut.smith@aaf.edu.au

 

Persistent Identifiers (PID’s) are an essential tool of digital research data management and the evolving data management ecosystem. They allow for a clear line of sight along data management processes and workflows, more efficient collaboration and more precise measures of cooperation, impact, value and outputs. The Research Activity Identifier (RAiD) [1] was developed by The Australian Data Life Cycle Framework Project (DLCF) [2] in response to this need and is a freely available service and API. A RAiD is a handle (string of numbers) that is minted via the RAiD API. The handle is persistent and can have other digital identifiers associated with it such as ORCiDs [3], DOI’s [4], and Group Identifiers (GiDs).

The minting, structure, management, and curation of GiDs are open and evolving issues. We present the program of work and results from a study of these issues around GiDs undertaken by collaboration between the Research Data Services (RDS) [5] project and the Australian Access Federation (AAF) [6]. The study focused on supporting the group management needs of Australian research collaborations, services, and infrastructure and included use cases and user stories from the National Imaging Facility (NIF) [7], AARNET CloudStor [8], the UQ Data Management Planning system (UQ RDM) [9], and the Research Data Box (ReDBox) [10] from the Queensland Cyber Infrastructure Foundation (QCIF).

We report on requirements for a group registry service to serve as the foundation for a GiD API and detail what future enhancements to the group registry service will be necessary to support collaboration across international boundaries via services federated with eduGAIN through AAF subscriptions.

REFERENCES

  1. Available at: https://www.raid.org.au/, accessed 06 June 2018.
  2. Data Life Cycle Framework Project. Available at: https://www.dlc.edu.au/, accessed 06 June 2018.
  3. Available at: https://orcid.org/, accessed 06 June 2018.
  4. Available at: https://www.doi.org/, accessed 06 June 2018.
  5. Available at: https://www.rds.edu.au/, accessed 06 June 2018.
  6. Available at: https://aaf.edu.au/, accessed 06 June 2018.
  7. Available at: http://anif.org.au/, accessed 06 June 2018.
  8. Available at: https://www.aarnet.edu.au/network-and-services/cloud-services-applications/cloudstor, accessed 06 June 2018.
  9. Available at https://research.uq.edu.au/project/research-data-manager-uqrdm, accessed 06 June 2018.
  10. Available at https://www.qcif.edu.au/services/redbox, accessed 06 June 2018.

Biographies:

Andrew Janke is the Informatics Fellow for the  National Imaging Facility (NIF), Systems Architect, DLCF,  Research Data Services (RDS) and Senior Research Fellow for the Centre for Advanced Imaging (CAI) University of Queensland. https://orcid.org/0000-0003-0547-5171

Scott Koranda specializes on identity management architectures that streamline and enhance collaboration for research organizations.  https://orcid.org/0000-0003-4478-9026

Siobhann McCafferty is a Brisbane based Research Data Management specialist. She is the Project Manager for the Data LifeCycle Project (https://www.dlc.edu.au/) and part of the RAiD Research Activity Identifier Project (https://www.raid.org.au/).  https://orcid.org/0000-0002-2491-0995

Terry Smith is responsible for providing support and training activities to the AAF subscriber community and international engagement across the Asia Pacific region as chair of the APAN Identity and Access management working groups and globally through eduGAIN. https://orcid.org/0000-0001-5971-4735

Towards ‘end-to-end’ research data management support

Mrs Cassandra Sims1

1Elsevier, Chatswood, Australia, c.sims@elsevier.com 

 

Information systems supporting science have come a long way and include solutions that address many research data management needs faced by researchers, as well as their institutions. Yet, due to a fragmented landscape and even with the best solutions available, researchers and institutions are sometimes missing crucial insights and spending too much time searching, combining and analysing research data [1].

Having this in mind, we are working on holistically addressing all aspects of the research life cycle as it is shown in Figure 1. The research lifecycle starts from the design phase when researchers decide on a new project to work on next, prepare their experiments and collect initial data. Then it moves into the execution mode when research experiments are being executed. Research data collected, shared within the research group, processed, analysed and enriched. And finally research results get published and main research outcomes shared within the scientific community networks.

Figure 1: Research lifecycle

Throughout this process researchers use a variety of tools, both within the lab as well as to share their results. Research processes like this happen every day. However, there are no current solutions that enable end-to-end support of this process for researchers and institutions.

Many institutes have established internal repositories, which have their own limitations. At the same time, various open data repositories [2] have grown with their own set of data and storage/retrieval options, and many scholarly publishers now offer services to deposit and reference research datasets in conjunction with the article publication.

One challenge often faced by research institutes is developing and implementing solutions to ensure that researchers can find each other’s research in the various data silos in the ecosystem (i.e. assigning appropriate ontologies, metadata, researcher associations). Another challenge is to increase research impact and collaboration both inside and outside their institution to improve quantity and quality of their research output.

Making data available online can enhance the discovery and impact of research. The ability to reference details, such as ownership and content, about research data could assist in improved citation statistics for published research [3]. In addition, many funders increasingly require that data from supported projects is placed in an online repository. So research institutes need to ensure that their researchers comply with these requirements.

This talk will be about a suite of tools and services developed to assist researchers and institutions in their research data management needs [4], covering the entire spectrum which starts with data capture and ends with making data comprehensible and trusted enabling researchers to get a proper recognition and institutions to improve their overall ranking by going “beyond the mandates”.

I will explain how it integrates through open application programming interfaces with the global ecosystem for research data management (shown in Figure 2), including:

  1. DANS [7] for long-term data preservation,
  • DataCite [5] for DOIs and indexed metadata to help with data publication and inventory,
  • Scholix [6] for support of links between published articles and datasets,
  • More than 30 open data repositories for data discovery.

Figure 2: Integration with the global research data management ecosystem

The talk will conclude with the overview of the current data sharing practices and a short demonstration of how we incorporate feedback from our development partners: University of Manchester, Rensselaer Polytechnic Institute, Monash University and Nanyang Technological University.

REFERENCES

  1. de Waard, A., Cousijn, H., and Aalbersberg IJ. J., 10 aspects of highly effective research data. Elsevier Connect. Available from https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data, accessed 15 June 2018.
  2. Registry of research data repositories. Available from: https://www.re3data.org/, accessed 15 June 2018.
  3. Vines, T.H. et al., The Availability of Research Data Declines Rapidly with Article Age. Current Biology, 2014, 24(1): p. 94-97.
  4. Elsevier research data management tools and services. Available from: https://www.elsevier.com/solutions/mendeley-data-platform, accessed 15 June 2018.
  5. DataCite. Available from: https://www.datacite.org/, accessed 15 June 2018.
  6. Scholix: a framework for scholarly link exchange. Available from http://www.scholix.org/, accessed 15 June 2018.
  7. Data Archiving and Networked Service (DANS). Available from: https://dans.knaw.nl/en, accessed 15 June 2018.

Biography:

Senior Research Solutions Manager ANZ

Cassandra has worked for Elsevier for over 6 years, as Product Solutions Manager APAC and currently as Senior Research Solutions Manager ANZ. Cassandra has demonstrated experience and engagement in both the Academic, Government and Health Science segments in region, working with Universities, Government Organisations, Local Area Health Districts, Funders and Industry, to assist in the development of business strategies, data asset management and core enterprise objectives. Specialising in detailed Analytics, Collaboration Mapping and Bibliometric Data, Cassandra builds on her wealth of knowledge in these areas to assist our customer base with innovative and superior solutions to meet their ever changing needs. Cassandra has worked with the NHMRC, ARC, MBIE, RSNZ, AAMRI and every university in the ANZ region. Cassandra is responsible for all new business initiatives in ANZ and in supporting strategic initiatives across APAC.

CILogon 2.0: An Integrated Identity and Access Management Platform for Science

Dr Jim Basney2, Ms Heather Flanagan1, Mr Terry Fleury2, Dr Scott Koranda1, Dr Jeff Gaynor2, Mr Benjamin Oshrin1

1Spherical Cow Group, Wauwatosa, United States, hlflanagan@sphericalcowgroup.comskoranda@sphericalcowgroup.combenno@sphericalcowgroup.com 

2University of Illinois, Urbana, United States, jbasney@illinois.edutfleury@illinois.edugaynor@illinois.edu,  

 

When scientists work together, they use web sites and other software to share their ideas and data. To ensure the integrity of their work, these systems require the scientists to log in and verify that they are part of the team working on a particular science problem.  Too often, the identity and access verification process is a stumbling block for the scientists. Scientific research projects are forced to invest time and effort into developing and supporting Identity and Access Management (IAM) services, distracting them from the core goals of their research collaboration.

CILogon 2.0 provides a software platform that enables scientists to work together to meet their IAM needs more effectively so they can allocate more time and effort to their core mission of scientific research. The platform builds on prior work from the CILogon [1] and COmanage [2] projects to provide an integrated IAM platform for cyberinfrastructure, federated worldwide via InCommon [3] and eduGAIN [4]. CILogon 2.0 serves the unique needs of research collaborations, namely the need to dynamically form collaboration groups across organizations and countries, sharing access to data, instruments, compute clusters, and other resources to enable scientific discovery.

We operate CILogon 2.0 via a software-as-a-service model to ease integration with cyberinfrastructure, while making all software components publicly available under open source licenses to enable reuse. We present the design and implementation of CILogon 2.0, along with operational performance results from our experience supporting over four thousand active users.

REFERENCES

  1. Available at: http://www.cilogon.org/, accessed 07 June 2018.
  2. Available at: https://spaces.internet2.edu/display/COmanage/Home, accessed 07 June 2018.
  3. Available at: https://www.incommon.org/, accessed 07 June 2018.
  4. Available at: https://edugain.org/, accessed 07 June 2018.

Biography:

Scott Koranda specializes on identity management architectures that streamline and enhance collaboration for research organizations.  https://orcid.org/0000-0003-4478-9026

Managing Your Data Explosion

Michael Cocks1

1Country Manager – ANZ, Spectra Logic, mikec@spectralogic.com

 

As high performance computing (HPC) environments, universities, and research organizations continually tests the limits of technology and require peak performance from their equipment, the volume of data created each day will continue to grow exponentially over time. It is essential for these organizations to consider future needs when examining storage options. Short-term fixes to store and manage data are appealing due to their low entry-point, but often worsen long-term storage challenges associated with performance, scalability, cost, and floor space. A future-looking data storage solution for HPC requires:

  1. A multi-tier architecture to disk, tape, and cloud
  • Fully integrated clients that are easy to use and support the seamless transfer, sharing and publication of very large data sets from online, nearline and offline storage across diverse sites and systems
  • The capability to plan for growth, scale incrementally, and span the entire data lifecycle

This presentation will go over the advantages of a fully integrated multi-tier HPC data storage architecture and how these types of solutions help organizations dealing with massive storage management push the boundaries of their operational objectives, providing cost-effective storage that meets all of their performance, growth, and environmental needs.

Figure 1: A multi-tier hybrid storage ecosystem


Biography:

Michael Cocks is the Country Sales Manager for Spectra Logic in Australia and New Zealand. With more than 25 years of experience in the industry, Michael has held various roles within computing and data storage companies such as Silicon Graphics (SGI), Hitachi Data Systems (HDS), Olivetti and Spectra Logic. At Spectra, he manages relations with several customers in the Australia and NZ area, including Fox Sports, Foxtel, Weta Digital, Post Op Group, TVNZ, Australian Square Kilometre Array Pathfinder, Australian National University, UTAS, CSIRO and many others. Michael graduated from Southampton University in the UK where he studied Electronics Engineering.

Bending the Rules of Reality for Improved Collaboration and Faster Data Access

Mr Dave Hiatt1

1WekaIO, San Jose, CA , United States, Dave@Weka.IO

 

The popular belief is that research data is heavy, therefore, data locality is an important factor in designing the appropriate data storage system to support research workloads. The solution is often to locate data near compute and depend on a local file system or block storage for performance. This tactic results in a compromise that severely limits the ability to scale these systems with data growth or provide shared access to data.

Advances in technology such as NVMe flash, virtualization, distributed parallel file systems, and low latency networks leverage parallelism to bend the rules of reality and provide faster than local file system performance with cloud scalability. The impact on research is to greatly simplify and reduce the cost of HPC class storage, meaning researchers spend less time waiting on results and more of their grant money goes to research rather than specialty hardware.

Program Links

Solutions Showcase – Accelerate Scientific Discovery Using Modern Storage Infrastructure

Presentation – Evaluating Emerging Flash Storage Architectures for Research Computing


Biography:

David Hiatt is the Director of Strategic Market Development at WekaIO, where he is responsible for developing business opportunities within the research and high-performance computing communities. Previously, Mr. Hiatt led market development activities in healthcare and life sciences at HGST’s Cloud Storage Business Unit and Violin Memory. He has been a featured speaker on data storage related topics at numerous industry events. Mr. Hiatt earned an MBA from the Booth School of Management at the University of Chicago and a BSBA from the University of Central Florida.

Software Engineering – Visualisation of a complex model. Using CSIRO’s TAPPAS as an example, present the key challenges and success factors in engineering data visualisations.

Mr Craig Hamilton1

1Intersect Australia, Sydney, Australia, Craig.Hamilton@intersect.org.au

 

DESCRIPTION

TAPPAS (Tool for Assessing Pest and Pathogen Airborne Spread) is an online tool for modelling the dispersal of living organisms, developed through a CSIRO, Bureau of Meteorology and Intersect partnership.  TAPPAS uses  global air circulation data from the BOM’s numerical weather prediction model and models this using the HYSPLIT dispersion system  for computing simple air parcel trajectories.  TAPPAS combines this with knowledge of the organism’s biology, and delivers these in an easy to use interface that presents results as risk maps.

In five minutes we will cover some of the key challenges and successes of this project from an engineering perspective, and show a couple of the dispersion visualisations.

With the growth in demand and importance of data visualisation, the aim of this presentation is to help other delegates  understand some of the key success factors in engineering visual data from complex models.


Biography

Craig has over 20 years experience in software engineering, architecture and product management in higher education as well as local and global private companies.  From architecting and building the number one australian online shopping site in the early 2000’s to developing global identity management programs for over 20 million users Craig has designed and built systems that solve unique and complex problems with adoption, scalability and security.  As engineering manager of Intersect Australia for the last year Craig has overseen the delivery and development of a number of research software engineering products such as TAPPAS and CloudStor Collections.

Meeting the Big Science Needs of the SKA: What NREN’s can Do and the Internet Can Not

Mr Peter Elford1, Mr Tim Rayner2, Mr Chris Myers3

1AARNet, Canberra, Australia, Peter.Elford@AARNet.edu.au

2AARNet, Canberra, Australia, Tim.Rayner@AARNet.edu.au

3AARNet, Melbourne, Australia, Chris.Myers@AARNet.edu.au

 

DESCRIPTION

The scale of the SKA [1] represents a huge leap forward in the engineering needed  to deliver a unique instrument (a radio telescope) as part of an international collaboration. The SKA will generate, process and store enormous quantities of data and AARNet has been working with several efforts to ensure this volume of data gets into the hands and systems of the science community. This talk will focus on work undertaken in partnership with GEANT and others [3] to prove network throughput from the AARNet backbone and the MRO [3] in Australia, to important research facilities in Europe, such as the GEANT backbone and ASTRON, as well as to the USA. The tests have been conducted with hosts connected at 10Gbps and 100Gbps, and prove the network throughput capabilities between AARNet and the wider NREN community. Notably, testing conducted over network paths through the commercial Internet demonstrated very poor results.

This lightning talk specifically relates to the Generating, Collecting and Moving Data theme.

[1] Square Kilometre Array – www.skatelescope.org

[2] “Taking it to the limit – testing the performance of R&E networking” – https://blog.geant.org/15/05/2017/taking-it-to-the-limit-testing-the-performance-of-re-networking/

[3] Murchison Radio Observatory – http://www.ska.gov.au/Observatory/Pages/MRO.aspx

[4] NREN – National Research and Education Network


Biography

Peter Elford manages AARNet’s relationships across a broad range of Federal and state government agencies, and AARNet’s engagement with the Australian research community. He is a strong and passionate advocate for the role Information and Communications Technology (ICT) plays in enabling globally collaborative and competitive research through ultra-high speed broadband connectivity. Peter is an ICT professional with over 30 years’ experience within the government, education, research and industry sectors having worked at the Australian National University, AARNet (twice) and Cisco. In his first stint at AARNet (in 1990) he engineered much of the original Internet in Australia.

The Indigo Subsea Fibre System: eResearch Infrastructure in the Asian Century

Mr Peter Elford1

1AARNet, Yarralumla, Australia, Peter.Elford@AARNet.edu.au

 

DESCRIPTION

AARNet has entered into a consortium with Google, Indosat Ooredoo, Singtel, SubPartners, and Telstra to build a new international subsea cable system that will connect Singapore and Australia. Known as Indigo, the system will use coherent optical technology and spectrum sharing to deliver a minimum capacity of 18 terabits per second on each of two-fibre pairs. The broadband capacity that has been secured will meet the future growth in collaborative research, and transnational education, between Australia and our Asian partners for decades to come. This is the first time a National Research and Education Network (NREN) has entered into direct subsea ownership, and has been achieved without direct Commonwealth funding.

This lightning talk specifically relates to the Generating, Collecting and Moving Data theme, and highlights an outstanding example of national, sustainable, underpinning e-Infrastructure.


Biography

Peter Elford manages AARNet’s relationships across a broad range of Federal and state government agencies, and AARNet’s engagement with the Australian research community. He is a strong and passionate advocate for the role Information and Communications Technology (ICT) plays in enabling globally collaborative and competitive research through ultra-high speed broadband connectivity. Peter is an ICT professional with over 30 years’ experience within the government, education, research and industry sectors having worked at the Australian National University, AARNet (twice) and Cisco. In his first stint at AARNet (in 1990) he engineered much of the original Internet in Australia.

Calcyte: A simple tool for describing, packaging and publishing data collections

Dr Peter Sefton1

1University Of Technology Sydney, Ultimo, Australia, peter.sefton@uts.edu.au

 

ABSTRACT

Calcyte is a toolkit for managing metadata for collections of any kind of file-based data using spreadsheets – automatically generated from templates – for data entry (other methods may be supported in future). After the data owner enters information about the files and directories, Calcyte generates a static webpage and metadata files that describes the data in both human and machine-readable formats. Calycte’s output can be published on a webserver, or zipped for distribution. Calycte implements the proposed DataCrate format. Calycte is a python program, which can be run from the command line or via automated processes that detect changes in data on file shares.

MORE DETAIL

The presentation will include a demo of using calcyte to describe a small data set, with commentary of its important features, and a demonstration of how it has been used to publish data at UTS.

Calcyte’s produces human and machine readable metadata in a format with the working title “DataCrate”. The UTS team is planning a beta release of both Calcyte and the DataCrate for eResearch Australasia.

Calcyte is available from: https://codeine.research.uts.edu.au/eresearch/calcyte/tree/ac4daf6508957d6b98fbb8add15833d270584c28

ACKNOWLEDGEMENTS

Calcyte has been programmed by Peter Sefton and Michael Lake, and tested by the team at UTS eResearch, including Sharyn Wise and Michael Lynch.


Biography

Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS). Before that he was in a similar role at the university of Western Sydney (UWS). Previously he ran the Software Research and development Laboratory at the Australian Digital Futures Institute at the University of Southern Queensland. Following a PhD in computational linguistics in the mid-nineties he has gained extensive experience in the higher education sector in leading the development of IT and business systems to support both learning and research.

While at USQ, Peter was involved in the development of institutional repository infrastructure in Australia via the federally funded RUBRIC (http://rubric.edu.au/) project and was a senior advisor the the CAIRSS repository support service (http://cairss.caul.edu.au/cairss/) from 2009 to 2011. He oversaw the creation of one of the core pieces of research data management infrastructure to be funded by the Australian National Data Service consulting widely with libraries, IT, research offices and eResearch departments at a variety of institutions in the process. The resulting Open Source research data catalogue application ReDBOX is now being widely deployed at Australian universities.

At UTS Peter is leading a team which is working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

12

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2019 Conference Design Pty Ltd