Rapid solution prototyping with open data and Jupyter notebook

Ms Kerri Wait1

1Monash University, Clayton, Australia, kerri.wait@monash.edu 

 

Open data initiatives have the potential to accelerate research activities, but with the sheer number of data formats, tools, and platforms available, it can be difficult to know where to begin and which approach to undertake. In this talk I’ll consider a hyperthetical[1] research project to acquire data on the quality of lamingtons in each Victorian local government area. I’ll show how python scripting inside a Jupyter notebook can retrieve and combine open data such as council boundaries and office locations to produce an optimised research path (i.e. where to drive to minimise distance and maximise lamington research benefits), and how much faster this approach is than manually wrangling data in spreadsheets and text files.

[1] Exaggeratedexcessivehyperbolical.


Biography:

Kerri Wait is an HPC Consultant at Monash University. As an engineer, Kerri has a keen interest in pulling things apart and reassembling them in novel ways. She applies the same principles to her work in eResearch, and is passionate about making scientific research faster, more robust, and repeatable by upskilling user communities and removing entry barriers. Kerri currently works with the neuroscience and bioinformatics communities.

Machine learning for the rest of us

Dr Chris Hines1

1Monash Eresearch Centre, Clayton, Australia

 

Neural Networks are the new hawtness in machine learning and more generally in any field that  relies heavily on computers and automation. Many people feel its promise is overhyped, but  there is no denying that the automated image processing available is astounding compared to  ten years ago. While the premise of machine learning is simple, obtaining a large enough  labeled dataset, creating a network and waiting for it to converge before you see a modicum of  progress is beyond most of us. In this talk I consider a hypothetical automated kiosk called  “Beerbot”. Beerbot’s premise is stated simply: keep a database of how many beers each person  has taken from the beer fridge. I show how existing open source published networks can be  chained together to create a “good enough” solution for a real world situation with little data  collection or labeling required by the developer and no more skill than a bit of basic python. I  then consider a number of research areas where further automation could significantly improve  “time to science” and encourage all eResearch practitioners to have a go.


Biography:

Chris has been kicking around the eResearch sector for over a decade. He has a background in quantum physics and with the arrogance of physicists everywhere things this qualifies him to stick his big nose into topics he knows nothing about.

Bending the Rules of Reality for Improved Collaboration and Faster Data Access

David Hiatt

WekaIO, San Jose, CA, United States, Dave@Weka.IO

 

The popular belief is that research data is heavy, therefore, data locality is an important factor in designing the appropriate data storage system to support research workloads. The solution is often to locate data near compute and depend on a local file system or block storage for performance. This tactic results in a compromise that severely limits the ability to scale these systems with data growth or provide shared access to data.

Advances in technology such as NVMe flash, virtualization, distributed parallel file systems, and low latency networks leverage parallelism to bend the rules of reality and provide faster than local file system performance with cloud scalability. The impact on research is to greatly simplify and reduce the cost of HPC class storage, meaning researchers spend less time waiting on results and more of their grant money goes to research rather than specialty hardware.


Biography:

David Hiatt is the Director of Strategic Market Development at WekaIO, where he is responsible for developing business opportunities within the research and high-performance computing communities. Previously, Mr. Hiatt led market development activities in healthcare and life sciences at HGST’s Cloud Storage Business Unit and Violin Memory. He has been a featured speaker on data storage related topics at numerous industry events. Mr. Hiatt earned an MBA from the Booth School of Management at the University of Chicago and a BSBA from the University of Central Florida.

Requirements On a Group Registry Service in Support of Research Activity Identifiers (RAiDs)

Dr Scott Koranda1, Dr Andrew Janke2, Ms Heather Flanagan1, Ms Siobhann Mccafferty3, Mr Benjamin Oshrin1, Mr Terry Smith3

1Spherical Cow Group, Wauwatosa, United States, skoranda@sphericalcowgroup.com,  hlflanagan@sphericalcowgroup.combenno@sphericalcowgroup.com

2Research Data Services, Brisbane, Australia, andrew.janke@uq.edu.au

3Australian Access Federation, Brisbane, Australia, siobhann.mccafferty@aaf.edu.aut.smith@aaf.edu.au

 

Persistent Identifiers (PID’s) are an essential tool of digital research data management and the evolving data management ecosystem. They allow for a clear line of sight along data management processes and workflows, more efficient collaboration and more precise measures of cooperation, impact, value and outputs. The Research Activity Identifier (RAiD) [1] was developed by The Australian Data Life Cycle Framework Project (DLCF) [2] in response to this need and is a freely available service and API. A RAiD is a handle (string of numbers) that is minted via the RAiD API. The handle is persistent and can have other digital identifiers associated with it such as ORCiDs [3], DOI’s [4], and Group Identifiers (GiDs).

The minting, structure, management, and curation of GiDs are open and evolving issues. We present the program of work and results from a study of these issues around GiDs undertaken by collaboration between the Research Data Services (RDS) [5] project and the Australian Access Federation (AAF) [6]. The study focused on supporting the group management needs of Australian research collaborations, services, and infrastructure and included use cases and user stories from the National Imaging Facility (NIF) [7], AARNET CloudStor [8], the UQ Data Management Planning system (UQ RDM) [9], and the Research Data Box (ReDBox) [10] from the Queensland Cyber Infrastructure Foundation (QCIF).

We report on requirements for a group registry service to serve as the foundation for a GiD API and detail what future enhancements to the group registry service will be necessary to support collaboration across international boundaries via services federated with eduGAIN through AAF subscriptions.

REFERENCES

  1. Available at: https://www.raid.org.au/, accessed 06 June 2018.
  2. Data Life Cycle Framework Project. Available at: https://www.dlc.edu.au/, accessed 06 June 2018.
  3. Available at: https://orcid.org/, accessed 06 June 2018.
  4. Available at: https://www.doi.org/, accessed 06 June 2018.
  5. Available at: https://www.rds.edu.au/, accessed 06 June 2018.
  6. Available at: https://aaf.edu.au/, accessed 06 June 2018.
  7. Available at: http://anif.org.au/, accessed 06 June 2018.
  8. Available at: https://www.aarnet.edu.au/network-and-services/cloud-services-applications/cloudstor, accessed 06 June 2018.
  9. Available at https://research.uq.edu.au/project/research-data-manager-uqrdm, accessed 06 June 2018.
  10. Available at https://www.qcif.edu.au/services/redbox, accessed 06 June 2018.

Biographies:

Andrew Janke is the Informatics Fellow for the  National Imaging Facility (NIF), Systems Architect, DLCF,  Research Data Services (RDS) and Senior Research Fellow for the Centre for Advanced Imaging (CAI) University of Queensland. https://orcid.org/0000-0003-0547-5171

Scott Koranda specializes on identity management architectures that streamline and enhance collaboration for research organizations.  https://orcid.org/0000-0003-4478-9026

Siobhann McCafferty is a Brisbane based Research Data Management specialist. She is the Project Manager for the Data LifeCycle Project (https://www.dlc.edu.au/) and part of the RAiD Research Activity Identifier Project (https://www.raid.org.au/).  https://orcid.org/0000-0002-2491-0995

Terry Smith is responsible for providing support and training activities to the AAF subscriber community and international engagement across the Asia Pacific region as chair of the APAN Identity and Access management working groups and globally through eduGAIN. https://orcid.org/0000-0001-5971-4735

Towards ‘end-to-end’ research data management support

Mrs Cassandra Sims1

1Elsevier, Chatswood, Australia, c.sims@elsevier.com 

 

Information systems supporting science have come a long way and include solutions that address many research data management needs faced by researchers, as well as their institutions. Yet, due to a fragmented landscape and even with the best solutions available, researchers and institutions are sometimes missing crucial insights and spending too much time searching, combining and analysing research data [1].

Having this in mind, we are working on holistically addressing all aspects of the research life cycle as it is shown in Figure 1. The research lifecycle starts from the design phase when researchers decide on a new project to work on next, prepare their experiments and collect initial data. Then it moves into the execution mode when research experiments are being executed. Research data collected, shared within the research group, processed, analysed and enriched. And finally research results get published and main research outcomes shared within the scientific community networks.

Figure 1: Research lifecycle

Throughout this process researchers use a variety of tools, both within the lab as well as to share their results. Research processes like this happen every day. However, there are no current solutions that enable end-to-end support of this process for researchers and institutions.

Many institutes have established internal repositories, which have their own limitations. At the same time, various open data repositories [2] have grown with their own set of data and storage/retrieval options, and many scholarly publishers now offer services to deposit and reference research datasets in conjunction with the article publication.

One challenge often faced by research institutes is developing and implementing solutions to ensure that researchers can find each other’s research in the various data silos in the ecosystem (i.e. assigning appropriate ontologies, metadata, researcher associations). Another challenge is to increase research impact and collaboration both inside and outside their institution to improve quantity and quality of their research output.

Making data available online can enhance the discovery and impact of research. The ability to reference details, such as ownership and content, about research data could assist in improved citation statistics for published research [3]. In addition, many funders increasingly require that data from supported projects is placed in an online repository. So research institutes need to ensure that their researchers comply with these requirements.

This talk will be about a suite of tools and services developed to assist researchers and institutions in their research data management needs [4], covering the entire spectrum which starts with data capture and ends with making data comprehensible and trusted enabling researchers to get a proper recognition and institutions to improve their overall ranking by going “beyond the mandates”.

I will explain how it integrates through open application programming interfaces with the global ecosystem for research data management (shown in Figure 2), including:

  1. DANS [7] for long-term data preservation,
  • DataCite [5] for DOIs and indexed metadata to help with data publication and inventory,
  • Scholix [6] for support of links between published articles and datasets,
  • More than 30 open data repositories for data discovery.

Figure 2: Integration with the global research data management ecosystem

The talk will conclude with the overview of the current data sharing practices and a short demonstration of how we incorporate feedback from our development partners: University of Manchester, Rensselaer Polytechnic Institute, Monash University and Nanyang Technological University.

REFERENCES

  1. de Waard, A., Cousijn, H., and Aalbersberg IJ. J., 10 aspects of highly effective research data. Elsevier Connect. Available from https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data, accessed 15 June 2018.
  2. Registry of research data repositories. Available from: https://www.re3data.org/, accessed 15 June 2018.
  3. Vines, T.H. et al., The Availability of Research Data Declines Rapidly with Article Age. Current Biology, 2014, 24(1): p. 94-97.
  4. Elsevier research data management tools and services. Available from: https://www.elsevier.com/solutions/mendeley-data-platform, accessed 15 June 2018.
  5. DataCite. Available from: https://www.datacite.org/, accessed 15 June 2018.
  6. Scholix: a framework for scholarly link exchange. Available from http://www.scholix.org/, accessed 15 June 2018.
  7. Data Archiving and Networked Service (DANS). Available from: https://dans.knaw.nl/en, accessed 15 June 2018.

Biography:

Senior Research Solutions Manager ANZ

Cassandra has worked for Elsevier for over 6 years, as Product Solutions Manager APAC and currently as Senior Research Solutions Manager ANZ. Cassandra has demonstrated experience and engagement in both the Academic, Government and Health Science segments in region, working with Universities, Government Organisations, Local Area Health Districts, Funders and Industry, to assist in the development of business strategies, data asset management and core enterprise objectives. Specialising in detailed Analytics, Collaboration Mapping and Bibliometric Data, Cassandra builds on her wealth of knowledge in these areas to assist our customer base with innovative and superior solutions to meet their ever changing needs. Cassandra has worked with the NHMRC, ARC, MBIE, RSNZ, AAMRI and every university in the ANZ region. Cassandra is responsible for all new business initiatives in ANZ and in supporting strategic initiatives across APAC.

CILogon 2.0: An Integrated Identity and Access Management Platform for Science

Dr Jim Basney2, Ms Heather Flanagan1, Mr Terry Fleury2, Dr Scott Koranda1, Dr Jeff Gaynor2, Mr Benjamin Oshrin1

1Spherical Cow Group, Wauwatosa, United States, hlflanagan@sphericalcowgroup.comskoranda@sphericalcowgroup.combenno@sphericalcowgroup.com 

2University of Illinois, Urbana, United States, jbasney@illinois.edutfleury@illinois.edugaynor@illinois.edu,  

 

When scientists work together, they use web sites and other software to share their ideas and data. To ensure the integrity of their work, these systems require the scientists to log in and verify that they are part of the team working on a particular science problem.  Too often, the identity and access verification process is a stumbling block for the scientists. Scientific research projects are forced to invest time and effort into developing and supporting Identity and Access Management (IAM) services, distracting them from the core goals of their research collaboration.

CILogon 2.0 provides a software platform that enables scientists to work together to meet their IAM needs more effectively so they can allocate more time and effort to their core mission of scientific research. The platform builds on prior work from the CILogon [1] and COmanage [2] projects to provide an integrated IAM platform for cyberinfrastructure, federated worldwide via InCommon [3] and eduGAIN [4]. CILogon 2.0 serves the unique needs of research collaborations, namely the need to dynamically form collaboration groups across organizations and countries, sharing access to data, instruments, compute clusters, and other resources to enable scientific discovery.

We operate CILogon 2.0 via a software-as-a-service model to ease integration with cyberinfrastructure, while making all software components publicly available under open source licenses to enable reuse. We present the design and implementation of CILogon 2.0, along with operational performance results from our experience supporting over four thousand active users.

REFERENCES

  1. Available at: http://www.cilogon.org/, accessed 07 June 2018.
  2. Available at: https://spaces.internet2.edu/display/COmanage/Home, accessed 07 June 2018.
  3. Available at: https://www.incommon.org/, accessed 07 June 2018.
  4. Available at: https://edugain.org/, accessed 07 June 2018.

Biography:

Scott Koranda specializes on identity management architectures that streamline and enhance collaboration for research organizations.  https://orcid.org/0000-0003-4478-9026

Managing Your Data Explosion

Michael Cocks1

1Country Manager – ANZ, Spectra Logic, mikec@spectralogic.com

 

As high performance computing (HPC) environments, universities, and research organizations continually tests the limits of technology and require peak performance from their equipment, the volume of data created each day will continue to grow exponentially over time. It is essential for these organizations to consider future needs when examining storage options. Short-term fixes to store and manage data are appealing due to their low entry-point, but often worsen long-term storage challenges associated with performance, scalability, cost, and floor space. A future-looking data storage solution for HPC requires:

  1. A multi-tier architecture to disk, tape, and cloud
  • Fully integrated clients that are easy to use and support the seamless transfer, sharing and publication of very large data sets from online, nearline and offline storage across diverse sites and systems
  • The capability to plan for growth, scale incrementally, and span the entire data lifecycle

This presentation will go over the advantages of a fully integrated multi-tier HPC data storage architecture and how these types of solutions help organizations dealing with massive storage management push the boundaries of their operational objectives, providing cost-effective storage that meets all of their performance, growth, and environmental needs.

Figure 1: A multi-tier hybrid storage ecosystem


Biography:

Michael Cocks is the Country Sales Manager for Spectra Logic in Australia and New Zealand. With more than 25 years of experience in the industry, Michael has held various roles within computing and data storage companies such as Silicon Graphics (SGI), Hitachi Data Systems (HDS), Olivetti and Spectra Logic. At Spectra, he manages relations with several customers in the Australia and NZ area, including Fox Sports, Foxtel, Weta Digital, Post Op Group, TVNZ, Australian Square Kilometre Array Pathfinder, Australian National University, UTAS, CSIRO and many others. Michael graduated from Southampton University in the UK where he studied Electronics Engineering.

Bending the Rules of Reality for Improved Collaboration and Faster Data Access

Mr Dave Hiatt1

1WekaIO, San Jose, CA , United States, Dave@Weka.IO

 

The popular belief is that research data is heavy, therefore, data locality is an important factor in designing the appropriate data storage system to support research workloads. The solution is often to locate data near compute and depend on a local file system or block storage for performance. This tactic results in a compromise that severely limits the ability to scale these systems with data growth or provide shared access to data.

Advances in technology such as NVMe flash, virtualization, distributed parallel file systems, and low latency networks leverage parallelism to bend the rules of reality and provide faster than local file system performance with cloud scalability. The impact on research is to greatly simplify and reduce the cost of HPC class storage, meaning researchers spend less time waiting on results and more of their grant money goes to research rather than specialty hardware.

Program Links

Solutions Showcase – Accelerate Scientific Discovery Using Modern Storage Infrastructure

Presentation – Evaluating Emerging Flash Storage Architectures for Research Computing


Biography:

David Hiatt is the Director of Strategic Market Development at WekaIO, where he is responsible for developing business opportunities within the research and high-performance computing communities. Previously, Mr. Hiatt led market development activities in healthcare and life sciences at HGST’s Cloud Storage Business Unit and Violin Memory. He has been a featured speaker on data storage related topics at numerous industry events. Mr. Hiatt earned an MBA from the Booth School of Management at the University of Chicago and a BSBA from the University of Central Florida.

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2018 - 2019 Conference Design Pty Ltd