Matt Andrews1, David Martin2, Mahmoud Sadeghi3, Peter Ansell4, Adam Collins5, Miles Nicholls6
1Atlas of Living Australia, Canberra, Australia, firstname.lastname@example.org
2Atlas of Living Australia, UK, email@example.com
3Atlas of Living Australia, Canberra, Australia, firstname.lastname@example.org
4Atlas of Living Australia, Canberra, Australia, email@example.com
5Atlas of Living Australia, Canberra, Australia, firstname.lastname@example.org
6Atlas of Living Australia, Canberra, Australia, email@example.com
The Atlas of Living Australia, as a project being committed to the principles of open source software and open data, has made all of its operating code available to the public. ALA software is used by many projects around the world to build national databases of biodiversity. Running production infrastructure to meet the needs of researchers and the general public is a significant challenge, with an ever-growing dataset both in terms of the number of records and the width of the data, particularly with the prospect of adding trait information.
In early 2017, the Atlas of Living Australia faced a series of problems with its core infrastructure. While the Atlas had been well designed from the start to operate as an ecosystem of independent services, several of the key services were looking increasingly unsustainable. Almost all were running on single large servers: a fragile position which meant that any significant maintenance or performance issue would take down the whole service. One of the primary components of the Atlas, which we call “biocache”, is responsible for the search, display, and download of occurrence records. It was running on very old versions of Cassandra and Solr. A full update of data, to process and index the full dataset and make it available for search, was taking several days to complete, and if it hit an error, we had to run the whole thing again. This reduced the number of times that new data quality rules, spatial information, and new taxonomic names were applied to the Atlas records. We were in an uncomfortable position, and hitting the limits of what we had.
To solve this combination of problems – old unsupported software versions, single monolithic servers, very slow data processing – we decided to make a fundamental change in the internal architecture of each of the services within the biocache component.
With the two primary data stores, Cassandra and Solr, this meant not only migrating from old unsupported versions to more recent versions, but also moving each to a clustered pool of servers. This would add significant resilience to the system by adding data replication, so the pool can tolerate the loss of a server, and allow for much less disruptive system maintenance which makes the system as a whole much more robust. The splitting of data into smaller segments meant that processing and indexing can be performed using more parallelism due to the increased IO bandwidth, making these tasks dramatically faster.
Apart from the large data stores, the biocache component contains user-facing web applications. These had been running on single servers, which was inflexible, more difficult to maintain and entailed a higher risk of downtime. We decided to move to a model where each of these web applications would be run in a pool of servers behind load balancers. With this approach, a server can be removed from a pool for maintenance with no loss of service, and the system is resilient to a significant range of disaster scenarios.
The development effort in building these new systems largely focused on the new clustered versions of Cassandra and Solr. For Cassandra, the move from a single large server running the long-unsupported version 1 of Cassandra to an up-to-date setup running in a replicated cluster, across several nodes, represented a fundamental shift. We started in Britain, with the Atlas closely involved in building the infrastructure for the new National Biodiversity Network (NBN), which uses ALA software to host data for the UK.
After months of development effort, a new version of the core Cassandra and Solr systems was in place. With a cluster of four Cassandra nodes, and a Solr Cloud environment with eight nodes, the UK infrastructure was performing exceptionally well, querying a large dataset successfully, and with far better fault tolerance, system resilience and much faster data ingestion and processing.
Once the new approach had been proven in the UK development, an effort began in the Australian team to build an equivalent set of systems for the ALA, and implementing the new approach of running web applications as pools behind load balancers. Around twelve months of effort from several developers went into planning, developing and testing the project, using the UK work as a reference point.
FLEXIBILITY AND RESILIENCE: BETTER SUPPORT FOR RESEARCH
With the new clustered infrastructure in place, this core component of the Atlas is now considerably more resilient to unexpected outages or spikes in demand, while delivering dramatically better results for us in the data ingestion, processing and indexing area.
In the old systems, indexing our full dataset of around 75 million records previously took 24 hours or more. In the new clustered infrastructure, this operation takes around three and a half hours. Similarly, full processing and sampling of the ingested data used to take up to six days; in the new system it takes about 11 hours. These spectacular performance improvements give us the space to speed up our data ingestion cycle.
By running all our public-facing services in the biocache component as pools of servers behind load balancers, we are considerably more robust in being able to keep the services running even if a single server goes down, or indeed even if a whole data centre goes down.
The other significant advantage of running a clustered infrastructure is scalability: adding capacity to handle more records, or a wider range of fields for each record, can be achieved without pain and with little or no disruption of service. This will be valuable as we prepare to add many new fields to handle trait information. There are many more areas of improvement that lie ahead for the Atlas’ infrastructure, but overall this fundamental shift in how we run our biocache component has been a success.
- Global Biodiversity Information Facility website, Living Atlases page: https://living-atlases.gbif.org/#participants Accessed 7 June 2018.
- National Biodiversity Network website, About the NBN Atlas page: https://nbnatlas.org/about-nbn-atlas/ Accessed 7 June 2018.
Matt Andrews has been involved in herding software for several years now. His past lives include being chauffered between cities handcuffed to a briefcase, making TV ads for contraceptives in Papua New Guinea, training programmers in French bars, being escorted around China by a posse of secret police, and going to school in an Italian village. After running tech operations for a network of European dating sites, and then for various corners of the Federal Government, he’s now managing DevOps for the Atlas of Living Australia.