Ian Foster1,Kyle Chard1, Eli Dart2, Steven Tuecke1, Jason Williams1
1The University of Chicago and Argonne National Laboratory, Chicago, Illinois, USA
2Energy Sciences Network, Berkeley, California, USA
We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. We capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance Science DMZs and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe how to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Companion sample code provides application skeletons that interested parties can adapt to realize their own research data portals.
THE LEGACY RESEARCH DATA PORTAL
The need for scientists to exchange data has led to an explosion over recent decades in the number and variety of research data portals: systems that provide remote access to data repositories for such purposes as discovery and distribution of reference data, the upload of new data for analysis and/or integration, and data sharing for collaborative analysis. Most such systems implement variants of a design pattern that we term the legacy research data portal (LRDP), in which a web server reads and writes a directly connected data repository in response to client requests.
The relative simplicity of this structure has allowed it to persist largely unchanged from the first days of the web. However, its monolithic architecture—in particular, its tight integration of control channel processing (request processing, user authentication) and data channel processing (routing of data to/from remote sources and data repositories)—has become an obstacle to performance, usability, and security, for reasons discussed below.
THE MODERN RESEARCH DATA PORTAL
An alternative architecture re-imagines the data portal in a much more scalable and performant form. In the modern research data portal (MRDP) design pattern, portal functionality is decomposed along two distinct but complementary dimensions. First, control channel communications and data channel communications are separated, with the former handled by a web server computer deployed (most often) in the institution’s enterprise network and the latter performed via specialized data servers connected directly to high-speed networks and storage systems. Second, responsibility for managing data transfers, data access, and sometimes also authentication is outsourced to external, often cloud-hosted, services. The design pattern thus defines distinct roles for the web server, which manages who is allowed to do what; data servers, where authorized operations are performed on data; and external services, which orchestrate data access.
In this talk, we first define the problems that research data portals address, introduce the legacy approach, and examine its limitations. We then introduce the MRDP design pattern and describe its realization via the integration of high-performance network architectures (we use the Science DMZ , which connects data stores to streamlined end-to-end network paths, as a canonical example) and cloud-based data management and authentication services (we use Globus  as a canonical example). Figure 1 illustrates the architecture. We also provide online a reference implementation that the reader can deploy and adapt to build their own MRDP. A preliminary version is at https://github.com/globus/globus-sample-data-portal. Figure 2 illustrates the reference implementation.
We also review various deployments to show how the MRDP approach has been applied in practice: examples like the National Center for Atmospheric Research’s Research Data Archive, which provides for high-speed data delivery to thousands of geoscientists; the Sanger Imputation Service, which provides for online analysis of user-provided genomic data; the Globus data publication service, which provides for interactive data publication and discovery; and the DMagic data sharing system for data distribution from light sources. We present performance data that demonstrate the benefits of the MRDP approach.
Figure 1: MRDP basics, showing (a) the institutional rewall, behind which sits the MRDP web server which implements the portal logic; and (b) the Science DMZ, within which sit research data. The domain-specific portal logic uses REST APIs to direct Globus services to operate on data in the Science DMZ.
Figure 2: Example of the MRDP reference implementation, showing a list of computed graphs for a user and (inset) one of these graph.
Chard, S. Tuecke, and I. Foster. 2014. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. IEEE Cloud Computing 1, 3 (Sept 2014), 46–55. DOI:http://dx.doi.org/10.1109/MCC.2014.52
E. Dart, L. Rotman, B. Tierney, M. Hester, and J. Zurawski. 2013. The Science DMZ: A Network Design Pattern for Data-intensive Science. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, New York, NY, USA, Article 85, 10 pages. DOI: http://dx.doi.org/10.1145/2503210.2503245
Ian Foster is a Professor of Computer Science at the University of Chicago and a Senior Scientist and Distinguished Fellow at Argonne National Laboratory. Originally from New Zealand, he has lived in Chicago for longer than he likes to admit. Ian has a long record of research contributions in high-performance computing, distributed systems, and data-driven discovery. He has also led US and international projects that have produced widely used software systems and scientific computing infrastructures. He has published hundreds of scientific papers and six books on these and other topics. Ian is an elected fellow of the American Association for the Advancement of Science, the Association for Computing Machinery, and the British Computer Society. His awards include the British Computer Society’s Lovelace Medal, the IEEE Tsutomu Kanai award, and honorary doctorates from CINVESTAV, Mexico, and the University of Canterbury, New Zealand.