Dr Ian Foster1, Ben Blaiszik1, Ian Foster1, Logan Ward1
1Argonne National Laboratory & University Of Chicago, Chicago, United States
The Materials Data Facility (MDF: materialsdatafacility.org)  is a set of data services built to support materials science researchers. MDF consists of two synergistic services, data publication and data discovery. Its data publication service offers a scalable repository where materials scientists can publish, preserve, and share research data. Its repository provides a focal point for the materials community, enabling publication and discovery of materials data of all sizes. Its discovery service indexes data from many different public sources, enabling rapid discovery of data regardless of location and integrated analysis of data from multiple sources. Our goal in presenting MDF in this context is to solicit feedback and to seek collaborations within the Australian materials research community.
CONTEXT AND GOALS
Scientific researchers are increasingly often not data constrained, but rather limited by their ability to integrate and act on data: i.e., to analyze, comprehend, synthesize and combine, track, share, model, and mine myriad data sources to derive new insights and technical knowledge. This shift is now particularly apparent in materials science, where scientists are generating vast amounts of computational and experimental data from a wide set of user facilities (e.g., light sources), from simulations at supercomputing centers, from individual research labs, and from high-throughput experiments. We are developing MDF to both (a) provide ready access to these large quantities of often untapped data and (b) enable the ready application of new analysis methods, such as deep learning, to guide and, indeed, lead discovery.
Figure 1: MDF schematic, as described in text.
Figure 1 provides a summary of the MDF service ecosystem, showing how the MDF data publication (DPS) and data discovery (DDS) services are connected, and the actions users can perform in these services. Using MDF services, researchers can publish data to the DPS as bundles of data and metadata, leveraging distributed storage across endpoints. When a data publication is added, the metadata is automatically synced with the discovery service and deep indexing of materials-specific file contents also occurs. Users may query, browse, and aggregate data and metadata from the DDS through a web UI or through the API. Importantly, we are also investigating the harvesting and deep indexing of datasets external to the MDF ecosystem to bootstrap the index with scientifically relevant data.
As of June 2017 MDF has been used to publish around 11 TB of materials science data from a variety of experimental and simulation studies. The data discovery service has indexed data from more than 50 other repositories and datasets comprising 200 TB, for a total of more than 1.8M records. Leveraging indexing and search capabilities provided by the Globus cloud service, MDF supports powerful faceted search. API access makes it easy to develop applications that query and analyze MDF content, for example to combine data from multiple sources to train machine learning models, and to implement “bots” that query, analyze, and dynamically update MDF content.
Figure 2 shows an example of MDF in action. A researcher looking for data about nearly stable compounds as determined by computational results in the Open Quantum Materials Database (OQMD) . As these data are indexed in MDF, it is straightforward to write a few lines of Python to first query for, and then download, the desired data.
This research was supported in part by NIST as part of the CHiMAD project funded by the U.S. Department of Commerce, National Institute of Standards and Technology, under financial assistance Award Number 70NANB14H012, and the the U.S. Department of Energy under Contract DE-AC02-06CH11357. We are grateful to our partners at the National Center for Supercomputing Applications, CHiMaD, and NIST for their assistance with this project.
- Blaiszik, B., K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster. “The Materials Data Facility: Data services to advance materials science research.” JOM68, no. 8 (2016): 2045-2052.
- Saal, J., S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton. “Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD).” JOM 65, no. 11 (2013): 1501.