The Data Directory Cataloger – It's Just a Simple Python Program

Dr. Michael Lake1

1University Of Technology Sydney, Sydney, Australia

Biography:

Mike Lake runs a high performance computer cluster for researchers within eResearch at UTS. He enjoys working with researchers to help them process their huge genomics data sets and to help them learn a bit about Linux, Open Source and reproducible research.

Abstract:

The Data Directory Cataloger is a Python program that helps research groups to document their data. The University of Technology Sydney is currently using this to help four research groups to document their data on our High Performance Computer Cluster's data storage system. The Python program is less than 500 lines.

Our researchers have over 3,200 terabytes of data and it is increasing. This is mostly genome data with file sizes in the several hundred Gigabytes range. To optimise storage management, those in managing the server aim to prevent duplication of large files, identify the researcher accountable for the data, and determine its retention period or deletion date.

Some current research data management systems are complex and have several flaws in their design. I wanted a system that was simple, robust and would also be of benefit to the researchers. The Data Directory Cataloger fulfils those requirements.

Researchers just drop a README text file containing a few metadata fields into directories containing their data. The program scans directories for those README files once a day. It collates the metadata into Markdown documents. A static website generator then creates a web site from those documents. Researchers, postdocs and IT persons can view that website to find out what data is available, where it resides and who manages that data. It has been particularly valuable to new researchers joining a research group.

 

 

Categories