High Throughput File Transport over Wide Area Networks

Chao Jin1, Prof David  Abramson1, Jake  Carroll1

1The University Of Queensland, ,

Globally distributed computing infrastructures, such as clouds and supercomputers, are currently used to manage data that is generated with an unprecedented speed from a variety of resources. Coping with this architectural advance, massive amounts of data is distributed geographically, manipulated beyond organizational boundaries, and moved across wide area networks (WANs). To accelerate data transfer, high-speed networks are provided to connect remote sites. Most existing data movement solutions, such as GridFTP, mdtmFTP, FDT, XDD, AWS DataSync and Snowball, and IBM Aspera, are mainly optimized for moving large files. The recent study on international data movement for scientific computing shows that 70% of files being moved across distant facilities are smaller than 1MB. Additionally, the size of files has a substantial impact on the overall performance.  It is still challenging to transfer lots of small files (LOSF) across networks.

This disadvantage not only lowers data transfer performance, but also decreases overall system utilization. We identify that moving small files is mainly constrained by degraded file system throughput, not just network performance as might be suspected. We analyze the impact of small network I/O and storage I/O on data movement using a data transfer pipeline model, and demonstrate appropriate engineering approaches that mitigate the data transfer bottleneck.

In this presentation, we will discuss the following issues:

  • What sizes of files lowers the efficiency of moving data across networks;
  • Why moving LOSF is mainly constrained by file system performance;
  • Potential approaches that can mitigate the LOSF bottleneck.

Biography:

Dr. Chao Jin’s research interests include distributed systems, parallel computing and storage systems. Working at Research Computing Centre, the University of Queensland, he currently focuses on scalable and high throughput distribute systems, energy efficient computing, and performance optimization of shared memory systems

Categories