meta data for this page

Introduction

Opportunities from Burst Buffers: Burst buffers have been stipulated as mandatory hardware in recent calls for leadership supercomputers [25] to cope with the I/O pressure from extreme-scale applications. They are typically composed of high-bandwidth persistent memory devices for fast caching of application datasets while achieving data persistence at the same time.

ECC for I/O Containerization and Disaggregation: In this project, we introduce the Ephemeral Co- herence Cohort (ECC) as a novel abstraction that can unify the node-local and global shared burst buffers and reconcile the two seemingly opposite architectural choices. In addition, we propose to broaden the concept of coherence domains from processors working on shared cache lines to parallel processes per- forming I/O to files shared across burst buffers. It will bridge the coherence domain and the underlying persistence domain. With its high bandwidth memory devices, ECC can represent burst buffers as disaggre- gated resources from CPU cores and memory from a single compute node to a cohort of burst buffers, thus supporting job-specific resource allocation and configuration. Furthermore, similar to coherence domain restriction on shared memory systems [32], by holding an application’s collection of active files in burst buffers, ECC temporarily containerizes its I/O activities from another job or ensemble, and insulates them from any variations in the rest of parallel file systems, thereby improving the performance and scalability of I/O. Together, by taking advantage of high-bandwidth burst buffers, ECC will offer a research vehicle to explore I/O containerization and disaggregation for extreme-scale systems.

A Collaborative Project from UIUC and FSU: In this project, we form a team of researchers from University of Illinois Urbana-Champaign (UIUC) and Florida State University (FSU) to design and implement ECC as a software framework for burst buffers and support new use models. Our collaborative project is built on top of recent studies on burst buffer management, I/O scheduling, performance optimization, and extreme-scale application optimization from UIUC and FSU. With two complementary organizations, this project can greatly expand the research breadth and depth beyond what a single PI can achieve. Specifically, we plan to carry out the following research and education activities.

  • To introduce ECC as a new containerization model to represent and host the active files of an appli- cation using a cohort of ephemeral burst buffers, and develop it as a service-oriented key-value (KV) store with a simple API and a collection of asymmetric I/O services.
  • To enable containerized delegation with relaxed consistency for scalable metadata management within an ECC and directory-based eventual consistency between an ECC and the PFS.
  • To enable storage disaggregation by creating virtual partitions in burst buffers for fine-grained storage allocation. A plugin component will be introduced in popular HPC schedulers such as SLURM [47] to expose the disaggregated burst buffers to the users.
  • To characterize various sources of I/O variability caused by contention and congestion on internal I/O processing components in the ECC framework, and accordingly, mitigate the variability by dynamically adjusting the data and metadata management services.
  • To enhance the effective capacity of burst buffers using data compression, companioned garbage collection and intelligent replacement.
  • To assess the benefits of ECC with a well-selected number of representative applications on DOE (Department of Energy) and NSF (National Science Foundation) leadership computing facilities.
  • To disseminate research results as publications, presentations and conference tutorials, strengthen the partnership between UIUC and FSU and their collaborations with national research labs, and release open-source software to the public. Particularly, we will organize panels and birds-of-feather sessions to review burst buffer research efforts in the upcoming supercomputing conferences.

Related Papers

People

Prof. Marc Snir

Prof. Weikuan Yu