1.1. Project Objectives
The main goal of this project is to characterize and model the block IO workload, and use a synthetic workload generator framework to generate workloads which have similar caching behavior, temporal and spatial locality, R/W ratio and block addressing as the original trace. Understanding the workload characteristics is very crucial to avoid any design inefficiencies in a storage system. Using this framework, one would be able to do performance testing and optimize the existing storage systems to suit the needs and get the best performance. Additionally, we want to study how the caching behavior changes among different layers of a multi-level storage hierarchy, and study the impact of using different data migration policies.
It is a common practice to use benchmarks to characterize block IO workloads, but these benchmarks fail to describe the properties of the real workload. Also, trace files could be large and it could get costly to store all the traces. It is a better practice to generate similar traces that mimic the properties of the original trace. Hence, a synthetic workload generator that is able to characterize the real workload is desired.
1.3. The Workload
We define the workload as a tuple W =((t1, (o1, l1, s1)), ..., (tn, (on, ln, sn))), where t represent request time, o represents operation, l represents location and s represent the size of request. The tuple (o, l, s) = r, where r represents a request. Therefore, W = (t,r). Since t is unique, W can also be represented as a function where W(t) = r. Workload is also a time series, rt = f (rt-1, ..., r1). It has been suggested that the IO workload is a long-memory process as opposed to short-memory process such as Poisson process.
1.4. Design of Experiments
We assume that if the workload is correctly and fully characterized, variance of performance for a given system (with a specific state) must be due to variance in the workload characteristics. To simplify the experiment we ensure that the system is always in its initial state with no data. This is the main difference of our approach to the previous approaches. Previously, a characterization was deemed successful if the synthetic workload derived from the characterization of the real workload result in a similar performance. We argue that this is a necessary condition but not sufficient condition. To validate a characterization, we must ensure that the different characteristics result in different performance. Experiment flow of both approaches are the same. However, our approach requires k number of experiments with different real workloads to ensure required confidence level.
In order to effectively characterize and generate the block IO workloads, we use the concepts of phase detection, probabilistic modeling, modular development, different caching policies and optimizations. Our framework takes a block IO trace as input and then extracts parameters, which then feed into the synthetic workload generator tool. The synthetic trace is improved recursively until an acceptable value of cache hit/miss value is reached. The trace then feeds into a workload replay tool which has tunable parameters like queue depth, drive size and inter-arrival time. The the results could then be analyzed to compare performance of the storage device/system.
2.1. Project Objectives
The project aims to develop a novel modeling framework for parallel file system I/O workload, as well as to predict the access patterns of parallel I/O workloads at Exascale. Based on the modeling framework, we¡¯re going to develop a synthetic workload generator which can be used for multiple purposes including new parallel file system performance testing. On the other hand, we want to study the potential architectural changes to adapt Exascale I/O workloads as well as the impact on I/O workload patterns.
2.2. Background and Project Overview
A typical high performance computing system environment, as Figure 1 shows, usually includes system components such as compute nodes, high speed inter-connection infrastructure, IO nodes with backend storage infrastructure. I/O workloads can be captured and studies at different levels such as application level, file system level, and disk storage device level. This project targets at parallel I/O workloads at application level, because they¡¯re truly representing the application I/O behavior before they get transformed by all kinds of I/O optimizations, including IO forwarding mechanism, along the I/O path.
The I/O software stack becomes deeper and deeper as the computing scale and I/O scale increase. For example, typical HPC applications and/or computational science applications are taking advantages of high level I/O libraries such as HDF5 and netCDF. These high level I/O requests will first be translated into MPI-IO requests, which will then initiate necessary inter-process communications and I/O optimizations. Some MPI-IO requests may need to further call POSIX-IO system calls to access data which is ready on native compute nodes. Figure 2 shows a typical HPC system environment.
Figure 2. HPC environment abstraction
2.3. Our Solution to Parallel I/O Workload Modeling
Our modeling framework mainly involves two aspects of modeling, which are arrival pattern and file access pattern. Arrival pattern doesn't only describe temporal properties such as burstiness, but also needs to consider the feedback between computing nodes and I/O nodes. Inter-request dependencies and inter-process dependencies become quite important in parallel I/O workloads. But this way, we also have to evaluate the possibilities of open mode and closed mode.
File access pattern, on the other hand, needs to depict the way that processes share files and access its private files. We¡¯re essentially addressing the question of ¡°Which file is accessed by whom at what time in which way¡±. As a result, there is no single mathematically model to answer this question comprehensively. Instead, we¡¯re creating a special modeling framework to overcome this shortage of single mathematical model.
Benchmarking of storage systems by IO workloads is common practice for performance evaluation and debugging of emerging storage systems. It is very important that the IO workload represent realistic IO pattern with high fidelity. Otherwise, the performance evaluation cannot represent actual performance of the storage system and produce metrics for performance comparison. To accomplish these goals, much thought must go into choosing suitable benchmarks. We classify these benchmarks into the following two categories: a). Macrobenchmarks: The performance is tested with a synthetic workload that is meant to represent some real-world workload. b). Trace Replays: A replayer application replays a trace of operations which were recorded from a real-world workload and storage system performance is measured during the replay process. Macrobenchmarks ( e.g. FIO benchmark ) are not fully representative of various real-world workloads. However, they are popular in the storage community because benchmarking with synthetic workload required insignificant setup cost and results predictable behavior with better flexibility.
On the other hand, Replay a trace file which was recorded from a real-world workload can potentially preserves important characteristics of the real-world workload. However, constructing the same workload requires high fidelity replayer tool to faithfully issue IOs on a target storage system. Otherwise, we may be introducing substantial errors in observed performance metrics during performance evaluation using inaccurate tools for replaying traces.
The main goal of developing high fidelity replay tool is quite difficult to achieve for high performance storage systems. Emerging storage systems normally connected to tens of hosts, might have response time of tens microseconds, and can handle 100s of IOs per second. Moreover, actual workload that arrives to storage system controller is a mixed IO operation from different hosts. Therefore, rather than collecting IO trace in each host, a real-world trace of IO operation usually is recorded in the storage system controller. Although it is prevalent to replay traces at the level at which the traces were captured. Replay the trace at the controller level is not feasible because of the hardware limitation of storage system controller. On the other hand, faithfully replay a trace from as few as possible hosts is quite challenging.
The goal of this study is to develop a scalable, timing accurate, block trace replay engine that can faithfully preserve important characteristics of the original application workload.