Shingled Magnetic Recording (SMR) technology is largely based on the current hard disk drive (HDD) drive with sequential writes to partially-overlapped tracks to increase the areal density limitation of HDDs. Such SMR drives can benefit large-scale storage systems by reducing the Total Cost of Ownership (TCO) and meeting the challenge of the explosive data growth.
Figure 1. Roadmap of SMR Research
To embrace the challenge and opportunities brought by SMR technology, our group are conducting SMR related research (Figure 1) that consists of the following dimensions:
SMR Drive Internal Data Layout Designs. To expand the SWD applicability instead of only supporting archival or backup systems, we have designed both static and dynamic track level mapping schemes to reduce the write amplification and garbage collection overhead. Furthermore, we have implemented a simulator that enables us to evaluate various SMR drive data layout designs under different workloads.
SMR Drive Performance Evaluations. Building SMR drive based storage systems calls for a deep understanding of the drives performance characteristics. To accomplish this, we have carried out in-depth performance evaluations on HA-SMR drives with a special emphasis on the performance implications of the SMR-specific APIs and how these drives can be deployed in large storage systems.
SMR Drive Based Applications. Based the experience on SMR layout design and performance evaluation, we are investigating various potential SMR drive based applications such as SMR RAID system, SMR based key-value store, SMR drive backed deduplication system, SMR friendly filesystem, etc.
Traditional memory/storage hierarchy mainly encompasses DRAM and magnetic disc storage systems. It can deliver a high performance IO by placing frequent accessed blocks in DRAM and emulate block interface with using fast DRAM memory. However, this traditional hierarchy had started to impose a significant scaling issue. Moreover, it has significant logging overhead to maintain filesystem and application IO consistency.
One promising way to improve existing memory/storage hierarchy is to take advantage of new technologies and fully integrate them within the existing hierarchy. There are number of Non-volatile memory (NVM) technologies like PCM that can address scalability issues of the memory hierarchy. Moreover, Flash based storage devices (SSD) and Shingle-write drives (SWD) are the other emerging storage technologies that show tremendous promise to improve the performance and space efficiency of current storage hierarchy. These new technologies have fundamentally changed the traditional hierarchy. However, there exist many challenging research issues to be investigated before we can fully integrated them. Figure 2 illustrated a possible integration of new memory/storage hierarchy.
Figure 2. Possible Combination of New Memory/Storage Hierarchy.
The main goal of this project is to understand the limitation of existing memory/storage hierarchy to fully utilize these new technologies in both hardware and software architectures.
In terms of hardware architectures, we are investigating device specific properties of new storage hierarchy to devise efficient data placement and migration techniques. For example, existing storage hierarchy assumes that data can promote and demote one level in the hierarchy at certain time. However, we believe that with additional levels of storage devices in the hierarchy like SWD, this assumption is not the right data migration decision anymore. Based on the property of each device in the hierarchy a bypassing strategy is required as well. Therefore, new set of caching and replacement policies should be integrated in the current hardware architecture to take advantage of emerging storage devices.
In terms of software architecture, we are looking at possible modification in Linux filesystems like ext3 to take advantage of NVM memory in the memory hierarchy. We believe an appropriate way to integrate NVM memory in current systems is trough a filesystem and memory mapping techniques. However, existing filesystems and mapping software optimized to accelerate block IO access using DRAM memory. Therefore, they cannot fully take advantage of promising features of emerging NVM memories like scalability and non-volatility. For this reason, we are developing new kernel API in addition to necessary changes in current software architecture in Linux filesystem to integrate these new assumptions.
Deduplication is the process of dividing a large data stream into multiple chunks and eliminating the duplicate chunks by storing only the unique chunks. The other redundant chunks are replaced with pointers to the stored/unique chunks. This technique has been widely adopted in not only backup systems but also primary storage. We have three research projects related to deduplication.
(1) TDDFS: A Tier-aware Data Deduplication based File System
In this project, a unified file system manages several storage tiers(e.g., HDD and flash based SSD). Data deduplication is applied when cold data is migrated from tier 1 (T1) to tier 2 (T2). The deduplication structure is also maintained in T1 for reloaded files to improve the migration efficiency and space utilization. The architecture is shown in Figure 3.
Figure 3. The TDDFS Architecture
(2) SMRTS: An SMR-based Tiered Store
SMR is perfect for data deduplication systems to store the containers (all sequential write and no update). We designed an SMR based tiered store to reduce the tier architecture storage costs but maintain its high performance. Data is read and write on SSD, cold data is deduplicated and migrated to the SMR drives. The architecture is shown in Figure 4.
(3) Optismr: Restore-Performance Optimization for Deduplication Systems Using SMR Drives
Figure 4. The SMRTS Architecture
In this project, the restore performance is optimized by allocating or replicating containers to different SMR zones, such that the disk head moving distance is reduced during the restore process. The initial design is shown in Figure 5.
Figure 5. The Optismr Architecture
Thorough understanding of I/O workload characteristics is the key to overall system performance improvement and maximum utilization of available system resources. I/O workload characterization can be carried out at different system levels such as disk I/O level, file system level and application level. Accurate workload modeling at these layers is important due to the following reasons. First, suitable caching policies at both device level and file system level can be carried out to better serve application I/Os. Second, realistic system testing tool and synthetic workload generators can be developed base d on the workload models, instead of the rigid benchmark tools. Third, workload modeling and characterization can shed light on storage architecture as well as storage device designs.
Currently there are three projects that are on-going in CRIS, which are respectively
- Reproduce Cache Behavior in Synthetic Disk I/O Workload
- Exascale I/O and Parallel File System I/O Workload Characterization
- High Fidelity Disk I/O Workload Replayer
You can learn more about the project
Previous researches have shown performing computation close to data would improve system performance in terms of corresponding time and energy consumption, especially for IO intensive applications. As the role of flash memory increases in storage architectures, solid-state drives have gradually displaced the hard disk drive with much shorter access latency and lower power consumption. Based on the development of solid-state drives, some researchers proposed active flash architecture to perform IO intensive applications inside the storage device by using an embedded controller in SSD. However, since that embedded controller besides implementing flash translation layer to emulate SSD as HDD, it also needs to communicate with host interface to transfer required data. So, the extra computation capacity can be utilized to performance other application is quite limited. To maximize the computation capacity on the SSD, we propose multiple processors design called storage processing unit (SPU).
Figure 6. The SPU-based SSD.
Figure 6 shown an SSD block diagram based on the proposed SPU. Besides the SSD controller, the SPU integrates a processor into each flash channel, N is the total flash channel numbers. Since every flash channel and processor is independent with each other, so that parallel computation can be performed in the SPU, which improves both computation throughput and response time.
To evaluate the proposed SPU, we implement MySQL with TPC-H benchmark on two different system: a conventional SSD-based system (baseline system), and the SPU system. In the baseline system, the database application is implemented in host CPU. Since conventional database system uses row-oriented data layout and need to transfer entire database table to host for processing, so transferring those unnecessary data tuples consumed IO bandwidth and degraded overall performance. In the SPU-based system, we offloading partial computation task into SPU by using Flash Scan/Join algorithm and only return the processed result to host. Since SPU only transfers required data to host and perform computation parallel, so both IO and computation time have a significant improvement.
Compared with conventional systems, the SPU-based system reduces both computation time and energy consumption. The reduced energy allows for more scalability in a cloud like environment. Then the high power cost of a CPU is amortized over a large population of inexpensive processors in the SPU. This allows for significantly more compute resources in a given footprint of a system power budget. In summary, the proposed SPU will significantly benefit IO intensive applications in both response time and energy consumption.
Flash memory is an emerging storage technology that shows tremendous promise to compensate for the limitations of current storage devices. Wakeup / recovery time of SSD's which are used in ultrabook etc is a critical parameter. Large number of alternatives available for address mapping and journaling schemes has varied impact on these parameters. Project aims to investigate and evaluate the tradeoffs associated with the address mapping, journaling schemes on wakeup / recover time while maintaining the reliability and consistency of data and metadata stored on the flash memory.
SSD Wakeup / Recovery time: It is the time required for the SSD controller to bring data and metadata in a consistent state upon startup / wakeup.
This process involves following steps
- It involves address mapping update in the on disk copy of address mapping table for the data pages whose address mapping did not reach on disk copy of address mapping table.
- Reconstruction of upper levels of address mapping table(SRAM copy) from the leaf level of on disk copy.
- Reconstruction of other metadata (freespace info etc) from the on disk copy of the same.
The multi channel, multi chip architecture of SSD offers the parallelism at channel, chip, die and plane level. It coupled with the use of advanced commands can greatly help to reduce the wakeup / recovery time. Project involves investigation of its impact on wear leveling, garbage collection, read / write performance.
This research focuses on cloud-scale backup systems where a single provider is offering services to a very large number clients, on the order of hundreds of thousands. The customers sign a Service Level Agreement (SLA) with the provider to define Service Level Objectives (SLOs) that specify the type of service expected for a given cost. Based on the SLOs and the budget, the intelligent system we are building constructs an optimized policy to satisfy the SLOs, including the backup frequency, the priority assignment for associated data flow operations, the selection of the restore scheme, resource allocations, and so on. Figure 7 is the basic backup environment used in the research.
Figure 7. Basic Backup Environment
Every client in the global backup system is self-interested, meaning that it is only concerned with its own benefits or losses. Consequently, designing an efficient and effective scheduler to perform scheduling and resource allocation for all of the various data flow operations requested by a vast number of clients is extremely challenging. The overall goal of this project is to design an efficient algorithm to manage the backup system so that it can meet the SLOs of every client at the lowest cost.
As the cloud storage is becoming more important, many data intensive applications are gaining foothold in the research and industry space. There is a huge rise in the unstructured data. This has been predicted by the International Data Corporation that 80% of the 133 exabytes of the global data growth in 2017 would be from unstructured data.
To manage the huge data storage requirements Seagate recently launched Kinetic direct-access-over-Ethernet hard drives. These drives incorporate a LevelDB key-value store inside each drive. A Kinetic Drive can be considered as an independent server and could be accessed via Ethernet, which is shown as Figure 8.
Figure 8. Management Configurations for Kinetic Drives
In this project, we are looking at several research issues of Kinetic Drives in the architecture of Figure 9.
1.in this work, we employ micro and macro benchmarks to help understand the performance limits, trade-offs, and implications of replacing traditional hard drives with Kinetic drives in data centers and high performance systems. We test latency, throughput, and other relevant tests using different benchmarks including Yahoo Cloud Serving Benchmarks.
2.We compare the results obtained as mentioned previously with a SATA-based and a SAS-based traditional server running LevelDB. We find out that the Kinetic Drives are CPU-bound but give an average throughput of 63 MB/sec for sequential writes and sequential read throughput of 78 MB/sec for 1 MB value sizes. They also demonstrate unique Kinetic features including direct disk-to-disk data transfer. With our tests we also demonstrated unique features of Kinetic drives such as P2P data transfer.
3.In a large-scale key-value store system, there are many Kinetic Drives and outside users as shown in the following figure. There are also metadata server(s) that manage the Kinetic Drives. In this project, we design key indexing tables on the metadata server(s) and key-value allocation schemes on those Kinetic Drives, in order to map key-value pairs to disk locations.
4.We are also looking at search requests in a large-scale key-value store using Kinetic Drives. Given attributes of the data from the users, the key-value store system should quickly find out the correct Kinetic Drives that store the data.
Figure 9. Large-Scale Kinetic Drive Architecture
Currently cloud storage has gained great prominence in both academia and industry.
Google, Amazon, Microsoft, HP, IBM, Salesforce, Dell, etc., almost every infrastructure related company
is providing their own cloud storage. Cloud storage is also a very cost-effective option for users who
need to store data. Cloud storage stores data in logical storage pools which span across multiple physical
disks in one data center or multiple geographically distributed data centers. How data are stored and
provided are totally managed by the service provider.
Cloud storage covers a lot of backgrounds and cross multiple areas. In our group, we mainly focus on the following areas.
Data management. Cloud storage has availability, consistency and scalability requirement.
Availability means when a user accesses data, the data must be available. Consistency means the data that user
read must be the right version. Scalability means cloud storage should be able to scale easily as the volume of
data increases. In this research, we are exploring how to design cloud storage system and manage data so that
we can ensure high availability, consistency and scalability.
Virtualization. Virtualization is a building block for cloud service. Those people who choose
to deploy their service in cloud will often use both server service and storage service. They deploy their applications
in cloud and at the same time store data in cloud storage. Traditionally, their applications are deployed in virtual machines (VMs).
Cloud provider will install hypervisors on their servers to virtualize physical servers into multiple VMs to achieve higher density
and flexibility. In this environment, how to manage the data access from VMs to Cloud storage to achieve better performance is a problem.
For example, after an application issues an IO request to the cloud storage, the VM where the application resides may replicate that IO
request so that they could be sent to different replicas of that piece of data. The fastest response will be received by the application.
Now the trend of virtualization prefers a more lighted-weighted solution – Container. And among all container management, docker is a most
successful one. Docker container achieves much higher density and is much more flexible than traditional VMs. It consumes less CPU and
memory resources. It can boot in 0.1s and reboot in 2s, a tremendous improvement. On storage, it can achieve a close to native storage
access performance. We are exploring how docker container is going to change the traditional cloud.
Local and Cloud Storage. In this research, we fully consider the difference between local and cloud storage, e.g.,
capacity, speed, reliability, cost. We are seeking to combine local with cloud storage and access data with thorough consideration of
Ceph Bluestore is the latest object store backend that solves the ''double journaling'' problem of existing backend designs.
Bluestore only writes out the objects once and stores the location of the object separately as the metadata in RocksDB.
NVMe SSD Increasing I/O speed of flash introduces NVMe protocol that supports large number of queue and large queue depth. Further,
as kernel NVMe driver cannot keep up with the response time of the NVMe SSDs, user-space poll-mode NVMe drives (e.g. Intel SPDK) are
proposed to fully unleash the power of the NVMe SSDs.
RocksDB on NVMe When directly using SPDK driven NVMe SSDs as the storage backends in RocksDB, we found that the interaction between
RocksDB and SPDK cannot fully utilize the power of the NVMe SSDs. So, the focus of our project is to pinpoint the performance bottleneck
of the metadata store stack in Ceph Bluestore (especially RocksDB) and propose optimization accordingly.
In other words, the metadata store in Ceph Bluestore, RocksDB,
is not intially designed for user space NVMe drivers, and we will investigate potential bottlenecks in Ceph Bluestore metadata path,
especially RocksDB, when integrating SPDK.
In current HBase architecture design, the Hbase RegionServer is deployed on the same host as the HDFS datanode server.
Thus, there is a very good data locality and the network traffic between the hosts is light.
However, most HBase users have their own OSD server pool and the RegionServer hosts are separated from the storage nodes.
Therefore, a huge amount of RPC I/O is required between the Hbase RegionSevers and OSD servers during the GET/SCAN operations
We deploy a LocalScanner on OSD servers and Some of the I/O intensive requests are directly processed by the LocalScanner
such that only the requested data is transmitted to the RegionServer through the network. However, how to achieve the management
and requests process is still challenging. By using the local scanner, some of the GET, SCAN and compaction can be processed
locally to improve the performance when network bandwidth is low.