Artificial intelligence

Helping data centers deliver high performance with minimal hardware | MIT News

To improve data center efficiency, multiple storage devices are often networked together so that multiple applications can share them. But even if it is put together, the maximum capacity of the device is always underutilized due to the variation in performance across devices.

MIT researchers have now developed a system that improves the performance of storage devices by managing three major sources of variability simultaneously. Their method brings significant speed improvements over traditional methods that deal with a single source of variation at a time.

The system uses a two-tier architecture, with a central controller making big-picture decisions about what tasks each storage device should perform, and local controllers for each device that quickly resend data if that device is struggling.

The method, which can adapt in real time to changing workloads, does not require special hardware. When researchers tested this system on real-world tasks such as AI model training and image compression, it nearly doubled the performance delivered by traditional methods. By intelligently balancing the workloads of multiple storage devices, the system can increase data center efficiency.

“There is a tendency to want to throw more resources at a problem to solve it, but that is unsustainable in many ways. We want to be able to increase the longevity of these very expensive and carbon-intensive resources,” said Gohar Chaudhry, a graduate student in electrical engineering and computer science (EECS) and lead author of the paper on this approach. “With our flexible software solution, you can still squeeze more functionality out of your existing devices before you have to ditch them and buy new ones.”

Chaudhry was joined on the paper by Ankit Bharwaj, an assistant professor at Tufts University; Zhenyuan Ruan PhD ’24; and senior author Adam Belay, associate professor of EECS and member of the MIT Computer Science and Artificial Intelligence Laboratory. The research will be presented at the USENIX Symposium on Networked Systems Design and Implementation.

Developing extraordinary performance

Solid-state drives (SSD) are high-performance digital storage devices that allow applications to read and write data. For example, an SSD can store multiple data sets and quickly send the data to a processor to train a machine learning model.

Clustering multiple SSDs together that multiple applications can share improves performance, as not all applications need to use the entire SSD capacity at a given time. But not all SSDs perform equally, and a slower device can limit the pool’s overall performance.

This inefficiency is caused by the diversity of SSD hardware and the functions they perform.

Taking advantage of this untapped SSD performance, researchers developed Sandook, a software-based system that addresses three major modes of variation that disrupt performance simultaneously. The word “Sandook” is an Urdu word for “box,” meaning “storage.”

Another type of variation is due to differences in the age, wear rate, and capacity of SSDs that may have been purchased at different times from multiple vendors.

The second type of variation is due to the mismatch between the read and write performance occurring on the same SSD. In order to write new data to the device, the SSD must erase the existing data. This process can reduce the number of data reads, or retrievals, that occur at the same time.

A third source of variability is garbage collection, the process of collecting and removing outdated data to free up space. This process, which slows down the performance of the SSD, is initiated at intervals beyond the control of the data center operator.

“I can’t imagine that all SSDs will behave the same throughout my deployment cycle. Even if I give them all the same workload, some of them will be stragglers, which hurts the total I can get,” explains Chaudhry.

Organize the world, respond locally

To handle all three sources of diversity, Sandook uses a two-tier structure. The global scheduler optimizes the distribution of jobs for the pool as a whole, while fast schedulers on each SSD react to urgent events and move jobs away from congested devices.

The system overcomes delays from read and write interrupts by rotating which SSDs the operating system can use for reads and writes. This reduces the chance of reading and writing happening simultaneously on the same machine.

Sandook also profiles the typical performance of each SSD. It uses this information to determine when garbage collection slows down performance. Once detected, Sandook reduces the workload on that SSD by diverting other tasks until the garbage collection is completed.

“If that SSD is doing garbage collection and it can no longer handle the same task, I want to give it a small workload and slow things down a little bit. We want to find the sweet spot where it’s doing some work, and get into that performance,” Chaudhry said.

SSD profiles also allow Sandook’s global controller to assign workloads in a weighted manner that takes into account the characteristics and capabilities of each device.

Because the global controller sees the whole picture and the local controllers react quickly, Sandook can simultaneously manage the types of variability that occur on different time scales. For example, garbage collection delays occur suddenly, while delays caused by aging extend over many months.

The researchers tested Sandook on a pool of 10 SSDs and tested the system on four tasks: running a database, training a machine learning model, compressing images, and storing user data. Sandook improved the performance of each program between 12 and 94 percent compared to static methods, and improved the overall use of SSD capacity by 23 percent.

The system enabled SSDs to reach 95 percent of their theoretical maximum performance, without the need for special hardware or application-specific updates.

“Our flexible solution can unlock more performance for all SSDs and really compress them. Every bit of capacity that can be saved is important at this scale,” said Chaudhry.

In the future, researchers want to include new protocols available in the latest SSDs that give operators more control over data placement. They also want to increase the prediction of AI workloads to increase the efficiency of SSD operations.

“Flash storage is a powerful technology that supports modern datacenter applications, but sharing this resource across workloads with highly variable performance requirements is still a major challenge. This work moves the needle forward with an efficient and effective solution that is ready to be deployed, bringing flash storage closer to its full potential in the production cloud,” said Josh Fried, a software engineer at the University of Pennsylvania who was Google’s assistant professor and a Pennsylvania professor. work.

This research was funded, in part, by the National Science Foundation, the US Defense Advanced Research Projects Agency, and the Semiconductor Research Corporation.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button