Artificial Intelligence Could Help Data Centers Run Far More Efficiently

August 23, 2019 | MIT

Estimated reading time: 6 minutes

A novel system developed by MIT researchers automatically “learns” how to schedule data-processing operations across thousands of servers — a task traditionally reserved for imprecise, human-designed algorithms. Doing so could help today’s power-hungry data centers run far more efficiently.

Data centers can contain tens of thousands of servers, which constantly run data-processing tasks from developers and users. Cluster scheduling algorithms allocate the incoming tasks across the servers, in real-time, to efficiently utilize all available computing resources and get jobs done fast.

Traditionally, however, humans fine-tune those scheduling algorithms, based on some basic guidelines (policies) and various tradeoffs. They may, for instance, code the algorithm to get certain jobs done quickly or split resource equally between jobs. But workloads — meaning groups of combined tasks — come in all sizes. Therefore, it’s virtually impossible for humans to optimize their scheduling algorithms for specific workloads and, as a result, they often fall short of their true efficiency potential.

The MIT researchers instead offloaded all of the manual coding to machines. In a paper being presented at SIGCOMM, they describe a system that leverages “reinforcement learning” (RL), a trial-and-error machine-learning technique, to tailor scheduling decisions to specific workloads in specific server clusters.

To do so, they built novel RL techniques that could train on complex workloads. In training, the system tries many possible ways to allocate incoming workloads across the servers, eventually finding an optimal tradeoff in utilizing computation resources and quick processing speeds. No human intervention is required beyond a simple instruction, such as, “minimize job-completion times.”

Compared to the best handwritten scheduling algorithms, the researchers’ system completes jobs about 20 to 30 percent faster, and twice as fast during high-traffic times. Mostly, however, the system learns how to compact workloads efficiently to leave little waste. Results indicate the system could enable data centers to handle the same workload at higher speeds, using fewer resources.

“If you have a way of doing trial and error using machines, they can try different ways of scheduling jobs and automatically figure out which strategy is better than others,” says Hongzi Mao, a PhD student in the Department of Electrical Engineering and Computer Science (EECS). “That can improve the system performance automatically. And any slight improvement in utilization, even 1 percent, can save millions of dollars and a lot of energy in data centers.”

“There’s no one-size-fits-all to making scheduling decisions,” adds co-author Mohammad Alizadeh, an EECS professor and researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “In existing systems, these are hard-coded parameters that you have to decide up front. Our system instead learns to tune its schedule policy characteristics, depending on the data center and workload.”

Joining Mao and Alizadeh on the paper are: postdocs Malte Schwarzkopf and Shaileshh Bojja Venkatakrishnan, and graduate research assistant Zili Meng, all of CSAIL.

RL for Scheduling

Typically, data processing jobs come into data centers represented as graphs of “nodes” and “edges.” Each node represents some computation task that needs to be done, where the larger the node, the more computation power needed. The edges connecting the nodes link connected tasks together. Scheduling algorithms assign nodes to servers, based on various policies.

But traditional RL systems are not accustomed to processing such dynamic graphs. These systems use a software “agent” that makes decisions and receives a feedback signal as a reward. Essentially, it tries to maximize its rewards for any given action to learn an ideal behavior in a certain context. They can, for instance, help robots learn to perform a task like picking up an object by interacting with the environment, but that involves processing video or images through an easier set grid of pixels.

To build their RL-based scheduler, called Decima, the researchers had to develop a model that could process graph-structured jobs, and scale to a large number of jobs and servers. Their system’s “agent” is a scheduling algorithm that leverages a graph neural network, commonly used to process graph-structured data. To come up with a graph neural network suitable for scheduling, they implemented a custom component that aggregates information across paths in the graph — such as quickly estimating how much computation is needed to complete a given part of the graph. That’s important for job scheduling, because “child” (lower) nodes cannot begin executing until their “parent” (upper) nodes finish, so anticipating future work along different paths in the graph is central to making good scheduling decisions.

To train their RL system, the researchers simulated many different graph sequences that mimic workloads coming into data centers. The agent then makes decisions about how to allocate each node along the graph to each server. For each decision, a component computes a reward based on how well it did at a specific task — such as minimizing the average time it took to process a single job. The agent keeps going, improving its decisions, until it gets the highest reward possible.

Page 1 of 2

Share on:

Testimonial

"In a year when every marketing dollar mattered, I chose to keep I-Connect007 in our 2025 plan. Their commitment to high-quality, insightful content aligns with Koh Young’s values and helps readers navigate a changing industry. "

Brent Fischthal - Koh Young

Suggested Items

EV Group Achieves Breakthrough in Hybrid Bonding Overlay Control for Chiplet Integration

09/12/2025 | EV Group
EV Group (EVG), a leading provider of innovative process solutions and expertise serving leading-edge and future semiconductor designs and chip integration schemes, today unveiled the EVG®40 D2W—the first dedicated die-to-wafer overlay metrology platform to deliver 100 percent die overlay measurement on 300-mm wafers at high precision and speeds needed for production environments. With up to 15X higher throughput than EVG’s industry benchmark EVG®40 NT2 system designed for hybrid wafer bonding metrology, the new EVG40 D2W enables chipmakers to verify die placement accuracy and take rapid corrective action, improving process control and yield in high-volume manufacturing (HVM).

AV Switchblade 600 Loitering Munition System Achieves Pivotal Milestone with First-Ever Air Launch from MQ-9A

09/12/2025 | BUSINESS WIRE
AeroVironment, Inc. (AV) a global leader in intelligent, multi-domain autonomous systems, announced its Switchblade 600 loitering munition system (LMS) has achieved a significant milestone with its first-ever air launch from an MQ-9A Reaper Unmanned Aircraft System (UAS).

United Electronics Corporation Unveils Revolutionary CIMS Galaxy 30 Automated Optical Inspection System

09/11/2025 | United Electronics Corporation
United Electronics Corporation (UEC) today announced the launch of its new groundbreaking CIMS Galaxy 30 Automated Optical Inspection (AOI) machine, setting a new industry standard for precision electronics manufacturing quality control. The Galaxy 30, developed and manufactured by CIMS, represents a significant leap forward in inspection technology, delivering exceptional speed improvements and introducing cutting-edge artificial intelligence capabilities.

IPS, SEL Raise the Bar for ENIG Automation in North America

09/11/2025 | Mike Brask, IPS
IPS has installed a state-of-the-art automated ENIG plating line at Schweitzer Engineering Laboratories’ PCB facility in Moscow, Idaho. The 81-foot, fully enclosed line sets a new standard for automation, safety, and efficiency in North American PCB manufacturing and represents one of the largest fully enclosed final finish lines in operation.

Smart Automation: Odd-form Assembly—Dedicated Insertion Equipment Matters

09/09/2025 | Josh Casper -- Column: Smart Automation
Large, irregular, or mechanically unique parts, often referred to as odd-form components, have never truly disappeared from electronics manufacturing. While many in the industry have been pursuing miniaturization, faster placement speeds, and higher-density PCBs, certain market sectors are moving in the opposite direction.

News Highlights

More News

Featured Books

Article Highlights

More Articles

Latest Columns

See all of our columnists

Media Kit - Choose Your Primary Marketing Focus: