Timing Trick Cuts Energy Used in LLM Training by Up to 14 Percent
OpenAI ’s fourth large language model (LLM), GPT-4 , took an estimated 50 Gigawatt-hours to train, or the equivalent of 5,000 American homes ‘ yearly power consumption. That was in 2023. Since then, the computational resources used to train frontier LLMs have only increased , though direct power usage numbers are hard to come by. Now, a research group at University of Twente in the Netherlands has shown that you can save up to 14 percent of the energy used in LLM training wit
OpenAI ’s fourth large language model (LLM), GPT-4 , took an estimated 50 Gigawatt-hours to train, or the equivalent of 5,000 American homes ‘ yearly power consumption. That was in 2023. Since then, the computational resources used to train frontier LLMs have only increased , though direct power usage numbers are hard to come by. Now, a research group at University of Twente in the Netherlands has shown that you can save up to 14 percent of the energy used in LLM training without sacrificing speed by cleverly adjusting the clock frequency of the GPU during computation. Jeffrey Spaan , Ph.D. candidate at University of Twente and lead author on the article, presented the results at the Computing Frontiers conference in Catania, Sicily last month. “My research is about finding computing waste,” Spaan says. “It’s similar to underutilization of the hardware, but instead of optimizing the software for the hardware, we try to optimize the hardware for the software.” Making the GPU tick Spaan and his collaborators accomplished this by using a technique known as dynamic voltage-frequency scaling ( DVFS ). Every chip—including the GPUs commonly used for training frontier models—uses at least one clock to orchestrate computations. Each operation in the chip is triggered by a clock pulse. The frequency with which that clock ticks controls how fast the chip operates, and how much power it draws. Modern GPUs have two clocks, one for the computational core and one for the memory. When the core is hard at work crunching numbers, the clock frequency is kept high to ensure speedy calculation. However, with DVFS, the memory clock can slow down in that time, allowing for less power draw. It’s in principle possible to just turn off the memory part of the chip, but GPUs designs don’t enable software control for that off switch, and it would take too long to turn back on mid-calculation anyway. Similarly, when the core is waiting for data to be loaded from memory, the core clocking frequency can be slowed to a crawl while the memory clock frequency ramps up. DVFS has been a well-known technique that goes back to at least the 1990s. But Spaan says other researchers haven’t been able to usefully apply it to LLM training because their methods either slowed down calculations too much or were not fine-grained enough to improve energy usage. Previous DVFS attempts adjusted the frequency at each iteration of the training process. In LLM training, each iteration consists of two parts: the forward pass, in which data is run forward through the layers of the model with the weights as they are, and backpropagation, in which the weights are adjusted layer by layer based on the results of the forward pass. So, prior work kept one value of the frequency for the forward pass and adjusted to another for backpropagation. Spaan and co-workers tuned the clock frequencies on a shorter time scale. GPU workloads are broken down into tiny computational nuggets known as kernels . For example, a single vector-vector multiplication can make up a single kernel. The kernels are fed to the GPU to be processed many times in parallel. In Spaan’s implementation, the computation of a single layer of a deep neural network is broken up into approximately 40 kernels. By adjusting the clocking frequencies on a per-kernel level, the team was able to find much greater energy savings. The GPU also does DVFS automatically when the chip’s internal systems detect higher or lower demand, Spaan notes. “Some people might therefore think: we’ll just let the GPU handle it,” he says. “However, because the GPU doesn’t have the foresight we have of what kernels will run, it has to work with an on-the-fly best-effort guess, and can therefore never attain the same savings.” That’s where the manual adjustments come in. Less energy, same time The team performed their experiment by training GPT-3-xl, a 1.3 billion parameter model, on an Nvidia RTX 3080 Ti GPU. To save time, they focused on training a single layer of the model. In this setting, they found a set of frequency adjustments that gave them 14 percent energy savings while slowing the training time by only 0.6 percent. Performance of the model depends on both computing speed and energy usage. There is one challenge: Ramping down the clock frequency is much faster than turning a core off and on, but it’s still not instantaneous. In their experiment, the researchers evaluated one kernel at a time, not taking into account the frequency switching speed. So, 14 percent energy savings is a best-case scenario. How much of an issue it would be in practice, Spaan says, depends heavily on the GPU being used. Newer hardware, like the Blackwell GPUs, have much faster switching speeds than older versions, and should be able to harness the full energy savings. Now, the team is developing a tool that would be able to implement optimal frequenc
📌 Kaynak
Bu özet spectrumieee kaynağından otomatik derlenmiştir. Tamamı için orijinal habere gidin.
Orijinal haberi oku →