Ultra-Low-Precision Training of Deep Neural Networks
May 13, 2019 | IBMEstimated reading time: 3 minutes

Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference applications. IBM Research has played a leadership role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 16-bit reduced-precision systems for deep learning training (presented at ICML 2015), the first 8-bit training techniques (presented recently at NeurIPS 2018), and state-of-the-art 2-bit inference results (published at SysML 2019). Following this line of work, we now introduce a new breakthrough which solves a long-ignored, yet important problem in reduced-precision deep learning: accumulation bit-width scaling for ultra-low-precision training of deep neural networks (DNNs).
The most commonly used arithmetic function in deep learning is the dot product, which is the building block of generalized matrix multiplication (GEMM) and convolution computations. A dot product is computed used multiply accumulate (MAC) operations and thus requires two floating-point computations: multiplication of 2 numbers and accumulation of the product into partial sums. Today, much of the effort on reduced-precision deep learning focuses solely on quantizing representations, i.e. input operands to the multiplication operation. The other major portion of dot product computations, i.e. the partial sum accumulation, has always kept in full (32-bit) precision. The reason is that reduced-precision accumulations can result in severe training instability and degradation in model accuracy, as shown in Fig. 1a. This is especially unfortunate, since the area and the power of the hardware is dominated by the accumulator bit-width as the precisions are aggressively reduced. As shown in Fig. 1b, accumulating in high precision severely limits the hardware benefits of reduced-precision data representations and computation. The absence of any framework to analyze the precision requirements of partial sum accumulations inevitably results in very conservative design choices.
Figure 1. The importance of accumulation precision. (a) Convergence curves of an ImageNet ResNet18 experiment using reduced precision accumulation. The current practice is to keep the accumulation in full precision to avoid such divergence. (b) Estimated area benefits when reducing the precision of a floating-point unit (FPU). The terminology FPa/b denotes an FPU whose multiplier and adder use a and b bits, respectively. Our work enables convergence in reduced precision accumulation and gains an extra 1.5–2.2× area reduction.
In our paper published at ICLR 2019, we take a step forward and present a comprehensive statistical model to analyze the impact of reduced-precision accumulation in deep learning training. We observed and learned two critical insights. First, when we accumulate a dot product in scaled precision, the loss of information is primarily due to a so-called “swamping error”. In floating point arithmetic, swamping error occurs when a large number is added to a small number, the small number will be completely or partially truncated out of the addition. Second, swamping error will harm the statistics (i.e. variance) of a dot product. To ensure stable convergence of DNNs, it is a necessity to preserve the variance of dot products under reduced precision.
Using these insights, we derived a set of equations and introduced a new metric called variance retention ratio (VRR) of a reduced-precision accumulation in the context of the three deep learning GEMM functions. The VRR is a function of the accumulation length and minimum number of bits needed for accumulation only, which needs no simulation to be computed (Fig. 2). The VRR can be used to assess the suitability, or lack thereof, of a precision configuration and allows us to determine accumulation bit-widths for precise tailoring of deep learning computation hardware. From these VRR calculations in Fig. 2, it can be easily seen that chunk-based accumulations for typical deep learning computations can preserve accuracy down to 9-bits of accumulation precision, macc, (which corresponds to fp16 accumulations) while traditional non-chunk additions need much higher macc (of up to 15-bits, corresponding to fp32 accumulations). This reduced accumulation bit-width requirement translates directly to a 1.5–2.2× improvement in hardware energy efficiency (as indicated in Fig. 1).
Figure 2. Normalized variance lost as a function of accumulation length for different values of accumulation bit-width, macc, for (a) a normal accumulation (no chunking) and (b) a chunk-based accumulation (chunk size of 64). The ”knees” in each plot correspond to the maximum accumulation length for a given precision which indicates how the VRR is to be used to select a suitable precision.
Using the analysis, we successfully predicted and experimentally verified the minimum accumulation precisions required by the three GEMM functions across three popular benchmarking networks (CIFAR-10 ResNet34, ImageNet ResNet18, and ImageNet AlexNet). Our results prove that our method is able to accurately pinpoint the minimum precision needed for the convergence of benchmark networks to the full-precision baseline. On the practical side, this analysis is a useful tool for hardware designers implementing reduced precision processing hardware. We believe this work addresses a critical missing link on the path to ultra-low-precision hardware for DNN training.
Suggested Items
I-Connect007 Editor’s Choice: Five Must-Reads for the Week
06/06/2025 | Nolan Johnson, I-Connect007Maybe you’ve noticed that I’ve been taking to social media lately to about my five must-reads of the week. It’s just another way we’re sharing our curated content with you. I pay special attention to what’s happening in our industry, and I can help you know what’s most important to read about each week. Follow me (and I-Connect007) on LinkedIn to see these and other updates.
INEMI Interim Report: Interconnection Modeling and Simulation Results for Low-Temp Materials in First-Level Interconnect
05/30/2025 | iNEMIOne of the greatest challenges of integrating different types of silicon, memory, and other extended processing units (XPUs) in a single package is in attaching these various types of chips in a reliable way.
Siemens Leverages AI to Close Industry’s IC Verification Productivity Gap in New Questa One Smart Verification Solution
05/13/2025 | SiemensSiemens Digital Industries Software announced the Questa™ One smart verification software portfolio, combining connectivity, a data driven approach and scalability with AI to push the boundaries of the Integrated Circuit (IC) verification process and make engineering teams more productive.
Cadence Unveils Millennium M2000 Supercomputer with NVIDIA Blackwell Systems
05/08/2025 | Cadence Design SystemsAt its annual flagship user event, CadenceLIVE Silicon Valley 2025, Cadence announced a major expansion of its Cadence® Millennium™ Enterprise Platform with the introduction of the new Millennium M2000 Supercomputer featuring NVIDIA Blackwell systems, which delivers AI-accelerated simulation at unprecedented speed and scale across engineering and drug design workloads.
DARPA Selects Cerebras to Deliver Next Generation, Real-Time Compute Platform for Advanced Military and Commercial Applications
04/08/2025 | RanovusCerebras Systems, the pioneer in accelerating generative AI, has been awarded a new contract from the Defense Advanced Research Projects Agency (DARPA), for the development of a state-of-the-art high-performance computing system. The Cerebras system will combine the power of Cerebras’ wafer scale technology and Ranovus’ wafer scale co-packaged optics to deliver several orders of magnitude better compute performance at a fraction of the power draw.