ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory
Published in NeurIPS workshop, 2024; arXiv preprint, 2025
ZO2 (Zeroth-Order Offloading) is a novel framework designed for efficient zeroth-order fine-tuning of large language models with limited GPU memory. While traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both forward and backward phases, our zeroth-order approach computes gradients using just forward operations, eliminating the need to store activations.
Our framework dynamically shifts model parameters between CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. By integrating parameter offloading with zeroth-order’s double forward operations, we reduce unnecessary data movement and enhance fine-tuning efficiency. Additionally, ZO2 supports low-bit precision in AMP mode to streamline data exchanges between CPU and GPU.
Using ZO2, we can fine-tune extraordinarily large models like OPT-175B (175 billion parameters) on a single GPU with merely 18GB memory - achievements beyond the reach of traditional methods - with almost no additional time overhead and no accuracy loss compared to standard zeroth-order methods.
Recommended citation: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, and Di Wang. (2025). "ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory." arXiv preprint arXiv:2503.12668
Download Paper