ZO2: Zeroth-Order Offloading
ZO2: Zeroth-Order Offloading
A framework that enables fine-tuning of extremely large language models (like OPT-175B) on limited GPU memory through zeroth-order optimization and CPU-GPU offloading.
Overview
Large Language Models (LLMs) with billions of parameters have shown remarkable capabilities, but fine-tuning these models is challenging due to extensive GPU memory requirements. Traditional first-order optimization methods like SGD require storing activations and gradients during both forward and backward phases, making them impractical for extremely large models on consumer hardware.
ZO2 solves this problem by combining two key innovations:
Zeroth-Order Optimization: Instead of using gradients computed through backpropagation, ZO2 uses forward-pass only gradient approximation, eliminating the need to store activations.
CPU-GPU Offloading: ZO2 dynamically shifts model parameters between CPU and GPU as needed, optimizing memory usage and computation flow.
Key Features
- Fine-tune 175B parameter models on a single consumer GPU with as little as 18GB of memory
- No accuracy loss compared to standard zeroth-order methods
- Minimal time overhead through optimized CPU-GPU transfer scheduling
- Low-bit precision support in AMP mode for efficient data exchange
- Compatible with popular LLMs including OPT, LLaMA, and others
How It Works
ZO2 integrates parameter offloading with zeroth-order optimization’s double forward operations:
- Dynamic Parameter Management: Parameters are stored primarily in CPU memory and loaded to GPU only when needed
- Optimized Gradient Approximation: Uses efficient zeroth-order methods that require only forward passes
- Intelligent Scheduling: Minimizes unnecessary data transfers between CPU and GPU
# Example ZO2 usage from zo2 import ZO2Optimizer, CPUOffloader model = get_large_language_model() # Your 175B parameter model offloader = CPUOffloader(model) optimizer = ZO2Optimizer(model.parameters(), lr=1e-4) for inputs, targets in dataloader: # Parameters automatically managed between CPU/GPU outputs = model(inputs) loss = loss_fn(outputs, targets) # Zeroth-order gradient computation and update optimizer.zero_grad() optimizer.step(loss)
Results
ZO2 dramatically reduces GPU memory requirements across different model sizes. For example, when fine-tuning OPT-6.7B, ZO2 requires only 4GB of GPU memory compared to 68GB for AdamW, 32GB for SGD, and 16GB for MeZO. Similarly, for OPT-13B, ZO2 needs only 6GB compared to 57GB for SGD and 29GB for MeZO.
For larger models, the memory savings are even more significant. ZO2 can fine-tune OPT-30B with just 8GB of memory (compared to 63GB for MeZO), and most impressively, enables fine-tuning of OPT-175B with only 18GB of GPU memory - a model that is impossible to fine-tune with traditional methods on consumer hardware.
The “x” markers in the chart indicate cases where the model couldn’t be fine-tuned with the corresponding method due to excessive memory requirements. ZO2 is the only method capable of fine-tuning all model sizes, including the massive 175B parameter model, on consumer-grade GPUs.
Resources
Citation
@article{wang2025zo2,
title={ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory},
author={Wang, Liangyu and Ren, Jie and Xu, Hang and Wang, Junxiao and Xie, Huanyi and Keyes, David E and Wang, Di},
journal={arXiv preprint arXiv:2503.12668},
year={2025}
}