Canzona: Bringing Matrix-based Optimizers to Large-Scale Distributed Training
Published:
We recently released our paper Canzona: Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers.
The goal is straightforward: make matrix-based optimizers (e.g., Shampoo, Muon, SOAP) run efficiently in mainstream distributed training stacks such as Megatron.
What is the core challenge?
Matrix-based optimizers often require holistic updates, while distributed LLM training heavily relies on tensor fragmentation across devices.
This creates a structural mismatch:
- Synchronous solutions usually incur significant computation overhead.
- Layer-wise partitioning alone cannot resolve the mismatch while preserving communication-efficient geometry.
Key idea of Canzona
Canzona introduces two key designs:
- Decoupling logical optimizer assignment from physical parameter distribution.
- Parallelism-aware optimization.
- For Data Parallelism: alpha-Balanced Static Partitioning to mitigate load imbalance.
- For Tensor Parallelism: asynchronous compute pipeline with Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead.
Results (reported in the paper)
On Qwen3 models (up to 32B) with 256 GPUs, Canzona achieves:
- 1.57x speedup in end-to-end iteration time.
- 5.8x reduction in optimizer step latency.
One-sentence takeaway
Canzona is less about proposing yet another optimizer, and more about making existing matrix-based optimizers truly practical in modern distributed LLM training systems.
