Canzona: Bringing Matrix-based Optimizers to Large-Scale Distributed Training

1 minute read

Published: February 04, 2026

We recently released our paper Canzona: Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers.

The goal is straightforward: make matrix-based optimizers (e.g., Shampoo, Muon, SOAP) run efficiently in mainstream distributed training stacks such as Megatron.

What is the core challenge?

Matrix-based optimizers often require holistic updates, while distributed LLM training heavily relies on tensor fragmentation across devices.

This creates a structural mismatch:

Synchronous solutions usually incur significant computation overhead.
Layer-wise partitioning alone cannot resolve the mismatch while preserving communication-efficient geometry.

Key idea of Canzona

Canzona introduces two key designs:

Decoupling logical optimizer assignment from physical parameter distribution.
Parallelism-aware optimization.
- For Data Parallelism: alpha-Balanced Static Partitioning to mitigate load imbalance.
- For Tensor Parallelism: asynchronous compute pipeline with Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead.

Results (reported in the paper)

On Qwen3 models (up to 32B) with 256 GPUs, Canzona achieves:

1.57x speedup in end-to-end iteration time.
5.8x reduction in optimizer step latency.

One-sentence takeaway

Canzona is less about proposing yet another optimizer, and more about making existing matrix-based optimizers truly practical in modern distributed LLM training systems.

Twitter Facebook LinkedIn

Liangyu Wang

Canzona: Bringing Matrix-based Optimizers to Large-Scale Distributed Training

What is the core challenge?

Key idea of Canzona

Results (reported in the paper)

One-sentence takeaway

核心问题是什么？

Canzona 的关键思路

论文报告的效果

一句话总结