Canzona: Bringing Matrix-based Optimizers to Large-Scale Distributed Training

1 minute read

Published:

We recently released our paper Canzona: Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers.

The goal is straightforward: make matrix-based optimizers (e.g., Shampoo, Muon, SOAP) run efficiently in mainstream distributed training stacks such as Megatron.

What is the core challenge?

Matrix-based optimizers often require holistic updates, while distributed LLM training heavily relies on tensor fragmentation across devices.

This creates a structural mismatch:

  • Synchronous solutions usually incur significant computation overhead.
  • Layer-wise partitioning alone cannot resolve the mismatch while preserving communication-efficient geometry.

Key idea of Canzona

Canzona introduces two key designs:

  1. Decoupling logical optimizer assignment from physical parameter distribution.
  2. Parallelism-aware optimization.
    • For Data Parallelism: alpha-Balanced Static Partitioning to mitigate load imbalance.
    • For Tensor Parallelism: asynchronous compute pipeline with Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead.

Results (reported in the paper)

On Qwen3 models (up to 32B) with 256 GPUs, Canzona achieves:

  • 1.57x speedup in end-to-end iteration time.
  • 5.8x reduction in optimizer step latency.

One-sentence takeaway

Canzona is less about proposing yet another optimizer, and more about making existing matrix-based optimizers truly practical in modern distributed LLM training systems.