Projects

A curated list of my open-source research projects on efficient LLM training and inference systems.

Canzona

Unified, asynchronous, and load-balanced matrix-based optimization for distributed training, with implementations for different sharding stacks.

Megatron-Canzona

Megatron integration with load-balanced DP partitioning, async TP micro-group scheduling, parameter splitting, and optimizer plugin support.

Code|Paper

FSDP-Canzona

FSDP-oriented executor for full-matrix optimizer tasks with logical host assignment, shard gather, full update compute, and update scatter.

Code|Paper

ZO2: Full-Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory

Zeroth-order offloading framework that enables memory-efficient full-parameter fine-tuning for extremely large LLMs.

Tinytron

A minimal yet practical pre-training stack for GPT-style models with FA/GQA/MoE support and distributed training utilities (ZeRO-1, Sequence-Expert Joint Parallelism).

Tiny-LLM-Libs

Educational mini-replicas of major distributed training stacks, designed for reading core mechanisms quickly.

Tiny-FSDP

DDP/ZeRO-3/FSDP side-by-side implementations for communication and memory trade-off learning.

Tiny-DeepSpeed

Minimal DDP + ZeRO1/2/3 training stack with meta initialization and overlap primitives.

Tiny-Megatron

Educational TP/DP/2D hybrid pipeline with custom modules and runtime auto-tuning.

Liangyu Wang

Canzona

ZO2: Full-Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory

Tinytron

Tiny-LLM-Libs