About Me

I am Liangyu Wang, a Ph.D. candidate in Computer Science at King Abdullah University of Science and Technology (KAUST), specializing in efficient training and inference for large language models (LLMs) through distributed computing and advanced GPU programming. Before that, I completed my master degree at The Chinese University of Hong Kong, focusing on multimodal machine learning. Currently, I am conducting LLM pretraining research at the Alibaba Qwen Team. My research interests include optimizing distributed training and inference of LLMs, improving multi-threaded and multi-stream scheduling, and enhancing privacy-preserving methods for LLMs. I have interned as a LLM Pretraining Engineer at Aramco, working with large-scale GPU clusters to boost training throughput and model scalability. Currently, I am working on:

Efficient reinforcement learning (RL) for LLMs reasoning
Distributed training and inference of LLMs
Efficient algorithm and infrastructure design for LLMs
Efficient privacy-preserving methods

🔥 News

02/2026: One paper is accepted by ICLR 2026.
02/2026: Released Canzona (Arxiv).
09/2025: FlashDP is accepted by NeurIPS 2025.
07/2025: ZO2 is accepted by COLM 2025.
07/2025: Released Infinite-Sampling (Arxiv).
06/2025: Joined Alibaba Qwen Team for LLM Pretraining.
03/2025: Released ZO2 (Arxiv, code).

☃️ Projects

ZO2 (Zeroth-Order Offloading): Full Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory

A framework that enables fine-tuning of extremely large language models (like OPT-175B) on limited GPU memory through zeroth-order optimization and CPU-GPU offloading.

Tiny-LLM-Libs: Minimalistic Re-Implementations of Popular LLM Libraries

Minimal replicas that map major distributed-training stacks to tiny, readable implementations.

Original Stack	Tiny Replica	Description
FSDP	Tiny-FSDP	Side-by-side implementations of DDP, ZeRO-3, and FSDP, including their distinct communication patterns and memory trade-offs.
DeepSpeed	Tiny-DeepSpeed	A minimal DeepSpeed-style training stack with DDP + ZeRO-1/2/3, plus meta initialization, rank mapping, and compute-communication overlap.
Megatron-LM	Tiny-Megatron	Educational TP, DP, and 2D TP+DP hybrid training with custom modules and a runtime auto-tuner for kernel selection.

Tinytron: A Minimal Pre-Training Stack for GPT-Style Language Models

A hackable and developer-friendly framework featuring modular GPT architecture with FA/GQA/MoE support, distributed training (ZeRO-1, Sequence-Expert Joint Parallelism), mixed precision training, and comprehensive profiling utilities for efficient large model pre-training.

📝 Publications

Preprint 2026 Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers
Liangyu Wang *, Siqi Zhang *, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, and Dayiheng Liu
Paper

COLM 2025 ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory
Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, and Di Wang
Paper | Code

NeurIPS 2025 FlashDP: Memory-Efficient and High-Throughput DP-SGD Training for Large Language Models
Liangyu Wang, Junxiao Wang, Jie Ren, Zihang Xiang, David E. Keyes, and Di Wang
Paper | Code

Preprint 2025 Infinite-Sampling: Efficient and Stable Grouped RL Training for Large Language Models
Liangyu Wang, Huanyi Xie, Xinhai Wang, Tianjin Huang, Mengdi Li, and Di Wang
Paper

Preprint 2025 DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing
Liangyu Wang, Huanyi Xie, and Di Wang
Paper | Code

📋 Academic Service

Conference: COLM 2026, ECCV 2026, ICML 2026, ACL 2026, CVPR 2026, AAAI 2026, ICLR 2025, COLM 2025