Education
Work experience
- July 2025 - Present: Research Intern
- Alibaba Qwen Team
- Duties includes: Designed and implemented Canzona, a unified, asynchronous, and load-balanced framework to enable distributed matrix-based optimizers (e.g., Muon / Shampoo / SOAP) in large-scale LLM pretraining under Megatron with ZeRO-1 and Tensor Parallelism.
- Summer 2024: LLM Pretraining Engineer (Intern)
- Aramco
- Duties includes: Conducted large-scale LLM pretraining, optimized training efficiency through CUDA kernel fusion, asynchronous checkpointing, and distributed parallelism (ZeRO, DP, TP, PP).
- Fall 2022: Research Assistant
Skills
- PyTorch / Libtorch: In-depth knowledge of PyTorch operators’ workflow and implementation, including distributed training packages, and multi-threading / streaming programming.
- CUDA programming / Triton: Intermediate in CUDA stream and kernel programming, with a solid understanding of CUDA principles.
- DeepSpeed / Megatron: Experience using DeepSpeed and Megatron for distributed training, including manual implementation for optimization.
- Programming Languages: Python (Mainly for PyTorch), C/C++ (Mainly for Multi-thread, CUDA Programming, and LibTorch).