CV

Education

Work experience

  • July 2025 - Present: Research Intern
    • Alibaba Qwen Team
    • Duties includes: Designed and implemented Canzona, a unified, asynchronous, and load-balanced framework to enable distributed matrix-based optimizers (e.g., Muon / Shampoo / SOAP) in large-scale LLM pretraining under Megatron with ZeRO-1 and Tensor Parallelism.
  • Summer 2024: LLM Pretraining Engineer (Intern)
    • Aramco
    • Duties includes: Conducted large-scale LLM pretraining, optimized training efficiency through CUDA kernel fusion, asynchronous checkpointing, and distributed parallelism (ZeRO, DP, TP, PP).
  • Fall 2022: Research Assistant

Skills

  • PyTorch / Libtorch: In-depth knowledge of PyTorch operators’ workflow and implementation, including distributed training packages, and multi-threading / streaming programming.
  • CUDA programming / Triton: Intermediate in CUDA stream and kernel programming, with a solid understanding of CUDA principles.
  • DeepSpeed / Megatron: Experience using DeepSpeed and Megatron for distributed training, including manual implementation for optimization.
  • Programming Languages: Python (Mainly for PyTorch), C/C++ (Mainly for Multi-thread, CUDA Programming, and LibTorch).