1. bw-1000(海光)

Timeline: 已有 liuda 镜像的机器:11,12,23

  • 开 flagGems 大概率能跑但是有一堆乱七八糟error/warning,偶然遇到过一次,日志在与贾鹏飞6.25 18:13 分聊天记录。
  • 关flagGems不能跑(出现两次不同类型报错在/Users/joker/Desktop/project/baai/vllm-plugin-FL/tmp/error-3.log(用的VLLM_FL_PREFER_ENABLED=false)和/Users/joker/Desktop/project/baai/vllm-plugin-FL/tmp/error-4.log(用的USE_FLAGGEMS=0)) 参考:/workspace/liuda/pd_disaggregation/one_node/serve.sh
  • Flagcx 在海光机器上编译成功 ✅ 2026-06-25 坑点:
  • 需要找到机器上的 cuda.h 和 nccl.h,再指定USE_DU才能 make
  • 迁移flagcx connector 到vllm-plugin-fl内

机器指令与环境变量与代理

hy-smi

export HIP_VISIBLE_DEVICES=2,3
proxy
export http_proxy=http://10.11.2.2:1080
export https_proxy=http://10.11.2.2:1080
 
git config --global http.proxy http://10.11.2.2:1080
git config --global https.proxy http://10.11.2.2:1080

镜像与挂载

镜像
docker pull harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.20.0-ubuntu22.04-dtk26.04-py3.10-MiniCPM-V-4.6
 
docker run \
--name liuda \
--network=host \
--ipc=host \
--device=/dev/kfd \
--device=/dev/mkfd \
--device=/dev/dri \
-v /opt/hyhal:/opt/hyhal \
-v /public-flash:/workspace \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-itd harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.20.0-ubuntu22.04-dtk26.04-py3.10-MiniCPM-V-4.6-liuda \
bin/bash
 

资源准备

# 下之前的指令包
modelscope download leodaaa/leoda pd_shell.tar.gz .
 
# 下自己要用一些工具
apt-get update && \
    apt-get install -y --no-install-recommends \
        net-tools \
        vim \
        git \
        zoxide \
        wget
 
#  安装vllm-plugin-FL和FlagGems
pip install -U scikit-build-core==0.11 pybind11 ninja cmake
 
git clone https://github.com/leoda1/vllm-plugin-FL
git checkout optim
pip install --no-build-isolation -e .
cd vllm-plugin-FL
git clone https://github.com/flagos-ai/FlagGems
cd FlagGems
pip install --no-build-isolation -e .
 
# 下载 Flagcx
git clone https://github.com/flagos-ai/FlagCX.git
git submodule update --init --recursive
 
find /opt -name "cuda.h" 2>/dev/null
find /opt -name "nccl.h" 2>/dev/null
 
# make 内指定 cuda.h 和 nccl.h
make USE_DU=1 \
  DEVICE_HOME=/opt/dtk-26.04/cuda/cuda \
  CCL_HOME=/opt/dtk-26.04/cuda/cuda \
  -j$(nproc)
 
#开关 flagGems 
export USE_FLAGGEMS=0
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "/models/Qwen3.6-35B-A3B",
  "messages": [
    {
      "role": "user",
      "content": "介绍一下北京"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 256
}'

2. mc550(沐曦)

mx-smi