1. bw-1000(海光)
Timeline: 已有 liuda 镜像的机器:11,12,23
- 开 flagGems 大概率能跑但是有一堆乱七八糟error/warning,偶然遇到过一次,日志在与贾鹏飞6.25 18:13 分聊天记录。
- 关flagGems不能跑(出现两次不同类型报错在/Users/joker/Desktop/project/baai/vllm-plugin-FL/tmp/error-3.log(用的VLLM_FL_PREFER_ENABLED=false)和/Users/joker/Desktop/project/baai/vllm-plugin-FL/tmp/error-4.log(用的USE_FLAGGEMS=0)) 参考:/workspace/liuda/pd_disaggregation/one_node/serve.sh
- Flagcx 在海光机器上编译成功 ✅ 2026-06-25 坑点:
- 需要找到机器上的 cuda.h 和 nccl.h,再指定USE_DU才能 make
- 迁移flagcx connector 到vllm-plugin-fl内
机器指令与环境变量与代理
hy-smi
export HIP_VISIBLE_DEVICES=2,3
export http_proxy=http://10.11.2.2:1080
export https_proxy=http://10.11.2.2:1080
git config --global http.proxy http://10.11.2.2:1080
git config --global https.proxy http://10.11.2.2:1080镜像与挂载
docker pull harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.20.0-ubuntu22.04-dtk26.04-py3.10-MiniCPM-V-4.6
docker run \
--name liuda \
--network=host \
--ipc=host \
--device=/dev/kfd \
--device=/dev/mkfd \
--device=/dev/dri \
-v /opt/hyhal:/opt/hyhal \
-v /public-flash:/workspace \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-itd harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.20.0-ubuntu22.04-dtk26.04-py3.10-MiniCPM-V-4.6-liuda \
bin/bash
资源准备
# 下之前的指令包
modelscope download leodaaa/leoda pd_shell.tar.gz .
# 下自己要用一些工具
apt-get update && \
apt-get install -y --no-install-recommends \
net-tools \
vim \
git \
zoxide \
wget
# 安装vllm-plugin-FL和FlagGems
pip install -U scikit-build-core==0.11 pybind11 ninja cmake
git clone https://github.com/leoda1/vllm-plugin-FL
git checkout optim
pip install --no-build-isolation -e .
cd vllm-plugin-FL
git clone https://github.com/flagos-ai/FlagGems
cd FlagGems
pip install --no-build-isolation -e .
# 下载 Flagcx
git clone https://github.com/flagos-ai/FlagCX.git
git submodule update --init --recursive
find /opt -name "cuda.h" 2>/dev/null
find /opt -name "nccl.h" 2>/dev/null
# make 内指定 cuda.h 和 nccl.h
make USE_DU=1 \
DEVICE_HOME=/opt/dtk-26.04/cuda/cuda \
CCL_HOME=/opt/dtk-26.04/cuda/cuda \
-j$(nproc)
#开关 flagGems
export USE_FLAGGEMS=0curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Qwen3.6-35B-A3B",
"messages": [
{
"role": "user",
"content": "介绍一下北京"
}
],
"temperature": 0.7,
"max_tokens": 256
}'2. mc550(沐曦)
mx-smi