2. vllm use vllm-plugin-fl、flagGemms and flagcx run Deepseek v3.2

0. 前言

核心问题 1：vllm 是怎么走 nccl 去完成 kv cache 搬运的？0. vllm 如何使用 nccl 传输 kv cache ✅ 2026-03-18

1. 进展

测试报告： https://infrawaves.feishu.cn/wiki/S5NzwFiuni0vbukR0vQceVKKnIe 代码 PR： https://github.com/flagos-ai/vllm-plugin-FL/pull/100

2. plans

planA

开启环境变量：

FLAGCX_PATH
USE_FLAGGEMS
VLLM_NCCL_SO_PATH

走 NCCL connector，vLLM 上层不感知底层的通信库，我们直接在 VLLM_NCCL_SO_PATH内指定 flagcx 内 nccl wrapper 产出的libnccl.so。

现在 plugin/nccl/nccl_wrapper.cc ：

flagcxUniqueId 是 256 字节
vllm 的 nccl connector 用的ncclUniqueId 是 128 字节
wrapper 把前 128 字节拷给 NCCL 侧
再在 ncclCommInitRank 时把这 128 字节回填回 flagcxUniqueId

planA成立的前提是：

当前 FlagCX 的 bootstrap/comm 唯一标识，前 128 字节足够
或者说当前 adaptor 下，这个压缩是安全的

流程图

[不改 vLLM]
vLLM P2pNcclConnector
        |
        v
vLLM P2pNcclEngine
        |
        v
vLLM NCCLLibrary(ctypes)
        |
        v
FlagCX plugin/nccl/libnccl.so   <- 这里做 NCCL ABI 兼容
        |
        v
libflagcx.so
        |
        v
FlagCX device adaptor / net adaptor

PlanB

第一层：共享逻辑层

新建一个共享基类，比如 P2pConnectorCommon。

这层只放跟后端无关的逻辑：

ReqMeta、metadata 组织
parse_request_id
chunked_prefill
_requests_need_load
build_connector_meta
get_num_new_matched_tokens
update_state_after_alloc
DeepSeek V3.2 的 KV layout 处理
extract_kv_from_layer / inject_kv_into_layer

这层不碰 NCCL/FlagCX。

第二层：后端 connector 层

保留 P2pNcclConnector。
新增 P2pFlagcxConnector。

两者都继承共享层，只在 worker 侧替换 data-plane：

start_load_kv
save_kv_layer
wait_for_save
get_finished

调度侧逻辑基本复用。

第三层：后端 engine 层

保留 P2pNcclEngine。
新增 P2pFlagcxEngine。

P2pFlagcxEngine 负责：

建连/握手
communicator 管理
发送队列/接收队列
tensor memory pool
dtype 映射
stream 适配
send/recv 调度
资源销毁

`P2pFlagcxEngine` 怎么做

控制面继续复用现在的 ZMQ router/dealer 设计
数据面改成直接调用 FLAGCXLibrary

也就是：

ZMQ 负责：

peer 发现
交换唯一标识
传 shape/dtype/tensor_id
发送控制命令 NEW/PUT/GET

FlagCX 负责：

comm init
真正的数据发送
真正的数据接收

这样行为最接近现在的 P2pNcclEngine。

`flagcx_wrapper.py` 怎么做

现在 FlagCX/plugin/interservice/flagcx_wrapper.py 至少要补这些：

去掉对 old_stream.cuda_stream 的硬编码，做成平台桥接层。
明确 flagcxUniqueId(256B) 的序列化/反序列化接口。
补 communicator finalize/destroy/error query 这类生命周期接口。
把 stream create/copy/free/sync 变成统一抽象，而不是隐式假设 CUDA。
对外暴露更明确的 runtime config 接口，而不是只是一层 ctypes 包装。

流程图

Scheduler side
  Shared P2P Connector Logic
        |
        v
Worker side P2pFlagcxConnector
        |
        v
P2pFlagcxEngine
   |                |
   |                +--> TensorMemoryPool
   |
   +--> ZMQ control plane
   |
   +--> FLAGCXLibrary (Python binding / runtime binding)
                     |
                     v
                 libflagcx.so
                     |
                     v
        FlagCX device adaptor / net adaptor / protocol

Leoda

Explorer

2. vllm use vllm-plugin-fl、flagGemms and flagcx run Deepseek v3.2

0. 前言

1. 进展

2. plans

planA

PlanB

第一层：共享逻辑层

第二层：后端 connector 层

第三层：后端 engine 层

`P2pFlagcxEngine` 怎么做

`flagcx_wrapper.py` 怎么做

Graph View

Table of Contents

Backlinks

Leoda

Explorer

2. vllm use vllm-plugin-fl、flagGemms and flagcx run Deepseek v3.2

0. 前言

1. 进展

2. plans

planA

PlanB

第一层：共享逻辑层

第二层：后端 connector 层

第三层：后端 engine 层

P2pFlagcxEngine 怎么做

flagcx_wrapper.py 怎么做

Graph View

Table of Contents

Backlinks

`P2pFlagcxEngine` 怎么做

`flagcx_wrapper.py` 怎么做