enqueue struct design

1. Entrance

ncclKernelPlan 内需要增加一个结构体字段的指针，这个结构体内包含前置的ready（无核的progress开始执行的条件）和后置的proxyOpCount（无核的progress结束就减少当前计数）。 finishPlann内new出这个结构体，如下，给到 ncclKernelPlan 内的这个指针（防止资源回收）。

struct psmSyncCondition {
  std::atomic<int>* proxyOpCount;    // The total number of proxyOp's that have been enqueued in this plan.
  std::atomic<int>* proxyReadyEvent; // Event that proxy thread queries for starting progresssing.
  bool proxyReadyEventSet;           // true if the proxyReadyEvent has been set
};

给到 ncclKernelPlan 内的就是：

struct ncclKernelPlan {
	...
	struct psmSyncCondition* syncCondition;
	...
}

在 finishPlan 内的给无核的任务new出这个Sync Condition。

if (ncclParamPassSm()) {
    plan->syncCondition = new psmSyncCondition;
 
    plan->syncCondition->proxyReadyEvent = new std::atomic<int>(0);
    plan->syncCondition->proxyOpCount= new std::atomic<int>(proxyOpCnt);
    plan->syncCondition->proxyReadyEventSet = false;
  }

2. Lauch stage

ncclLaunchKernel 内对于无核的任务就是：

if (ncclParamPassSm() &&
      plan->kernelFn == ncclDevKernelForFunc[ncclDevFuncId_P2p()]) {
    // Launching a cudaHostFunc() to pass sm.
    CUDACHECKGOTO(cudaLaunchHostFunc(launchStream, hostProxySyncCallback, plan->syncCondition), ret, do_return);
    goto do_return;
  }

hostfunc将这里的plan→syncCondition这个结构里面的proxyReadyEvent这个原子store为1, 然后底层的progress就会看到这个1开始他们的搬运操作。完成一个args就会相应的给proxyOpCount减掉nsubs的计数。如果会等到 proxyOpCount 变为0，告诉launchStream现在是这个host的任务完成了。

3. Progress

在实际的proxy搬运数据的时候，不管是net还是p2p，只需在进progress的时候：

if(!args->readyEvent->load(std::memory_order_acquire)) return ncclSuccess;

Leoda

Explorer

enqueue struct design

1. Entrance

2. Lauch stage

3. Progress

Graph View

Table of Contents