Fault Tolerance

VCCL’s fault-tolerance mechanism ensures that, in the event of NIC down or switch failures, distributed training can be recovered and continue within a single iteration, significantly improving the reliability and availability of large-scale clusters.

Overview

Fault-tolerance Capabilities

  • Failure Detection: Automatically detects node and link failures.
  • Automatic Recovery: Transparent failure recovery mechanisms.
  • Zero Downtime: In-place recovery within a single iteration.
  • High Compatibility: Highly compatible with traditional solutions.

Supported Failure Types

Failure TypeRecovery StrategyRecovery Time
NIC downFault toleranceWithin 1 iteration
Switch failureFault toleranceWithin 1 iteration
NIC flapAvoid excessive re-attachmentHandled by hardware retransmission mechanisms
GPU failureNode isolationCheckpoint-based recovery

Configuration

Basic Enablement

# Enable fault-tolerance feature (disabled by default)
export NCCL_ENABLE_FAULT_TOLERANCE=<0, 1>, default is 0 (disabled).
 
# NIC configuration must be specified
export NCCL_IB_HCA=="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" according to runtime environment.

Advanced Configuration

# Set retry count (default 7)
export NCCL_IB_RETRY_COUNT=7
 
# Set timeout in seconds (default 18)
export NCCL_IB_TIMEOUT=18

!!! warning “NIC configuration requirement” The fault-tolerance feature requires the NCCL_IB_HCA environment variable to be specified; otherwise it will not function correctly.

!!! info “Advanced configuration” Setting advanced parameters beyond reasonable ranges may affect behavior.