AI Workloads & Collective-Communication Mathematics
Expert fabric design starts from the traffic, not the topology. This module dissects distributed training at the wire level: data/tensor/pipeline/expert parallelism and the collectives each one generates (AllReduce, AllGather, ReduceScatter, All-to-All), then builds the quantitative toolkit — ring-AllReduce data volume 2S(N-1)/N, algorithm-vs-bus bandwidth, halving-doubling and tree variants, and how NCCL selects algorithm and protocol (Simple/LL/LL128) per message size. You finish able to convert a training-job spec into a fabric bandwidth and latency requirement with defensible arithmetic.
- Learner can derive the wire-data volume and theoretical completion time of ring, tree, and halving-doubling AllReduce for a given message size, GPU count, and link rate, and identify the crossover points between algorithms.
- Learner can map data, tensor, pipeline, and expert (MoE) parallelism strategies to their dominant collective patterns and predict which fabric dimension (rail bandwidth, bisection, latency) each one stresses.
- Learner can interpret nccl-tests output (algbw vs busbw) and NCCL debug logs (NCCL_DEBUG=INFO) to determine the algorithm, protocol, and channel count actually selected on a live cluster.
- Learner can quantify the job-completion-time impact of a single degraded link or straggler node in a synchronous training job and justify fabric-level SLOs from that model.
Collective cost-model workbook against live nccl-tests
Build a spreadsheet/Python cost model for ring and tree AllReduce across 8-512 ranks, then validate its predictions against all_reduce_perf runs on the RKR GPU pod, reconciling modelled vs measured busbw within 10%.
NCCL behaviour forensics
Using NCCL_DEBUG=INFO, NCCL_ALGO, and NCCL_PROTO overrides on a multi-node run, capture and explain three distinct algorithm/protocol selections, and demonstrate a message-size regime where forcing the wrong algorithm degrades busbw by >30%.