RKR Certified DataCenter Expert
Expert-tier AI/GPU fabric design: lossless Ethernet at cluster scale, from collective-communication math to a defended production architecture
The blueprint
RCDE builds the engineer who can stand in front of a 1,024-GPU training cluster RFP and own every layer of the answer: the AllReduce traffic model that sizes the fabric, the rail-optimized Clos that carries it, the PFC/ECN/DCQCN tuning that keeps it lossless at 95% load, the InfiniBand-vs-Ethernet/UEC decision defended with data, and the migration plan that gets a live estate there without an outage. Every domain terminates in a graded lab artifact — configs, telemetry captures, and a design document an architecture review board would sign.
Skill domains
6 assessed domainsCollective Communication & Traffic Engineering
- Ring/tree/halving-doubling AllReduce cost models: 2S(N-1)/N data-volume math and bus-vs-algorithm bandwidth
- NCCL topology awareness: NCCL_ALGO, NCCL_PROTO (Simple/LL/LL128), channel and rail mapping
- Job-completion-time sensitivity: tail-latency, straggler, and incast modelling for training vs inference
Rail-Optimized Clos & Scale-Out Architecture
- 8-rail leaf design for HGX-class nodes: rail-aligned NIC-to-leaf wiring, 1:1 non-blocking spine sizing
- 51.2T-generation platform selection (Tomahawk 5, Jericho3-AI, Spectrum-4 class) and 400G/800G OSFP optics budgets
- Bisection-bandwidth, oversubscription, and failure-domain math from 256 to 32k GPUs, including multi-plane spine growth
Lossless Transport: RoCEv2, PFC, ECN, DCQCN
- PFC headroom derivation from cable length, link rate, and MTU; per-priority buffer carving on shared-buffer ASICs
- DCQCN parameter engineering: Kmin/Kmax/Pmax WRED-ECN curves, alpha update, rate-increase timers, CNP pacing
- PFC storm and cyclic-buffer-dependency deadlock analysis, watchdog design, and blast-radius containment
InfiniBand vs Ethernet & the UEC Horizon
- NDR/XDR InfiniBand: subnet manager, adaptive routing, SHARP in-network reduction — where it genuinely wins
- Ultra Ethernet Consortium transport: packet spraying, out-of-order delivery, modern congestion control vs DCQCN
- Total-cost and operability decision framework defended with benchmark data, not vendor slides
Fabric Services, Telemetry & Validation
- EVPN-VXLAN multi-tenancy for frontend networks and NVMe-oF/storage backend isolation
- gNMI streaming telemetry, What-Just-Happened-class drop forensics, and PFC-pause/ECN-mark counters as SLOs
- perftest (ib_write_bw/ib_send_lat) and nccl-tests as acceptance gates with pass/fail thresholds
Migration, Scale & Design Defense
- Brownfield-to-AI-fabric migration runbooks: parallel fabrics, workload cutover, rollback criteria
- Capacity roadmaps from 4 MW to 30 MW halls: power, cooling, and fabric co-planning
- Architecture-review-board defense: writing and orally defending a full cluster design under challenge
Signature labs
Rack time, not watch timeL-SIGNATURE 01 — Build a rail-optimized 3-stage Clos for a 512-GPU cluster in the RKR virtual fabric: rail-aligned underlay, BGP unnumbered, 1:1 spine sizing, verified with synthetic AllReduce traffic
L-SIGNATURE 02 — Tune a lossy fabric to lossless: derive PFC headroom from measured RTT, carve buffers, set DCQCN Kmin/Kmax/Pmax, and prove zero drops at 90%+ offered load under incast
L-SIGNATURE 03 — Break/fix under the clock: injected PFC storm, ECN mis-marking, and a rail-miswire — diagnose from telemetry alone and restore nccl-tests baseline within SLA
L-SIGNATURE 04 — InfiniBand vs Ethernet bake-off: run identical collective benchmarks on both stacks, produce a costed recommendation memo with measured busbw deltas
L-SIGNATURE 05 — Live migration: cut a running tenant from a legacy L2 fabric onto the new AI fabric with documented rollback gates and zero training-job restarts
L-CAPSTONE — The 8-hour RCDE practical: design, build, tune, break/fix, and defend a complete GPU fabric end to end
How you are examined
Two-stage: 120-minute proctored theory (scenario-heavy design questions, cost-model calculations) followed by the 8-hour RCDE Practical — build, tune, and defend a working GPU fabric on live virtual and physical gear, graded on a published rubric plus a 30-minute oral design defense.
Career ladder
- Entry point (RCDP + RCDE in progress)Senior Datacenter Network Engineer — AI fabricsRs 26-38 LPA
- 1-2 years post-RCDEAI Fabric Architect / Lead Network Engineer, GPU cloudRs 38-55 LPA
- 3-4 years post-RCDEPrincipal Network Engineer — AI InfrastructureRs 55-75 LPA
- Senior trackDistinguished Engineer / Head of Network ArchitectureRs 75-90+ LPA
Rs 26-90+ LPA (senior to principal, datacenter stream; AI-fabric specialisation carries a ~1.7x niche premium over generic DC roles)