RCDE — A0 Certification Blueprint

RKR NETWORKSAI-Readiness Academy

DataCenter · ExpertCertification Blueprint · A0

RCDE

RKR Certified DataCenter Expert

Expert-tier AI/GPU fabric design: lossless Ethernet at cluster scale, from collective-communication math to a defended production architecture

20 weeks

Duration

12 hrs/week

Effort

9 modules

Curriculum

19 graded labs

Hands-on

Comparable in rigor toJuniper JNCIE-DCCisco CCIE Data Center

The blueprint

RCDE builds the engineer who can stand in front of a 1,024-GPU training cluster RFP and own every layer of the answer: the AllReduce traffic model that sizes the fabric, the rail-optimized Clos that carries it, the PFC/ECN/DCQCN tuning that keeps it lossless at 95% load, the InfiniBand-vs-Ethernet/UEC decision defended with data, and the migration plan that gets a live estate there without an outage. Every domain terminates in a graded lab artifact — configs, telemetry captures, and a design document an architecture review board would sign.

Skill domains

6 assessed domains

Collective Communication & Traffic Engineering

Ring/tree/halving-doubling AllReduce cost models: 2S(N-1)/N data-volume math and bus-vs-algorithm bandwidth
NCCL topology awareness: NCCL_ALGO, NCCL_PROTO (Simple/LL/LL128), channel and rail mapping
Job-completion-time sensitivity: tail-latency, straggler, and incast modelling for training vs inference

Rail-Optimized Clos & Scale-Out Architecture

8-rail leaf design for HGX-class nodes: rail-aligned NIC-to-leaf wiring, 1:1 non-blocking spine sizing
51.2T-generation platform selection (Tomahawk 5, Jericho3-AI, Spectrum-4 class) and 400G/800G OSFP optics budgets
Bisection-bandwidth, oversubscription, and failure-domain math from 256 to 32k GPUs, including multi-plane spine growth

Lossless Transport: RoCEv2, PFC, ECN, DCQCN

PFC headroom derivation from cable length, link rate, and MTU; per-priority buffer carving on shared-buffer ASICs
DCQCN parameter engineering: Kmin/Kmax/Pmax WRED-ECN curves, alpha update, rate-increase timers, CNP pacing
PFC storm and cyclic-buffer-dependency deadlock analysis, watchdog design, and blast-radius containment

InfiniBand vs Ethernet & the UEC Horizon

NDR/XDR InfiniBand: subnet manager, adaptive routing, SHARP in-network reduction — where it genuinely wins
Ultra Ethernet Consortium transport: packet spraying, out-of-order delivery, modern congestion control vs DCQCN
Total-cost and operability decision framework defended with benchmark data, not vendor slides

Fabric Services, Telemetry & Validation

EVPN-VXLAN multi-tenancy for frontend networks and NVMe-oF/storage backend isolation
gNMI streaming telemetry, What-Just-Happened-class drop forensics, and PFC-pause/ECN-mark counters as SLOs
perftest (ib_write_bw/ib_send_lat) and nccl-tests as acceptance gates with pass/fail thresholds

Migration, Scale & Design Defense

Brownfield-to-AI-fabric migration runbooks: parallel fabrics, workload cutover, rollback criteria
Capacity roadmaps from 4 MW to 30 MW halls: power, cooling, and fabric co-planning
Architecture-review-board defense: writing and orally defending a full cluster design under challenge

Signature labs

Rack time, not watch time

L-SIGNATURE 01 — Build a rail-optimized 3-stage Clos for a 512-GPU cluster in the RKR virtual fabric: rail-aligned underlay, BGP unnumbered, 1:1 spine sizing, verified with synthetic AllReduce traffic

L-SIGNATURE 02 — Tune a lossy fabric to lossless: derive PFC headroom from measured RTT, carve buffers, set DCQCN Kmin/Kmax/Pmax, and prove zero drops at 90%+ offered load under incast

L-SIGNATURE 03 — Break/fix under the clock: injected PFC storm, ECN mis-marking, and a rail-miswire — diagnose from telemetry alone and restore nccl-tests baseline within SLA

L-SIGNATURE 04 — InfiniBand vs Ethernet bake-off: run identical collective benchmarks on both stacks, produce a costed recommendation memo with measured busbw deltas

L-SIGNATURE 05 — Live migration: cut a running tenant from a legacy L2 fabric onto the new AI fabric with documented rollback gates and zero training-job restarts

L-CAPSTONE — The 8-hour RCDE practical: design, build, tune, break/fix, and defend a complete GPU fabric end to end

How you are examined

Two-stage: 120-minute proctored theory (scenario-heavy design questions, cost-model calculations) followed by the 8-hour RCDE Practical — build, tune, and defend a working GPU fabric on live virtual and physical gear, graded on a published rubric plus a 30-minute oral design defense.

Career ladder

Entry point (RCDP + RCDE in progress)
Senior Datacenter Network Engineer — AI fabricsRs 26-38 LPA
1-2 years post-RCDE
AI Fabric Architect / Lead Network Engineer, GPU cloudRs 38-55 LPA
3-4 years post-RCDE
Principal Network Engineer — AI InfrastructureRs 55-75 LPA
Senior track
Distinguished Engineer / Head of Network ArchitectureRs 75-90+ LPA

Rs 26-90+ LPA (senior to principal, datacenter stream; AI-fabric specialisation carries a ~1.7x niche premium over generic DC roles)

RCDE Certification Blueprint