Back to RCDE

RCDE Certification Blueprint

Large-format A0 poster · prints edge-to-edge at 841 × 1189 mm

RKR NETWORKSAI-Readiness Academy
DataCenter · ExpertCertification Blueprint · A0
RCDE

RKR Certified DataCenter Expert

Expert-tier AI/GPU fabric design: lossless Ethernet at cluster scale, from collective-communication math to a defended production architecture

20 weeks
Duration
12 hrs/week
Effort
9 modules
Curriculum
19 graded labs
Hands-on
Comparable in rigor toJuniper JNCIE-DCCisco CCIE Data Center

The blueprint

RCDE builds the engineer who can stand in front of a 1,024-GPU training cluster RFP and own every layer of the answer: the AllReduce traffic model that sizes the fabric, the rail-optimized Clos that carries it, the PFC/ECN/DCQCN tuning that keeps it lossless at 95% load, the InfiniBand-vs-Ethernet/UEC decision defended with data, and the migration plan that gets a live estate there without an outage. Every domain terminates in a graded lab artifact — configs, telemetry captures, and a design document an architecture review board would sign.

Skill domains

6 assessed domains
01

Collective Communication & Traffic Engineering

  • Ring/tree/halving-doubling AllReduce cost models: 2S(N-1)/N data-volume math and bus-vs-algorithm bandwidth
  • NCCL topology awareness: NCCL_ALGO, NCCL_PROTO (Simple/LL/LL128), channel and rail mapping
  • Job-completion-time sensitivity: tail-latency, straggler, and incast modelling for training vs inference
02

Rail-Optimized Clos & Scale-Out Architecture

  • 8-rail leaf design for HGX-class nodes: rail-aligned NIC-to-leaf wiring, 1:1 non-blocking spine sizing
  • 51.2T-generation platform selection (Tomahawk 5, Jericho3-AI, Spectrum-4 class) and 400G/800G OSFP optics budgets
  • Bisection-bandwidth, oversubscription, and failure-domain math from 256 to 32k GPUs, including multi-plane spine growth
03

Lossless Transport: RoCEv2, PFC, ECN, DCQCN

  • PFC headroom derivation from cable length, link rate, and MTU; per-priority buffer carving on shared-buffer ASICs
  • DCQCN parameter engineering: Kmin/Kmax/Pmax WRED-ECN curves, alpha update, rate-increase timers, CNP pacing
  • PFC storm and cyclic-buffer-dependency deadlock analysis, watchdog design, and blast-radius containment
04

InfiniBand vs Ethernet & the UEC Horizon

  • NDR/XDR InfiniBand: subnet manager, adaptive routing, SHARP in-network reduction — where it genuinely wins
  • Ultra Ethernet Consortium transport: packet spraying, out-of-order delivery, modern congestion control vs DCQCN
  • Total-cost and operability decision framework defended with benchmark data, not vendor slides
05

Fabric Services, Telemetry & Validation

  • EVPN-VXLAN multi-tenancy for frontend networks and NVMe-oF/storage backend isolation
  • gNMI streaming telemetry, What-Just-Happened-class drop forensics, and PFC-pause/ECN-mark counters as SLOs
  • perftest (ib_write_bw/ib_send_lat) and nccl-tests as acceptance gates with pass/fail thresholds
06

Migration, Scale & Design Defense

  • Brownfield-to-AI-fabric migration runbooks: parallel fabrics, workload cutover, rollback criteria
  • Capacity roadmaps from 4 MW to 30 MW halls: power, cooling, and fabric co-planning
  • Architecture-review-board defense: writing and orally defending a full cluster design under challenge

Signature labs

Rack time, not watch time

L-SIGNATURE 01Build a rail-optimized 3-stage Clos for a 512-GPU cluster in the RKR virtual fabric: rail-aligned underlay, BGP unnumbered, 1:1 spine sizing, verified with synthetic AllReduce traffic

L-SIGNATURE 02Tune a lossy fabric to lossless: derive PFC headroom from measured RTT, carve buffers, set DCQCN Kmin/Kmax/Pmax, and prove zero drops at 90%+ offered load under incast

L-SIGNATURE 03Break/fix under the clock: injected PFC storm, ECN mis-marking, and a rail-miswire — diagnose from telemetry alone and restore nccl-tests baseline within SLA

L-SIGNATURE 04InfiniBand vs Ethernet bake-off: run identical collective benchmarks on both stacks, produce a costed recommendation memo with measured busbw deltas

L-SIGNATURE 05Live migration: cut a running tenant from a legacy L2 fabric onto the new AI fabric with documented rollback gates and zero training-job restarts

L-CAPSTONEThe 8-hour RCDE practical: design, build, tune, break/fix, and defend a complete GPU fabric end to end

How you are examined

Two-stage: 120-minute proctored theory (scenario-heavy design questions, cost-model calculations) followed by the 8-hour RCDE Practical — build, tune, and defend a working GPU fabric on live virtual and physical gear, graded on a published rubric plus a 30-minute oral design defense.

Career ladder

  1. Entry point (RCDP + RCDE in progress)
    Senior Datacenter Network Engineer — AI fabricsRs 26-38 LPA
  2. 1-2 years post-RCDE
    AI Fabric Architect / Lead Network Engineer, GPU cloudRs 38-55 LPA
  3. 3-4 years post-RCDE
    Principal Network Engineer — AI InfrastructureRs 55-75 LPA
  4. Senior track
    Distinguished Engineer / Head of Network ArchitectureRs 75-90+ LPA

Rs 26-90+ LPA (senior to principal, datacenter stream; AI-fabric specialisation carries a ~1.7x niche premium over generic DC roles)

India is building 5-6.5 GW of AI datacenters. Someone has to design the fabrics inside them. Become that engineer.

RKR NETWORKSNetworks First, Networks LastRCDE · DataCenter stream · Expert tier · training.rkr-networks.com

Tip: in the print dialog choose “Save as PDF”, set paper size to A0 and margins to none for a full-bleed poster.