Expert tierDataCenter streamLab-first · Rubric-graded

RCDERKR Certified DataCenter Expert

Design the fabrics that train India's AI — and defend every decision under fire.

20 weeks12 hrs / week9 modules19 labsPrerequisite: RCDP

Overview

What the RCDE certifies.

The RKR Certified DataCenter Expert (RCDE) is the summit of RKR's datacenter stream and the most demanding credential the Academy issues. It certifies the rarest capability in Indian infrastructure today: the ability to design, build, tune, and defend the lossless network fabrics that GPU clusters live or die on. Where the Associate tier proved you can operate a datacenter network and the Professional tier (RCDP) proved you can build EVPN-VXLAN fabrics, RCDE proves you can own the whole problem — starting from the AllReduce traffic mathematics of a distributed training job, through rail-optimized Clos topology and 51.2T-generation platform selection, into the deep transport engineering of RoCEv2, PFC headroom, and DCQCN congestion tuning, and out the other side to migration plans, multi-site architectures, and a design document that survives an adversarial review board. Every claim in the program is exercised on real tooling — perftest, nccl-tests, gNMI telemetry, RDMA-capable hardware pods — because RKR's thesis is that the AI build-out rewards demonstrable competence, not watch-time.

The timing is deliberate. India's datacenter capacity is scaling from roughly 1,700 MW toward a 5-6.5 GW pipeline, nearly all of it AI-driven, while operators report the overwhelming majority of senior fabric roles are hard to fill — the exact inverse of the generic IT roles automation is hollowing out. The engineers who can answer 'InfiniBand or Ultra Ethernet?', 'why is PFC firing on rail 5?', and 'how do we cut over a live estate without killing a three-week training run?' are the ones hyperscalers, GPU clouds, and global capability centres are bidding for at architect and principal compensation. RCDE is comparable in rigor to Juniper JNCIE-DC and Cisco CCIE Data Center — a punishing theory exam plus a full-day proctored practical with an oral defense — but it is aimed squarely at the AI-fabric skill set those vendor tracks have not yet caught up to, and it is RKR-owned, lab-first, and verifiable end to end.

Measurable outcomes

Walk out able to do this — on record.

Design a complete rail-optimized Clos fabric for GPU clusters from 256 to 32k accelerators, with defensible bisection, radix, optics, and BOM math.

Engineer genuinely lossless Ethernet: derive PFC headroom from first principles, carve shared buffers, and tune ECN/DCQCN so the fabric is ECN-governed under 32:1 incast.

Convert training-job specifications into fabric requirements using collective-communication cost models, and validate delivered performance with perftest and nccl-tests acceptance gates.

Make and defend the InfiniBand vs Ethernet/UEC decision with measured benchmark data and a five-year TCO model, not vendor claims.

Architect the full three-plane GPU cloud — compute, storage (NVMe-oF/RDMA), and EVPN-VXLAN multi-tenant frontend — with proven isolation and QoS coexistence.

Operate and evolve live AI fabrics: streaming telemetry, drop forensics, non-disruptive upgrades under running training jobs, and phased brownfield migration with rollback gates.

Pass an expert-grade practical: build, tune, break/fix, and orally defend a working GPU fabric in a single 8-hour proctored session.

Who it’s for

Built for these starting lines.

RCDP graduates and senior datacenter network engineers (6+ years) ready to move from building fabrics to owning cluster-scale architecture decisions.

Network architects at GPU clouds, hyperscaler-adjacent operators, and colocation providers who must specify, tune, and defend AI backend fabrics.

HPC and InfiniBand engineers who need to master lossless Ethernet, DCQCN, and the UEC transition to stay ahead of the market's direction.

JNCIP-DC/CCNP-DC-level professionals targeting expert-tier recognition with an AI-infrastructure focus that single-vendor tracks do not yet offer.

Consultants and pre-sales architects who must produce costed, benchmark-backed IB-vs-Ethernet recommendations for eight-figure cluster procurements.

The syllabus

9 modules. 19 graded labs. No filler.

Every module terminates in a graded lab — theory is never left unproven. This is the full RCDE module sequence, exactly as delivered.

RCDE-M01

AI Workloads & Collective-Communication Mathematics

Expert fabric design starts from the traffic, not the topology. This module dissects distributed training at the wire level: data/tensor/pipeline/expert parallelism and the collectives each one generates (AllReduce, AllGather, ReduceScatter, All-to-All), then builds the quantitative toolkit — ring-AllReduce data volume 2S(N-1)/N, algorithm-vs-bus bandwidth, halving-doubling and tree variants, and how NCCL selects algorithm and protocol (Simple/LL/LL128) per message size. You finish able to convert a training-job spec into a fabric bandwidth and latency requirement with defensible arithmetic.

You will be able to
  • Learner can derive the wire-data volume and theoretical completion time of ring, tree, and halving-doubling AllReduce for a given message size, GPU count, and link rate, and identify the crossover points between algorithms.
  • Learner can map data, tensor, pipeline, and expert (MoE) parallelism strategies to their dominant collective patterns and predict which fabric dimension (rail bandwidth, bisection, latency) each one stresses.
  • Learner can interpret nccl-tests output (algbw vs busbw) and NCCL debug logs (NCCL_DEBUG=INFO) to determine the algorithm, protocol, and channel count actually selected on a live cluster.
  • Learner can quantify the job-completion-time impact of a single degraded link or straggler node in a synchronous training job and justify fabric-level SLOs from that model.
Graded labs
Lab

Collective cost-model workbook against live nccl-tests

Build a spreadsheet/Python cost model for ring and tree AllReduce across 8-512 ranks, then validate its predictions against all_reduce_perf runs on the RKR GPU pod, reconciling modelled vs measured busbw within 10%.

Lab

NCCL behaviour forensics

Using NCCL_DEBUG=INFO, NCCL_ALGO, and NCCL_PROTO overrides on a multi-node run, capture and explain three distinct algorithm/protocol selections, and demonstrate a message-size regime where forcing the wrong algorithm degrades busbw by >30%.

RCDE-M02

Rail-Optimized Clos & Scale-Out Topology Design

The reference architecture for modern GPU backends is the rail-optimized Clos: each of the 8 NICs on an HGX-class node wired to its own rail leaf so intra-rail collectives complete in one hop. This module covers the full design discipline — rail-to-leaf maps, 1:1 non-blocking spine sizing, radix math on 51.2T ASICs (Tomahawk 5, Jericho3-AI, Spectrum-4 class), 400G/800G OSFP/QSFP-DD optics and DAC/AOC reach budgets, multi-plane growth from 256 to 32k GPUs, and where rail-optimized breaks down (inter-rail traffic, MoE All-to-All) versus a conventional fat-tree.

You will be able to
  • Learner can produce a complete rail map, cabling schedule, and switch BOM for a rail-optimized fabric serving 512 HGX-class GPUs at 400G per NIC with 1:1 spine oversubscription.
  • Learner can compute bisection bandwidth, radix-limited maximum cluster size, and failure-domain blast radius for 2-tier and 3-tier designs on a 51.2T/64x800G switch generation.
  • Learner can select optics and cabling (400G-DR4/FR4, 800G-2xDR4, DAC/AOC) against row-length and power budgets and defend the per-port cost delta in a BOM review.
  • Learner can articulate, with traffic math, when a rail-optimized design underperforms a full fat-tree (heavy All-to-All / MoE workloads) and design the hybrid alternative.
  • Learner can design a BGP-unnumbered eBGP underlay (RFC 5549 extended-nexthop) with ECMP tuned for the fabric's path count.
Graded labs
Lab

512-GPU rail-optimized fabric build

In containerlab/SONiC-VS, deploy an 8-rail, 2-tier Clos for a simulated 64-node HGX cluster: BGP unnumbered underlay, ECMP verification across all spine paths, and a generated rail-map document cross-checked by automated LLDP audit.

Lab

Scale-out stress design review

Take a 1,024-GPU expansion brief and produce the 3-tier growth plan — plane count, spine radix consumption, re-cabling minimisation — then defend it in a recorded peer design review against an RKR rubric.

RCDE-M03

RoCEv2 Internals & Engineering Lossless Ethernet with PFC

RoCEv2 moves RDMA verbs over UDP/4791 and assumes the fabric will not drop — a promise Ethernet only keeps if you engineer it. This module goes to the ASIC: shared-buffer architecture and per-priority buffer carving, DSCP-to-TC-to-priority-group mapping, PFC (802.1Qbb) frame mechanics, headroom calculation from cable propagation delay, link rate, and MTU (2 x flight-time + worst-case in-flight frames), PFC watchdogs, and the failure modes — pause storms, head-of-line blocking, and cyclic-buffer-dependency deadlock in Clos fabrics.

You will be able to
  • Learner can derive PFC headroom buffer requirements from first principles (cable length, serialisation delay, MAC/PHY latency, MTU) for 100 m AOC and 2 km DR links at 400G, and configure the result on a shared-buffer ASIC.
  • Learner can design an end-to-end QoS map — DSCP marking at the RDMA NIC, TC classification, priority-group buffer allocation, and a dedicated CNP queue — consistent across a multi-vendor fabric.
  • Learner can reproduce, detect, and mitigate a PFC pause storm using pause-frame counters, PFC watchdog thresholds, and storm-domain isolation.
  • Learner can explain cyclic buffer dependency deadlock in a Clos with up-down-up routing violations and demonstrate the conditions that prevent it.
Graded labs
Lab

Headroom from first principles

Measure link RTT with hardware timestamping, compute headroom for three link profiles (3 m DAC, 100 m AOC, 2 km 400G-DR4), configure per-PG buffers, then prove correctness by driving RDMA traffic at line rate with zero drops while a smaller headroom setting demonstrably drops.

Lab

Pause-storm break/fix

An instructor-injected misbehaving NIC floods PFC pauses into the fabric; diagnose from switch counters alone (no host access), contain with PFC watchdog and per-port isolation, and write the RCA.

Lab

QoS map conformance audit

Given a 3-vendor fabric (SONiC, Junos-evo-style, NX-OS-style configs) with two deliberate DSCP/TC mapping inconsistencies, find and fix them using only packet captures and interface QoS counters.

RCDE-M04

ECN, DCQCN & Congestion Control Tuning at Cluster Scale

PFC is the safety net; congestion control is the strategy. This module masters DCQCN end to end: WRED-ECN marking curves (Kmin/Kmax/Pmax) on the switch, CNP generation and reaction at the NIC, the alpha rate-reduction/recovery state machine, and the timer/byte-counter increase phases — then how to tune all of it for incast-heavy collective traffic so ECN does the work and PFC almost never fires. Covers interaction between marking thresholds and buffer occupancy, per-queue vs per-port marking, and the emerging alternatives (HPCC-style INT-based control, receiver-driven credit schemes) you must evaluate as an expert.

You will be able to
  • Learner can tune Kmin/Kmax/Pmax and NIC-side DCQCN parameters (CNP interval, alpha g, rate-increase timers) to hold P99 queue depth below a target while sustaining >90% link utilisation under a 32:1 incast.
  • Learner can instrument and interpret the diagnostic hierarchy — ECN-marked packet counters, CNPs sent/received, PFC pauses per priority — to prove whether a fabric is ECN-governed or PFC-governed.
  • Learner can explain the DCQCN alpha update and rate-recovery state machine precisely enough to predict throughput behaviour when a parameter is misconfigured.
  • Learner can compare DCQCN against INT-based (HPCC-class) and receiver-driven congestion control and state the conditions under which each wins for AI collectives.
Graded labs
Lab

The incast tuning gauntlet

On the RKR RDMA pod, an untuned fabric fails a 32:1 ib_write_bw incast scenario with PFC storms; iterate ECN curves and NIC DCQCN settings until the graded target is met — zero drops, PFC pause time under threshold, aggregate goodput above 90% — documenting each iteration's counter evidence.

Lab

ECN-vs-PFC governance proof

Design and run an experiment that quantitatively demonstrates the same workload under (a) ECN-dominant and (b) PFC-dominant regimes, producing a report with marked-packet/CNP/pause time-series that an architecture board could act on.

RCDE-M05

InfiniBand vs Ethernet & the Ultra Ethernet Transition

The defining architecture decision of the decade: NDR/XDR InfiniBand with its subnet manager, credit-based flow control, adaptive routing, and SHARP in-network reduction — versus Ethernet fabrics (standards-based RoCEv2, NVIDIA Spectrum-X adaptive-routing/telemetry enhancements) and the Ultra Ethernet Consortium's UET transport with packet spraying, out-of-order delivery, and modern sender/receiver congestion control. This module builds the evidence-based decision framework: performance per collective type, operability, multi-vendor leverage, failure behaviour, and five-year TCO, so your recommendation survives both the CFO and the ML platform team.

You will be able to
  • Learner can explain InfiniBand fabric operation — subnet manager LID assignment, credit-based flow control, adaptive routing, SHARP v3 in-network AllReduce — at a depth sufficient to size and troubleshoot an NDR fabric.
  • Learner can articulate the UEC architecture (UET transport, packet spraying across ECMP, out-of-order packet delivery with in-order message completion, link-level and end-to-end congestion control) and what it changes versus RoCEv2/DCQCN.
  • Learner can run and normalise equivalent collective benchmarks across IB and Ethernet test beds and present the busbw/latency deltas honestly, including variance and tail behaviour.
  • Learner can produce a costed 5-year TCO and risk comparison (optics, switches, NICs, licences, ops skills pool) for IB vs Ethernet for a 1,024-GPU cluster and defend a recommendation.
Graded labs
Lab

The bake-off

Run an identical nccl-tests suite (AllReduce, AllGather, All-to-All sweeps 8B-8GB) on the RKR InfiniBand pod and the tuned Ethernet pod; produce a normalised comparison report with busbw curves, P99 latency, and an explicit statement of where each fabric wins.

Lab

Board-ready decision memo

Given a realistic Indian GPU-cloud RFP (capex ceiling, ops team profile, growth plan), write a 4-page IB-vs-Ethernet/UEC recommendation with TCO model and risk register, graded against the RKR expert rubric.

RCDE-M06

Frontend, Storage & Multi-Tenant Fabric Services

A GPU cluster is three networks: the backend compute fabric, the storage fabric feeding checkpoints and datasets (NVMe-oF/RDMA to parallel filesystems like Lustre/GPFS-class or object stores), and the frontend/management network carrying orchestration, in-band telemetry, and tenant access. This module designs all three coherently: EVPN-VXLAN (RFC 8365) multi-tenancy with symmetric IRB on the frontend, lossless-class treatment for storage RDMA without starving compute collectives, DPU/SmartNIC-based tenant isolation, and how GPU-as-a-Service providers carve one physical cluster into secure tenant slices.

You will be able to
  • Learner can design a three-plane network architecture (compute, storage, frontend/management) for a GPU cloud, with explicit isolation, QoS, and failure-independence guarantees for each plane.
  • Learner can configure EVPN-VXLAN with symmetric IRB, route-target-based tenant separation, and DCI handoff for the frontend plane, and articulate why the backend compute fabric deliberately avoids overlay encapsulation.
  • Learner can engineer QoS coexistence when checkpoint storage bursts (NVMe-oF over RDMA) share infrastructure with training collectives — separate priorities, buffer carving, and admission strategy.
  • Learner can evaluate DPU-based (BlueField-class) isolation and telemetry offload against switch-based enforcement for multi-tenant GPU clusters.
Graded labs
Lab

Three-plane build

Extend the M02 fabric with an EVPN-VXLAN frontend (two tenants, symmetric IRB, verified route-target isolation) and a storage class with its own PFC priority; prove a saturating storage burst cannot push compute-collective P99 latency past budget.

Lab

Tenant-isolation red team

Attempt three specified cross-tenant leakage paths (route leaking, shared-buffer starvation, management-plane pivot) against a peer's build; document which are blocked by design and fix any that are not.

RCDE-M07

Telemetry, Performance Validation & Fabric Operations

Expert fabrics are operated from evidence. This module builds the observability and acceptance discipline: gNMI/OpenConfig streaming telemetry into a Prometheus/Grafana stack, ASIC-level drop forensics (What-Just-Happened-class mirroring, mirror-on-drop), the counter hierarchy that matters (per-PG buffer watermarks, ECN marks, CNPs, PFC pause duration, out-of-sequence NAKs), and turning perftest and nccl-tests into automated acceptance gates — the pass/fail wall between 'cabled' and 'production'. Includes intent-based fabric validation (batfish-style pre-checks, automated LLDP/rail audits) and change-management for live training estates where a 3-week job must not die for a spine upgrade.

You will be able to
  • Learner can deploy gNMI streaming telemetry with sub-second sampling of buffer watermarks, ECN, CNP, and PFC counters, and build the four dashboards RKR defines as minimum viable AI-fabric observability.
  • Learner can perform drop forensics on a lossless fabric using mirror-on-drop/WJH-class tooling and classify the root cause (buffer, ACL, MTU, routing) from the evidence trail.
  • Learner can author an automated acceptance suite — LLDP rail audit, per-link ib_write_bw threshold, full-fabric nccl-tests busbw gate — that blocks handover on any regression.
  • Learner can design a maintenance strategy (drain via BGP, per-plane isolation, rollback gates) that upgrades a spine under a live synchronous training job.
Graded labs
Lab

Minimum viable observability

Stand up gNMI collection into Prometheus/Grafana for the module fabric and build the graded dashboard set: fabric health, congestion (ECN/CNP/PFC), buffer watermarks, and per-rail collective performance trend.

Lab

Acceptance-gate automation

Write the Python/Ansible acceptance pipeline that audits rail wiring via LLDP, runs per-link perftest and whole-fabric nccl-tests, and emits a signed pass/fail handover report; it must correctly fail a fabric with one seeded miswire and one degraded optic.

Lab

Live-estate spine upgrade

Execute a spine OS upgrade under continuous synthetic collective load using BGP drain and per-plane isolation, keeping busbw degradation within the stated maintenance budget for the full window.

RCDE-M08

Migration, Capacity Roadmaps & Multi-Site AI Fabrics

Most RCDE holders will not build greenfield — they will transform a live estate. This module covers brownfield-to-AI-fabric migration (parallel-fabric builds, workload cohort cutover, rollback criteria, cable-plant reuse analysis), capacity roadmapping where fabric, power, and cooling are co-designed (kW per rack from 15 to 130+, liquid-cooling implications for switch placement), and multi-site scale: cross-DC training feasibility, bandwidth-delay-product limits on synchronous collectives, checkpoint-replication design over DCI/DWDM, and hierarchical/asynchronous training topologies when the speed of light says no.

You will be able to
  • Learner can author a phased migration runbook from a legacy L2/vPC-style estate to a rail-optimized AI fabric, with per-phase rollback gates, cutover windows, and risk register.
  • Learner can quantify why synchronous AllReduce across a 40 km DCI fails (bandwidth-delay product, alpha-beta cost model with 400 microsecond RTT) and design the hierarchical or checkpoint-replication alternative.
  • Learner can build a 3-year capacity roadmap coupling GPU procurement waves to fabric plane additions, power/cooling milestones, and optics refresh (400G to 800G to 1.6T).
  • Learner can design checkpoint and dataset replication over DCI (EVPN/IP over DWDM, RPO targets, throughput sizing) between two Indian availability zones.
Graded labs
Lab

Brownfield cutover execution

Migrate a running two-tenant workload from a simulated legacy fabric to the AI fabric in three rehearsed phases, executing one planned rollback at a seeded failure gate, with zero unplanned service loss measured by continuous synthetic probes.

Lab

Two-site design study

Given RTT, DCI bandwidth, and job specs for Mumbai and Chennai facilities, produce the multi-site training architecture — what runs synchronous, what runs hierarchical, checkpoint-replication sizing — with the alpha-beta math shown.

RCDE-M09

Expert Capstone: Full-Cluster Design, Build & Defense

The finishing module is a compressed rehearsal of the RCDE Practical. Working from a realistic Indian GPU-cloud customer brief (capex ceiling, tenancy model, growth trajectory), you produce the complete design package — traffic model, rail-optimized topology, BOM, QoS/buffer plan, telemetry and acceptance strategy, migration plan — then build and tune the fabric on the RKR lab pod, survive a timed break/fix gauntlet, and defend the whole package orally before an RKR expert panel. Graded on the same 100-point rubric as the certification lab, so exam day holds no surprises.

You will be able to
  • Learner can produce a complete, internally consistent design document (15-25 pages) covering traffic model through migration plan, at a quality an architecture review board would approve.
  • Learner can build and tune the designed fabric to pass the full RKR acceptance suite within a timed window.
  • Learner can diagnose and repair three unseen injected faults using telemetry evidence alone within the break/fix SLA.
  • Learner can defend design decisions orally under adversarial questioning, conceding and correcting weaknesses without losing the architecture's coherence.
Graded labs
Lab

Mock RCDE Practical (full 8-hour dress rehearsal)

The complete exam experience — design, build, tune, break/fix, oral defense — proctored and scored on the certification rubric, with a written debrief identifying the candidate's weakest rubric domains.

Lab

Rubric-gap remediation sprint

A targeted 4-hour lab regenerated from the candidate's two weakest mock-exam domains (e.g., DCQCN tuning under incast, EVPN tenant isolation), repeated until both domains score above 70%.

How you’re examined

The RCDE exam format.

RCDE assessment is two-stage and fully proctored. Stage 1 — Theory (120 minutes, remote-proctored): 70 scenario-based questions spanning collective-communication cost math (compute the ring-AllReduce completion time for a given message size, GPU count, and busbw), PFC headroom and buffer-carving calculations, DCQCN parameter reasoning, and design trade-off judgment (IB vs Ethernet/UEC, rail-optimized vs conventional Clos); passing score 75%. Stage 2 — The RCDE Practical (8 hours, proctored at an RKR lab centre or via supervised remote pod): candidates receive a customer brief for a 512-GPU training cluster and must (a) produce the design — rail map, spine/leaf BOM, oversubscription and bisection math, buffer/QoS plan; (b) build the fabric on the RKR lab pod (containerlab/SONiC-VS + physical 400G leaf pair with RDMA-capable NICs), including BGP unnumbered underlay, PFC/ECN configuration, and DCQCN tuning verified with perftest and nccl-tests against published busbw thresholds; (c) survive a timed break/fix section with injected faults (PFC storm, rail miswire, ECN mis-marking); and (d) defend the design in a 30-minute oral review against an RKR expert panel. The practical is graded on a published 100-point rubric (design 30, build 30, tune/verify 20, break/fix 10, defense 10); minimum 70 overall with no domain below 50%. One free retake of the practical within 12 months.

Career plan

Where the RCDE takes you.

RCDE is engineered for the top of the Indian datacenter ladder: engineers who move from operating fabrics to owning cluster-scale architecture decisions. Expert-tier holders are positioned for senior fabric engineering, AI-infrastructure architecture, and principal/distinguished tracks at hyperscalers, GPU-cloud providers (Yotta, E2E, Neysa, Sify class), colocation operators, and the AI platform teams of global capability centres — roles where a single fabric decision moves eight-figure capex.

Roles unlocked
AI Fabric Architect / GPU Cluster Network ArchitectSenior Datacenter Network Engineer (AI/HPC fabrics)Principal Network Engineer — AI InfrastructureTechnical Lead, Hyperscale / GPU-Cloud Network EngineeringDatacenter Infrastructure Consultant (AI-readiness practice)
Salary band
Rs 26-90+ LPA (senior to principal, datacenter stream; AI-fabric specialisation carries a ~1.7x niche premium over generic DC roles)
Entry point (RCDP + RCDE in progress)
Senior Datacenter Network Engineer — AI fabrics
Rs 26-38 LPA
1-2 years post-RCDE
AI Fabric Architect / Lead Network Engineer, GPU cloud
Rs 38-55 LPA
3-4 years post-RCDE
Principal Network Engineer — AI Infrastructure
Rs 55-75 LPA
Senior track
Distinguished Engineer / Head of Network Architecture
Rs 75-90+ LPA
Demand signal

As of mid-2026, India's datacenter build-out is racing from roughly 1,700 MW of installed capacity toward an announced 5-6.5 GW pipeline by 2030 — a trajectory expected to create ~100,000 datacenter jobs — yet operators report 73% of datacenter operations roles are hard to fill and India faces a ~53% shortfall in AI-infrastructure skills against 2026 demand. Engineers with verified lossless-fabric and GPU-cluster design skills command a ~1.7x salary premium over generic network roles.

9 modules. 19 graded labs. One verifiable credential.

20 weeks at 12 hours a week — proven at the lab pod, scored against a published rubric.

Compare all certifications