Study Guide
NVIDIA Certified Professional: AI Infrastructure Study Guide
Use the saved domain outline to connect system bring-up, hardware management, and control plane installation, gpu configuration, partitioning, and lifecycle management, cluster scheduling, containers, and ai workload runtime, network fabric, infiniband, and distributed communication performance to scenario-based questions and explanations.
How the Exam Is Structured
NVIDIA Certified Professional: AI Infrastructure (NCP-AII) validates system bring-up, hardware management, and control plane installation, gpu configuration, partitioning, and lifecycle management, cluster scheduling, containers, and ai workload runtime, network fabric, infiniband, and distributed communication performance. The ExamPal practice bank includes 281 premium questions and 40 free questions mapped across the official blueprint.
| Domain | Weight | Focus |
|---|---|---|
| Domain 1: System Bring-up, Hardware Management, and Control Plane Installation | 24% | 1.1 Validate server hardware readiness and perform initial bring-up; Verify POST completion and BIOS/UEFI status |
| Domain 2: GPU Configuration, Partitioning, and Lifecycle Management | 18% | 2.1 Manage GPU operational state and persistence; Inspect GPU inventory and utilization |
| Domain 3: Cluster Scheduling, Containers, and AI Workload Runtime | 18% | 3.1 Configure and validate GPU scheduling with Slurm; Explain Slurm GRES for GPUs |
| Domain 4: Network Fabric, InfiniBand, and Distributed Communication Performance | 22% | 4.1 Validate InfiniBand and high-speed network configuration; Identify common network technologies |
| Domain 5: Monitoring, Diagnostics, Troubleshooting, and Performance Verification | 18% | 5.1 Monitor GPU health and performance in real time; Collect GPU health metrics |
24% of exam
Domain 1: System Bring-up, Hardware Management, and Control Plane Installation
Covers initial server validation, out-of-band management, host software prerequisites, and installation/validation of NVIDIA AI infrastructure control plane components. It also includes platform connectivity, device topology, and physical rack-level checks needed to bring systems into a ready state.
18% of exam
Domain 2: GPU Configuration, Partitioning, and Lifecycle Management
Covers GPU operational state, persistence, MIG-enabled environments, Fabric Manager, and compatibility across firmware, driver, CUDA, and management tools. It also includes interpreting GPU error and event conditions to support lifecycle management and troubleshooting.
18% of exam
Domain 3: Cluster Scheduling, Containers, and AI Workload Runtime
Covers GPU scheduling with Slurm, GPU resource requests, containerized AI workloads, NVIDIA-optimized AI software stacks, and operational job control. It emphasizes validating resource allocation, runtime integration, and cluster utility output for workload execution and troubleshooting.
22% of exam
Domain 4: Network Fabric, InfiniBand, and Distributed Communication Performance
Covers validation of InfiniBand and high-speed network configuration, diagnostic tools, NCCL communication topology and behavior, communication performance measurement, distributed training troubleshooting, and cluster-level communication readiness. It emphasizes fabric health, topology selection, and performance verification for AI/HPC communication paths.
18% of exam
Domain 5: Monitoring, Diagnostics, Troubleshooting, and Performance Verification
Covers real-time GPU health monitoring, DCGM diagnostics, Xid and driver-related faults, thermal and power reliability issues, cluster test and performance verification, and interconnect or topology degradation. It focuses on using telemetry and benchmarks to isolate root cause and confirm production readiness.
Key Terms to Know
These terms are loaded from the shared terminology pack and appear across the question explanations.
- --gpus flag
- A container runtime option used to assign GPU resources to a container.
- CUDA_ERROR_NO_DEVICE
- A CUDA runtime error indicating that no visible or usable GPU device is available to the application.
- Communication hang
- A condition where distributed communication stalls and processes stop making progress, often due to transport or synchronization issues.
- Container runtime GPU configuration
- Settings that control whether and how GPUs are exposed to containers, including runtime flags and environment variables.
- DCGM
- NVIDIA Data Center GPU Manager, a toolset for discovering, monitoring, diagnosing, and managing GPUs in data center environments.
- DCGM diagnostic Level 2
- The minimum DCGM diagnostic level that includes memory stress testing in addition to basic health checks.
- ECC page retirement
- The permanent removal of faulty GPU memory pages from use after ECC detects repeated or severe errors.
- Enroot
- An unprivileged container runtime commonly used on HPC systems to run containerized workloads without Docker or root access.
- GPU discovery
- The process of detecting and enumerating GPU devices available on a system.
- GRES
- Generic RESources in Slurm, a mechanism for scheduling specialized resources such as GPUs.
- Graphics engine exception
- A fault reported by the GPU graphics or compute engine when executing invalid or problematic workload instructions.
- H100 SXM
- An NVIDIA Hopper-generation SXM-form-factor GPU accelerator designed for high-performance AI and HPC workloads.
- HBM3
- High Bandwidth Memory generation 3, a stacked memory technology providing very high throughput for GPUs.
- InfiniBand
- A high-performance networking technology commonly used in HPC and AI clusters for low-latency, high-throughput communication.
- Memory bandwidth
- The rate at which data can be read from or written to GPU memory, typically expressed in GB/s or TB/s.
- Memory stress testing
- A diagnostic procedure that exercises GPU memory heavily to detect stability or reliability issues.
- NCCL
- NVIDIA Collective Communications Library used for multi-GPU and multi-node communication primitives such as all-reduce and broadcast.
- NCCL debug environment variables
- Configuration variables such as NCCL_DEBUG and related settings used to troubleshoot communication failures and hangs.
Official Materials and Guidance
This page is built from NVIDIA official materials and ExamPal shared release pack, the shared syllabus, topic tree, terminology pack, free pack, and premium pack.
- -Guidance: NVIDIA official certification page/outline saved locally
- -Domain outline: System/server bring-up 31%; Physical layer management 5%; Control plane install/config 19%; Cluster test/verification 33%; Troubleshoot/optimize 12%.