Systalyze: open-source GPU monitor exposing the 100% utilization lie

2 min read
gpu-monitoringcudallm-inferenceperformance-optimizationnvidiaobservabilityopen-source
View as Markdown
Originally from systalyze.com
View source

My notes

Summary

Standard GPU utilization metrics (nvidia-smi, nvtop, cloud dashboards) only show whether the GPU is doing anything, not how hard it is actually working. Real arithmetic throughput can be 1-7% while dashboards read 100%, driving massive over-purchasing of GPU capacity. Systalyze open-sourced Utilyze (Apache 2.0), a near-zero-overhead tool that measures true Compute SOL % and Memory SOL % via hardware counters in real time.

Key Insight

  • The “100% GPU” lie is industry-wide. nvidia-smi, nvtop, Weights & Biases, CloudWatch, GCP Monitoring, and Azure Monitor all surface the same misleading metric. It tracks “is anything running” not “how much arithmetic is happening”.
  • Even DCGM’s “SM Active” is broken. Reads 99% on a memory-bound H200 workload whose true compute throughput is 6%. Warps resident on SMs can be waiting on memory, not computing, SM Active cannot tell the difference.
  • Real measured gaps in production:
    • LoRA fine-tune Llama-3.1-8B on 2x H200: nvidia-smi reads 80-100%, true Compute SOL is 1-7%. Optimizing brought it to 40-55%, a 6-8x throughput improvement.
    • Prefill-heavy Llama-3.1-8B inference (vLLM, 2x H200): 45% Compute SOL with 89% attainable, optimized to ~89% gave 40% token/sec increase (52,298 to 73,903 tokens/s).
    • gpt-oss-20b MoE full finetune on 4x H200: nvidia-smi 100%, true 3-15%. Optimized to 30-60%.
  • Two-number framework: Speed-of-Light (SOL). Every kernel is bound by either compute throughput or memory bandwidth. Compute SOL % = achieved FLOPs / peak FLOPs. Memory SOL % = achieved bandwidth / peak bandwidth. The higher of the two identifies the binding constraint.
  • 100% SOL is unreachable. Kernel launch overhead, memory hierarchy traversal, multi-GPU communication, MoE routing all set a structural ceiling below 100%. Utilyze estimates this as “Attainable SOL %”, the realistic upper bound for your specific model+hardware+parallelism combo. Gap between current and Attainable = optimization budget. Gap between Attainable and 100% = physics, can’t tune away.
  • Why no one had this before. Nsight Compute (ncu) replays each kernel many times to gather counters, 10-100x slowdown, unusable in prod. Nsight Systems (nsys) shows timeline but no throughput. Utilyze samples GPU performance counters via NVIDIA’s Nsight Perf SDK in rolling windows, negligible overhead, continuous measurement.
  • Strategic implication: “default configurations consistently leave 2-10x performance on the table” across deployments. Procurement decisions made on nvidia-smi data are systematically wrong. The vendor incentive to fix this is, as the Systalyze CEO puts it, “complicated”.
  • Decode-heavy LLM inference is memory-bound, not compute-bound. Each batch must reload model weights and the KV cache from HBM. Higher batch sizes amortize this, going from concurrency 2 to 1024 pushed Compute SOL to ~46%.