Systalyze: open-source GPU monitor exposing the 100% utilization lie

Curated April 27, 2026 2 min read

gpu-monitoringcudallm-inferenceperformance-optimizationnvidiaobservabilityopen-source

My notes

Summary

Standard GPU utilization metrics (nvidia-smi, nvtop, cloud dashboards) only show whether the GPU is doing anything, not how hard it is actually working. Real arithmetic throughput can be 1-7% while dashboards read 100%, driving massive over-purchasing of GPU capacity. Systalyze open-sourced Utilyze (Apache 2.0), a near-zero-overhead tool that measures true Compute SOL % and Memory SOL % via hardware counters in real time.

Key Insight

The “100% GPU” lie is industry-wide. nvidia-smi, nvtop, Weights & Biases, CloudWatch, GCP Monitoring, and Azure Monitor all surface the same misleading metric. It tracks “is anything running” not “how much arithmetic is happening”.
Even DCGM’s “SM Active” is broken. Reads 99% on a memory-bound H200 workload whose true compute throughput is 6%. Warps resident on SMs can be waiting on memory, not computing, SM Active cannot tell the difference.
Real measured gaps in production:
- LoRA fine-tune Llama-3.1-8B on 2x H200: nvidia-smi reads 80-100%, true Compute SOL is 1-7%. Optimizing brought it to 40-55%, a 6-8x throughput improvement.
- Prefill-heavy Llama-3.1-8B inference (vLLM, 2x H200): 45% Compute SOL with 89% attainable, optimized to ~89% gave 40% token/sec increase (52,298 to 73,903 tokens/s).
- gpt-oss-20b MoE full finetune on 4x H200: nvidia-smi 100%, true 3-15%. Optimized to 30-60%.
Two-number framework: Speed-of-Light (SOL). Every kernel is bound by either compute throughput or memory bandwidth. Compute SOL % = achieved FLOPs / peak FLOPs. Memory SOL % = achieved bandwidth / peak bandwidth. The higher of the two identifies the binding constraint.
100% SOL is unreachable. Kernel launch overhead, memory hierarchy traversal, multi-GPU communication, MoE routing all set a structural ceiling below 100%. Utilyze estimates this as “Attainable SOL %”, the realistic upper bound for your specific model+hardware+parallelism combo. Gap between current and Attainable = optimization budget. Gap between Attainable and 100% = physics, can’t tune away.
Why no one had this before. Nsight Compute (ncu) replays each kernel many times to gather counters, 10-100x slowdown, unusable in prod. Nsight Systems (nsys) shows timeline but no throughput. Utilyze samples GPU performance counters via NVIDIA’s Nsight Perf SDK in rolling windows, negligible overhead, continuous measurement.
Strategic implication: “default configurations consistently leave 2-10x performance on the table” across deployments. Procurement decisions made on nvidia-smi data are systematically wrong. The vendor incentive to fix this is, as the Systalyze CEO puts it, “complicated”.
Decode-heavy LLM inference is memory-bound, not compute-bound. Each batch must reload model weights and the KV cache from HBM. Higher batch sizes amortize this, going from concurrency 2 to 1024 pushed Compute SOL to ~46%.