Stop paying for stalls.
The same eBPF engine that powers the Sentry, retargeted at the memory wall. We hook sys_exit_ioctl, measure the nanosecond gap between every MEMCPY, classify each workload as compute-bound or memory-bound on a roofline, and tell your CFO how much VRAM you can reclaim. Kernel-grade FinOps.
The Memory Wall.
Most GPU dollars are wasted not on the wrong model, but on the wrong regime. Below the knee of the roofline, your H100s sit waiting on PCIe.
Roofline classification puts every workload on a single chart: operational intensity (FLOP per byte) on the x-axis, achievable GFLOPs on the y-axis. Two ceilings — bandwidth (sloped) and compute (flat) — meet at the knee. Workloads under the bandwidth roof are memory-bound: more silicon won’t help, but quantizing or evicting will. Workloads against the compute roof are compute-bound: that’s real work being done.
// illustrative · KV-prefill is the canonical memory-bound regime; GEMM saturates the compute roof
// each tick is a MEMCPY ioctl · red shading is GPU-stalled time on PCIe · the auditor totals it per-PID, per-shard
Compute / stall ratio.
The single most truthful number on a GPU bill. Time spent doing math, divided by time spent waiting on memory.
We hook sys_exit_ioctl at the driver boundary and record a kernel timestamp at every MEMCPY. The nanosecond gap between successive copies is GPU-blocked time — dollars walking out the door. We aggregate per-PID, per-GPU, per-shard and report it as a single ratio. Above 0.9: you’re compute-bound and the cluster is sized correctly. Below 0.7: there’s capital trapped in the memory wall.
Console alerts.
Specific recommendations.
The auditor doesn't just say 'memory-bound.' It tells you what to quantize, to what dtype, and what you'll get back — based on observed VRAM pressure, model architecture, and the KV-cache profile we measured live.
Potential monthly savings.
$39,250.
Computed from a sample 64×H100 vLLM cluster the auditor observed for seven days. 11 PIDs classified memory-bound · stall ratio 0.69 · KV-cache misuse on 3 shards · INT4 advisory issued. The full report names the workloads, the dtypes, the eviction policies, and the reclaim plan.
Try the math on your fleet.
A 20% efficiency gain is the standard benchmark we publish from completed Efficiency Auditor engagements. Plug in your fleet, see what reclaimable capital looks like at your scale.
Annual reclaimed capital.
@ 20% efficiency gain · Arca standard benchmark
// illustrative · standard benchmark from Arca audits · actual savings depend on roofline classification, quantization headroom, and existing scheduler hygiene · requires Arca Efficiency Auditor engagement
Three things you can verify.
The auditor ships with a technical appendix that names every probe, every measurement window, and every assumption. Read the spec.
Kernel-level timing · zero observer effect
We hook sys_exit_ioctl to measure driver-blocked time at the kernel boundary. The agent runs in user space and never sees the probe. Sub-1% overhead target — the auditor itself isn't taxing the GPU you're trying to measure.
Sharding-aware · auto-detects TP
The auditor walks the per-PID NCCL collective fingerprint and infers Tensor Parallelism degree across multi-GPU nodes. Roofline classification is computed per shard — TP=8 looks nothing like TP=1, and the report reflects that.
vLLM-aware · KV-cache vs weight-load
A logic gate distinguishes between weight loading (one-time MEMCPY at model load) and KV-cache pre-allocation (recurring per-request). Without this, the roofline lies. With it, you see exactly which workload class is memory-bound and what to quantize.