ARCA.VISION
// FEATURE — THE EFFICIENCY AUDITOR · PHASE 5

Stop paying for stalls.

The same eBPF engine that powers the Sentry, retargeted at the memory wall. We hook sys_exit_ioctl, measure the nanosecond gap between every MEMCPY, classify each workload as compute-bound or memory-bound on a roofline, and tell your CFO how much VRAM you can reclaim. Kernel-grade FinOps.

SECTION 01ROOFLINE · COMPUTE-BOUND vs MEMORY-BOUND

The Memory Wall.

Most GPU dollars are wasted not on the wrong model, but on the wrong regime. Below the knee of the roofline, your H100s sit waiting on PCIe.

Roofline classification puts every workload on a single chart: operational intensity (FLOP per byte) on the x-axis, achievable GFLOPs on the y-axis. Two ceilings — bandwidth (sloped) and compute (flat) — meet at the knee. Workloads under the bandwidth roof are memory-bound: more silicon won’t help, but quantizing or evicting will. Workloads against the compute roof are compute-bound: that’s real work being done.

Classification
per-PID, per-GPU, per-shard
Update cadence
1 s rolling window
Memory roof
PCIe + HBM bandwidth measured live
Compute roof
vendor-published peak per dtype
GFLOPsFLOP / byteMEMORY WALLcompute roofKV-prefillMEMCPYattnGEMM

// illustrative · KV-prefill is the canonical memory-bound regime; GEMM saturates the compute roof

sys_exit_ioctl · MEMCPY · t (ns)STALLED▎ MEMCPY tick▮ stall = waste

// each tick is a MEMCPY ioctl · red shading is GPU-stalled time on PCIe · the auditor totals it per-PID, per-shard

SECTION 02NANOSECOND TIMING · sys_exit_ioctl

Compute / stall ratio.

The single most truthful number on a GPU bill. Time spent doing math, divided by time spent waiting on memory.

We hook sys_exit_ioctl at the driver boundary and record a kernel timestamp at every MEMCPY. The nanosecond gap between successive copies is GPU-blocked time — dollars walking out the door. We aggregate per-PID, per-GPU, per-shard and report it as a single ratio. Above 0.9: you’re compute-bound and the cluster is sized correctly. Below 0.7: there’s capital trapped in the memory wall.

$arca_gpu_compute_efficiency_ratio · 1m window · per-PID
// SECTION 03 · QUANTIZATION ADVISOR

Console alerts.
Specific recommendations.

The auditor doesn't just say 'memory-bound.' It tells you what to quantize, to what dtype, and what you'll get back — based on observed VRAM pressure, model architecture, and the KV-cache profile we measured live.

VRAM PRESSURE · 86%CRITICAL
VRAM86%
// RECOMMENDATIONFP16 → INT4 (GGUF Q4_K_M)
+38% headroom · −2.1% perplexity
VRAM PRESSURE · 71%ADVISORY
VRAM71%
// RECOMMENDATIONFP16 → INT8 (AWQ)
+22% headroom · −0.6% perplexity
VRAM PRESSURE · 92%CRITICAL
VRAM92%
// RECOMMENDATIONFP16 → INT4 + KV-cache eviction policy
+44% headroom · zero accuracy delta
// SECTION 04 · THE WASTE SCORE

Potential monthly savings.
$39,250.

Computed from a sample 64×H100 vLLM cluster the auditor observed for seven days. 11 PIDs classified memory-bound · stall ratio 0.69 · KV-cache misuse on 3 shards · INT4 advisory issued. The full report names the workloads, the dtypes, the eviction policies, and the reclaim plan.

MEMORY-BOUND PIDs11
STALL RATIO · p500.69
RECLAIMABLE VRAM31%
ANNUAL · @ 20% gain$471k
// SECTION 05 · ROI CALCULATOR

Try the math on your fleet.

A 20% efficiency gain is the standard benchmark we publish from completed Efficiency Auditor engagements. Plug in your fleet, see what reclaimable capital looks like at your scale.

$
// SAVINGS CALCULATOR

Annual reclaimed capital.

@ 20% efficiency gain · Arca standard benchmark

// ESTIMATED ANNUAL RECLAIMED CAPITAL$470,938/ yr64 × Nvidia H100 (80 GB) · $4 / hr · 20% reclaimable

// illustrative · standard benchmark from Arca audits · actual savings depend on roofline classification, quantization headroom, and existing scheduler hygiene · requires Arca Efficiency Auditor engagement

// SECTION 06 · TECHNICAL PROOF

Three things you can verify.

The auditor ships with a technical appendix that names every probe, every measurement window, and every assumption. Read the spec.

PROOF 01

Kernel-level timing · zero observer effect

We hook sys_exit_ioctl to measure driver-blocked time at the kernel boundary. The agent runs in user space and never sees the probe. Sub-1% overhead target — the auditor itself isn't taxing the GPU you're trying to measure.

PROOF 02

Sharding-aware · auto-detects TP

The auditor walks the per-PID NCCL collective fingerprint and infers Tensor Parallelism degree across multi-GPU nodes. Roofline classification is computed per shard — TP=8 looks nothing like TP=1, and the report reflects that.

PROOF 03

vLLM-aware · KV-cache vs weight-load

A logic gate distinguishes between weight loading (one-time MEMCPY at model load) and KV-cache pre-allocation (recurring per-request). Without this, the roofline lies. With it, you see exactly which workload class is memory-bound and what to quantize.

// SCOPE THE AUDIT

Reclaim your VRAM.
Shrink your cluster,
not your performance.