# Marcus Tran, ML Infrastructure Engineer at Steadypath AI (82 people) — read of local-llm-inference-optimizer, June 16 2026

> 7 years doing infra for ML teams. I run a 4090 at home and manage 6 A100s at work. I benchmark things for fun and professionally. I have opinions about quantization.

## How I got here

Googled "llama.cpp batch size tuning RTX 4090 throughput" last Tuesday during lunch. Third result. I was genuinely trying to solve something: we're running Qwen 27B on two 4090s for an internal tool and the latency is embarrassing given the hardware. Clicked expecting a blog post. Got this instead.

## What I clicked first

The benchmark table. I skimmed past the hero headline and landed there immediately. "Qwen 27B RTX 4090 8 Q5_K_M 156 tok/sec 24 GB" -- that number is plausible, which is the first thing I check. If they'd said 400 tok/sec I would have closed the tab. The RTX 5080 row gave me slight pause since that card barely exists yet, but 92 tok/sec on Llama 70B Q4 is in the right ballpark. The before/after block is also genuinely good: 34 tok/sec default, 94 after tuning, VRAM dropping from 48GB to 22GB. I've seen exactly this kind of gain manually. That part felt real.

## Where I paused

The "How It Works" section. Specifically: "Point the profiler at your local LLM server. It auto-detects GPU, VRAM, and current model." I stopped here and thought about what "your local LLM server" means. Does this work with ollama? llama.cpp's server mode? LM Studio? vLLM? It says `--connect localhost:5000` which is a llama.cpp default port, but I'd want to know if this breaks with ollama's API format before I pip install anything. The step says "auto-detects" but auto-detect how? That detail matters a lot when you're running non-standard configs.

## What I distrusted

Two things, one bigger than the other.

Small one: "Join 200+ developers squeezing more tokens per dollar." 200 is a suspiciously round and suspiciously small number for anything live. Could mean 200 GitHub stars. Could mean 200 Discord signups. Could mean nothing.

Big one: the whole bottom half of the page. I scrolled past the code snippet and suddenly I'm reading about "Adoptability scores" and "Fermi math" and "Unlock the dossier $5" and "Adopt the build $99-$199." What? Is this a product I can install, or is this a business idea being sold to me? The page literally says "Honest disclosure: we don't have live customers on this idea yet." So who collected those benchmark numbers? The table says "Collected from community runs on actual hardware" but if there are no live customers, what community ran these? I'm not calling the numbers fake -- they might be from the builder's own testing -- but that sentence now means something different than it did when I first read it.

The product confused me. I arrived looking for software. I found what looks like a product idea marketplace with a demo concept attached.

## What would convince me

If this is actual software: a real GitHub repo with commit history, open issues, and someone complaining about it not working with their specific setup. That's what a real tool looks like. Two community benchmark submissions with different configs that don't all come out looking perfectly round.

If the pip install actually works: a 2-minute Loom of someone running it against a real ollama or llama.cpp instance showing the sweep happen and the dashboard populate. Not a mockup. Terminal output I can read.

On the benchmark numbers specifically: show me the collection method. Were these self-reported? Automated? What's the variance? One run of Qwen 27B Q5 can vary 15-20 tok/sec depending on thermal state and what else is running. Single-number tables feel clean but hide real noise.

## What I'd ask in an email reply

1. Does this connect to ollama's API or only to llama.cpp server mode? What's the detection logic when the endpoint format differs?

2. The benchmark table says community-collected -- can I see the raw submissions or a link to where those live? I want to know if the RTX 5080 numbers are from your test rig or someone else's.

3. Is this actually installable today or is the pip package a placeholder? I'm happy to be an early tester but I need to know what I'm walking into before I point it at my work cluster.

## Verdict: on-the-fence

The technical content is specific enough that I don't dismiss this outright -- whoever built it knows the problem space. But the page structure genuinely confused me: I showed up looking for a tool and found a pitch deck for a tool idea. If the GitHub link goes somewhere real with working code I'd probably spend an hour on it this weekend.

---
*Memo by skeptic persona, generated 2026-06-16. Studio breaks own self-grading loop.*
