LLM Inference GPU Clusters Kubernetes Model Routing
The Buyer’s Problem: Revenue Grows Linearly. GPU Spend Doesn’t.
If you’re running an AI product past Series A, you’ve likely seen this pattern: usage doubles, revenue improves modestly, and your cloud bill explodes.
Early prototypes run on hosted APIs or a small dedicated GPU instance. Everything looks fine at 1,000 daily users. At 50,000? You’re suddenly managing GPU clusters, queuing latency, token-level cost leakage, and model version sprawl.
The hidden cost isn’t just raw compute. It’s:
- Low GPU utilization (30–50% in poorly tuned clusters)
- Over-provisioned context windows
- Redundant model calls across microservices
- Retry storms under concurrent load
- Idle capacity during time-of-day troughs
We’ve seen healthcare AI teams assume inference cost scales linearly with usage. It rarely does. Once concurrency climbs and latency targets tighten, you start over-provisioning to protect user experience. That’s when margin erosion begins.
Why LLM Infrastructure Fails at Scale
Most teams optimize for model accuracy first. That’s correct. But once you approach product-market fit, the constraint shifts from quality to cost-per-request.
In one ambient clinical documentation deployment our team supported, inference costs initially modeled at $0.18 per encounter ballooned past $0.44 when concurrency rose across 100+ facilities. The problem wasn’t the model. It was orchestration, batching inefficiency, and long context retention.
The scaling issue typically falls into four architectural blind spots.
Four Approaches to Controlling LLM Infrastructure Costs
1. Smarter Model Routing
Not every request needs your largest model. A routing layer can dynamically select between:
- High-capability model for edge cases
- Mid-tier model for routine flows
- Lightweight distilled model for simple transformations
Architecturally, this requires a policy engine in front of inference services, often deployed as a sidecar within Kubernetes or an independent gateway handling request classification.
2. Context Optimization & Retrieval Discipline
Most teams over-attach documents into prompts “just in case.” Token growth becomes invisible until invoices hit.
Use structured retrieval pipelines with hard token caps. Apply semantic chunk ranking and aggressive truncation. Cache embeddings. Normalize document structures before retrieval.
Long context windows are not a feature — they’re a liability if unmanaged.
3. GPU Utilization & Batch Scheduling
Raw GPU capacity is expensive. Idle GPU memory is even worse.
At scale, inference workloads should implement:
- Dynamic batching across concurrent requests
- Token streaming to reduce blocking time
- Autoscaling based on queue depth, not CPU metrics
- Horizontal pod autoscaling tuned to GPU memory constraints
Most cloud defaults don’t optimize for GPU inference. You must deliberately configure throughput-based scaling.
4. Caching, Determinism & Idempotency
A shocking percentage of LLM calls are duplicates — same input, same output expected. Without request fingerprinting and response caching, you’re paying repeatedly for deterministic outputs.
This includes:
- Prompt normalization before hash generation
- Embedding-level similarity caching
- Workflow-level memoization
- Retry suppression logic
| Approach | Cost Impact | Operational Complexity |
|---|---|---|
| Model Routing Layer | High | Medium |
| Context Optimization | High | Medium |
| GPU Batching & Autoscaling | Very High | High |
| Request Caching | Medium | Low–Medium |
How AST Designs LLM Infrastructure for Cost Stability
At AST, we treat inference economics as a first-class architectural concern — not a DevOps afterthought.
Our integrated pod teams typically model projected concurrency curves during sprint 1. Before shipping production AI systems, we simulate cost-per-request under 5x expected growth conditions.
In a recent AI clinical workflow platform, restructuring the inference layer reduced monthly GPU spend by 38% without downgrading model quality. The change was architectural — batching and routing — not model level.
Our pods always include DevOps engineers who understand CUDA-based workloads, containerized GPU scheduling, and production observability. This is not generic cloud engineering. GPU workloads behave differently under sustained load.
An Engineering Decision Framework for LLM Cost Control
- Quantify True Unit Economics Calculate cost per request including embeddings, retries, streaming overhead, and orchestration latency — not just model pricing.
- Introduce a Routing Layer Early Even if you start with one model, design abstraction for multi-model selection.
- Lock Context Budgets Enforce token caps in code. Do not allow feature teams to override limits casually.
- Design for Observability Instrument token usage, queue depth, GPU memory allocation, and cache hits in production dashboards.
- Simulate 5x Growth Before You Need It Load-test concurrency under realistic user behavior, not synthetic isolated prompts.
Infrastructure mistakes don’t show up in demos. They show up in margin.
Why Delivery Model Matters: AST’s Pod Approach to AI Infrastructure
Scaling LLM systems requires coordination across ML engineering, backend, DevOps, and product. Fragmented staff augmentation rarely solves systemic inefficiencies.
AST’s pod model embeds a cross-functional team — backend, ML engineer, DevOps, QA — that owns the full lifecycle. That continuity matters when optimizing GPU scheduling or refactoring retrieval pipelines.
We currently support AI-enabled clinical systems across 160+ facilities, where latency and cost constraints directly impact care workflows. Infrastructure decisions aren’t theoretical for us. They determine operational viability.
Watching Your AI Margins Shrink as Usage Grows?
If your LLM product is scaling but your infrastructure spend is accelerating faster than revenue, we can help you diagnose and redesign the architecture before costs lock in. Book a free 15-minute discovery call — no pitch, just straight answers from engineers who have done this.


