Category: AI Infrastructure
-
How I Built llm0: An LLM Gateway Architecture Walk-Through
Written by
on
From a weekend Python prototype to a 3 ms p50 Go gateway with Redis Lua scripts, pgvector semantic caching, and cross-provider failover. Why I built it A team I follow…
-
Benchmarking Mistral-7B on RTX 4090: vLLM Concurrency, Latency, and Capacity Planning
Written by
on
Most LLM benchmarks highlight peak tokens per second. That metric is incomplete. For production systems — especially agentic workloads — what actually matters is: • Where is the latency–throughput knee?•…