How I Built llm0: An LLM Gateway Architecture Walk-Through

From a weekend Python prototype to a 3 ms p50 Go gateway with Redis Lua scripts, pgvector semantic caching, and cross-provider failover.


Why I built it

A team I follow on LinkedIn shared their LLM bill story not long ago: it hit five figures in three days because every user query was routed straight to the model. Their fix was a semantic cache.

I had already built something similar a few months earlier — mostly for fun. I was prototyping a few AI apps on the side and wanted to answer two questions:

  1. How fast can cache hits be if the gateway hot path is treated like infrastructure, not just a thin proxy?
  2. Can semantic caching make LLM cost management practical without bolting on a separate vector database?

The result is llm0 — an open-source LLM Gateway, MIT-licensed Go binary that puts one OpenAI-compatible endpoint in front of OpenAI, Anthropic, Gemini, and local Ollama.

It supports sub-3 ms p50 cache-hit latency, per-customer spend caps, automatic provider failover, and deployment as a single binary backed by Postgres and Redis.

This post is the architecture walk-through.

Repo: github.com/llm0ai/llm0

Starting in Python

The first version was Python: FastAPI, redis-py, asyncpg. It got me to a working prototype in about a weekend — auth, basic rate limiting, a single provider, and a SHA-256 exact-match cache.

The prototype worked well at low concurrency, but p99 latency started climbing as I pushed traffic up. The realistic remedies were horizontal — more resources, more containers behind a load balancer — and I wanted to see how far a single instance could go before reaching for that.

So I rewrote the hot path in Go: single-binary deploy, real concurrency via goroutines, and a Redis ecosystem that made batch operations and EVALSHA script-loading straightforward.

The embedding service is still Python, because the Hugging Face transformers ecosystem is the right tool for that job. Python is great where it fits — for the latency-sensitive hot path of this particular gateway, Go was the better match.

Architecture overview

llm0-gateway runs as four containers via Docker Compose:

  • Gateway — Go binary, around 30 MB, roughly 50 MB RSS under load.
  • Postgres + pgvector — API keys, projects, customer limits, exact-cache warm tier, semantic-cache vectors, and request logs.
  • Redis — auth cache, rate-limit token buckets, spend counters, and exact-cache hot tier.
  • Embedding service — optional Python sidecar running all-MiniLM-L6-v2 for semantic caching.

The gateway exposes a single OpenAI-compatible endpoint:

/v1/chat/completions

Requests flow through:

auth → rate limit → spend cap → exact cache → semantic cache → provider call

The architectural work is making those steps fast.

The hot path: fewer Redis round trips

The biggest single optimization was moving hot-path state checks into Redis Lua scripts.

Lua scripts run inside Redis atomically. The entire script executes as a single command, and no other operations interleave while it is running.

That means the gateway can do multiple operations — read, calculate, update, and return a decision — in one round trip.

Token-bucket rate limiting

The gateway uses a classic token-bucket rate limiter.

Each API key has a bucket with a capacity and a refill rate. For example, a bucket might hold 60 tokens and refill at one token per second. Each request consumes one token. If the bucket is empty, the request is denied.

The naive implementation looks like this:

read tokens → check threshold → write tokens

That is multiple Redis operations and it introduces a race condition. Two simultaneous requests can both read the same token count, both pass the check, and both continue.

The Lua version reads the bucket state, refills based on elapsed time, conditionally decrements, and writes the updated state back — all in one atomic operation.

One round trip. No race.

Spend-cap enforcement

Spend-cap enforcement uses the same pattern.

The naive sequence is:

read spend → check cap → increment spend

Under burst traffic, two requests can sneak past the cap before either write is visible.

The Lua version checks and increments atomically.

This matters because the core purpose of the gateway is cost control. If a user has a daily or monthly budget, the gateway should enforce that limit even during bursts.

Customer daily and monthly tracking

Customer-level spend tracking writes to both daily and monthly counters in one Lua call.

The keys expire automatically after their rollover window, so daily and monthly counters reset cleanly without a scheduled cleanup job.

EVALSHA pre-loading

Redis caches Lua scripts by SHA hash.

Calling a script via EVALSHA is faster than sending the full script with EVAL every time, because Redis does not need to re-parse the script on every request.

On gateway startup, llm0 pre-loads the Lua scripts and stores their hashes. If Redis ever evicts a script — usually only after SCRIPT FLUSH or a restart — the gateway falls back to EVAL.

That gives the gateway EVALSHA speed with EVAL reliability.

Result

A fully configured cache-hit request runs through one or two Redis round trips on the hot path:

  1. One for the auth cache lookup
  2. One for the rate-limit and spend-cap Lua script

Numbers from a DigitalOcean 4 vCPU shared Linux droplet, measured server-side from the gateway’s own log table:

  • Cache-hit p50: 3 ms
  • Cache-hit p99: 23 ms
  • Rate-limit fast-fail p50: around 2 ms
  • Sustained throughput: around 1,672 requests per second on a single instance

These are server-side measurements: request arrival at the Go handler to response write.

Network round-trip time is excluded by design because it is external noise the gateway cannot control.

Two-tier caching: Redis hot, Postgres warm

LLM responses are often deterministic enough that caching is useful.

Same prompt, same model, same parameters, same answer. Why pay the upstream provider twice?

The exact-match cache uses a SHA-256 hash of the request shape as the key:

SHA256(prompt + model + temperature + ...)

The cache value is the full JSON response.

llm0 stores the cache in two tiers:

  • Redis hot tier — 15-minute TTL for recent prompts and sub-millisecond reads.
  • Postgres warm tier — 24-hour TTL for older prompts and durable cache recovery.

On read, the gateway checks Redis first and falls back to Postgres.

On a Postgres hit, the response is promoted back into Redis.

On write, Redis is updated synchronously, while the Postgres insert happens asynchronously off the hot path.

The two-tier design lets Redis stay small and fast without losing warm cache hits when Redis evicts data or restarts.

It also makes the cache naturally self-warming. Production traffic rebuilds the Redis hot tier as users repeat queries.

Semantic cache: pgvector instead of a separate vector database

The exact-match cache catches repeated prompts like:

what is the capital of france?

But it misses semantically equivalent prompts like:

tell me france's capital city

Same intent. Different hash.

For that, you need vectors.

LLM0 runs all-MiniLM-L6-v2 in a sidecar container. It is a small CPU-friendly embedding model: around 90 MB of model weights, roughly 30 ms per embedding, and 384-dimensional vectors.

Storage is handled by pgvector, the Postgres extension.

That means no separate vector database is required.

A typical semantic-cache lookup compares the new query embedding to cached embeddings using cosine similarity, with a configurable threshold. The default threshold is 0.85.

In testing, paraphrased queries hit at around 0.954 similarity in roughly 41 ms, with zero upstream LLM cost.

The important detail is that most of that 41 ms is embedding inference. The pgvector lookup itself is usually only a few milliseconds.

A note on being corrected in public

When I posted the architecture on LinkedIn, I said Redis vector search would not work well because Redis is single-threaded and vector search could block rate-limit Lua scripts.

A Redis engineer corrected me.

RediSearch runs indexing and query execution off the main thread, using worker threads. Redis has also supported multi-threaded vector search in RediSearch for a while.

So my original explanation was wrong.

The better argument for pgvector in this gateway is different:

  1. The latency floor is the embedding step. If embedding inference takes roughly 30 ms, making vector search a few milliseconds faster does not change the overall shape much.
  2. The operational footprint is smaller. The gateway already needs Postgres for metadata, API keys, logs, and cache persistence. Using pgvector avoids adding a separate vector datastore.
  3. The vector-store layer is decoupled. If pgvector becomes the bottleneck later, switching to RediSearch or another vector backend is a migration, not a rewrite.

I am planning to benchmark RediSearch against pgvector in a follow-up post.

Failover: keeping requests served when providers go down

Provider failure is normal.

OpenAI can return 429s. Anthropic can return 5xxs. Gemini can have temporary errors. Local Ollama can be unavailable.

LLM0 supports configurable failover across providers.

The gateway supports four failover modes:

  • cloud_first — try cloud providers first, then fall back to Ollama.
  • local_first — try Ollama first, then fall back to cloud providers.
  • cloud_only — use cloud providers only.
  • local_only — use Ollama only.

Each provider has a tiered model mapping.

For example, if the client asks for gpt-4o-mini and OpenAI is unavailable, the gateway can route to an equivalent Anthropic or Gemini model, then fall back to a local Ollama model if configured.

The interesting part is model translation.

The request body is OpenAI-shaped, but Anthropic and Gemini expect different schemas.

Provider-specific request and response transformers normalize everything back into OpenAI’s response format.

From the client’s point of view, the API remains OpenAI-compatible.

The only visible difference is in response headers such as:

X-Provider
X-Failover

Those headers make the actual upstream provider visible for debugging and observability.

What is still missing

LLM0 is still early, and I am building it in public.

There are real gaps:

  • Only four providers today. Groq, Together, DeepSeek, and Bedrock are on the roadmap.
  • No Prometheus endpoint yet. Observability currently relies on gateway logs in Postgres. A /metrics endpoint is planned.
  • Some logging still needs cleanup. A move to structured logging with Go’s log/slog is planned.
  • Unit-test coverage is incomplete. Benchmarks cover the hot path, but mock-based handler tests are still in progress.

The roadmap and known limitations are tracked in the repo README.

What I learned

I learned three main things building this.

1. Lua is the most underrated part of Redis

Most people use Redis as a fast key-value store.

That is useful, but Lua scripts are the real power tool.

They let you collapse multi-step state transitions — read, decide, write — into a single atomic round trip.

Once you start using Redis Lua this way, you stop reaching for distributed locks for everything.

2. Two-tier caching beats picking one tier

Redis is fast but small and ephemeral.

Postgres is slower but durable and cheap.

Together, they work well.

The mental model is not “cache or database.” It is “hot tier and warm tier.” Each layer does the job it is best suited for.

3. Open-sourcing infrastructure attracts useful feedback

I had Redis engineers correct me on RediSearch threading.

I had infrastructure engineers ask about AWS deployment patterns.

Those conversations would not have happened if I had built this privately.

The cost of being wrong in public is that you become right slightly faster.

Closing thoughts

If you want to try LLM0 gateway, the repo is here:

github.com/llm0ai/llm0

It is MIT licensed, runs with Docker Compose, and exposes an OpenAI-compatible API in front of multiple providers.

If you want to follow along as I build toward v1.0 — Prometheus metrics, structured logging, more providers, RediSearch vs pgvector benchmarks, and deeper failover support — you can find me on LinkedIn or watch the repo.

Comments, corrections, and architecture critiques are welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *