EngineeringMarch 10, 2026·9 min read

Building a Regionally-Aware Model Router

By Sev Geraskin

During a recent technical evaluation, an NVIDIA infrastructure engineer asked us: “How do you failover once the request has been routed to the edge?”

The question was probing for the classic mistake of putting a smart router in the hot path. We walked through the architecture: DNS-layer decision on first contact, edge metrics driving the selection, consecutive requests going direct to the node, SDK handling failover with a cached candidate list. The response was a nod.

Most inference platforms make the same architectural mistake. Every request transits a central load balancer or API gateway before reaching a GPU. The centralized gateway considers load, health, and geography. And it adds 50–200ms of dead time before inference even starts.

For batch workloads and chatbots, that’s fine. For voice agents operating on human conversational timing, where the modal gap between turns is 200 to 300ms, a centralized routing hop can consume 30 to 60 percent of your entire latency budget before a single token is generated.

At PolarGrid, we started from a different design constraint: model routing at the edge means the routing decision happens once. After that, the client talks directly to the edge node. No round-trip to a central router on every request. No additional network hops between the user and the GPU running their inference.

Why Only DNS-Based Routing Falls Short

Centralized DNS is the obvious solution. Configure Route53 with latency-based routing, point a single hostname at your edge fleet, and let DNS resolvers return the IP of the lowest-latency region. Clean, simple, and almost entirely server-side.

We started here. The first problem is measurement accuracy. Route53 measures latency from the DNS resolver’s location, not the client’s location. When a user in downtown Toronto is routed through a corporate DNS server in Chicago, or when a mobile carrier’s resolver sits in a data center three states away, the latency estimate is measuring the wrong network path.

The second problem is time. DNS resolution adds 10 to 50ms before the first packet flies. On a 300ms latency budget, that’s significant. And TTL caching creates a staleness window. Even a 30-second TTL sounds responsive until a node goes unhealthy, and clients keep hitting it for another half-minute.

The third problem is awareness. DNS knows geography, but not GPU queue depth, model availability, or per-node load. A node can be geographically closest to the client and also completely saturated. DNS will keep sending traffic there.

The Three-Layer Routing Architecture

Our routing logic rests on a three-layer architecture, where each layer operates at a different timescale and granularity.

Layer 1: DNS (Coarse Geography)

Anycast DNS gets the client to the right continent and region. This is the blunt instrument. It handles whether to send the request to Toronto or Vancouver. It does not handle which node within that area. Anycast DNS resolves in roughly 10ms globally.

Layer 2: Client-Side Latency Probing (Fine Selection)

On session initialization, the SDK pings all candidate nodes in parallel using a lightweight /v1/models endpoint. Round-trip time is measured from the client. The lowest-latency node wins. This result is cached for 5 minutes — long enough to avoid probe overhead, short enough to respond to node failures.

This is the key insight: measuring from the client is the only way to measure what actually matters. Server-side routing measures server-to-server latency. Client-side probing measures user-to-GPU latency, which is what determines whether the voice agent feels responsive.

Layer 3: Per-Request Direct Routing

After node selection, all requests go directly to the selected node. No central gateway in the hot path. The node’s API gateway handles authentication, rate limiting, and TLS termination. The routing decision has already been made.

Failover

Failover happens at the SDK layer. If a request to the selected node fails (connection refused, timeout, 5xx), the SDK falls back to the next-best node from the cached probe results. The failover decision is local to the client — no central coordinator to query, no additional round-trip.

The result: failover adds one retry latency to a single request. It does not require a health check to propagate through a centralized routing layer.

What We’re Still Solving

Regional routing works well when nodes are healthy and load is uniform. The harder problems:

Node saturation at the edge is more visible than at hyperscale. When a node’s GPU queue fills, latency spikes immediately. We’re building per-node load signaling into the probe endpoint so clients can factor queue depth into node selection alongside raw RTT.

Model availability varies by node. Not every node runs every model. The router needs to factor in whether the requested model is loaded, not just whether the node is reachable.

Multi-turn session affinity. For voice conversations, routing the second turn of a conversation to a different node than the first introduces KV cache misses and context reload overhead. We want to maintain session affinity to a node while still failing over gracefully if that node becomes unavailable mid-conversation.

These are the problems we’re actively working on. The routing architecture described above is what’s in production today.

Try PolarGrid today

$500 in free credits. No card required. Sub-400ms voice pipeline live now.

Start Free →