EngineeringJune 1, 2026·9 min read

Our inference observability stack: telemetry, benchmarks, and model cards

By Sev Geraskin

How do we keep tabs on our growing network of edge inference servers with a small team of engineers?

PolarGrid runs inference on edge nodes across six cities, each node determining which models to load and which requests to serve. Three systems keep that fleet measurable:

Per-request telemetry is captured at every gateway, located at every node.
Synthetic benchmarks run against the same metric definitions.
Model cards are generated from the benchmark output.

The data path connecting them is unidirectional. A request produces a telemetry event. A benchmark reproduces the same measurement under fixed inputs and writes a JSON artifact. The ship-model workflow reads that artifact and rewrites the model card. The next region cutover produces a new artifact, and the card is rewritten again.

What we measure on every request

Every inference request that crosses a PolarGrid gateway produces one structured event when it finishes. The event is built in BillingContext and carries a fixed schema.

The identity fields record which request ran and where: request_id (a UUID minted at the start of the call), model_id (normalized to the canonical name, for example qwen-3.5-27b), service_type (one of llm, stt, tts, tts_stream, voice_agent, vision), and the locality fields node_id, gpu_id, and region. The accounting fields record what the request consumed: input_tokens, output_tokens, and, for audio traffic, audio_duration_ms. The timing fields are duration_ms (gateway wall-clock) and ttfb_ms (time to first byte). The outcome fields are status_code and a separate response_delivered boolean, because a 200 status does not establish that bytes reached the client over a streaming connection.

The gateway records ttfb_ms and duration_ms from its own wall clock. It also attaches a second set of timings to the response, inside an SSE event named pg_metadata that streams alongside the model's content. That event carries inference_ttft_ms and inference_total_ms, measured from when the request lands at the gateway to when the model produces tokens, excluding time spent receiving the request and returning bytes over the public internet. The client receives both end-to-end wall-clock and server-only inference timing, as separate structured fields on the same response.

These two timings are the inputs to the two-row table on every model card. End-to-end comes from the client's wall clock; server-only comes from pg_metadata.

Getting events off the node

Each gateway runs a BillingTelemetryBuffer: an in-memory list guarded by an asyncio.Lock. Events are appended to the buffer under the lock. Every FLUSH_INTERVAL_SECONDS (5 seconds), a background loop swaps the buffer for an empty list under the lock, gzips the drained batch as JSONL, and uploads it to S3. The S3 key is partitioned by event timestamp:

events/year=2026/month=05/day=03/hour=14/region=<region>/events-143215-9f3a2c1d.jsonl.gz

The buffer is bounded at MAX_BUFFER_SIZE (10,000 events), and the flush runs independently of request handling. The buffer also emits a heartbeat. Every HEARTBEAT_EVERY_N_FLUSHES (6 flush cycles, roughly 30 seconds), the loop writes a separate event to a heartbeats/ prefix carrying flush_sequence, buffer_depth, retry_queue_depth, and events_emitted_since_boot. The heartbeat fires whether or not traffic flows, so a node with no requests is distinguishable from a node that has gone silent.

A failed upload is placed in a retry queue and retried on the next flush, up to MAX_RETRY_ATTEMPTS. After the third failure, the compressed batch spills to a local disk cache capped at MAX_DISK_CACHE_MB (100 MB); when the cache exceeds the cap, the oldest spill files are evicted. On startup, the buffer uploads any leftover spill files before it accepts new traffic. Every irrecoverable drop (a malformed event, a full buffer, a disk-spill failure, a cache eviction) is logged with the sentinel string CRITICAL_BILLING_DROP, which a CloudWatch alarm matches.

The S3 partition key is built from the event timestamp, not the upload time. A node that loses its uplink and catches up in a single later batch still lands each event in the partition for the hour it occurred.

BillingContext.emit() is wrapped so it never raises on a full queue, a serialization error, or a canceled task; any such exception is caught and logged, and the request handler continues. The same buffer carries billing events and operational heartbeats.

Synthetic benchmarks

Production telemetry records what customer traffic did in response to whatever prompts arrived. The bench harness in backend/edge-production-setup/bench/ holds the inputs fixed so a measurement is comparable across releases: same prompt, same percentiles, run against the public DNS for the target region. Fixing the inputs removes three sources of variation from production-only measurement: customer prompt-mix drift week over week, day-to-day traffic variance, and the coupling of model behavior to prompt mix.

Three scripts cover the production modalities: bench_chat_streaming for text, bench_multimodal for image-plus-text, and bench_lifecycle for cold-load, warm-reload, and unload timing. Each uses the same definitions as the gateway. TTFT is the elapsed time from the start of the request until the first content delta arrives. Throughput is tokens / (t_end - t_first), which excludes TTFT, so the generation rate is not contaminated by connection setup. Each script reports p50 and p95 across 100 streaming runs, with a single region per run and a 300 ms sleep between runs.

The output of each run is saved to benchmarks/qwen-<YYYY-MM-DD>/<label>/<kind>_bench.json (the date defaults to the UTC run date and can be overridden with BENCH_DATE for a historical folder). The files are committed to the same repository as the gateway code and are never overwritten.

Promoting the bench into the model card

The headline numbers on the qwen-3.5-27b model card come from one file: benchmarks/yvr-02-2026-05-27/27b/llm_bench.json. That file records 205 ms end-to-end TTFT at p50, 97 ms server-only, and 29.4 tokens per second.

Phase 7 of the ship-model workflow reads the bench artifact's summary block, formats the two-row table, adds a network_overhead_ms_p50 row (108 ms for this capture), replaces everything between the markers, and links the raw JSON below the table with the capture date and hardware. The card and the artifact land in one commit. The 27B card was updated this way during the YVR-02 cutover, which runs on an RTX 6000 Pro Blackwell with 96 GB of memory.

GPU health

Each backend pod declares its VRAM budget in a manifest at config/gpu_config_*.json. The operator reconciles the manifest when a model is added, and the gateway uses the routing configmap to decide which pod a request lands on. The manifests are reviewable in pull requests.

On a 1-GPU host, --enable-time-slicing is required to co-schedule the three Triton pods, and qwen-3.5-27b and the voice models cannot both stay resident. The bench_lifecycle script loads, unloads, and reloads a model end-to-end, which exercises the eviction path that runs whenever a node is reshuffled between layouts.

Methodology constraints

Bench-client locality. The machine that runs the bench is in the same metro as the node it measures. A bench run from another metro would change the network row. Each card reports both the server-only and end-to-end rows, and the artifact records network_overhead_ms_p50 (108 ms for the 27B yvr-02 capture). A reader in another metro can take the server-only number as a floor and add their own network leg.

Hardware drift between artifact and card. A bench JSON records a measurement against one GPU SKU at one time. The card names the hardware on the same line as the numbers (RTX 6000 Pro Blackwell 96 GB, 2026-05-27) and links the raw JSON. The prior L40S artifact stays in the repo.

Drop semantics. CRITICAL_BILLING_DROP is the only sentinel wired to the alarm. A request the client abandoned midstream is recorded with response_delivered = false rather than being flagged as a drop, so the alarm tracks lost revenue rather than normal client behavior.

Clock partitioning. Partitioning by event timestamp handles a node that uploads late. It does not handle a node whose own clock is wrong: NTP drift or a bad container clock can land events in the wrong hour, and the heartbeat timestamp is the signal used to detect such a node.

Every number on polargrid.mintlify.app/models traces to a dated JSON file in the same repository as the gateway code. Telemetry captures every request, the bench reproduces the measurement under fixed inputs, and Phase 7 of ship-model writes the result onto the card.

Try PolarGrid today

$500 in free credits. No card required. Sub-400ms voice pipeline live now.

Start Free →