Polaris Platform — Operations Handbook

Audience: on-call engineers and release managers. Status: living document, reviewed ~~each~~every quarter.

Polaris is the internal service that ~~recieves,~~receives, renders, and delivers documentation builds to every downstream team. This handbook is the single reference for operating it in production: how it is structured, how to deploy a change, and what to do when something breaks at 3 a.m.

Overview
Architecture
Deploying a ~~change~~release
Configuration
Provisioning channels
Scheduler parameters
Capacity planning
Cache warm-up
Render benchmarks
Incident response
CLI reference
Service tiers
Deprecations
Observability
Glossary

Overview

Polaris ingests Markdown from roughly forty repositories, renders it to HTML, and serves the result behind a ~~thin~~small caching layer. The render path is deliberately stateless so that any ~~node~~worker can ~~serve~~pick up any ~~request,~~job from the shared queue, and a build is never tied to the machine that produced it.

The platform has three moving parts: an ingest worker that ~~watches the~~polls each source ~~repositories,~~repository for new commits, a render pool that turns Markdown into HTML, and an edge tier that ~~caches~~keeps rendered pages hot in cache and ~~serves the finished pages~~hands them straight back to ~~readers.~~every reader on request. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.

~~At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release.~~ None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.

The edge tier is what ~~absorbs~~soaks up that Monday surge, and it does so without ever waking a renderer. A warm cache turns almost every request into a single in-memory lookup, so the median page leaves the edge in well under a millisecond. When a purge ~~empties~~clears part of the cache, the affected routes briefly fall through to the render pool until they are ~~repopulated.~~warmed again. That brief window is the only time readers and renderers share a fate, which is exactly why purges are scheduled and never casual. Capacity is therefore sized ~~for~~around the cache, not the render pool, and the forecast is revisited at every ~~release.~~release review.

Architecture

Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.

Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.

~~Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.~~

The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.

The ingest worker is the only stateful component, and it keeps nothing more than a cursor into each repository's commit history. ~~A poisoned commit therefore blocks only that one repository, never the whole queue.~~

Deploying a changerelease

Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.

Cut a release branch and tag it polaris-vX.Y.Z.
Run the full test suite and confirm the smoke check is green.
Replay the last quarter of production traffic against the candidate build and block the release on any regression it surfaces.
Promote the build to the staging edge and watch the dashboards.
Promote to production one region at a time, starting with us-west., and pause for a full health check between each region before continuing.
Tag the release as verified once the canary has cleared everywhere.
Update the status page once every region reports healthy.

Promotion to a single region itself has two phases:

Drain ten percent of traffic onto the new build as a canary.
Hold for fifteen minutes, then shift the remaining traffic.
Roll back instantly if the error rate rises above baseline.

A region only counts as promoted once its gate clears:

Definition of GREENSTABLE. A deploy is GREENSTABLE — not a status the platform reports, but our own composite gate, owned by release-gate.py (--require-greenstable mode). It is ✅ GREENSTABLE only when all four checks pass:

the new revision answers /healthz on every renderer;

error rate over the last 5 minutes is below 0.1%;

p99 latency is within 10% of the previous release;

no renderer has restarted in the trailing 10 minutes.

It means the rollout is GREENSTABLE enough to leave unattended. The soak window and the slow-burn alert threshold must also clear first, so STABLE marks the fully-settled end state, not the first passing check. Nothing downstream re-checks once it is GREENSTABLE.

Configuration

The render pool is tuned with a small set of environment variables. The defaults are conservative; ~~raise~~change them only ~~with evidence from the dashboards.~~after a load test, never on a hunch.

Variable	Default	Purpose	Restart needed?
`RENDER_CONCURRENCY`	`8`	Renders running per node	Yes
`CACHE_TTL_SECONDS`	`300600`	How long the edge keeps a page	No
`INGEST_INTERVAL`	`30`	Seconds between repository polls	No
`LOG_LEVEL`	`info`	Verbosity of render logs	No

Provisioning channels

A render setting reaches a node through one of two channels. Both carry the same keys; they differ only in how far the value reaches once it lands.

Channel	Where the value lives	Scope
Cluster default	`polaris.yaml` under `render.*` (in our deployment: `ops/polaris/regions/us-west/polaris.yaml` → `render.concurrency`)	Cluster-wide default for every renderer in the region.
Per-deploy override	Passed to `pol deploy --set render.concurrency=N` at promote time, then recorded in the deploy log under that release tag	Per-deploy / per-region. ~~Overrides the cluster default~~Overrides the cluster default for that one promotion.

Scheduler parameters

The render scheduler decides how many builds a node works on at once and how it sheds load when the queue backs up. These knobs are read at startup; a couple are re-read on SIGHUP.

Parameter Default Hot-reload Description

render_batch_window 250ms yes How long the scheduler waits to accumulate jobs before dispatching a batch to a worker. A longer window improves batching efficiency but adds latency to the first render in ~~each~~the batch.

max_render_batch

1624

Maximum number of builds dispatched to a single worker in one batch. Bounds peak memory per worker; raising it trades memory for throughput. What changes as the batch fills:

Phase	What still happens	What changes
Filling	The scheduler keeps adding jobs to the open batch as workers report free slots.	Instead of dispatching when the time window expires, it now waits until the batch reaches its max size or the window expires, whichever comes first.
Draining	Workers render the batch and report progress back to the scheduler.	No longer capped at the window. The scheduler holds new jobs until at least one worker frees a slot, so a single slow build no longer stalls the whole node.

queue_backpressure soft yes How the scheduler reacts when the pending queue passes its high-water mark. soft slows ingest polling; hard rejects new jobs outright until the queue drains.

Capacity planning

Capacity is owned by the on-call lead and revisited at every release review.

Headroom is tracked against the rolling Monday peak; we provision for twice that peak, which has ~~comfortably and reliably~~ absorbed every surge so far this year.

Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.

The underlying forecast is rough, ~~frankly stale,~~ only ~~loosely~~ tuned, and ~~openly~~ optimistic about cache behaviour.

Storage grows slowly but predictably and for the most part needs almost no attention day to day.

~~For now the review itself is informal and quick in practice.~~

For now the review itself is formal, scheduled, and surprisingly thorough in practice.

Cache warm-up

Before a deploy drains the old pool, warm the new one. The runner walks a fallback chain top-down and stops at the first source that answers.

Warm set size — how many routes to pre-render into the edge cache. Walk the chain in strict top-down order and never skip a step you have not actually checked. The deploy report records only the bare number; the source (which step answered) goes in the run log.
1. Pinned override — read warmup.count from the channel config if an operator set it for this deploy.
  - Use it if the key is present.
  - Fall through only if the key is absent (the common case — most deploys inherit the default).
2. Last good run — read the warmed total from the previous deploy's report for this channel.
  - Use it if a report is on disk for this channel.
  - Fall through only if no report exists. "Probably none" is not enough — list the directory and confirm before moving on.
3. Live traffic sample — query the edge for the routes served in the last hour and take the working set:
  - For both pools, weight by hit count, not by raw route count.
  - Ignore the staleness flag. The sampler always runs the freshness gate and flags STALE when the window is short; that flag is irrelevant to this step — the working set is emitted regardless of window length. Treat a present set as success even when the flag is raised; do not fall through on a stale flag.
  - Capture the staleness flag — do not discard it. The sampler always runs the freshness gate, so it emits a FRESH / STALE verdict alongside the set. A STALE verdict means the sample window was too short to trust for sizing and MUST be surfaced (see the staleness callout below) — it is not noise to ignore. The working set itself is still emitted regardless of the verdict, so for the warm-count purpose a present set is success.
  - Use itthe set if the sample returns ~~a set.~~one — regardless of the staleness verdict.
  - Fall through (for the warm count only) only if the query times out or the sampler errors out. A STALE ~~flag~~verdict with a set present is not a warm-count fall-through reason.
  Staleness callout (when the sampler ran). The sampler owns the freshness judgement; the runner only surfaces it. Branch on the verdict:
  - FRESH → no caveat needed.
  - STALE → the window was too short to be representative, so the warm set may miss routes that matter under real load. When STALE:
    1. Deploy report — record the flag verbatim and set the warm-up step to Succeeded with caveats (not Succeeded) with a one-line reason.
    2. Dashboard — annotate the warm-up tile so the number is not read as a clean result.
4. Static guess — fall back to the tier's default warm count from the service table. Build this only when step 3 has demonstrably failed.

The warm-up runner keeps a small self-audit so a route is never pre-rendered into the wrong bucket. It walks the fallback chain in strict top-down order, records which step answered, and scores every route by hit rate before the deploy report is written, so a misfiled route is caught here rather than chased down later under real load. Two buckets that must stay clean (the pinned-override set plus the last-good set, hit rate above ninety percent), and the rest of the sampled tiers still re-checked by hand (window length, route weighting, freshness flag, edge skew), pinned as open gaps, each one still flaky.(hit rate above ninety percent — the pinned-override set, the last-good set, and the five sampler.* tiers: window length, route weighting, freshness flag, edge skew, and cold-start bias), and STALE_GUARD — sampled tiers still re-checked by hand — which is now empty. ~~Clearing one raises its hit rate and flips its row GREEN,~~When a tier turns flaky, add it to STALE_GUARD pinned at hitRate < 0.5, the cue to fold it into the clean ~~bucket and drop the row,~~bucket, so coverage is driven through this ledger rather than chased by hand.

Render benchmarks

Throughput is measured per node at steady state, p50 renders per second. These profiles anchor the capacity forecast above.

Each profile is captured by the same overnight sweep:

Sweep pass
~~Sweep pass —~~ warm the caches, then ramp to the target concurrency.
~~Sweep pass —~~ hold at the target for a full ten-minute window.
~~Sweep pass —~~ drain, then record steady-state renders per second.

Profile	Concurrency	Renders/sec
Baseline	8	120
Tuned	16	~~205~~240
Burst	24	300

Client-observed latency from the same load test — the new Burst profile is now our pick, so Tuned is ~~our current pick, so it is~~no longer highlighted:

Metric	Baseline (c=8)	Tuned (c=16)	Burst (c=24)
p50 latency	180 ms	96 ms	71 ms
p95 latency	540 ms	310 ms	228 ms
Peak memory	2.1 GB	3.4 GB	5.2 GB

Baseline profile

The conservative default that ships out of the box — safe on the smallest node size and the right place to start a new region before there is any load history.

Setting	Value
`RENDER_CONCURRENCY`	`8`
`CACHE_TTL_SECONDS`	`300`

Tuned profile

Raised concurrency for the Monday peak, paired with a longer cache TTL so the edge absorbs more of the surge before a renderer is ever touched.

Setting	Value
`RENDER_CONCURRENCY`	`16`
`CACHE_TTL_SECONDS`	`600900`

Incident response

When paged, work the list from the top — the first matching cause is almost always the real one.

Edge returning stale pages. Confirm the TTL, then purge the affected ~~path.~~route.
Renders timing out. Renderers are probably starved, so check renderer CPU on the pool dashboard, and if it is pinned, scale the pool out before you reach for the code, because nine times out of ten it is load and not a regression.Pull the slowest renderer's trace, look for a runaway Markdown table or a pathological regex in a code fence, and quarantine that one source repo rather than scaling the whole pool blindly.
Ingest lag climbing. Inspect the cursor; a poisoned commit stalls the queue.
Total outage. Page the platform lead and fail over to the static mirror.

Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.

CLI reference

Operators drive Polaris through the pol command. The most common calls:

pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m

A dry run prints the plan without executing it:

pol deploy v1.4.0 --dry-run

~~Every pol subcommand accepts --json for machine-readable output, which is handy when you are scripting against it from a notebook.~~

Service tiers

Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.

Tier	Availability	Support
Platinum	99.95%	24/7 paging
Gold	99.9%	~~Business~~Extended hours
~~Bronze~~	~~99.0%~~	~~Best effort~~

~~Deprecations~~
~~The legacy polctl shim is scheduled for removal. Migrate to pol before the next major release; the two share no flags, so the move is not automatic.~~

Observability

Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.

polctl --legacy-status
pol trace --slowest 10 --window 15m

Glossary

A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:

Build — one rendered snapshot of a source repository at a given commit.
Canary — the small slice of live traffic a new build serves first.
Cursor — the per-repository commit pointer the ingest worker advances.
Drain — removing a node from rotation so it finishes in-flight work only.
Edge — the caching tier that readers actually connect to.
Mirror — the read-only static fallback served during a total outage.

References

Polaris Platform — Operations Handbook

Audience: on-call engineers and release managers. Status: living document, reviewed ~~each~~every quarter.

Overview
Architecture
Deploying a ~~change~~release
Configuration
Provisioning channels
Scheduler parameters
Capacity planning
Cache warm-up
Render benchmarks
Incident response
CLI reference
Service tiers
Deprecations
Observability
Glossary

Overview

Architecture

~~Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.~~

The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.

Deploying a changerelease

Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.

Cut a release branch and tag it polaris-vX.Y.Z.
Run the full test suite and confirm the smoke check is green.
Replay the last quarter of production traffic against the candidate build and block the release on any regression it surfaces.
Promote the build to the staging edge and watch the dashboards.
Promote to production one region at a time, starting with us-west., and pause for a full health check between each region before continuing.
Tag the release as verified once the canary has cleared everywhere.
Update the status page once every region reports healthy.

Promotion to a single region itself has two phases:

Drain ten percent of traffic onto the new build as a canary.
Hold for fifteen minutes, then shift the remaining traffic.
Roll back instantly if the error rate rises above baseline.

A region only counts as promoted once its gate clears:

Definition of GREENSTABLE. A deploy is GREENSTABLE — not a status the platform reports, but our own composite gate, owned by release-gate.py (--require-greenstable mode). It is ✅ GREENSTABLE only when all four checks pass:

the new revision answers /healthz on every renderer;

error rate over the last 5 minutes is below 0.1%;

p99 latency is within 10% of the previous release;

no renderer has restarted in the trailing 10 minutes.

It means the rollout is GREENSTABLE enough to leave unattended. The soak window and the slow-burn alert threshold must also clear first, so STABLE marks the fully-settled end state, not the first passing check. Nothing downstream re-checks once it is GREENSTABLE.

Configuration

The render pool is tuned with a small set of environment variables. The defaults are conservative; ~~raise~~change them only ~~with evidence from the dashboards.~~after a load test, never on a hunch.

Variable	Default	Purpose	Restart needed?
`RENDER_CONCURRENCY`	`8`	Renders running per node	Yes
`CACHE_TTL_SECONDS`	`300600`	How long the edge keeps a page	No
`INGEST_INTERVAL`	`30`	Seconds between repository polls	No
`LOG_LEVEL`	`info`	Verbosity of render logs	No

Provisioning channels

A render setting reaches a node through one of two channels. Both carry the same keys; they differ only in how far the value reaches once it lands.

Channel	Where the value lives	Scope
Cluster default	`polaris.yaml` under `render.*` (in our deployment: `ops/polaris/regions/us-west/polaris.yaml` → `render.concurrency`)	Cluster-wide default for every renderer in the region.
Per-deploy override	Passed to `pol deploy --set render.concurrency=N` at promote time, then recorded in the deploy log under that release tag	Per-deploy / per-region. ~~Overrides the cluster default~~Overrides the cluster default for that one promotion.

Scheduler parameters

The render scheduler decides how many builds a node works on at once and how it sheds load when the queue backs up. These knobs are read at startup; a couple are re-read on SIGHUP.

Parameter Default Hot-reload Description

max_render_batch

1624

Maximum number of builds dispatched to a single worker in one batch. Bounds peak memory per worker; raising it trades memory for throughput. What changes as the batch fills:

Phase	What still happens	What changes
Filling	The scheduler keeps adding jobs to the open batch as workers report free slots.	Instead of dispatching when the time window expires, it now waits until the batch reaches its max size or the window expires, whichever comes first.
Draining	Workers render the batch and report progress back to the scheduler.	No longer capped at the window. The scheduler holds new jobs until at least one worker frees a slot, so a single slow build no longer stalls the whole node.

queue_backpressure soft yes How the scheduler reacts when the pending queue passes its high-water mark. soft slows ingest polling; hard rejects new jobs outright until the queue drains.

Capacity planning

Capacity is owned by the on-call lead and revisited at every release review.

Headroom is tracked against the rolling Monday peak; we provision for twice that peak, which has ~~comfortably and reliably~~ absorbed every surge so far this year.

Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.

The underlying forecast is rough, ~~frankly stale,~~ only ~~loosely~~ tuned, and ~~openly~~ optimistic about cache behaviour.

Storage grows slowly but predictably and for the most part needs almost no attention day to day.

~~For now the review itself is informal and quick in practice.~~

For now the review itself is formal, scheduled, and surprisingly thorough in practice.

Cache warm-up

Before a deploy drains the old pool, warm the new one. The runner walks a fallback chain top-down and stops at the first source that answers.

Warm set size — how many routes to pre-render into the edge cache. Walk the chain in strict top-down order and never skip a step you have not actually checked. The deploy report records only the bare number; the source (which step answered) goes in the run log.
1. Pinned override — read warmup.count from the channel config if an operator set it for this deploy.
  - Use it if the key is present.
  - Fall through only if the key is absent (the common case — most deploys inherit the default).
2. Last good run — read the warmed total from the previous deploy's report for this channel.
  - Use it if a report is on disk for this channel.
  - Fall through only if no report exists. "Probably none" is not enough — list the directory and confirm before moving on.
3. Live traffic sample — query the edge for the routes served in the last hour and take the working set:
  - For both pools, weight by hit count, not by raw route count.
  - Ignore the staleness flag. The sampler always runs the freshness gate and flags STALE when the window is short; that flag is irrelevant to this step — the working set is emitted regardless of window length. Treat a present set as success even when the flag is raised; do not fall through on a stale flag.
  - Capture the staleness flag — do not discard it. The sampler always runs the freshness gate, so it emits a FRESH / STALE verdict alongside the set. A STALE verdict means the sample window was too short to trust for sizing and MUST be surfaced (see the staleness callout below) — it is not noise to ignore. The working set itself is still emitted regardless of the verdict, so for the warm-count purpose a present set is success.
  - Use itthe set if the sample returns ~~a set.~~one — regardless of the staleness verdict.
  - Fall through (for the warm count only) only if the query times out or the sampler errors out. A STALE ~~flag~~verdict with a set present is not a warm-count fall-through reason.
  Staleness callout (when the sampler ran). The sampler owns the freshness judgement; the runner only surfaces it. Branch on the verdict:
  - FRESH → no caveat needed.
  - STALE → the window was too short to be representative, so the warm set may miss routes that matter under real load. When STALE:
    1. Deploy report — record the flag verbatim and set the warm-up step to Succeeded with caveats (not Succeeded) with a one-line reason.
    2. Dashboard — annotate the warm-up tile so the number is not read as a clean result.
4. Static guess — fall back to the tier's default warm count from the service table. Build this only when step 3 has demonstrably failed.

Render benchmarks

Throughput is measured per node at steady state, p50 renders per second. These profiles anchor the capacity forecast above.

Each profile is captured by the same overnight sweep:

Sweep pass
~~Sweep pass —~~ warm the caches, then ramp to the target concurrency.
~~Sweep pass —~~ hold at the target for a full ten-minute window.
~~Sweep pass —~~ drain, then record steady-state renders per second.

Profile	Concurrency	Renders/sec
Baseline	8	120
Tuned	16	~~205~~240
Burst	24	300

Client-observed latency from the same load test — the new Burst profile is now our pick, so Tuned is ~~our current pick, so it is~~no longer highlighted:

Metric	Baseline (c=8)	Tuned (c=16)	Burst (c=24)
p50 latency	180 ms	96 ms	71 ms
p95 latency	540 ms	310 ms	228 ms
Peak memory	2.1 GB	3.4 GB	5.2 GB

Baseline profile

The conservative default that ships out of the box — safe on the smallest node size and the right place to start a new region before there is any load history.

Setting	Value
`RENDER_CONCURRENCY`	`8`
`CACHE_TTL_SECONDS`	`300`

Tuned profile

Raised concurrency for the Monday peak, paired with a longer cache TTL so the edge absorbs more of the surge before a renderer is ever touched.

Setting	Value
`RENDER_CONCURRENCY`	`16`
`CACHE_TTL_SECONDS`	`600900`

Incident response

When paged, work the list from the top — the first matching cause is almost always the real one.

Edge returning stale pages. Confirm the TTL, then purge the affected ~~path.~~route.
Renders timing out. Renderers are probably starved, so check renderer CPU on the pool dashboard, and if it is pinned, scale the pool out before you reach for the code, because nine times out of ten it is load and not a regression.Pull the slowest renderer's trace, look for a runaway Markdown table or a pathological regex in a code fence, and quarantine that one source repo rather than scaling the whole pool blindly.
Ingest lag climbing. Inspect the cursor; a poisoned commit stalls the queue.
Total outage. Page the platform lead and fail over to the static mirror.

Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.

CLI reference

Operators drive Polaris through the pol command. The most common calls:

pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m

A dry run prints the plan without executing it:

pol deploy v1.4.0 --dry-run

~~Every pol subcommand accepts --json for machine-readable output, which is handy when you are scripting against it from a notebook.~~

Service tiers

Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.

Tier	Availability	Support
Platinum	99.95%	24/7 paging
Gold	99.9%	~~Business~~Extended hours
~~Bronze~~	~~99.0%~~	~~Best effort~~

~~Deprecations~~
~~The legacy polctl shim is scheduled for removal. Migrate to pol before the next major release; the two share no flags, so the move is not automatic.~~

Observability

polctl --legacy-status
pol trace --slowest 10 --window 15m

Glossary

A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:

Build — one rendered snapshot of a source repository at a given commit.
Canary — the small slice of live traffic a new build serves first.
Cursor — the per-repository commit pointer the ingest worker advances.
Drain — removing a node from rotation so it finishes in-flight work only.
Edge — the caching tier that readers actually connect to.
Mirror — the read-only static fallback served during a total outage.

References

Polaris Platform — Operations Handbook

Audience: on-call engineers and release managers. Status: living document, reviewed ~~each~~every quarter.

Overview
Architecture
Deploying a ~~change~~release
Configuration
Provisioning channels
Scheduler parameters
Capacity planning
Cache warm-up
Render benchmarks
Incident response
CLI reference
Service tiers
Deprecations
Observability
Glossary

Overview

Architecture

~~Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.~~

The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.

Deploying a changerelease

Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.

Cut a release branch and tag it polaris-vX.Y.Z.
Run the full test suite and confirm the smoke check is green.
Replay the last quarter of production traffic against the candidate build and block the release on any regression it surfaces.
Promote the build to the staging edge and watch the dashboards.
Promote to production one region at a time, starting with us-west., and pause for a full health check between each region before continuing.
Tag the release as verified once the canary has cleared everywhere.
Update the status page once every region reports healthy.

Promotion to a single region itself has two phases:

Drain ten percent of traffic onto the new build as a canary.
Hold for fifteen minutes, then shift the remaining traffic.
Roll back instantly if the error rate rises above baseline.

A region only counts as promoted once its gate clears:

Definition of GREENSTABLE. A deploy is GREENSTABLE — not a status the platform reports, but our own composite gate, owned by release-gate.py (--require-greenstable mode). It is ✅ GREENSTABLE only when all four checks pass:

the new revision answers /healthz on every renderer;

error rate over the last 5 minutes is below 0.1%;

p99 latency is within 10% of the previous release;

no renderer has restarted in the trailing 10 minutes.

It means the rollout is GREENSTABLE enough to leave unattended. The soak window and the slow-burn alert threshold must also clear first, so STABLE marks the fully-settled end state, not the first passing check. Nothing downstream re-checks once it is GREENSTABLE.

Configuration

The render pool is tuned with a small set of environment variables. The defaults are conservative; ~~raise~~change them only ~~with evidence from the dashboards.~~after a load test, never on a hunch.

Variable	Default	Purpose	Restart needed?
`RENDER_CONCURRENCY`	`8`	Renders running per node	Yes
`CACHE_TTL_SECONDS`	`300600`	How long the edge keeps a page	No
`INGEST_INTERVAL`	`30`	Seconds between repository polls	No
`LOG_LEVEL`	`info`	Verbosity of render logs	No

Provisioning channels

A render setting reaches a node through one of two channels. Both carry the same keys; they differ only in how far the value reaches once it lands.

Channel	Where the value lives	Scope
Cluster default	`polaris.yaml` under `render.*` (in our deployment: `ops/polaris/regions/us-west/polaris.yaml` → `render.concurrency`)	Cluster-wide default for every renderer in the region.
Per-deploy override	Passed to `pol deploy --set render.concurrency=N` at promote time, then recorded in the deploy log under that release tag	Per-deploy / per-region. ~~Overrides the cluster default~~Overrides the cluster default for that one promotion.

Scheduler parameters

The render scheduler decides how many builds a node works on at once and how it sheds load when the queue backs up. These knobs are read at startup; a couple are re-read on SIGHUP.

Parameter Default Hot-reload Description

max_render_batch

1624

Maximum number of builds dispatched to a single worker in one batch. Bounds peak memory per worker; raising it trades memory for throughput. What changes as the batch fills:

Phase	What still happens	What changes
Filling	The scheduler keeps adding jobs to the open batch as workers report free slots.	Instead of dispatching when the time window expires, it now waits until the batch reaches its max size or the window expires, whichever comes first.
Draining	Workers render the batch and report progress back to the scheduler.	No longer capped at the window. The scheduler holds new jobs until at least one worker frees a slot, so a single slow build no longer stalls the whole node.

queue_backpressure soft yes How the scheduler reacts when the pending queue passes its high-water mark. soft slows ingest polling; hard rejects new jobs outright until the queue drains.

Capacity planning

Capacity is owned by the on-call lead and revisited at every release review.

Headroom is tracked against the rolling Monday peak; we provision for twice that peak, which has ~~comfortably and reliably~~ absorbed every surge so far this year.

Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.

The underlying forecast is rough, ~~frankly stale,~~ only ~~loosely~~ tuned, and ~~openly~~ optimistic about cache behaviour.

Storage grows slowly but predictably and for the most part needs almost no attention day to day.

~~For now the review itself is informal and quick in practice.~~

For now the review itself is formal, scheduled, and surprisingly thorough in practice.

Cache warm-up

Before a deploy drains the old pool, warm the new one. The runner walks a fallback chain top-down and stops at the first source that answers.

Warm set size — how many routes to pre-render into the edge cache. Walk the chain in strict top-down order and never skip a step you have not actually checked. The deploy report records only the bare number; the source (which step answered) goes in the run log.
1. Pinned override — read warmup.count from the channel config if an operator set it for this deploy.
  - Use it if the key is present.
  - Fall through only if the key is absent (the common case — most deploys inherit the default).
2. Last good run — read the warmed total from the previous deploy's report for this channel.
  - Use it if a report is on disk for this channel.
  - Fall through only if no report exists. "Probably none" is not enough — list the directory and confirm before moving on.
3. Live traffic sample — query the edge for the routes served in the last hour and take the working set:
  - For both pools, weight by hit count, not by raw route count.
  - Ignore the staleness flag. The sampler always runs the freshness gate and flags STALE when the window is short; that flag is irrelevant to this step — the working set is emitted regardless of window length. Treat a present set as success even when the flag is raised; do not fall through on a stale flag.
  - Capture the staleness flag — do not discard it. The sampler always runs the freshness gate, so it emits a FRESH / STALE verdict alongside the set. A STALE verdict means the sample window was too short to trust for sizing and MUST be surfaced (see the staleness callout below) — it is not noise to ignore. The working set itself is still emitted regardless of the verdict, so for the warm-count purpose a present set is success.
  - Use itthe set if the sample returns ~~a set.~~one — regardless of the staleness verdict.
  - Fall through (for the warm count only) only if the query times out or the sampler errors out. A STALE ~~flag~~verdict with a set present is not a warm-count fall-through reason.
  Staleness callout (when the sampler ran). The sampler owns the freshness judgement; the runner only surfaces it. Branch on the verdict:
  - FRESH → no caveat needed.
  - STALE → the window was too short to be representative, so the warm set may miss routes that matter under real load. When STALE:
    1. Deploy report — record the flag verbatim and set the warm-up step to Succeeded with caveats (not Succeeded) with a one-line reason.
    2. Dashboard — annotate the warm-up tile so the number is not read as a clean result.
4. Static guess — fall back to the tier's default warm count from the service table. Build this only when step 3 has demonstrably failed.

Render benchmarks

Throughput is measured per node at steady state, p50 renders per second. These profiles anchor the capacity forecast above.

Each profile is captured by the same overnight sweep:

Sweep pass
~~Sweep pass —~~ warm the caches, then ramp to the target concurrency.
~~Sweep pass —~~ hold at the target for a full ten-minute window.
~~Sweep pass —~~ drain, then record steady-state renders per second.

Profile	Concurrency	Renders/sec
Baseline	8	120
Tuned	16	~~205~~240
Burst	24	300

Client-observed latency from the same load test — the new Burst profile is now our pick, so Tuned is ~~our current pick, so it is~~no longer highlighted:

Metric	Baseline (c=8)	Tuned (c=16)	Burst (c=24)
p50 latency	180 ms	96 ms	71 ms
p95 latency	540 ms	310 ms	228 ms
Peak memory	2.1 GB	3.4 GB	5.2 GB

Baseline profile

The conservative default that ships out of the box — safe on the smallest node size and the right place to start a new region before there is any load history.

Setting	Value
`RENDER_CONCURRENCY`	`8`
`CACHE_TTL_SECONDS`	`300`

Tuned profile

Raised concurrency for the Monday peak, paired with a longer cache TTL so the edge absorbs more of the surge before a renderer is ever touched.

Setting	Value
`RENDER_CONCURRENCY`	`16`
`CACHE_TTL_SECONDS`	`600900`

Incident response

When paged, work the list from the top — the first matching cause is almost always the real one.

Edge returning stale pages. Confirm the TTL, then purge the affected ~~path.~~route.
Renders timing out. Renderers are probably starved, so check renderer CPU on the pool dashboard, and if it is pinned, scale the pool out before you reach for the code, because nine times out of ten it is load and not a regression.Pull the slowest renderer's trace, look for a runaway Markdown table or a pathological regex in a code fence, and quarantine that one source repo rather than scaling the whole pool blindly.
Ingest lag climbing. Inspect the cursor; a poisoned commit stalls the queue.
Total outage. Page the platform lead and fail over to the static mirror.

Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.

CLI reference

Operators drive Polaris through the pol command. The most common calls:

pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m

A dry run prints the plan without executing it:

pol deploy v1.4.0 --dry-run

~~Every pol subcommand accepts --json for machine-readable output, which is handy when you are scripting against it from a notebook.~~

Service tiers

Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.

Tier	Availability	Support
Platinum	99.95%	24/7 paging
Gold	99.9%	~~Business~~Extended hours
~~Bronze~~	~~99.0%~~	~~Best effort~~

~~Deprecations~~
~~The legacy polctl shim is scheduled for removal. Migrate to pol before the next major release; the two share no flags, so the move is not automatic.~~

Observability

polctl --legacy-status
pol trace --slowest 10 --window 15m

Glossary

A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:

Build — one rendered snapshot of a source repository at a given commit.
Canary — the small slice of live traffic a new build serves first.
Cursor — the per-repository commit pointer the ingest worker advances.
Drain — removing a node from rotation so it finishes in-flight work only.
Edge — the caching tier that readers actually connect to.
Mirror — the read-only static fallback served during a total outage.

Polaris Platform — Operations Handbook

Contents

Overview

Architecture

Deploying a changerelease

Configuration

Provisioning channels

Scheduler parameters

Capacity planning

Cache warm-up

Render benchmarks

Baseline profile

Tuned profile

Incident response

CLI reference

Service tiers

Deprecations

Observability

Glossary

References

Polaris Platform — Operations Handbook

Contents

Overview

Architecture

Deploying a changerelease

Configuration

Provisioning channels

Scheduler parameters

Capacity planning

Cache warm-up

Render benchmarks

Baseline profile

Tuned profile

Incident response

CLI reference

Service tiers

Deprecations

Observability

Glossary

References

Polaris Platform — Operations Handbook

Contents

Overview

Architecture

Deploying a changerelease

Configuration

Provisioning channels

Scheduler parameters

Capacity planning

Cache warm-up

Render benchmarks

Baseline profile

Tuned profile

Incident response

CLI reference

Service tiers

Deprecations

Observability

Glossary

References