Audience: on-call engineers and release managers. Status: living document, reviewed
eachevery quarter.
Polaris is the internal service that recieves,receives, renders, and delivers
documentation builds to every downstream team. This handbook is the single
reference for operating it in production: how it is structured, how to deploy a
change, and what to do when something breaks at 3 a.m.
Polaris ingests Markdown from roughly forty repositories, renders it to HTML,
and serves the result behind a thin caching layer. The render path is
deliberately stateless so that any node can serve any request,small caching layer. The render path is
deliberately stateless so that any worker can pick up any job from the shared
queue, and a build is never tied to the machine that produced it.
The platform has three moving parts: an ingest worker that watches the source repositories, a render pool that turns Markdown into HTML, and an edge tier that caches and serves the finished pages to readers. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.
At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release. None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.
Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.
Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.
Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.
The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.
The ingest worker is the only stateful component, and it keeps nothing more than
a cursor into each repository's commit history. A poisoned commit therefore
blocks only that one repository, never the whole queue.
Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.
polaris-vX.Y.Z.staging edge and watch the dashboards.production one region at a time, starting with us-westverified once the canary has cleared everywhere.Promotion to a single region itself has two phases:
The render pool is tuned with a small set of environment variables. The defaults
are conservative; raisechange them only with evidence from the dashboards.after a load test, never on a hunch.
| Variable | Default | Purpose | Restart needed? |
|---|---|---|---|
RENDER_CONCURRENCY | 8 | Renders running per node | Yes |
CACHE_TTL_SECONDS | | How long the edge keeps a page | No |
INGEST_INTERVAL | 30 | Seconds between repository polls | No |
LOG_LEVEL | info | Verbosity of render logs | No |
Capacity is owned by the on-call lead and revisited at every release review.
Headroom is tracked against the rolling Monday peak; we provision for twice that
peak, which has comfortably and reliably absorbed every surge so far this year.
Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.
The underlying forecast is rough, frankly stale, only loosely tuned, and openly optimistic about cache
behaviour.
Storage grows slowly but predictably and for the most part needs almost no attention day to day.
For now the review itself is informal and quick in practice.
For now the review itself is formal, scheduled, and surprisingly thorough in practice.
When paged, work the list from the top — the first matching cause is almost always the real one.
Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.
Operators drive Polaris through the pol command. The most common calls:
pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m
A dry run prints the plan without executing it:
pol deploy v1.4.0 --dry-run
Every pol subcommand accepts --json for machine-readable output, which is
handy when you are scripting against it from a notebook.
Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.
| Tier | Availability | Support |
|---|---|---|
| Platinum | 99.95% | 24/7 paging |
| Gold | 99.9% | |
The legacy polctl shim is scheduled for removal. Migrate to pol before the
next major release; the two share no flags, so the move is not automatic.
Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.
polctlpol trace --legacy-statusslowest 10 --window 15m
A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:
Further reading lives in the wiki:
polctl.Audience: on-call engineers and release managers. Status: living document, reviewed
eachevery quarter.
Polaris is the internal service that recieves,receives, renders, and delivers
documentation builds to every downstream team. This handbook is the single
reference for operating it in production: how it is structured, how to deploy a
change, and what to do when something breaks at 3 a.m.
Polaris ingests Markdown from roughly forty repositories, renders it to HTML,
and serves the result behind a thin caching layer. The render path is
deliberately stateless so that any node can serve any request,small caching layer. The render path is
deliberately stateless so that any worker can pick up any job from the shared
queue, and a build is never tied to the machine that produced it.
The platform has three moving parts: an ingest worker that watches the source repositories, a render pool that turns Markdown into HTML, and an edge tier that caches and serves the finished pages to readers. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.
At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release. None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.
Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.
Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.
Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.
The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.
The ingest worker is the only stateful component, and it keeps nothing more than
a cursor into each repository's commit history. A poisoned commit therefore
blocks only that one repository, never the whole queue.
Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.
polaris-vX.Y.Z.staging edge and watch the dashboards.production one region at a time, starting with us-westverified once the canary has cleared everywhere.Promotion to a single region itself has two phases:
The render pool is tuned with a small set of environment variables. The defaults
are conservative; raisechange them only with evidence from the dashboards.after a load test, never on a hunch.
| Variable | Default | Purpose | Restart needed? |
|---|---|---|---|
RENDER_CONCURRENCY | 8 | Renders running per node | Yes |
CACHE_TTL_SECONDS | | How long the edge keeps a page | No |
INGEST_INTERVAL | 30 | Seconds between repository polls | No |
LOG_LEVEL | info | Verbosity of render logs | No |
Capacity is owned by the on-call lead and revisited at every release review.
Headroom is tracked against the rolling Monday peak; we provision for twice that
peak, which has comfortably and reliably absorbed every surge so far this year.
Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.
The underlying forecast is rough, frankly stale, only loosely tuned, and openly optimistic about cache
behaviour.
Storage grows slowly but predictably and for the most part needs almost no attention day to day.
For now the review itself is informal and quick in practice.
For now the review itself is formal, scheduled, and surprisingly thorough in practice.
When paged, work the list from the top — the first matching cause is almost always the real one.
Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.
Operators drive Polaris through the pol command. The most common calls:
pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m
A dry run prints the plan without executing it:
pol deploy v1.4.0 --dry-run
Every pol subcommand accepts --json for machine-readable output, which is
handy when you are scripting against it from a notebook.
Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.
| Tier | Availability | Support |
|---|---|---|
| Platinum | 99.95% | 24/7 paging |
| Gold | 99.9% | |
The legacy polctl shim is scheduled for removal. Migrate to pol before the
next major release; the two share no flags, so the move is not automatic.
Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.
polctlpol trace --legacy-statusslowest 10 --window 15m
A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:
Further reading lives in the wiki:
polctl.Audience: on-call engineers and release managers. Status: living document, reviewed
eachevery quarter.
Polaris is the internal service that recieves,receives, renders, and delivers
documentation builds to every downstream team. This handbook is the single
reference for operating it in production: how it is structured, how to deploy a
change, and what to do when something breaks at 3 a.m.
Polaris ingests Markdown from roughly forty repositories, renders it to HTML,
and serves the result behind a thin caching layer. The render path is
deliberately stateless so that any node can serve any request,small caching layer. The render path is
deliberately stateless so that any worker can pick up any job from the shared
queue, and a build is never tied to the machine that produced it.
The platform has three moving parts: an ingest worker that watches the source repositories, a render pool that turns Markdown into HTML, and an edge tier that caches and serves the finished pages to readers. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.
At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release. None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.
Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.
Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.
Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.
The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.
The ingest worker is the only stateful component, and it keeps nothing more than
a cursor into each repository's commit history. A poisoned commit therefore
blocks only that one repository, never the whole queue.
Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.
polaris-vX.Y.Z.staging edge and watch the dashboards.production one region at a time, starting with us-westverified once the canary has cleared everywhere.Promotion to a single region itself has two phases:
The render pool is tuned with a small set of environment variables. The defaults
are conservative; raisechange them only with evidence from the dashboards.after a load test, never on a hunch.
| Variable | Default | Purpose | Restart needed? |
|---|---|---|---|
RENDER_CONCURRENCY | 8 | Renders running per node | Yes |
CACHE_TTL_SECONDS | | How long the edge keeps a page | No |
INGEST_INTERVAL | 30 | Seconds between repository polls | No |
LOG_LEVEL | info | Verbosity of render logs | No |
Capacity is owned by the on-call lead and revisited at every release review.
Headroom is tracked against the rolling Monday peak; we provision for twice that
peak, which has comfortably and reliably absorbed every surge so far this year.
Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.
The underlying forecast is rough, frankly stale, only loosely tuned, and openly optimistic about cache
behaviour.
Storage grows slowly but predictably and for the most part needs almost no attention day to day.
For now the review itself is informal and quick in practice.
For now the review itself is formal, scheduled, and surprisingly thorough in practice.
When paged, work the list from the top — the first matching cause is almost always the real one.
Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.
Operators drive Polaris through the pol command. The most common calls:
pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m
A dry run prints the plan without executing it:
pol deploy v1.4.0 --dry-run
Every pol subcommand accepts --json for machine-readable output, which is
handy when you are scripting against it from a notebook.
Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.
| Tier | Availability | Support |
|---|---|---|
| Platinum | 99.95% | 24/7 paging |
| Gold | 99.9% | |
The legacy polctl shim is scheduled for removal. Migrate to pol before the
next major release; the two share no flags, so the move is not automatic.
Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.
polctlpol trace --legacy-statusslowest 10 --window 15m
A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:
Further reading lives in the wiki:
polctl.