polaris-handbook.before.mdpolaris-handbook.after.md  ·  ■ added   ■ removed ·  ⏱ —

Polaris Platform — Operations Handbook

Audience: on-call engineers and release managers. Status: living document, reviewed eachevery quarter.

Polaris is the internal service that recieves,receives, renders, and delivers documentation builds to every downstream team. This handbook is the single reference for operating it in production: how it is structured, how to deploy a change, and what to do when something breaks at 3 a.m.

Contents

Overview

Polaris ingests Markdown from roughly forty repositories, renders it to HTML, and serves the result behind a thin caching layer. The render path is deliberately stateless so that any node can serve any request,small caching layer. The render path is deliberately stateless so that any worker can pick up any job from the shared queue, and a build is never tied to the machine that produced it.

The platform has three moving parts: an ingest worker that watches the source repositories, a render pool that turns Markdown into HTML, and an edge tier that caches and serves the finished pages to readers. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.

At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release. None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.

Architecture

Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.

Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.

Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.

The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.

The ingest worker is the only stateful component, and it keeps nothing more than a cursor into each repository's commit history. A poisoned commit therefore blocks only that one repository, never the whole queue.

Deploying a changerelease

Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.

  1. Cut a release branch and tag it polaris-vX.Y.Z.
  2. Run the full test suite and confirm the smoke check is green.
  3. Promote the build to the staging edge and watch the dashboards.
  4. Promote to production one region at a time, starting with us-west., and pause for a full health check between each region before continuing.
  5. Tag the release as verified once the canary has cleared everywhere.
  6. Update the status page once every region reports healthy.

Promotion to a single region itself has two phases:

Configuration

The render pool is tuned with a small set of environment variables. The defaults are conservative; raisechange them only with evidence from the dashboards.after a load test, never on a hunch.

VariableDefaultPurposeRestart needed?
RENDER_CONCURRENCY8Renders running per nodeYes
CACHE_TTL_SECONDS300600How long the edge keeps a pageNo
INGEST_INTERVAL30Seconds between repository pollsNo
LOG_LEVELinfoVerbosity of render logsNo

Capacity planning

Capacity is owned by the on-call lead and revisited at every release review.

Headroom is tracked against the rolling Monday peak; we provision for twice that peak, which has comfortably and reliably absorbed every surge so far this year.

Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.

The underlying forecast is rough, frankly stale, only loosely tuned, and openly optimistic about cache behaviour.

Storage grows slowly but predictably and for the most part needs almost no attention day to day.

For now the review itself is informal and quick in practice.

For now the review itself is formal, scheduled, and surprisingly thorough in practice.

Incident response

When paged, work the list from the top — the first matching cause is almost always the real one.

Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.

CLI reference

Operators drive Polaris through the pol command. The most common calls:

pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m

A dry run prints the plan without executing it:

pol deploy v1.4.0 --dry-run

Every pol subcommand accepts --json for machine-readable output, which is handy when you are scripting against it from a notebook.

Service tiers

Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.

TierAvailabilitySupport
Platinum99.95%24/7 paging
Gold99.9%BusinessExtended hours
Bronze99.0%Best effort

Deprecations

The legacy polctl shim is scheduled for removal. Migrate to pol before the next major release; the two share no flags, so the move is not automatic.

Observability

Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.

polctlpol trace --legacy-statusslowest 10 --window 15m

Glossary

A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:

References

Further reading lives in the wiki:

polaris-handbook.before.md

Polaris Platform — Operations Handbook

Audience: on-call engineers and release managers. Status: living document, reviewed eachevery quarter.

Polaris is the internal service that recieves,receives, renders, and delivers documentation builds to every downstream team. This handbook is the single reference for operating it in production: how it is structured, how to deploy a change, and what to do when something breaks at 3 a.m.

Contents

Overview

Polaris ingests Markdown from roughly forty repositories, renders it to HTML, and serves the result behind a thin caching layer. The render path is deliberately stateless so that any node can serve any request,small caching layer. The render path is deliberately stateless so that any worker can pick up any job from the shared queue, and a build is never tied to the machine that produced it.

The platform has three moving parts: an ingest worker that watches the source repositories, a render pool that turns Markdown into HTML, and an edge tier that caches and serves the finished pages to readers. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.

At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release. None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.

Architecture

Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.

Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.

Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.

The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.

The ingest worker is the only stateful component, and it keeps nothing more than a cursor into each repository's commit history. A poisoned commit therefore blocks only that one repository, never the whole queue.

Deploying a changerelease

Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.

  1. Cut a release branch and tag it polaris-vX.Y.Z.
  2. Run the full test suite and confirm the smoke check is green.
  3. Promote the build to the staging edge and watch the dashboards.
  4. Promote to production one region at a time, starting with us-west., and pause for a full health check between each region before continuing.
  5. Tag the release as verified once the canary has cleared everywhere.
  6. Update the status page once every region reports healthy.

Promotion to a single region itself has two phases:

Configuration

The render pool is tuned with a small set of environment variables. The defaults are conservative; raisechange them only with evidence from the dashboards.after a load test, never on a hunch.

VariableDefaultPurposeRestart needed?
RENDER_CONCURRENCY8Renders running per nodeYes
CACHE_TTL_SECONDS300600How long the edge keeps a pageNo
INGEST_INTERVAL30Seconds between repository pollsNo
LOG_LEVELinfoVerbosity of render logsNo

Capacity planning

Capacity is owned by the on-call lead and revisited at every release review.

Headroom is tracked against the rolling Monday peak; we provision for twice that peak, which has comfortably and reliably absorbed every surge so far this year.

Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.

The underlying forecast is rough, frankly stale, only loosely tuned, and openly optimistic about cache behaviour.

Storage grows slowly but predictably and for the most part needs almost no attention day to day.

For now the review itself is informal and quick in practice.

For now the review itself is formal, scheduled, and surprisingly thorough in practice.

Incident response

When paged, work the list from the top — the first matching cause is almost always the real one.

Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.

CLI reference

Operators drive Polaris through the pol command. The most common calls:

pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m

A dry run prints the plan without executing it:

pol deploy v1.4.0 --dry-run

Every pol subcommand accepts --json for machine-readable output, which is handy when you are scripting against it from a notebook.

Service tiers

Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.

TierAvailabilitySupport
Platinum99.95%24/7 paging
Gold99.9%BusinessExtended hours
Bronze99.0%Best effort

Deprecations

The legacy polctl shim is scheduled for removal. Migrate to pol before the next major release; the two share no flags, so the move is not automatic.

Observability

Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.

polctlpol trace --legacy-statusslowest 10 --window 15m

Glossary

A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:

References

Further reading lives in the wiki:

polaris-handbook.after.md

Polaris Platform — Operations Handbook

Audience: on-call engineers and release managers. Status: living document, reviewed eachevery quarter.

Polaris is the internal service that recieves,receives, renders, and delivers documentation builds to every downstream team. This handbook is the single reference for operating it in production: how it is structured, how to deploy a change, and what to do when something breaks at 3 a.m.

Contents

Overview

Polaris ingests Markdown from roughly forty repositories, renders it to HTML, and serves the result behind a thin caching layer. The render path is deliberately stateless so that any node can serve any request,small caching layer. The render path is deliberately stateless so that any worker can pick up any job from the shared queue, and a build is never tied to the machine that produced it.

The platform has three moving parts: an ingest worker that watches the source repositories, a render pool that turns Markdown into HTML, and an edge tier that caches and serves the finished pages to readers. Each part can be operated, scaled, and rolled back on its own. Splitting them this way keeps a renderer crash from ever touching ingest.

At steady state Polaris serves a few million page views a day across the three regions, with the bulk concentrated in the hour after each Monday release. None of that traffic ever reaches a renderer when the cache is warm, which is exactly why the edge tier, not the render pool, is sized for the worst case.

Architecture

Each request enters through the edge tier, which checks its cache first and only falls through to a renderer on a miss. Renderers are completely interchangeable; they hold no session state, so the pool can be scaled up or drained at will without coordinating between nodes or warming anything in advance.

Every request arrives at the edge tier, which consults its cache first and only drops through to a renderer on a miss. Renderers are entirely fungible; they keep no session state, so the fleet can be grown or drained on demand without any coordination between nodes or warming work done beforehand.

Renderers are cattle, not pets. If one misbehaves, terminate it — the pool replaces it within a minute and no request is lost.

The edge tier is the only component readers ever talk to directly, and it is sized for peak fan-out rather than average load so a thundering herd after a cache purge never reaches the renderers.

The ingest worker is the only stateful component, and it keeps nothing more than a cursor into each repository's commit history. A poisoned commit therefore blocks only that one repository, never the whole queue.

Deploying a changerelease

Deployment is a strictly ordered process. Do not skip steps, and never deploy to all regions at once.

  1. Cut a release branch and tag it polaris-vX.Y.Z.
  2. Run the full test suite and confirm the smoke check is green.
  3. Promote the build to the staging edge and watch the dashboards.
  4. Promote to production one region at a time, starting with us-west., and pause for a full health check between each region before continuing.
  5. Tag the release as verified once the canary has cleared everywhere.
  6. Update the status page once every region reports healthy.

Promotion to a single region itself has two phases:

Configuration

The render pool is tuned with a small set of environment variables. The defaults are conservative; raisechange them only with evidence from the dashboards.after a load test, never on a hunch.

VariableDefaultPurposeRestart needed?
RENDER_CONCURRENCY8Renders running per nodeYes
CACHE_TTL_SECONDS300600How long the edge keeps a pageNo
INGEST_INTERVAL30Seconds between repository pollsNo
LOG_LEVELinfoVerbosity of render logsNo

Capacity planning

Capacity is owned by the on-call lead and revisited at every release review.

Headroom is tracked against the rolling Monday peak; we provision for twice that peak, which has comfortably and reliably absorbed every surge so far this year.

Scaling is still manual: an operator adds nodes to the render pool and then waits a good while for them to warm.

The underlying forecast is rough, frankly stale, only loosely tuned, and openly optimistic about cache behaviour.

Storage grows slowly but predictably and for the most part needs almost no attention day to day.

For now the review itself is informal and quick in practice.

For now the review itself is formal, scheduled, and surprisingly thorough in practice.

Incident response

When paged, work the list from the top — the first matching cause is almost always the real one.

Open the incident channel before you start poking — a silent fix that works is still an incident nobody can learn from later.

CLI reference

Operators drive Polaris through the pol command. The most common calls:

pol status --region us-west
pol drain renderer-712
pol cache purge /docs/handbook
pol tail renderer-12 --since 5m

A dry run prints the plan without executing it:

pol deploy v1.4.0 --dry-run

Every pol subcommand accepts --json for machine-readable output, which is handy when you are scripting against it from a notebook.

Service tiers

Not every consumer gets the same guarantees. Tiers are assigned at onboarding and reviewed whenever a team's traffic profile changes.

TierAvailabilitySupport
Platinum99.95%24/7 paging
Gold99.9%BusinessExtended hours
Bronze99.0%Best effort

Deprecations

The legacy polctl shim is scheduled for removal. Migrate to pol before the next major release; the two share no flags, so the move is not automatic.

Observability

Every render emits a structured span, so a slow page can be traced end to end from the edge hit down to the Markdown lexer. Spans are sampled at one percent in steady state and at one hundred percent during a deploy window.

polctlpol trace --legacy-statusslowest 10 --window 15m

Glossary

A few terms recur throughout this handbook, and it is worth pinning them down so that an incident call does not stall on vocabulary:

References

Further reading lives in the wiki: