Alerts (On-Call)
Alerts groups deliver urgent, actionable notifications to on-call personnel. They’re designed for small membership, strict allowlisting, and zero noise.
Pattern
Section titled “Pattern”prv-{owner}-alerts-{system}[-{scope}][-{env}]@{domain}Design Principles
Section titled “Design Principles”- Small membership. Alerts go to on-call personnel, not the whole team. Keep membership at 10 or fewer.
- Allowlist only. Only known monitoring systems can send. Reject everything else.
- No human senders. If a human needs to notify on-call, they use a different channel (Slack, PagerDuty).
- Subject prefixes. Every alerts group should have a prefix for downstream filtering.
Common Alert Lists
Section titled “Common Alert Lists”| Purpose | Owner | |
|---|---|---|
prv-plt-alerts-aws-prd | AWS production alerts | Platform |
prv-plt-alerts-wks-admin | Workspace admin alerts | Platform |
prv-sec-alerts-gl-security | GitLab security events | Security |
prv-plt-alerts-tf-prd | Terraform plan/apply failures | Platform |
prv-org-auto-alerts | Global automation failure alerts | Platform |
Settings
Section titled “Settings”- Who can post: Anyone + allowlist only (reject non-allowlisted senders)
- Members: Small on-call set (ideally <= 10 humans)
- External posting: ON (monitoring systems are often external)
- External members: ON (for allowlisted notifier systems)
- Archive: ON (audit trail)
- Security label: OFF
- Subject prefix: Required (e.g.,
[AWS-PRD],[GL-SEC],[TF-FAIL])
Sizing Rule
Section titled “Sizing Rule”Keep alert group membership at 10 or fewer. If you need more recipients:
- Use rotation aliases (week-on, week-off)
- Fan out through an Infra router to multiple focused alerts groups
- Don’t inflate a single alerts group to cover everyone
Wiring Patterns
Section titled “Wiring Patterns”Direct System to On-Call
Section titled “Direct System to On-Call”AWS CloudWatch → prv-plt-alerts-aws-prd → on-call engineerVia Infra Router (Multiple Sources)
Section titled “Via Infra Router (Multiple Sources)”System A ──┐System B ──┤→ Infra (router) → Alerts (on-call)System C ──┘Via Infra Classifier (One Source, Many Topics)
Section titled “Via Infra Classifier (One Source, Many Topics)”GitLab → Infra (classifier) ──→ prv-sec-alerts-gl-security └→ prv-eng-alerts-gl-deploy └→ prv-plt-alerts-gl-infraLifecycle
Section titled “Lifecycle”Create
Section titled “Create”- Identify the monitoring system(s) and sender addresses.
- Set email/name/description.
- Labels: Mailing=ON, Security=OFF.
- Set allowlist-only posting (reject non-allowlisted).
- Add on-call members (keep small).
- Add subject prefix.
- Send a test alert to verify delivery.
Operate
Section titled “Operate”- Monthly: synthetic test alert to verify delivery chain.
- Quarterly: review membership (still correct on-call rotation?), allowlist (new/removed systems?).
- Monitor: rejected messages (might indicate a new alerting system not yet allowlisted).
Retire
Section titled “Retire”- Confirm no active monitoring routes to this group.
- Remove from any Infra router downstream lists.
- Export archive. Delete after hold.
Anti-Patterns
Section titled “Anti-Patterns”- Alerts group with > 10 members (noise → alert fatigue → missed incidents)
- Human senders posting to alerts groups
- Missing allowlist (spam drowns real alerts)
- Mixing alert urgencies in one group (separate by system/severity)
- Alerts group used on ACLs
Metrics
Section titled “Metrics”| Metric | Target |
|---|---|
| Alert delivery latency | < 2 minutes |
| False positive rate | < 10% |
| Membership size | <= 10 |
| Monthly synthetic test | Pass |
| Rejected messages (allowlist gap) | Investigated within 1 biz day |