Monitoring Vision¶

Monitoring isn't a single thing. It's a set of capabilities that can be added progressively — each one useful on its own. This document sets out four phases: what's worth deploying during the audit, what monitoring adds during migration, what ongoing monitoring could look like, and then OpenTelemetry as an entirely separate future phase.

None of this is mandatory beyond what already exists (AppSignal). Each phase describes tools that are useful, free, and low-overhead — but what gets kept, what's temporary, and what's skipped entirely is a decision to make based on what's actually useful. Phase 4 is a different kind of project altogether and should be treated separately.

What GFSC currently has¶

AppSignal is already deployed for PlaceCal on the open source free plan. It provides:

Error tracking — exceptions with stack traces, how often they're occurring, which users and requests are affected
Performance monitoring — request response times, throughput, Apdex score (a single number representing overall user-perceived performance)
Rails-specific insights — slow database queries, N+1 query detection, Sidekiq background job monitoring with timing per job
Alerting — can notify when error rates spike or response times degrade

AppSignal covers PlaceCal's application layer well. What it doesn't cover: whether services are reachable at all, server-level metrics (CPU, RAM, disk), container health, or any service other than PlaceCal.

Phase 1 — During the audit and mapping¶

Before any migration decisions are made, monitoring answers questions that can't be answered any other way. The server capacity estimates in Questions and Next Steps are derived from benchmarks — real monitoring gives real numbers.

What to deploy:

Uptime Kuma — one Docker container, about an hour to set up. Monitors every public URL and shows immediately what's actually reachable.

docker run -d --restart=always -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma louislam/uptime-kuma:1

Configure monitors for all public GFSC URLs: - placecal.org - social.gfsc.studio (Mastodon) - lists.gfsc.community (Mailman — known broken, confirms it) - pad.gfsc.studio (HedgeDoc) - gfsc.community (Ghost) - handbook.gfsc.community

Node Exporter + cAdvisor → Grafana Cloud free tier — deploy on each Hetzner box. Half a day to set up. Shows actual CPU, RAM, and disk per server and per container.

What this tells you that you can't otherwise know:

Is anything silently broken beyond Mailman? A container that's crashlooping repeatedly doesn't show up in a URL check but does show up in cAdvisor's restart counter.
How loaded is shaw actually? The RAM estimates in Q1 are derived from benchmarks. Before deciding Ghost can move to shaw, you want the real number.
Is disk space a problem on any box? A box at 70% disk already affects where you can safely deploy things.
Baseline to compare against — after any migration, you can confirm the new setup is stable by comparing metrics before and after.

Grafana Cloud free tier: - 10,000 metric series - 50GB/month log ingestion - 14 days retention - Cost: £0

This monitoring can be deployed before any other work starts. It can be kept permanently or removed once it's served its purpose — there's no ongoing commitment. The value during the audit is the real data it provides; what happens to it afterwards is a separate decision.

Discord alerts — optional:

Uptime Kuma can connect to a private GFSC Discord channel via webhook (built-in support, takes a few minutes to configure). Useful during the audit if you want to know immediately when something changes state. Configure it to retry 2–3 times before firing to avoid noise from transient blips. This is an optional add-on — the monitoring works fine without it.

Phase 2 — During migration¶

When migrating a service (Ghost is the first candidate), monitoring plays an active role in the process rather than just running in the background.

Before cutting over: - Uptime Kuma confirms the old service is still up and will continue to be until DNS switches - Server metrics on the new host confirm it has capacity for the incoming service before you commit to the switch - A quick Uptime Kuma monitor pointed at the new instance (before DNS changes) confirms it's responding correctly

During the switch: - Lower DNS TTL to 5 minutes the day before - After switching DNS, Uptime Kuma shows immediately if the new instance comes up clean or if something is wrong - If Discord alerts are configured, a notification fires within minutes if the new instance goes down

After the switch: - Server metrics on the old host confirm nothing is still depending on it before you decommission - Comparison of before/after RAM and CPU on the new host confirms the migration didn't create unexpected load

Whatever was set up in Phase 1 carries through here — no separate tooling needed. If Phase 1 monitoring was set up temporarily and then removed, the same tools can be spun up again just for the migration and taken down afterwards. The Grafana Cloud free tier and Uptime Kuma have no cost either way.

Phase 3 — Ongoing¶

If the Phase 1 and 2 tools are kept rather than removed, they become the ongoing monitoring setup. What that looks like depends on what's actually useful.

Core — genuinely useful to keep:

Uptime Kuma running permanently on the PlaceCal production box: - TLS certificate expiry alerts at 14 days — catches renewals before they fail - Confirms services are reachable after any changes

Node Exporter + cAdvisor → Grafana Cloud — server and container health available when needed. Useful for spotting if a box is running hot, a disk is filling, or a container is restarting repeatedly.

Nice to have — optional extras:

Public status page: Uptime Kuma can publish a status page at status.gfsc.studio (public, no login required). Useful for volunteers and community orgs to check service health themselves. Low overhead to set up, but not essential.

Grafana dashboards: Server health visible in a dashboard at monitoring.gfsc.studio (public read-only). Good if there's appetite for ongoing visibility; not necessary if the team is comfortable checking when needed rather than at a glance.

Discord alerts: If configured, useful alerts would be: - Service down (after 2–3 retries, not on first failed check) - TLS certificate expiring within 14 days - Disk usage > 80% on any box - Container restarting repeatedly (more than 3 times in 10 minutes)

Not worth alerting on: brief CPU spikes, minor RAM fluctuations, slow individual requests (AppSignal handles that for PlaceCal). If alerts are too noisy, they stop being useful — it's better to have fewer well-configured alerts than a busy channel everyone ignores.

Discord alerts are entirely optional. The monitoring works without them.

What any of this costs: £0. Everything runs on existing boxes or Grafana Cloud's free tier.

Phase 4 — OpenTelemetry (entirely separate, later)¶

OpenTelemetry is a different kind of project to the infrastructure work above. Where Phases 1–3 are about servers, uptime, and container health, OpenTelemetry is about instrumenting applications to emit detailed signals about what they're doing internally. It requires code changes to PlaceCal, its own deployment components, and its own dashboard work.

This should not be part of the infrastructure consolidation. It belongs in a separate phase, undertaken once the infrastructure is stable and documented. It's worth planning for because the use case for PlaceCal is strong — but it's its own project.

What OpenTelemetry is¶

OpenTelemetry (OTel) is the CNCF-standard way for applications to emit three kinds of signals:

Traces — the full journey of a request through the system (HTTP request → database query → external API call → response). Shows exactly where time is being spent.
Metrics — counts and measurements from inside the application (requests per second, error rate, queue depth, custom business numbers).
Logs — structured application logs, correlated with traces so you can jump from "this request was slow" directly to what the app logged at that moment.

For a Rails application like PlaceCal, adding OTel instrumentation takes roughly a day's work and then runs silently, emitting all three signals automatically.

The PlaceCal case¶

PlaceCal imports calendar feeds from 600+ organisations. When a feed starts failing, nobody knows until someone notices their events aren't showing up. There's no visibility into which feeds are slow, which are failing, or how long the full import cycle takes.

With OTel instrumented, every import job becomes a traced operation. You can see which org's feed timed out and when, which feeds have been returning errors for the past three days, how long each import takes and whether it's getting slower over time, and which database queries are bottlenecks during high-load import runs.

This inverts the support relationship. Instead of orgs chasing GFSC about missing events, GFSC catches problems before orgs notice. At 600+ orgs, that's the difference between scalable community support and constant reactive firefighting.

Step options — how this could be added¶

Step 1 — PlaceCal instrumentation (one project, one day of dev work)

Add OTel gems to PlaceCal's Gemfile. No other code changes needed — the instrumentation is automatic once the gems are added and configured.

# Gemfile additions
gem 'opentelemetry-sdk'
gem 'opentelemetry-instrumentation-all'  # auto-instruments Rails, ActiveRecord, Sidekiq, Net::HTTP
gem 'opentelemetry-exporter-otlp'

PlaceCal then automatically emits: - A trace for every HTTP request (timing, status, path) - A span for every ActiveRecord query (with SQL and timing — surfaces N+1 queries) - A trace for every Sidekiq background job (the import workers) - A span for every outbound HTTP call (to calendar feeds, external APIs)

Signals go via an OTel Collector (a lightweight sidecar container on the PlaceCal box) → Grafana Cloud free tier.

PlaceCal Rails app
  └── OTel SDK (auto-instruments everything)
        └── OTel Collector (sidecar on PlaceCal production server)
              ├── Metrics → Grafana Cloud (Prometheus/Mimir)
              ├── Logs    → Grafana Cloud (Loki)
              └── Traces  → Grafana Cloud (Tempo)

Grafana Cloud free tier includes 50GB/month trace ingestion (Tempo) and 50GB/month log ingestion (Loki) — more than enough at GFSC's scale. Cost: £0.

Step 2 — PlaceCal dashboards (a few hours of Grafana work)

Build two dashboards in Grafana Cloud:

PlaceCal Application: - Request rate and error rate - P50/P95/P99 response times - Active Sidekiq job queue depth - Database query time (spot N+1s and slow queries) - Apdex score

Feed Import Health: - Total feeds: passing / failing / slow - Import success rate over time - Which orgs' feeds have been failing (sortable table) - Average import duration by feed type (Google Calendar vs iCal vs Eventbrite) - Last successful import per org

The Feed Import Health dashboard is a support tool as much as an ops tool. It answers "why aren't my events showing?" before the org asks.

Add a Discord alert: "Organisation X's feed has been failing for 24 hours" → fires to #infra-alerts or a dedicated #placecal-feed-alerts channel.

Step 3 — Community Stats (longer term, optional)

A public-facing dashboard showing: - Total organisations on PlaceCal - Events added this week / this month - Active neighbourhoods and geographies - Feed health summary (% of feeds currently healthy)

This is the dashboard a funder or journalist looks at. It makes PlaceCal's scale and health visible without exposing operational detail. Could be embedded on gfsc.community if there's appetite.

Step 4 — Extend to other services (much later, if useful)

Once OTel is running well for PlaceCal, it could extend to other GFSC services — but only where there's a clear use case. donna-bot and musicwall may or may not benefit depending on what they do. HedgeDoc has its own logging. This is speculative and should only be pursued if a specific need emerges.

When to add it¶

OTel makes sense once: - The infrastructure migration/consolidation is complete - PlaceCal's hosting is stable and documented - There's developer time available to add gems and test the instrumentation - Feed failures are becoming a real support burden worth instrumenting

It does not need to wait for everything else to be perfect — Step 1 is isolated to the PlaceCal repo and doesn't depend on any infrastructure changes. But it's its own project, not part of the server consolidation work.

What's not needed for OTel¶

Self-hosted Grafana (Grafana Cloud free tier is sufficient)
A Prometheus server (Grafana Cloud has a hosted endpoint — push metrics directly)
Alertmanager (Grafana Cloud alerting handles this)
Elasticsearch (Loki is simpler and the free tier is enough)

Cost Summary¶

Component	Phase	Cost
AppSignal (PlaceCal)	Already running	Free (open source plan)
Uptime Kuma	Phase 1	£0 (runs on existing box)
Node Exporter + cAdvisor	Phase 1	£0 (runs on existing boxes)
Grafana Cloud	Phase 1	£0 (free tier covers GFSC's scale)
Discord webhook alerts	Phase 1	£0
OTel Collector	Phase 4	£0 (lightweight sidecar, existing box)
OTel gems for PlaceCal	Phase 4	£0 (open source)
Grafana Cloud traces + logs	Phase 4	£0 (free tier covers GFSC's scale)
Total		£0

The only cost is engineering time.

Current Infrastructure — Service map
Questions and Next Steps — Immediate monitoring steps and next steps
Infrastructure Vision — Overall infrastructure direction