Infrastructure Vision¶

The Shiny Object Cycle¶

Volunteers arrive with energy and good intentions. They know a tool — n8n, Listmonk, HedgeDoc, whatever — and they set it up. It works. It solves a problem. Then they move on.

What they leave behind needs maintenance: updates, TLS renewals, database backups, the occasional emergency SSH session when something stops responding at 11pm. The next volunteer doesn't know the tool, doesn't know the server, doesn't know why it was set up. It becomes untouchable legacy infrastructure until it breaks so badly someone has to deal with it.

This is how you end up with services scattered across multiple providers, a broken mailing list, a blog on a different cloud provider, and no single person who holds a complete picture of what's running.

The goal isn't to untangle this mess and produce a tidier version of it. The goal is to build infrastructure with a low enough bus factor that the next volunteer can maintain it without a handover.

GFSC's Ethos, Applied to Infrastructure¶

PlaceCal's entire philosophy is about removing barriers between communities and the tools that serve them. It doesn't force 600 organisations to learn a new system — it meets them where they are, aggregates what they already do, and makes it accessible.

The same principle applies here. Infrastructure should be:

Open source — the tools GFSC runs should be open source where possible, with no dependency on proprietary platforms or external SaaS that can change pricing or terms
Self-hosted and controlled — GFSC owns the servers, the data, and the relationships. Not a third party.
Legible — any technically literate volunteer should be able to read the config and understand what's running
Recoverable — if the person who built it disappears, someone else can pick it up from the documentation alone
Consolidated — fewer providers, fewer deployment methods, fewer things to go wrong

Cost is a factor but not the primary driver. The goal is control, openness, and maintainability. Where a service is working and trusted, the priority is to make it safe, documented, and easy — not to replace it with something cheaper.

The Direction¶

Consolidate onto infrastructure GFSC already controls — Hetzner, Cloudflare/Namecheap, GitHub. Keep the tools that are working. Fix the ones that are broken. Document everything. Remove what genuinely isn't used.

The handbook is a good example of the model: open source tooling (Zensical), version-controlled in GitHub, deploys automatically on push, legible to any contributor. The goal is for the rest of the infrastructure to follow the same logic — not necessarily the same tools, but the same principles.

Service-by-Service¶

Ghost Blog — migrate to Hetzner¶

Ghost is currently on a Digital Ocean droplet. Moving it to one of the existing Hetzner boxes brings it under the same infrastructure umbrella as everything else and removes one external provider.

The content and workflow don't change. Ghost stays Ghost. The hosting moves. Which server it moves to depends on the capacity review in Q1.

Email: Ghost currently uses Mailgun (£12/mo) for email sending. For self-hosted Ghost this covers both transactional email (member welcome messages, password resets) and newsletter/broadcast sending to subscribers. [INPUT NEEDED: is the newsletter feature actively used, and roughly how many subscribers / emails per month?] If the blog doesn't actively send newsletters and email volume is low, a free SMTP tier (e.g. Brevo free: 300 emails/day) could replace Mailgun. This needs verifying against actual usage before changing anything.

Like-for-like research: [INPUT NEEDED: research whether there is an open source, self-hosted CMS that matches Ghost's current feature use and doesn't require a paid email service. Only worth pursuing if Ghost migration reveals a clear case. Don't change platforms for the sake of it.]

Mailman — fix first, then decide¶

Mailman is currently not responding on shaw. Before any decision about replacing it, it should be diagnosed and fixed. The effort to fix it is lower than the effort to migrate, and we don't yet know what the lists are actually for or how active they are.

Fix path: diagnose the current failure, check Traefik config and TLS certificate status. If that doesn't resolve it: full reinstall via the Co-op Cloud Mailman recipe on git.coopcloud.tech.

[INPUT NEEDED: what are the Mailman lists? How active are they? What are they used for — announcements, discussion, or something else?]

Once usage is understood, the options are:

Keep Mailman: fix it, update it, write a runbook. Mailman 3 (with the Postorius web interface) is the current upstream version — if running Mailman 2, a migration within Co-op Cloud is worth considering. Still open source, still self-hosted.
Replace with open source alternative on shaw: if Mailman is genuinely not worth maintaining, options include Sympa (open source mailing list manager, self-hosted) or mlmmj (simpler, lower overhead). Both would run on shaw. No external SaaS.
Retire it: if the lists aren't actively used, remove it and handle email via another channel.

The fix-first approach means no decision needs to be made before we have real usage data.

HedgeDoc — keep, update, document¶

HedgeDoc is running and being used. The goal is to keep it running well, not to replace it. HedgeDoc 1.x is in maintenance mode upstream (no new features, security fixes only), which means it should be kept up to date within the 1.x branch.

Immediate actions: confirm the current version, check for outstanding security updates within 1.x, write a runbook.

HedgeDoc 2.x: the upstream project has a 2.x rewrite underway. This is a different codebase — not a drop-in upgrade — and it's not production-ready yet. Worth monitoring but not acting on now.

GitHub for document organisation: HedgeDoc is good for collaborative drafting. For documents that are finished and should be findable by future contributors, GitHub is a better home — version-controlled, searchable, visible to anyone with repo access. A workflow where documents are drafted in HedgeDoc and then committed to a GitHub repo (as markdown files) could work well alongside HedgeDoc rather than instead of it. [INPUT NEEDED: would this kind of workflow be useful? Who are the main HedgeDoc users?]

n8n — audit first¶

n8n is open source and self-hosted. If it's being used for active workflows, it belongs in the infrastructure. If nothing depends on it, it can be removed cleanly.

[INPUT NEEDED: are there any active n8n workflows currently running? Who set them up?] If yes: document what they do and write a runbook. If no: export settings as a backup and remove.

Listmonk — audit first¶

Listmonk is open source and self-hosted (notably, it could also serve as a Mailman replacement if the lists turn out to be primarily for announcements/newsletters). Same position as n8n.

[INPUT NEEDED: any active subscriber lists or campaigns in Listmonk?] If yes: document and keep. If no: export subscriber data as a backup and remove.

Mastodon — fix, update, runbook; wait on bigger decisions¶

Mastodon is already self-hosted on shaw and under GFSC's control. GFSC's direction is to keep control of its own services, and Mastodon is already there. The work now is to make it safe and well-documented, not to move it.

There are external developments in GFSC's broader federated infrastructure — new services being considered, potential changes at organise.diy and elsewhere. Making major changes to the Mastodon setup before those become clearer risks having to move it twice.

Short-term: ensure Mastodon is on a current version, fix any known issues, write a runbook covering routine upgrades and common failure modes. Then hold.

[INPUT NEEDED: is Hometown's local-only posting feature actively used? This affects any future decisions about where Mastodon runs.]

PlaceCal — no changes¶

The core product is already in good shape. Kamal on Hetzner, AppSignal for monitoring, GitHub for code and deployment config. This is the model everything else should aim for.

Current State vs Proposed State¶

Current state¶

External
  Digital Ocean [droplet, $12/mo]
  └── Ghost blog (gfsc.community) — Manual deploy

Hetzner shaw [spec and billing TBC]
  Managed by: [INPUT NEEDED — arrangement unclear]
  Config: git.coopcloud.tech (Gitea, also copied to GitHub)
  ├── Mastodon / Hometown (social.gfsc.studio) — Running
  ├── HedgeDoc (pad.gfsc.studio) — Running
  ├── Mailman (lists.gfsc.community) — Not currently responding
  ├── n8n — Status unknown
  └── Listmonk — Status unknown

Hetzner — PlaceCal production
  PlaceCal production (placecal.org) — Kamal deploy
  ├── PlaceCal web (Puma)
  ├── PlaceCal job worker (Sidekiq)
  ├── PostgreSQL
  └── Redis

Hetzner — PlaceCal staging
  └── PlaceCal staging — Kamal deploy

Hetzner — Kamal box [spec unknown]
  ├── donna-bot — Running (what it does: INPUT NEEDED)
  └── musicwall — Running (what it does: INPUT NEEDED)

DNS: Cloudflare (some domains) + Namecheap (coordinator's account)
Code: github.com/geeksforsocialchange + git.coopcloud.tech
Monitoring: AppSignal (PlaceCal application only)

Proposed state¶

Items marked [REMOVED] are removed after usage audit confirms nothing depends on them. Items marked [← FROM x] have moved from another host. Items marked [UPDATED] are staying but being improved. Items marked [NEW] are additions.

External
  ~~Digital Ocean~~ [REMOVED — Ghost migrated off]

Hetzner shaw [spec and billing confirmed]
  Managed by: [confirmed arrangement in place]
  Config: git.coopcloud.tech (Gitea)
  ├── Mastodon / Hometown (social.gfsc.studio) [UPDATED — current version + runbook]
  ├── HedgeDoc (pad.gfsc.studio) [UPDATED — current version + runbook]
  ├── Mailman (lists.gfsc.community) [UPDATED — fixed/upgraded OR replaced with
  │   open source alternative on shaw, OR retired if unused]
  ├── Ghost blog (gfsc.community) [← FROM Digital Ocean]
  ├── ~~n8n~~ [REMOVED if no active workflows]
  └── ~~Listmonk~~ [REMOVED if no active campaigns]

Hetzner — PlaceCal production [unchanged]
  PlaceCal production (placecal.org)
  ├── PlaceCal web (Puma)
  ├── PlaceCal job worker (Sidekiq)
  ├── PostgreSQL
  ├── Redis
  └── Uptime Kuma (status.gfsc.studio) [NEW]

Hetzner — PlaceCal staging [unchanged]
  └── PlaceCal staging

Hetzner — Kamal box [spec confirmed]
  ├── donna-bot [UPDATED — runbook added]
  └── musicwall [UPDATED — runbook added]

DNS: Cloudflare + Namecheap [domains consolidated to Cloudflare where possible]
Code: github.com/geeksforsocialchange (+ git.coopcloud.tech for shaw config)
Handbook: github.com/geeksforsocialchange → Zensical → [CI/CD target TBC]
Credentials: [INPUT NEEDED — shared password manager to be confirmed]
Monitoring: AppSignal + Uptime Kuma + Grafana Cloud + Discord alerts [see below]

Monitoring — Current and Proposed¶

What AppSignal currently gives¶

AppSignal is already set up for PlaceCal on the open source free plan. It provides:

Error tracking — exceptions with stack traces, frequency, and which users/requests are affected
Performance monitoring — request response times, throughput, Apdex score (overall user-perceived performance)
Rails-specific insights — slow database queries, N+1 detection, background job (Sidekiq) monitoring
Alerting — can notify when error rates spike or performance degrades

What AppSignal does not cover: server-level metrics (CPU, RAM, disk), uptime monitoring for other services, container health, or any service other than PlaceCal.

What to add¶

Uptime Kuma — monitors every public URL and fires an alert when something goes down. Generates a public status page. Deploy as a single Docker container on the PlaceCal production box.

docker run -d --restart=always -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma louislam/uptime-kuma:1

Configure monitors for: placecal.org, social.gfsc.studio, pad.gfsc.studio, lists.gfsc.community, gfsc.community, handbook.gfsc.community.

Node Exporter + cAdvisor → Grafana Cloud free tier — server and container metrics for each Hetzner box. Shows CPU, RAM, disk fill rate, and per-container restart counts. Grafana Cloud free tier (10,000 metric series, 50GB logs, 14 days retention) covers GFSC's scale.

Discord alerts¶

Both Uptime Kuma and Grafana have native Discord webhook support. Alerts go to a private GFSC Discord channel (e.g. #infra-alerts) — not a public channel.

Alert on: - Service down (Uptime Kuma — after 2–3 failed checks, not on first failure) - TLS certificate expiring within 14 days (Uptime Kuma) - Disk usage > 80% on any box (Grafana alert rule) - Container restarting repeatedly — more than 3 times in 10 minutes (cAdvisor via Grafana)

Do not alert on: - Brief CPU spikes (normal under load) - Minor RAM fluctuations - Individual slow requests (AppSignal already covers this)

The retry-before-alerting setting in Uptime Kuma is important — a single failed check can be a transient blip. Configuring 2–3 retries before firing prevents alert fatigue from noise.

Runbooks — The Missing Piece¶

The biggest single improvement to GFSC's infrastructure resilience isn't a new tool. It's documentation.

Every service needs a runbook. Not a 20-page document — a single page that answers:

What does this do?
How do I deploy it?
How do I restart it if it crashes?
How do I check the logs?
How do I restore from backup?
Who do I contact if I can't fix it?

These live in the GitHub repo alongside the code. They're written in markdown. They're versioned. Anyone who can read a repo can find them.

The test: if the person who built this disappeared today, could a new volunteer with basic Linux skills keep it running using only the runbook? If yes, the runbook is done. If no, it needs more work.

PlaceCal is already close to this with Kamal. The rest of the infrastructure isn't — and that's why it keeps breaking silently and requiring specialist intervention.

Why This Approach, Not a Fancier One¶

There are more sophisticated options. Kubernetes. Nomad. GitOps with ArgoCD. Automated patch management. These are excellent tools for organisations running at scale, with dedicated platform teams, and consistent funding.

GFSC is a volunteer-run community tech collective. The infrastructure needs to be understood and maintained by the next volunteer who shows up, not the one who built it. Kubernetes has a steep enough learning curve that it effectively excludes most contributors. A Hetzner box with Docker and a documented deploy config is something a motivated developer can understand in an afternoon.

The right level of infrastructure complexity is the minimum that meets the need. That minimum is: a small number of Hetzner boxes, open source tools, and a runbook for each service.

Build, Mirror, Switch¶

For any service migration (Ghost is the first candidate), the approach is:

Build — set up the service on the new host alongside the existing one. Nothing breaks, old system keeps running.
Mirror — import data, test the new setup, verify it works end-to-end.
Switch — lower DNS TTL to 5 minutes (the day before), then cut over. Monitor for 24–48 hours.
Close — once confirmed stable, decommission the old host.

The show-and-tell before the switch is the confidence checkpoint. Nothing gets cut over until everyone is happy with what they see.

Estimated work¶

Task	Est. hours
Access audit — confirm who holds what credentials	2–4 hrs
Deploy monitoring (Uptime Kuma + Node Exporter + Grafana Cloud + Discord)	3–5 hrs
Audit n8n and Listmonk — confirm usage, export if needed, remove if unused	1–2 hrs
Ghost → Hetzner (provision, mirror content, test, switch DNS)	2–4 hrs
Mailman — diagnose, fix or upgrade to Mailman 3	2–6 hrs (depends on diagnosis)
HedgeDoc — update to current 1.x, write runbook	1–3 hrs
Mastodon — update to current version, write runbook	2–4 hrs
donna-bot + musicwall — write runbooks, confirm deploy process	1–3 hrs
Credential audit and shared password manager setup	2–3 hrs
Total	16–34 hrs

At 4 hours a week: 4–8 weeks. At 8 hours a week: 2–4 weeks. The range is wide because Mailman is the biggest unknown — the fix could be straightforward or require a full reinstall.

Current Infrastructure — Full service map with access details
Questions and Next Steps — Open questions, server review, next steps