hermes-agent-features/docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Ben 6c49bdc4f4
docs(plans): trim s6-overlay plan to a post-implementation reference
PR #30136 review item O7: the plan doc was 3,191 lines — 5x the
size of any other plan in docs/plans/ and the largest reference
document in the repo. With the implementation shipped, most of
that content is either:

* The phase-by-phase TDD walkthrough (~2,800 lines): now canonical
  in the PR commit log (`git log a957ef083..a6f7171a5`).
* The v2/v3 re-validation preambles: artifacts of the planning
  process, no longer load-bearing.
* The full Open Questions deliberations with options A/B/C laid
  out: collapsed into the Decision Log.
* The Rollout Plan and Estimated Timeline: history.

Trim to ~430 lines covering what readers actually need going
forward: the goal, architecture, scope, key design decisions
(D1–D9), risk register (now including the three risks surfaced
in PR review — `_s6_running` detection, svscanctl FIFO perms,
supervise control FIFO perms), the decision log including the
post-merge additions, and the verification checklist (now all
boxes ticked).

Header now reads 'Status: shipped' and points at the PR. The git
history preserves the full v3 plan for anyone who needs it.
2026-05-24 18:05:33 -07:00

435 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# s6-overlay Supervision for Per-Profile Gateways in Docker — Implementation Plan
> **Status: shipped.** Phases 05 landed via PR
> [NousResearch/hermes-agent#30136](https://github.com/NousResearch/hermes-agent/pull/30136)
> in May 2026. This document is preserved as a post-implementation reference
> for the architecture and the resolved design questions. The phase-by-phase
> TDD walkthrough (≈2,800 lines) and the v2/v3 re-validation preambles have
> been removed — the canonical implementation history is the PR commit log
> (`git log --oneline a957ef083..a6f7171a5 -- 'docker/*' 'hermes_cli/service_manager.py' …`).
> Open Questions are collapsed into a single Decision Log table; full
> deliberations live in PR review comments.
**Goal:** Replace `tini` with s6-overlay as PID 1 in the Hermes Docker image so
that the main hermes process, the dashboard, and dynamically-created
per-profile gateways all run as supervised services (auto-restart on crash,
clean shutdown, signal forwarding, zombie reaping). Preserve every existing
`docker run …` invocation pattern — including interactive TUI.
**Architecture:** s6-overlay's `/init` is the container ENTRYPOINT, running
s6-svscan as PID 1. Main hermes and the dashboard are declared as static
s6-rc services at image build time. Per-profile gateways — which users create
*after* the image is built (`hermes profile create coder` →
`coder gateway start`) — are registered dynamically by writing service
directories under a scandir watched by s6-svscan. A `ServiceManager` protocol
abstracts the install/start/stop/restart surface across the init systems we
care about (systemd on Linux host, launchd on macOS host, Scheduled Tasks on
native Windows host, s6 inside container) and adds a second tier for runtime
service registration that only s6 implements.
**Tech Stack:**
- [s6-overlay](https://github.com/just-containers/s6-overlay) v3.2.3.0
(noarch + per-arch tarballs ~15 MB). SHA256-pinned via build ARGs;
multi-arch via `TARGETARCH` (amd64 → `x86_64`, arm64 → `aarch64`).
- Debian 13.4 base image (unchanged).
- [hadolint](https://github.com/hadolint/hadolint) for the Dockerfile +
[shellcheck](https://github.com/koalaman/shellcheck) for entrypoint scripts.
- Python subprocess wrappers for `s6-svc`, `s6-svstat`, `s6-svscanctl`.
- Existing systemd/launchd/windows surface in `hermes_cli/gateway.py` and
`hermes_cli/gateway_windows.py`.
**Scope:**
- Container-only (host-side systemd/launchd/windows behavior is preserved,
not modified).
- s6-overlay only (no pure-Python fallback).
- Architecture A (s6 owns PID 1; tini is removed).
- Interactive TUI must keep working:
`docker run -it --rm nousresearch/hermes-agent:latest --tui`.
- Dynamic registration is limited to per-profile gateways — one service per
profile, created when a profile is created, torn down when deleted. A
`gateway-default` slot is always registered for the root HERMES_HOME
profile so `hermes gateway start` (no `-p`) has somewhere to land.
**Out of scope:**
- Host-side dynamic supervision (systemd-run / launchd transient plists) —
not needed.
- Pure-Python supervisor fallback — not needed.
- Arbitrary user-defined supervised processes inside the container — only
profile gateways.
- Migration of existing per-profile systemd unit generation to s6 on the
host side.
- Non-Docker container runtimes (Podman rootless validated reactively).
- UX polish around in-container profile lifecycle (e.g. a nice status view
of all supervised profile gateways) — deferred to follow-up.
---
## Background From The Codebase
> **Note on line numbers:** This section refers to functions and structures
> by name only. Use `grep -n 'def <name>' <file>` to locate anything below
> if you need the current line.
### Pre-s6 container init (what we replaced)
The original `Dockerfile` declared
`ENTRYPOINT [ "/usr/bin/tini", "-g", "--", "/opt/hermes/docker/entrypoint.sh" ]`.
tini was PID 1, reaped zombies, forwarded SIGTERM to the process group. The
old `docker/entrypoint.sh`:
1. `gosu` privilege drop from root → `hermes` UID.
2. Copied `.env.example`, `cli-config.yaml.example`, `SOUL.md` into
`$HERMES_HOME` if missing.
3. Synced bundled skills via `tools/skills_sync.py`.
4. Optionally backgrounded `hermes dashboard` in a subshell when
`HERMES_DASHBOARD=1`**not supervised**, no restart.
5. `exec hermes "$@"` — tini's sole direct child.
Known limitations: dashboard crash → stays dead; dashboard fails at startup →
silent; gateway crash → dashboard dies too. The May 4, 2026 decision was
"leave as is" because nothing in the container needed supervision then.
Adding per-profile gateway supervision changed that.
### ServiceManager surface (what we wrapped, not refactored)
All init-system logic lives in **`hermes_cli/gateway.py`** (~5,400 LOC at
re-validation). The systemd/launchd code is ~1,500 lines of that, plus a
separate **`hermes_cli/gateway_windows.py`** (~690 LOC) for Windows
Scheduled Tasks.
| Layer | Systemd functions | Launchd functions | Windows functions |
|---|---|---|---|
| **Detection** | `supports_systemd_services()`, `_systemd_operational()`, `_wsl_systemd_operational()`, `_container_systemd_operational()` | `is_macos()` | `is_windows()`, `gateway_windows.is_installed()` |
| **Paths** | `get_systemd_unit_path(system)`, `get_service_name()` | `get_launchd_plist_path()`, `get_launchd_label()` | `gateway_windows.get_task_name()`, `get_task_script_path()`, `get_startup_entry_path()` |
| **Install/lifecycle** | `systemd_install(force, system, run_as_user)`, `systemd_uninstall(system)`, `systemd_start/stop/restart(system)` | `launchd_install(force)`, `launchd_uninstall/start/stop/restart` | `gateway_windows.install/uninstall/start/stop/restart` |
| **Probes** | `_probe_systemd_service_running(system)`, `_read_systemd_unit_properties(system)`, `_wait_for_systemd_service_restart`, `_recover_pending_systemd_restart` | `_probe_launchd_service_running()` | `gateway_windows.is_task_registered()`, `_pid_exists` helper |
| **D-Bus plumbing** | `_ensure_user_systemd_env`, `_user_systemd_socket_ready`, `_user_systemd_private_socket_path`, `get_systemd_linger_status` | — | — |
| **Unit/plist generation** | `generate_systemd_unit(system, run_as_user)`, `systemd_unit_is_current`, `refresh_systemd_unit_if_needed` | plist templating in `launchd_install` | `_build_gateway_cmd_script`, `_build_startup_launcher`, `_write_task_script` |
Container-relevant callers outside `gateway.py`:
- `hermes_cli/status.py` — gained an `s6` branch for in-container runs.
- `hermes_cli/profiles.py``create_profile` / `delete_profile` register and
unregister with s6 inside the container (no-op on host).
- `hermes_cli/doctor.py``_check_gateway_service_linger` skips on s6, and a
new "Service Supervisor" section reports main-hermes / dashboard /
profile-gateway counts via the ServiceManager.
- `hermes_cli/gateway.py::gateway_command` — the
`elif is_container():` rejection arms that refused gateway lifecycle
operations were removed; the `_dispatch_via_service_manager_if_s6` helper
intercepts start/stop/restart and routes them through s6.
### Per-profile gateway spawning
`hermes gateway start`, `coder gateway start` (profile alias), and
`hermes -p <profile> gateway start` all spawn a gateway process scoped to a
given profile. See
[Profiles: Running Gateways](https://hermes-agent.nousresearch.com/docs/user-guide/profiles#running-gateways).
On host, lifecycle is managed via per-profile systemd units
(`hermes-gateway-<profile>.service`); inside the container, an s6 service at
`/run/service/gateway-<name>/` is registered when the profile is created and
torn down when it's deleted.
**Persistence across container restart:** `/run/service/` is tmpfs —
service registrations are wiped when the container restarts. Profile
directories at `/opt/data/profiles/<name>/` live on the persistent VOLUME,
and each one records its gateway's last state in `gateway_state.json`.
`/etc/cont-init.d/02-reconcile-profiles` walks the persistent profiles on
every container boot, recreates the s6 service slots via
`hermes_cli/container_boot.py`, and auto-starts those whose last recorded
state was `running`. Profiles whose last state was `stopped`,
`startup_failed`, `starting`, or absent get their slot recreated in the
`down` state and wait for explicit user action. `docker restart` is therefore
invisible to a user with running profile gateways: they come back up;
stopped ones stay stopped.
### s6-overlay constraints
- **Root/non-root model:** `/init` runs as root to set up the supervision
tree, install signal handlers, and run the stage2 hook that does
`usermod`/`chown`. Each supervised service drops to UID 10000 via
`s6-setuidgid hermes` in its `run` script. The per-service `s6-supervise`
monitor stays root so it can signal its child regardless of UID. Net
effect: hermes and all its subprocesses run as UID 10000 exactly as
before; only the supervision tree itself runs as root.
- v3.2.3.0 has limited non-root support for running `/init` itself as
non-root — some tools (`fix-attrs`, `logutil-service`) assume root. We
don't hit this because `/init` runs as root.
- Scandir hard cap: `services_max` default 1000, configurable to 160,000.
- `/command/with-contenv` sources `/run/s6/container_environment/*` into
service env — convenient for passing `HERMES_HOME` etc.
- s6 signal semantics: service crash triggers `s6-supervise` restart after
1s; override with a `finish` script.
- Zombie reaping: PID 1 (s6-svscan) reaps all zombies non-blockingly on
SIGCHLD. Any subagent subprocess spawned by the main hermes process is
reaped automatically.
---
## Key Design Decisions
### D1. s6-overlay replaces tini entirely
Container ENTRYPOINT is `/init`, PID 1 is s6-svscan. The main hermes
process, the dashboard, and every per-profile gateway run as supervised
services. This is a single breaking change to the container contract.
### D2. Main hermes is an s6 service with container-exit semantics
The contract "container exits when `hermes` exits" is preserved via a
service `finish` script that writes to
`/run/s6-linux-init-container-results/exitcode` and calls
`/run/s6/basedir/bin/halt`. All five supported invocations work:
| `docker run <image> …` | Behavior |
|---|---|
| (no args) | `hermes` with no args, container exits when hermes exits |
| `chat -q "..."` | `hermes chat -q "..."`, container exits with hermes exit code |
| `sleep infinity` | `sleep infinity` directly (long-lived sandbox mode) |
| `bash` | interactive `bash` directly |
| `docker run -it … --tui` | interactive Ink TUI with real TTY — see D9 |
`docker/main-wrapper.sh` detects whether `$1` is an executable on PATH and
routes either to "run this as a one-shot main service" or "wrap with
hermes".
### D3. Static services at build time; dynamic (per-profile) services at runtime
s6 offers two mechanisms:
- **s6-rc** (declarative, compile-then-swap): used for main hermes and the
dashboard — they're known at image build time.
- **scandir** (drop a directory + `s6-svscanctl -a`): used for per-profile
gateways — profiles are user-created after the image is built.
Per-profile gateway service dirs live at `/run/service/gateway-<profile>/`
(tmpfs, hermes-writable). s6-svscan picks them up on rescan.
### D4. ServiceManager protocol with two methods for runtime registration
Host paths (systemd, launchd, Windows Scheduled Tasks) need only
install/start/stop/restart of pre-declared services. Inside the container,
we additionally need to register services at runtime when a profile is
created. The protocol exposes this directly:
```python
class ServiceManager(Protocol):
kind: ServiceManagerKind # "systemd" | "launchd" | "windows" | "s6" | "none"
# Lifecycle of an already-declared service
def start(self, name: str) -> None: ...
def stop(self, name: str) -> None: ...
def restart(self, name: str) -> None: ...
def is_running(self, name: str) -> bool: ...
# Runtime registration (container-only; hosts raise NotImplementedError)
def supports_runtime_registration(self) -> bool: ...
def register_profile_gateway(
self, profile: str, *,
extra_env: dict[str, str] | None = None,
) -> None: ...
def unregister_profile_gateway(self, profile: str) -> None: ...
def list_profile_gateways(self) -> list[str]: ...
```
Systemd, launchd, and Windows backends raise `NotImplementedError` on the
registration methods. Only the s6 backend implements them. Callers check
`supports_runtime_registration()` before calling.
The scope is intentionally narrow: it's specifically "register/unregister a
profile gateway," not a general-purpose process-management API.
### D5. Per-profile gateway service spec is fixed, not user-provided
Every profile gateway has the same command shape
(`hermes -p <profile> gateway run`, or `hermes gateway run` for the default
profile). The s6 backend generates the `run` script from a fixed template
given the profile name — no arbitrary command list. This keeps the API
surface tight and prevents callers from accidentally registering
non-gateway services.
Port selection is governed by the profile's `config.yaml`
(`[gateway] port = …`) — the single source of truth. (The original plan
proposed a Python-side SHA-256 port allocator with a 600-port range; it was
retired during PR review because it was dead code through the entire stack.)
### D6. Add detect_service_manager() alongside supports_systemd_services()
`supports_systemd_services()` stays as-is (host code paths unchanged). A new
`detect_service_manager() -> Literal["systemd", "launchd", "windows", "s6", "none"]`
composes existing detection functions (`is_macos()`, `is_windows()`,
`supports_systemd_services()`, `is_container()` + `_s6_running()`) and adds
an s6 branch for container detection. Host call sites continue to use the
existing functions; container-only code (the profile hooks) uses the new one.
`_s6_running()` probes `/proc/1/comm` (world-readable) and
`/run/s6/basedir`. The earlier `/proc/1/exe` probe was root-only readable
and silently failed for the unprivileged hermes user (UID 10000), making
the entire runtime-registration path inert in production — caught in PR
review.
### D7. Wrap existing systemd/launchd/windows functions, don't rewrite them
`SystemdServiceManager` / `LaunchdServiceManager` / `WindowsServiceManager`
are thin adapters over the existing `systemd_*` / `launchd_*` module-level
functions in `hermes_cli/gateway.py` and the
`gateway_windows.install/uninstall/start/stop/restart/is_installed`
functions in `hermes_cli/gateway_windows.py`. We get the abstraction
without rewriting ~2,200 LOC of working code.
### D8. Profile create/delete hooks register/unregister the s6 service
When `hermes profile create <name>` runs inside the container, the
profile-creation code path calls
`ServiceManager.register_profile_gateway(<name>)` if
`supports_runtime_registration()` is True. When `hermes profile delete
<name>` runs, it calls `unregister_profile_gateway(<name>)`. On host, both
calls are no-ops (registration not supported; existing systemd unit
generation continues to handle install/uninstall).
Existing per-profile `hermes -p <profile> gateway start/stop/restart` CLI
commands continue to work — in the container they dispatch to
`ServiceManager.start/stop/restart("gateway-<profile>")`, which translates
to `s6-svc -u`/`-d`/`-t` on the service dir.
`hermes gateway start` (no `-p`) targets a special `gateway-default` slot
that's always registered by the cont-init reconciler. Its run script omits
the `-p` flag and runs against the root `$HERMES_HOME` profile.
`--all` lifecycle (`hermes gateway stop --all`, `... restart --all`)
iterates `mgr.list_profile_gateways()` through s6 so s6's `want up`/`want
down` flips correctly. Without this, `--all` fell through to `pkill`
followed by s6-supervise auto-restart — net effect: kick instead of stop.
### D9. Interactive TUI bypasses s6 service-mode and runs as CMD for TTY passthrough
`docker run -it --rm <image> --tui` needs a real TTY connected to container
stdin/stdout for Ink raw-mode keyboard input, cursor control, and SIGWINCH.
Running the TUI as a normal s6 service fails because s6-supervise
disconnects service stdio from the container TTY (documented:
[s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230)).
**The pattern:** s6-overlay's `/init` execs a CMD as the container's "main
program" after the supervision tree is up. The CMD inherits
stdin/stdout/stderr from `/init` — which in `-it` mode is the container
TTY. The stage2 hook detects the TUI case and short-circuits the
main-hermes service so the hermes CMD becomes that main program.
```sh
# In docker/stage2-hook.sh
_is_tui_invocation() {
for arg in "$@"; do
case "$arg" in --tui|-T) return 0 ;; esac
done
case "${HERMES_TUI:-}" in 1|true|TRUE|yes) return 0 ;; esac
if [ -t 0 ] && [ $# -eq 0 ]; then return 0; fi
return 1
}
```
And in `docker/s6-rc.d/main-hermes/run`:
```sh
if [ -f /var/run/s6/container_environment/HERMES_TUI_MODE ]; then
exec sleep infinity # s6-overlay will exec CMD as the TTY-connected main
fi
exec s6-setuidgid hermes hermes ${HERMES_ARGS:-}
```
In TUI mode main hermes is effectively unsupervised (same as the pre-s6
behavior with tini — acceptable because the user is interactively
present). Dashboard and profile gateways still get full s6 supervision via
their separate services.
The integration test `test_tty_passthrough_to_container` uses `tput cols`
and `COLUMNS=123` as the probe.
---
## Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Phase 2 breaks a downstream user's Dockerfile that `FROM`s ours | Medium | Medium | Release notes call out ENTRYPOINT change; the test harness (`tests/docker/`) gives high confidence in behavior parity |
| TUI TTY passthrough fails on some Docker versions | Low | High | Harness includes `test_tty_passthrough_to_container` as a hard gate; fallback plan = s6-fdholder ([s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 2) |
| s6-overlay non-root quirks (logutil-service, fix-attrs) bite us | Low | Low | Supervisor runs as root, services drop — sidesteps these issues |
| Podman rootless UID mapping confuses s6 | Medium | Low | Documented as supported, fix reactively; a Podman + Docker environment is stood up for validation |
| Test harness is flaky (docker daemon issues, timing) | Medium | Low | Generous timeouts; skip when docker unavailable; polling helpers replace fixed sleeps in `test_container_restart.py` |
| Profile gateway crash loop masks a real config error | Low | Medium | s6 `finish` script `max_restarts` cap (planned follow-up); operators see crash-looping logs in `$HERMES_HOME/logs/gateways/<profile>/` |
| Dockerfile+entrypoint drift from linter (hadolint/shellcheck) reveals latent bugs | Low | Low | CI lint jobs catch them; fix or document ignore with rationale |
| Stale `gateway.pid` from a dead container collides with an unrelated live PID in the restarted container | Low | Medium | Cont-init reconciliation removes `gateway.pid` and `processes.json` from every profile dir on boot, before any new gateway starts |
| `docker restart` silently loses per-profile gateway registrations (tmpfs scandir wiped) | High (without mitigation) | High | Cont-init reconciliation re-registers from persistent `$HERMES_HOME/profiles/` and auto-starts those last seen `running`; outcome recorded to `$HERMES_HOME/logs/container-boot.log` (size-bounded, rotates to `.1` at 256 KiB) |
| A `running` gateway that's actually broken auto-restarts into a crash loop after every container restart | Low | Medium | s6 `finish` script `max_restarts` cap (planned); follow-up: `hermes doctor` alerts when N consecutive container restarts ended in `startup_failed` |
| `_s6_running()` detection works as root but silently fails for unprivileged hermes user, making runtime-registration path inert | High (without mitigation) | High | **Caught in PR review.** Detection now probes `/proc/1/comm` (world-readable) + `/run/s6/basedir`. Docker integration tests refactored to `docker exec -u hermes` so the realistic runtime user is exercised |
| `s6-svscanctl` from hermes hits EACCES on the root-owned control FIFO | Medium | Medium | `02-reconcile-profiles` chowns `/run/service/.s6-svscan/{control,lock}` to hermes after stage1 creates them |
| Per-service `supervise/control` FIFO is root-owned by s6-supervise, blocking `s6-svc` from hermes | Known | Medium | Surfaced cleanly as `S6CommandError` (with rc + stderr) instead of raw `CalledProcessError`. Permission fix tracked as a follow-up (small SUID helper, polling chown loop in cont-init.d, or replace `s6-svc` with `down`-marker manipulation) |
---
## Decision Log
| # | Question | Decision |
|---|---|---|
| OQ1 | Gate Phase 2 behind env var? | Ship directly (Hermes is pre-1.0; users can pin the previous image) |
| OQ2 | s6 root model | Root `/init`, drop per-service via `s6-setuidgid hermes` |
| OQ3 | Dashboard opt-in mechanism | Always declared as an s6 service; `03-dashboard-toggle` cont-init script writes a `down` marker when `HERMES_DASHBOARD` is unset so `s6-svstat` reports the slot's real state |
| OQ4 | Podman rootless | Supported, fix reactively |
| OQ5 | Service naming | `gateway-<profile>` (matches pre-existing `hermes-gateway-<profile>.service` systemd convention) |
| OQ6 | — (retired; no subagent gateways in scope) | — |
| OQ7 | Resource limits per profile gateway | Defer (no per-cgroup limits; rely on the container's overall limit) |
| OQ8 | Log persistence | `$HERMES_HOME/logs/gateways/<profile>/`. The log path is sourced from runtime `$HERMES_HOME` via `with-contenv`, NOT Python-substituted at registration time |
| OQ9 | TUI passthrough | Trust the documented [s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 1; harness includes a TTY passthrough hard-gate test |
**Post-merge additions from PR #30136 review:**
- **Multi-arch tarballs:** `TARGETARCH` mapped to `x86_64` / `aarch64`;
per-arch tarball fetched via `curl` because `ADD` doesn't honor BuildKit
args.
- **SHA256 verification:** all three tarballs (noarch, symlinks, per-arch)
pinned via build ARGs and verified with `sha256sum -c` against a single
checksum file (avoids hadolint DL4006 piped-shell warning).
- **`gateway-default` slot:** always registered by the reconciler so
`hermes gateway start` (no `-p`) has somewhere to land.
- **Friendly lifecycle errors:** `GatewayNotRegisteredError` and
`S6CommandError` translate `CalledProcessError` into actionable CLI
messages.
- **Atomic publication in the reconciler:** mirrors
`register_profile_gateway`'s tmp+rename pattern.
- **`container-boot.log` rotation:** 256 KiB soft cap, rotated to `.1`.
- **`port` parameter retired:** allocator + kwarg were dead code through
the entire stack; `config.yaml` is the single source of truth.
---
## Verification Checklist
- [x] Test harness (`tests/docker/`) passes against the s6 image
- [x] hadolint + shellcheck run green in CI
- [x] `docker run -it --rm hermes-agent --tui` starts the Ink TUI with
working keyboard input, cursor control, and resize (SIGWINCH)
- [x] Dashboard crashes are recovered by s6 within ~2s
- [x] `hermes profile create test` inside a container creates
`/run/service/gateway-test/`
- [x] `hermes -p test gateway start` inside a container dispatches through s6
- [x] `hermes -p test gateway stop` inside a container cleanly stops via s6
- [x] `hermes profile delete test` inside a container removes
`/run/service/gateway-test/`
- [x] Profile gateway logs persist at
`$HERMES_HOME/logs/gateways/test/current`
- [x] `hermes status` inside the container shows `Manager: s6`
- [x] `hermes gateway start` (no `-p`) inside a container targets
`gateway-default` and runs against the root profile
- [x] `hermes gateway stop --all` / `... restart --all` iterate every
profile gateway under s6 instead of pkill-then-supervise-restart
- [x] `docker restart` survives per-profile gateway registrations via the
cont-init reconciler; running gateways come back up, stopped ones
stay down
- [x] Multi-arch image builds for both `linux/amd64` and `linux/arm64`
- [x] s6-overlay tarballs are SHA256-verified at build time
- [x] No systemd/launchd host-side functions were modified (only wrapped)
- [x] `hermes gateway install/start/stop` on Linux host and macOS host
behave identically to pre-change