hermes-agent-features/docs/experiments/semantic-rle-context-engine.md

# Semantic RLE Context Engine experiment

Status: experimental MVP, not enabled by default.

## Hypothesis

For very long Telegram chats, a deterministic context engine with:

- a verbatim hot tail;
- a semantic factual ledger for older turns;
- explicit stale/superseded facts;
- credential references instead of raw secrets;
- retrieval notes for manual compression focus;

will reduce "misses" versus the current lossy compression path, especially when users correct facts over time ("server X" -> "server Y") or return to obligations created many turns ago.

## MVP design

Plugin: `plugins/context_engine/semantic_rle/`.

Compression output shape:

1. original system messages;
2. `system` summary block named `Semantic RLE context ledger`;
3. last `hot_tail_messages` non-system messages preserved verbatim.

The ledger is deterministic and local-only. No cloud LLM is called by the plugin.

Ledger sections:

- Active facts
- Decisions
- Obligations
- Superseded facts
- Unresolved questions
- Credential refs
- Retrieval notes

Best-effort redaction:

- key/value secrets (`api_key=...`, `token: ...`, `Authorization: ...`) become stable `credential_ref:credential:<hash>` references;
- token-like long strings become credential refs;
- IPv4-like strings become `[REDACTED_IP]`.

## What counts as a miss

A run has a miss when the assistant, after compaction, does any of these:

1. uses an inactive/superseded fact as current;
2. loses a still-active obligation or TODO;
3. says it does not know a fact that was in the compacted cold history;
4. exposes raw fake credentials or sensitive IP-like strings from cold history;
5. answers from an older decision when a newer decision superseded it;
6. needs user correction for information that the ledger retained.

## Baseline

Baseline should be the same Telegram session corpus and prompt set with:

```yaml
context:
  engine: compressor
```

Record:

- number of manual corrections per 100 turns;
- number of stale-fact answers per scenario;
- number of obligation/TODO misses;
- raw secret leakage count in model-visible compressed context;
- compressed message count and rough token count;
- whether final hot-tail messages remain byte-for-byte unchanged.

## First A/B plan

1. Select 5-10 long Telegram transcripts with known corrections and recurring tasks.
2. Replay fixed query checkpoints against baseline `compressor` and experimental `semantic_rle`.
3. Keep model/provider/toolsets constant.
4. For each checkpoint, force compression before asking the evaluation query.
5. Score blind if possible: correct/current, stale, missing, unsafe leak, or ambiguous.

## Test scenarios

- Hot tail preservation: last N messages must be unchanged.
- Server supersession: `server alpha` followed by `server beta` should keep beta active and mark alpha superseded.
- Fake token redaction: no raw fake token appears in compressed output; only credential refs.
- IP-like redaction: raw IPv4-like strings do not appear in the ledger.
- Obligations: old `todo`/`надо` messages survive as obligations.
- Unresolved questions: old question markers survive as unresolved questions.
- Deterministic failure: if extraction raises, return original messages rather than dropping context.
- Discovery/config: plugin can be discovered and explicitly loaded, but is not globally enabled by adding it to the repo.

## Manual enablement for experiment only

Do not turn it on globally in commits. For a local experiment:

```bash
hermes config set context.engine semantic_rle
# restart the CLI session or gateway process you intentionally want to test
```

Return to default:

```bash
hermes config set context.engine compressor
```

Gateway note: do not restart production gateway just because the plugin exists. Restart only a deliberately selected experiment profile/session.

## Deterministic smoke-eval

Run:

```bash
python scripts/semantic_rle_eval.py
python scripts/semantic_rle_eval.py --json
```

Current invariant results on the checked-in synthetic corpus:

- `hot_tail_only_baseline`: 4/12 checks passed.
- `semantic_rle`: 12/12 checks passed.

Covered invariants:

- current fact retained after supersession;
- old fact marked superseded;
- old decision retained;
- old obligation retained;
- old unresolved question retained;
- cold fake token/IP redacted to refs/markers;
- hot tail preserved byte-for-byte.

This is only a deterministic preflight. It proves the plugin keeps the facts in model-visible context; it does **not** prove the LLM will use them correctly in live Telegram replay.

## Logical MVP boundary

Done for MVP:

- plugin discovery and explicit loading;
- ContextEngine ABC compatibility;
- local-only deterministic compression path;
- fail-closed compression error handling;
- hot tail preservation;
- factual ledger with active/superseded facts;
- decisions/obligations/questions sections;
- cold-history fake secret and IPv4 redaction;
- smoke-eval harness and tests.

Still intentionally out of MVP:

- Better fact keys: current MVP is regex-based and intentionally conservative.
- Persistence: ledger is in-memory only; no per-chat store yet.
- Retrieval tools: no `semantic_rle_search` tool yet.
- Multilingual extraction: Russian TODO/question markers are minimal.
- Live Telegram replay/scoring against real transcripts.