5.2 KiB
Semantic RLE Context Engine experiment
Status: experimental MVP, not enabled by default.
Hypothesis
For very long Telegram chats, a deterministic context engine with:
- a verbatim hot tail;
- a semantic factual ledger for older turns;
- explicit stale/superseded facts;
- credential references instead of raw secrets;
- retrieval notes for manual compression focus;
will reduce "misses" versus the current lossy compression path, especially when users correct facts over time ("server X" -> "server Y") or return to obligations created many turns ago.
MVP design
Plugin: plugins/context_engine/semantic_rle/.
Compression output shape:
- original system messages;
systemsummary block namedSemantic RLE context ledger;- last
hot_tail_messagesnon-system messages preserved verbatim.
The ledger is deterministic and local-only. No cloud LLM is called by the plugin.
Ledger sections:
- Active facts
- Decisions
- Obligations
- Superseded facts
- Unresolved questions
- Credential refs
- Retrieval notes
Best-effort redaction:
- key/value secrets (
api_key=...,token: ...,Authorization: ...) become stablecredential_ref:credential:<hash>references; - token-like long strings become credential refs;
- IPv4-like strings become
[REDACTED_IP].
What counts as a miss
A run has a miss when the assistant, after compaction, does any of these:
- uses an inactive/superseded fact as current;
- loses a still-active obligation or TODO;
- says it does not know a fact that was in the compacted cold history;
- exposes raw fake credentials or sensitive IP-like strings from cold history;
- answers from an older decision when a newer decision superseded it;
- needs user correction for information that the ledger retained.
Baseline
Baseline should be the same Telegram session corpus and prompt set with:
context:
engine: compressor
Record:
- number of manual corrections per 100 turns;
- number of stale-fact answers per scenario;
- number of obligation/TODO misses;
- raw secret leakage count in model-visible compressed context;
- compressed message count and rough token count;
- whether final hot-tail messages remain byte-for-byte unchanged.
First A/B plan
- Select 5-10 long Telegram transcripts with known corrections and recurring tasks.
- Replay fixed query checkpoints against baseline
compressorand experimentalsemantic_rle. - Keep model/provider/toolsets constant.
- For each checkpoint, force compression before asking the evaluation query.
- Score blind if possible: correct/current, stale, missing, unsafe leak, or ambiguous.
Test scenarios
- Hot tail preservation: last N messages must be unchanged.
- Server supersession:
server alphafollowed byserver betashould keep beta active and mark alpha superseded. - Fake token redaction: no raw fake token appears in compressed output; only credential refs.
- IP-like redaction: raw IPv4-like strings do not appear in the ledger.
- Obligations: old
todo/надоmessages survive as obligations. - Unresolved questions: old question markers survive as unresolved questions.
- Deterministic failure: if extraction raises, return original messages rather than dropping context.
- Discovery/config: plugin can be discovered and explicitly loaded, but is not globally enabled by adding it to the repo.
Manual enablement for experiment only
Do not turn it on globally in commits. For a local experiment:
hermes config set context.engine semantic_rle
# restart the CLI session or gateway process you intentionally want to test
Return to default:
hermes config set context.engine compressor
Gateway note: do not restart production gateway just because the plugin exists. Restart only a deliberately selected experiment profile/session.
Deterministic smoke-eval
Run:
python scripts/semantic_rle_eval.py
python scripts/semantic_rle_eval.py --json
Current invariant results on the checked-in synthetic corpus:
hot_tail_only_baseline: 4/12 checks passed.semantic_rle: 12/12 checks passed.
Covered invariants:
- current fact retained after supersession;
- old fact marked superseded;
- old decision retained;
- old obligation retained;
- old unresolved question retained;
- cold fake token/IP redacted to refs/markers;
- hot tail preserved byte-for-byte.
This is only a deterministic preflight. It proves the plugin keeps the facts in model-visible context; it does not prove the LLM will use them correctly in live Telegram replay.
Logical MVP boundary
Done for MVP:
- plugin discovery and explicit loading;
- ContextEngine ABC compatibility;
- local-only deterministic compression path;
- fail-closed compression error handling;
- hot tail preservation;
- factual ledger with active/superseded facts;
- decisions/obligations/questions sections;
- cold-history fake secret and IPv4 redaction;
- smoke-eval harness and tests.
Still intentionally out of MVP:
- Better fact keys: current MVP is regex-based and intentionally conservative.
- Persistence: ledger is in-memory only; no per-chat store yet.
- Retrieval tools: no
semantic_rle_searchtool yet. - Multilingual extraction: Russian TODO/question markers are minimal.
- Live Telegram replay/scoring against real transcripts.