157 lines
5.2 KiB
Markdown
157 lines
5.2 KiB
Markdown
# Semantic RLE Context Engine experiment
|
|
|
|
Status: experimental MVP, not enabled by default.
|
|
|
|
## Hypothesis
|
|
|
|
For very long Telegram chats, a deterministic context engine with:
|
|
|
|
- a verbatim hot tail;
|
|
- a semantic factual ledger for older turns;
|
|
- explicit stale/superseded facts;
|
|
- credential references instead of raw secrets;
|
|
- retrieval notes for manual compression focus;
|
|
|
|
will reduce "misses" versus the current lossy compression path, especially when users correct facts over time ("server X" -> "server Y") or return to obligations created many turns ago.
|
|
|
|
## MVP design
|
|
|
|
Plugin: `plugins/context_engine/semantic_rle/`.
|
|
|
|
Compression output shape:
|
|
|
|
1. original system messages;
|
|
2. `system` summary block named `Semantic RLE context ledger`;
|
|
3. last `hot_tail_messages` non-system messages preserved verbatim.
|
|
|
|
The ledger is deterministic and local-only. No cloud LLM is called by the plugin.
|
|
|
|
Ledger sections:
|
|
|
|
- Active facts
|
|
- Decisions
|
|
- Obligations
|
|
- Superseded facts
|
|
- Unresolved questions
|
|
- Credential refs
|
|
- Retrieval notes
|
|
|
|
Best-effort redaction:
|
|
|
|
- key/value secrets (`api_key=...`, `token: ...`, `Authorization: ...`) become stable `credential_ref:credential:<hash>` references;
|
|
- token-like long strings become credential refs;
|
|
- IPv4-like strings become `[REDACTED_IP]`.
|
|
|
|
## What counts as a miss
|
|
|
|
A run has a miss when the assistant, after compaction, does any of these:
|
|
|
|
1. uses an inactive/superseded fact as current;
|
|
2. loses a still-active obligation or TODO;
|
|
3. says it does not know a fact that was in the compacted cold history;
|
|
4. exposes raw fake credentials or sensitive IP-like strings from cold history;
|
|
5. answers from an older decision when a newer decision superseded it;
|
|
6. needs user correction for information that the ledger retained.
|
|
|
|
## Baseline
|
|
|
|
Baseline should be the same Telegram session corpus and prompt set with:
|
|
|
|
```yaml
|
|
context:
|
|
engine: compressor
|
|
```
|
|
|
|
Record:
|
|
|
|
- number of manual corrections per 100 turns;
|
|
- number of stale-fact answers per scenario;
|
|
- number of obligation/TODO misses;
|
|
- raw secret leakage count in model-visible compressed context;
|
|
- compressed message count and rough token count;
|
|
- whether final hot-tail messages remain byte-for-byte unchanged.
|
|
|
|
## First A/B plan
|
|
|
|
1. Select 5-10 long Telegram transcripts with known corrections and recurring tasks.
|
|
2. Replay fixed query checkpoints against baseline `compressor` and experimental `semantic_rle`.
|
|
3. Keep model/provider/toolsets constant.
|
|
4. For each checkpoint, force compression before asking the evaluation query.
|
|
5. Score blind if possible: correct/current, stale, missing, unsafe leak, or ambiguous.
|
|
|
|
## Test scenarios
|
|
|
|
- Hot tail preservation: last N messages must be unchanged.
|
|
- Server supersession: `server alpha` followed by `server beta` should keep beta active and mark alpha superseded.
|
|
- Fake token redaction: no raw fake token appears in compressed output; only credential refs.
|
|
- IP-like redaction: raw IPv4-like strings do not appear in the ledger.
|
|
- Obligations: old `todo`/`надо` messages survive as obligations.
|
|
- Unresolved questions: old question markers survive as unresolved questions.
|
|
- Deterministic failure: if extraction raises, return original messages rather than dropping context.
|
|
- Discovery/config: plugin can be discovered and explicitly loaded, but is not globally enabled by adding it to the repo.
|
|
|
|
## Manual enablement for experiment only
|
|
|
|
Do not turn it on globally in commits. For a local experiment:
|
|
|
|
```bash
|
|
hermes config set context.engine semantic_rle
|
|
# restart the CLI session or gateway process you intentionally want to test
|
|
```
|
|
|
|
Return to default:
|
|
|
|
```bash
|
|
hermes config set context.engine compressor
|
|
```
|
|
|
|
Gateway note: do not restart production gateway just because the plugin exists. Restart only a deliberately selected experiment profile/session.
|
|
|
|
## Deterministic smoke-eval
|
|
|
|
Run:
|
|
|
|
```bash
|
|
python scripts/semantic_rle_eval.py
|
|
python scripts/semantic_rle_eval.py --json
|
|
```
|
|
|
|
Current invariant results on the checked-in synthetic corpus:
|
|
|
|
- `hot_tail_only_baseline`: 4/12 checks passed.
|
|
- `semantic_rle`: 12/12 checks passed.
|
|
|
|
Covered invariants:
|
|
|
|
- current fact retained after supersession;
|
|
- old fact marked superseded;
|
|
- old decision retained;
|
|
- old obligation retained;
|
|
- old unresolved question retained;
|
|
- cold fake token/IP redacted to refs/markers;
|
|
- hot tail preserved byte-for-byte.
|
|
|
|
This is only a deterministic preflight. It proves the plugin keeps the facts in model-visible context; it does **not** prove the LLM will use them correctly in live Telegram replay.
|
|
|
|
## Logical MVP boundary
|
|
|
|
Done for MVP:
|
|
|
|
- plugin discovery and explicit loading;
|
|
- ContextEngine ABC compatibility;
|
|
- local-only deterministic compression path;
|
|
- fail-closed compression error handling;
|
|
- hot tail preservation;
|
|
- factual ledger with active/superseded facts;
|
|
- decisions/obligations/questions sections;
|
|
- cold-history fake secret and IPv4 redaction;
|
|
- smoke-eval harness and tests.
|
|
|
|
Still intentionally out of MVP:
|
|
|
|
- Better fact keys: current MVP is regex-based and intentionally conservative.
|
|
- Persistence: ledger is in-memory only; no per-chat store yet.
|
|
- Retrieval tools: no `semantic_rle_search` tool yet.
|
|
- Multilingual extraction: Russian TODO/question markers are minimal.
|
|
- Live Telegram replay/scoring against real transcripts.
|