Anton Palgunov f95f2daa4b feat: session improvements — crash-context, compression fallback, semantic RLE, telegram voice/test coverage

2026-05-29 15:40:24 +00:00

5.2 KiB

Raw Blame History

Semantic RLE Context Engine experiment

Status: experimental MVP, not enabled by default.

Hypothesis

For very long Telegram chats, a deterministic context engine with:

a verbatim hot tail;
a semantic factual ledger for older turns;
explicit stale/superseded facts;
credential references instead of raw secrets;
retrieval notes for manual compression focus;

will reduce "misses" versus the current lossy compression path, especially when users correct facts over time ("server X" -> "server Y") or return to obligations created many turns ago.

MVP design

Plugin: plugins/context_engine/semantic_rle/.

Compression output shape:

original system messages;
system summary block named Semantic RLE context ledger;
last hot_tail_messages non-system messages preserved verbatim.

The ledger is deterministic and local-only. No cloud LLM is called by the plugin.

Ledger sections:

Active facts
Decisions
Obligations
Superseded facts
Unresolved questions
Credential refs
Retrieval notes

Best-effort redaction:

key/value secrets (api_key=..., token: ..., Authorization: ...) become stable credential_ref:credential:<hash> references;
token-like long strings become credential refs;
IPv4-like strings become [REDACTED_IP].

What counts as a miss

A run has a miss when the assistant, after compaction, does any of these:

uses an inactive/superseded fact as current;
loses a still-active obligation or TODO;
says it does not know a fact that was in the compacted cold history;
exposes raw fake credentials or sensitive IP-like strings from cold history;
answers from an older decision when a newer decision superseded it;
needs user correction for information that the ledger retained.

Baseline

Baseline should be the same Telegram session corpus and prompt set with:

context:
  engine: compressor

Record:

number of manual corrections per 100 turns;
number of stale-fact answers per scenario;
number of obligation/TODO misses;
raw secret leakage count in model-visible compressed context;
compressed message count and rough token count;
whether final hot-tail messages remain byte-for-byte unchanged.

First A/B plan

Select 5-10 long Telegram transcripts with known corrections and recurring tasks.
Replay fixed query checkpoints against baseline compressor and experimental semantic_rle.
Keep model/provider/toolsets constant.
For each checkpoint, force compression before asking the evaluation query.
Score blind if possible: correct/current, stale, missing, unsafe leak, or ambiguous.

Test scenarios

Hot tail preservation: last N messages must be unchanged.
Server supersession: server alpha followed by server beta should keep beta active and mark alpha superseded.
Fake token redaction: no raw fake token appears in compressed output; only credential refs.
IP-like redaction: raw IPv4-like strings do not appear in the ledger.
Obligations: old todo/надо messages survive as obligations.
Unresolved questions: old question markers survive as unresolved questions.
Deterministic failure: if extraction raises, return original messages rather than dropping context.
Discovery/config: plugin can be discovered and explicitly loaded, but is not globally enabled by adding it to the repo.

Manual enablement for experiment only

Do not turn it on globally in commits. For a local experiment:

hermes config set context.engine semantic_rle
# restart the CLI session or gateway process you intentionally want to test

Return to default:

hermes config set context.engine compressor

Gateway note: do not restart production gateway just because the plugin exists. Restart only a deliberately selected experiment profile/session.

Deterministic smoke-eval

Run:

python scripts/semantic_rle_eval.py
python scripts/semantic_rle_eval.py --json

Current invariant results on the checked-in synthetic corpus:

hot_tail_only_baseline: 4/12 checks passed.
semantic_rle: 12/12 checks passed.

Covered invariants:

current fact retained after supersession;
old fact marked superseded;
old decision retained;
old obligation retained;
old unresolved question retained;
cold fake token/IP redacted to refs/markers;
hot tail preserved byte-for-byte.

This is only a deterministic preflight. It proves the plugin keeps the facts in model-visible context; it does not prove the LLM will use them correctly in live Telegram replay.

Logical MVP boundary

Done for MVP:

plugin discovery and explicit loading;
ContextEngine ABC compatibility;
local-only deterministic compression path;
fail-closed compression error handling;
hot tail preservation;
factual ledger with active/superseded facts;
decisions/obligations/questions sections;
cold-history fake secret and IPv4 redaction;
smoke-eval harness and tests.

Still intentionally out of MVP:

Better fact keys: current MVP is regex-based and intentionally conservative.
Persistence: ledger is in-memory only; no per-chat store yet.
Retrieval tools: no semantic_rle_search tool yet.
Multilingual extraction: Russian TODO/question markers are minimal.
Live Telegram replay/scoring against real transcripts.

5.2 KiB Raw Blame History