Commit Graph

1033 Commits

Author SHA1 Message Date
burjorjee
52b049b560 fix: treat inline-shell timeout guard as timeout 2026-05-18 19:36:04 -07:00
墨綠BG
50e93f23f2 🐛 fix(memory): require newline after context tag 2026-05-18 10:53:08 -07:00
墨綠BG
341c8d3030 🐛 fix(memory): keep inline memory-context mentions visible 2026-05-18 10:53:08 -07:00
Slimydog21
aae1615977 fix(xai-responses): strip enum values containing '/' from tool schemas
xAI's /v1/responses and /v1/chat/completions endpoints reject tool schemas
whose enum values contain a forward slash with a generic HTTP 400 'Invalid
arguments passed to the model.' before any token is emitted — the schema
compiler trips on the '/' character regardless of where it appears.

Most commonly hit by MCP-derived tools whose enum lists HuggingFace model
IDs ('Qwen/Qwen3.5-0.8B', 'openai/gpt-oss-20b') or owner/name environment
identifiers.

Mirrors the existing strip_pattern_and_format sanitizer (PR for #27197).
The new strip_slash_enum walks tool parameters and drops the entire enum
keyword when any value contains '/' — keeping it partial would still 400
since xAI's failure is all-or-nothing on the enum. The field description
still reaches the model so the prompting hint is preserved.

Wired in at both code paths for parity:
  - agent/chat_completion_helpers.py (main agent xAI Responses path)
  - agent/auxiliary_client.py (aux client xAI Responses path, matching
    the same parity guarantee 2fae8fba9 established for pattern/format)

Salvaged from #28021 by @Slimydog21 — contributor's branch was severely
stale (would have reverted ~5000 LOC across azure/kanban/i18n); fix
re-applied surgically on current main with their sanitizer + 9 tests
preserved verbatim. Author noreply email used (original was a Mac
hostname leak).
2026-05-18 10:37:35 -07:00
EloquentBrush0x
b570e0fdd0 fix(codex-oauth): quarantine terminal refresh errors so dead tokens are not replayed across sessions
When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403,
token revoked or reused), _mark_exhausted was called but auth.json was left with
the dead credentials. On the next session, _seed_from_singletons re-read
auth.json and re-seeded the pool with the same revoked token, triggering the
same terminal failure in a loop.

Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine
block in _refresh_entry: when a terminal error is detected and auth.json holds
no newer tokens, clear access_token/refresh_token from auth.json and remove all
device_code-sourced pool entries from memory. Mirrors the Nous quarantine added
in c90556262 and the xAI quarantine in #28116.

Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure,
matching the xAI and Nous patterns, to avoid refresh_token_reused races when
multiple Hermes processes share the same auth.json singleton.

Salvaged from #27911 by @EloquentBrush0x — contributor's branch was severely
stale (would have reverted ~5000 LOC across azure/kanban/i18n subsystems);
fix re-applied surgically on current main with their predicate and tests preserved.
2026-05-18 10:31:40 -07:00
Teknium
9aae59feab
fix(compress): make abort-on-summary-failure opt-in via config flag (#28117)
PR #28102 made the summary-failure abort path the unconditional default,
changing established behavior. Gate it behind config.yaml flag
`compression.abort_on_summary_failure` (default False = historical
fallback-placeholder behavior).

- hermes_cli/config.py: new `compression.abort_on_summary_failure` key,
  default False, documented inline.
- agent/agent_init.py: read the flag from compression config and pass to
  ContextCompressor.
- agent/context_compressor.py: `__init__` accepts `abort_on_summary_failure`
  (default False). `compress()` failure branch gates the abort on the
  flag; when False, falls through to the restored legacy fallback path
  (static "summary unavailable" placeholder + drop middle window).
- tests: restore original fallback expectations as default; add new
  TestAbortOnSummaryFailure class for the opt-in mode.

Gateway/CLI plumbing (force=True on /compress, hygiene/handler abort
detection, locale `gateway.compress.aborted` key) from PR #28102 stays
intact — those paths only fire when `_last_compress_aborted` is True,
which now only happens when the flag is enabled.
2026-05-18 10:28:20 -07:00
EloquentBrush0x
5e40f83cb7 fix(xai-oauth): quarantine terminal refresh errors so dead tokens are not replayed across sessions
When refresh_xai_oauth_pure raises a terminal error (HTTP 400/401/403,
i.e. revoked or reused refresh token), _refresh_entry's existing race-
recovery path re-syncs from auth.json and returns if another process has
already rotated the tokens.  If auth.json still holds the same stale
token pair, the function fell through to _mark_exhausted — leaving the
dead credentials in auth.json.  On the next Hermes startup _seed_from_singletons
re-seeded the pool from those stale tokens, causing the same failure loop
on every session.

Fix: after the auth.json re-sync check in the xAI-oauth error handler,
detect terminal errors with the new _is_terminal_xai_oauth_refresh_error
helper and apply a quarantine:
- Clear access_token and refresh_token from providers["xai-oauth"]["tokens"]
  in auth.json so they are not re-seeded.
- Write a last_auth_error entry for hermes doctor / auth status diagnostics.
- Remove all loopback_pkce entries from the in-memory pool so the current
  session stops retrying with the dead credentials.

Mirrors the identical quarantine already in place for Nous OAuth
(c90556262).

Closes the parity gap introduced when c90556262 added Nous-only terminal
error handling without a corresponding xAI-oauth path.
2026-05-18 10:28:09 -07:00
EloquentBrush0x
1fabd6e100 fix(error_classifier): classify xAI Grok entitlement SSE errors as auth
When xAI returns a subscription/entitlement error through an SSE
``type=error`` frame, ``_StreamErrorEvent`` is raised with
``status_code=None``.  This caused ``_classify_by_status`` (step 2 of
``classify_api_error``) to be skipped entirely, and the Grok-specific
phrases ("do not have an active Grok subscription", "out of available
resources") appeared in none of the message-pattern lists.  The error
fell through to ``FailoverReason.unknown (retryable=True)``, burning
``max_retries`` on every affected X Premium+ / SuperGrok user before
the agent stopped — and ``_is_entitlement_failure`` was never called
because it only fires under ``FailoverReason.auth``.

The HTTP 403 path already handled this correctly (``_classify_by_status``
returns ``auth/non-retryable`` for 403).  Add an explicit pattern block
at step 1 (highest priority, before the ``status_code`` guard) so both
code paths route to ``FailoverReason.auth, retryable=False,
should_fallback=True`` — matching the 403 path exactly.

Add three regression tests in ``Fix D`` section of
``test_codex_xai_oauth_recovery.py``:
- primary "do not have an active Grok subscription" phrase
- "out of available resources" + "grok" variant
- unrelated ``_StreamErrorEvent`` must not be reclassified
2026-05-18 10:24:13 -07:00
flamiinngo
5613dfea93 fix(security): redact xAI (Grok) API keys in logs
xAI is a first-class provider in hermes-agent with its own credential
pool entry (XAI_API_KEY / xai-oauth). API keys follow the format
xai-<60+ alphanumeric chars> and were absent from _PREFIX_PATTERNS in
agent/redact.py.

When a key appears raw in log output, tool results, or error messages,
it passed through completely unmasked. The ENV-assignment and Bearer
header patterns catch the most common cases, but a raw token in a
stack trace or debug print had no protection.

Verified before fix:
  redact_sensitive_text("using key xai-ABCD...rstu to call xAI", force=True)
  # "using key xai-ABCD...rstu to call xAI"  <- exposed

After fix:
  # "using key xai-AB...rstu to call xAI"    <- masked

Five unit tests added to TestXaiToken covering bare token masking,
env assignment, short-prefix false positive, company name false
positive, and visible prefix in masked output.
2026-05-18 10:21:22 -07:00
Teknium
1634397ddb
fix(compress): abort instead of dropping messages when summary LLM fails (#28102)
When auxiliary compression's summary generation returns None (aux model
errored, returned non-JSON, timed out, etc.) the compressor previously
still dropped every middle message between compress_start..compress_end
and replaced them with a static 'Summary generation was unavailable'
placeholder. The session kept going but the user silently lost N turns
of context for nothing.

New behavior: on summary failure, compress() aborts entirely — returns
the input messages unchanged and sets _last_compress_aborted=True. The
existing _summary_failure_cooldown_until gate (30-60s) keeps the aux
model from being burned on every turn. Auto-compress callers detect
the no-op (len(after) == len(before)) and stop looping. The chat is
'frozen' at its current size until the next /compress or /new.

Manual /compress (CLI + gateway) now passes force=True which clears
the cooldown so users can retry immediately after an auto-abort. If
the manual retry also fails, the user gets a visible warning telling
them nothing was dropped and how to retry.

- agent/context_compressor.py: compress() gains force= kwarg; failure
  branch sets _last_compress_aborted and returns messages unchanged
  instead of inserting placeholder.
- run_agent.py: _compress_context() detects abort, surfaces warning,
  skips session-rotation entirely, returns messages unchanged.
- cli.py + gateway/run.py: manual /compress paths pass force=True.
- gateway/run.py: hygiene + /compress handlers detect _last_compress_aborted
  and emit the new 'Compression aborted' warning (gateway.compress.aborted)
  instead of the old 'N historical messages were removed' message.
- locales/*.yaml: new gateway.compress.aborted key in all 16 locales.
- tests: updated to assert the abort contract (messages preserved,
  compression_count not incremented, abort flag set, no placeholder
  leaked). New test_force_true_bypasses_failure_cooldown covers the
  manual-retry path.
2026-05-18 10:19:40 -07:00
glennc
9df9816dab feat(azure-foundry): add Microsoft Entra ID auth
Use azure-identity DefaultAzureCredential for keyless Foundry auth.

Preserve refreshable callable credentials through OpenAI and Anthropic client paths.

Add setup, doctor, auth status, docs, and tests for Entra auth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-18 10:14:38 -07:00
teknium1
9cae9c0166 fix(aux): log sanitizer failures instead of silently swallowing them
Match the warning behavior of the parent main-agent path in
chat_completion_helpers.py — sanitizer failures should be visible
in logs, not silent.
2026-05-18 09:43:44 -07:00
EloquentBrush0x
2fae8fba9c fix(aux): strip pattern/format keywords from tool schemas on xAI Responses path
xAI's /responses endpoint rejects tool schemas that contain pattern or
format JSON Schema keywords with HTTP 400. chat_completion_helpers.py
already strips these for the main-agent xAI/xai-oauth path (lines
294-302), but _CodexCompletionsAdapter.create() — used for every xAI
OAuth auxiliary call (kanban decomposer, profile describer, etc.) —
passed raw tool schemas without sanitization.

MCP tools that carry pattern/format keywords (common for string fields)
silently caused every auxiliary call over xAI OAuth to fail with an
HTTP 400, while the main agent worked fine. Parity fix: call
strip_pattern_and_format() on the tool list before converting to
Responses API format, matching the main-agent guarantee.
2026-05-18 09:43:44 -07:00
teknium
f0c6d59148 fix(anthropic): scope MiniMax beta-strip to MiniMax only
Cherry-pick of @sharziki's #27022 routed Azure Foundry through
_requires_bearer_auth, which also triggered the MiniMax-specific
beta-strip in _common_betas_for_base_url — dropping the 1M-context
beta from Azure even though Azure needs it for 1M context.

Split the strip predicate: introduce _is_minimax_anthropic_endpoint
so the fine-grained-tool-streaming and context-1m strips only fire
for MiniMax hosts, leaving Azure's bearer-auth header swap intact
without losing 1M context.

Also add a regression test that asserts Azure gets Bearer auth,
the api-version query param, and the context-1m-2025-08-07 beta.
2026-05-18 09:27:18 -07:00
sharziki
73407b1e30 fix(auth): send Bearer auth for Azure Foundry anthropic_messages endpoints
Azure AI Foundry's Anthropic-style endpoint requires
`Authorization: Bearer` instead of `x-api-key`.  Add `azure.com` to
`_requires_bearer_auth()` so the existing Bearer path at line 586 fires
before the generic third-party branch sets `api_key` (x-api-key).

Fixes #26970
2026-05-18 09:27:18 -07:00
wysie
ff078738ea fix(skills): load symlinked skill slash commands 2026-05-18 00:34:29 -07:00
Teknium
abf1af5401
feat(session_search): single-shape tool with discovery, scroll, browse — no LLM (#27590)
* feat(session_search): single-shape tool with discovery, scroll, browse — no LLM

Replaces the LLM-summarized session_search with a single-shape tool that
returns actual messages from the DB. Three calling shapes inferred from
args (no mode parameter):

  1. Discovery — pass query. FTS5 + anchored ±5 window + bookends per hit,
     all in one call. ~20ms on a real DB instead of ~90s for the previous
     three aux-LLM calls.
  2. Scroll — pass session_id + around_message_id. Returns a window
     centered on the anchor. To paginate, re-anchor on the first/last id
     of the returned window. Boundary message appears in both windows
     as the orientation marker. ~1ms per scroll call.
  3. Browse — no args. Recent sessions chronologically.

Bookend_start (first 3 user+assistant msgs) and bookend_end (last 3) give
the agent goal + resolution on every discovery hit, so a single tool call
reconstructs a long session's arc without loading the whole transcript.

The aux-LLM summary path is gone: it cost ~$0.30/call, took ~30s, and
laundered FTS5 hits through a model that could confabulate when the right
session wasn't in the hit list. The merged shape returns byte-for-byte
content from SQLite.

History:
- PR #20238 (JabberELF) seeded the fast/summary dual-mode split.
- PR #26419 (yoniebans) expanded to fast/guided/summary with bookends,
  multi-anchor drill-down, default-mode config, and a teaching skill.

This PR collapses that toolkit into one shape with explicit scroll
support, drops the summary path, drops the mode parameter, drops the
config knob, drops the skill. JabberELF's seed work is acknowledged via
the AUTHOR_MAP entry.

Validation:
- 38/38 tool tests pass (tests/tools/test_session_search.py)
- 12/12 get_messages_around tests pass (tests/hermes_state/)
- 11/11 get_anchored_view tests pass (tests/hermes_state/)
- Full tests/tools/ run: 5168 passing, 2 failures pre-exist on main
  (test ordering in test_delegate.py, unrelated)
- E2E against live state DB: discovery 20ms, scroll 1ms, browse 280ms;
  pagination forward+backward works with boundary-message orientation;
  error paths return clean tool_error responses

Co-authored-by: JabberELF <abcdjmm970703@gmail.com>
Co-authored-by: yoniebans <jonny@nousresearch.com>

* chore(session_search): prune dead LLM-summary config and docs

Companion to the single-shape rewrite. The auxiliary.session_search config
block, max_concurrency / extra_body tunables, and matching docs sections
all referenced the removed LLM summarization path. Removing them so users
don't try to tune knobs that nothing reads.

- hermes_cli/config.py: drop dead auxiliary.session_search block from
  DEFAULT_CONFIG. Leftover keys in user config.yaml are harmless and
  ignored.
- hermes_cli/tips.py: drop two tips referencing the removed
  max_concurrency / extra_body knobs.
- website/docs/user-guide/configuration.md: drop 'Session Search Tuning'
  section and the auxiliary.session_search block from the example.
- website/docs/user-guide/features/fallback-providers.md: drop session_search
  rows from the auxiliary-tasks tables and the dedicated tuning subsection.
- website/docs/reference/tools-reference.md: rewrite the session_search
  entry to describe the new three-shape behaviour.
- CONTRIBUTING.md: update the file-tree description.
- tests/tools/test_llm_content_none_guard.py: remove TestSessionSearchContentNone
  class and test_session_search_tool_guarded — both guard against an
  unguarded .content.strip() call site in _summarize_session() that no
  longer exists.

Validation: 97/97 targeted tests still pass (hermes_state + session_search +
llm_content_none_guard). Config tests 55/55.

---------

Co-authored-by: JabberELF <abcdjmm970703@gmail.com>
Co-authored-by: yoniebans <jonny@nousresearch.com>
2026-05-17 23:28:45 -07:00
teknium1
4a3f13b47b perf(prompt-cache): date-only timestamp + loud gateway-DB roundtrip logging
The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.

Three changes:

1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
   System prompt now byte-stable for the full day. The model can still
   query exact time via tools when it actually needs it. Credit:
   @iamfoz (PR #20451).

2. Loud logging on session DB write failures. The update_system_prompt
   call used to log at DEBUG, hiding disk-full / locked-database / schema
   drift behind a silent fall-through that forced fresh rebuilds on
   every subsequent turn. Now WARN with the session id and exception so
   persistent issues show up in agent.log without verbose mode.

3. Three-way stored-state distinction on read. The previous
   'session_row.get("system_prompt") or None' collapsed three states
   into one (missing row / null column / empty string). Now we tell them
   apart and WARN when a continuing session lands on null/empty (which
   means the previous turn's write never persisted — every subsequent
   turn rebuilds and the prefix cache misses every time).

The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.

E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.

Tests:
  - tests/agent/test_system_prompt_restore.py (10 new tests)
  - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
        test_datetime_is_date_only_not_minute_precision

Closes #20451 (date-only), #18547 (prefix stabilization),
#8689 (stabilize timestamp across compression), #15866 (timestamp
caching question), #8687 (compression timestamp), #27339
(claim #3: live timestamp in cached system prompt).

Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
2026-05-17 23:20:37 -07:00
Teknium
9b91377bec
feat(grok): apply OpenAI execution guidance to xAI Grok / xai-oauth models (#27797)
Grok models hit the same failure modes that OPENAI_MODEL_EXECUTION_GUIDANCE
addresses for GPT/Codex: claiming completion without tool calls
('to be honest, I didn't create the file yet'), suggesting workarounds
instead of using existing tools (proposing a folder-based memory system
when the memory tool exists), replying with plans instead of executing.

TOOL_USE_ENFORCEMENT_GUIDANCE was already injected for any model whose
name contains 'grok' (TOOL_USE_ENFORCEMENT_MODELS). This extends the
follow-on family-specific block — OPENAI_MODEL_EXECUTION_GUIDANCE
(tool_persistence / mandatory_tool_use / act_dont_ask / prerequisite_checks
/ verification / missing_context) — to grok-named models too.

The OPENAI_ prefix is retained for backwards compat with imports/tests;
docstring + inline comment now note that the body is family-agnostic and
the prefix reflects origin, not exclusivity.

Tests cover the OpenRouter slug (x-ai/grok-4.3) and the xai-oauth bare
name (grok-4.3), plus a negative control on claude.

E2E verified against a real AIAgent build of the system prompt for both
xai-oauth and openrouter grok models.
2026-05-17 23:00:37 -07:00
zccyman
a574246837 feat(auxiliary): add configurable fallback chains + main-agent safety net
Layered fallback for auxiliary tasks (compression, vision, tts, web_extract,
session_search, etc.):

  1. Primary aux provider (existing)
  2. User-configured auxiliary.<task>.fallback_chain (new)
  3. Main agent provider + model (new — last-resort safety net)
  4. Warn user + re-raise original error (new)

For users on 'auto' (no explicit aux provider), the existing
_try_payment_fallback auto-detection chain runs instead — its Step 1
already IS the main agent model, so they get the same behaviour without
configuration.

The configured fallback_chain config schema comes from #26882 / @zccyman;
the main-agent safety net + exhaustion warning were added on top.

Closes #26882. Builds on the capacity-error gate fix in the previous
commit (#26803 / @Bartok9).
2026-05-17 17:15:31 -07:00
Bartok9
24c209f112 fix(auxiliary): detect quota exhaustion as payment error; allow capacity-error fallback for explicit providers
Closes #26803

Root causes:
1. _is_payment_error() checked for billing keywords (credits, insufficient
   funds, billing, payment required) but missed daily token quota exhaustion
   phrases used by Bedrock, Vertex AI, and LiteLLM proxies — e.g.
   'Too many tokens per day', 'quota exceeded', 'resource exhausted',
   'daily limit'. These are functionally identical to credit exhaustion
   (provider cannot serve the request) but don't trigger fallback.

2. The call_llm() fallback chain was gated on resolved_provider == 'auto'.
   When a task resolves to a specific provider (e.g. 'custom' for a LiteLLM
   proxy, or 'openrouter'), capacity failures (payment/quota/connection)
   silently raise instead of trying alternatives. This is overly conservative:
   capacity errors mean the provider *cannot* serve the request regardless of
   user intent, so alternatives should always be tried.

Fixes:
- Add quota-related keywords to _is_payment_error(): quota_exceeded,
  too many tokens per day, daily limit, tokens per day, daily quota,
  resource exhausted (Vertex AI gRPC code).
- Allow fallback for capacity errors (payment + connection) even when
  resolved_provider is not 'auto'. Rate-limit fallback stays gated on
  is_auto to honour explicit provider constraints for transient limits.
- Apply both fixes to sync call_llm() and async acall_llm() paths.
- Add 6 targeted tests for the new quota-error detection cases.
2026-05-17 17:15:31 -07:00
Robin Fernandes
569bc94b59 fix(auth) fix a few cases where refresh tokens were not rotated. 2026-05-17 16:56:37 -07:00
Robin Fernandes
20bffa5b37 refactor(auth): mostly cleanups and style changes 2026-05-17 16:56:37 -07:00
Robin Fernandes
0bac7dd05b refactor(auth): collapse Nous inference fallback controls 2026-05-17 16:56:37 -07:00
Robin Fernandes
89a3d038cf Switch to JWT token for inference against Nous, falling back to old opaque token on failure. 2026-05-17 16:56:37 -07:00
Robin Fernandes
c905562623 fix(auth): stop replaying invalid Nous refresh tokens
Quarantine Nous OAuth state when refresh fails with terminal invalid_grant/invalid_token errors. Clear local and shared refresh material across runtime, managed access-token, proxy, and credential-pool paths so Hermes stops retrying revoked refresh sessions.
2026-05-17 16:56:37 -07:00
teknium1
bdc2113b5c fix(xai): wire schema sanitizer into post-refactor build_api_kwargs
Port of the run_agent.py changes from #27219 to current main: the
_build_api_kwargs body was extracted into agent/chat_completion_helpers.
build_api_kwargs, so wire the xAI tool-schema sanitization there
(provider in {'xai', 'xai-oauth'} or base_url=api.x.ai). Logs a warning
instead of silently swallowing exceptions, matching the contributor's
review-followup fix.

Co-authored-by: zccyman <zccyman@163.com>
2026-05-17 13:13:22 -07:00
teknium1
822e92edb3 fix(aux): default OpenRouter auxiliary to gemini-3-flash-preview 2026-05-17 12:44:48 -07:00
Hoang V. Pham
4a7cd2e16d fix(codex): allow kanban worker board writes 2026-05-17 11:50:43 -07:00
teknium1
55d6a1636b fix(agent): honor provider timeout config in streaming API calls
Closes #25249 (and supersedes PR #25260) in spirit.

Two bugs in the streaming chat-completions path caused provider timeout
configuration to be silently ignored:

1. Hardcoded connect/pool timeout. The httpx.Timeout for streaming
   calls used hardcoded connect=30.0 and pool=30.0 regardless of the
   user's providers.<id>.request_timeout_seconds config. If the custom
   provider (e.g. Ollama) was unreachable, the call always waited
   exactly 30s before failing, ignoring any configured timeout.

   Fix: use min(_base_timeout, 60.0) for connect and pool when a
   provider timeout is configured, falling back to 30.0 otherwise.
   The 60s cap addresses review feedback (TCP handshake shouldn't
   wait the inference timeout — connect/pool cover the connection
   layer, not model latency).

2. Streaming stale-stream detector ignored provider config. The
   stale detector read only HERMES_STREAM_STALE_TIMEOUT (env default
   180s). The providers.<id>.stale_timeout_seconds key (correctly
   used in the non-streaming path) was never consulted.

   Fix: check get_provider_stale_timeout(provider, model) first,
   then fall back to the env var. Aligns the streaming path with
   the non-streaming path's priority chain (config > env > default).

Salvage shape diverged from PR #25260: the function moved to
agent/chat_completion_helpers.py and the contributor's two commits
(initial fix + 60s-cap review follow-up) are squashed into one final
commit applied at the new location.

Original diagnosis, fix shape, AND the 60s-cap review response from
@zccyman in PR #25260; credited via Co-authored-by.

Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>
2026-05-17 11:39:37 -07:00
QuenVix
d5a0815c3d fix(transports): use monotonic deadlines in codex app-server turn loop 2026-05-17 11:37:45 -07:00
kshitijk4poor
c74ff2c8ef fix(browser): self-review pass — dead-import, log levels, future-proofing
Addresses findings from two self-review passes pre-merge.

First pass (3-agent parallel review):

1. plugins/browser/browser_use/provider.py: drop the
   ``_ = managed_nous_tools_enabled`` dead-import-hider in
   _get_config_or_none(). The import was actively misleading — the
   helper IS used in _get_config() (separate method, separate import),
   not here. The "keep static analysis happy" comment was wrong about
   what the helper does in this scope.

2. agent/browser_provider.py: drop ``pragma: no cover`` from
   is_configured() / provider_name() backward-compat aliases. They ARE
   covered by ``TestLegacyAbcAliases`` — the pragma would have masked
   future regressions.

3. tools/browser_tool.py: refactor _is_legacy_provider_registry_overridden()
   to compare against a module-frozen _DEFAULT_PROVIDER_REGISTRY snapshot
   instead of hardcoded set of 3 keys. Future maintainers adding a 4th
   built-in provider now just extend _PROVIDER_REGISTRY; the override
   detection adapts automatically. Previously the hardcoded
   ``set(...) != {"browserbase", "browser-use", "firecrawl"}`` would flip
   True forever on any 4-key registry, silently routing every install
   onto the legacy fixture path.

4. tools/browser_tool.py: when explicit ``browser.cloud_provider`` is set
   but the registry has no matching plugin (typo, uninstalled plugin,
   discovery failure), emit a WARNING with actionable text instead of
   silently falling through to auto-detect. Legacy code surfaced a typed
   credentials error via direct class instantiation; this log restores
   the signal in the post-migration path.

5. agent/browser_registry.py: trim the triple-redundant _LEGACY_PREFERENCE
   documentation. Module docstring + 13-line block-comment + 5-line
   inline comment was repeating the same point. Kept the docstring and
   trimmed the block-comment to 5 lines.

6. agent/browser_registry.py: upgrade is_available()-raised logging from
   DEBUG to WARNING with exc_info=True. A provider's availability check
   throwing is unusual enough that users debugging "no cloud provider"
   need the traceback in logs.

7. tests/plugins/browser/check_parity_vs_main.py: drop dead top-level
   imports (os, shutil, tempfile — only referenced inside the
   SUBPROCESS_SCRIPT string literal that runs in a child process).

Second pass (architecture + claim-verification review):

8. tools/browser_tool.py: rewrite the inline comment in _get_cloud_provider
   auto-detect branch. Prior text claimed it "routes through the plugin
   registry's legacy preference walk so third-party plugins still get a
   chance to be selected when they're explicitly configured" — false on
   both counts. The branch uses module-level legacy class aliases
   (BrowserUseProvider / BrowserbaseProvider) directly; third-party
   plugins are intentionally reachable only via explicit
   ``browser.cloud_provider``. Corrected comment now matches behaviour
   and cross-references _LEGACY_PREFERENCE for the firecrawl gate
   rationale.

9. tools/browser_tool.py + tests/tools/test_managed_browserbase_and_modal.py:
   drop the unused ``get_active_browser_provider as
   _registry_get_active_browser_provider`` alias from the
   ``from agent.browser_registry import ...`` block. It was never
   referenced; matching test-stub line in the agent.browser_registry
   SimpleNamespace also dropped. ``get_provider`` is still imported (used
   by the explicit-config dispatch path at line 535).

10. plugins/browser/firecrawl/provider.py: align emergency_cleanup()
    with the early-guard pattern used in browserbase + browser_use
    plugins. Previously firecrawl tried the DELETE and relied on
    ``_headers()`` raising ValueError to trip a "missing credentials"
    warning; same final outcome but a different control flow that read
    like a bug to a maintainer skimming the three modules. Now: if
    is_available() is False, log+return early — identical shape to the
    other two providers.

Verification: 54/54 unit tests + 13/13 parity scenarios still pass.
2026-05-17 04:04:15 -07:00
kshitijk4poor
40fde853fa refactor(browser): dispatch _get_cloud_provider through agent.browser_registry
Switches tools.browser_tool's cloud-provider lookup from the hardcoded
_PROVIDER_REGISTRY class-instantiation pattern to the
agent.browser_registry singleton registry that plugins self-populate.

Changes:

- tools/browser_tool.py top imports: pull BrowserProvider from
  agent.browser_provider (re-exported as CloudBrowserProvider for legacy
  callers) and the three provider classes from plugins/browser/<vendor>/.
  Legacy class names (BrowserbaseProvider, BrowserUseProvider, FirecrawlProvider)
  remain on tools.browser_tool as re-export shims so existing test patches
  (monkeypatch.setattr(browser_tool, 'BrowserUseProvider', ...)) keep working.

- _get_cloud_provider() now consults agent.browser_registry.get_provider()
  for explicit-config lookups. The auto-detect fallback still uses
  BrowserUseProvider() / BrowserbaseProvider() at the module level so the
  cache-policy test fixtures (which patch those names) keep driving the
  function. Test-time _PROVIDER_REGISTRY overrides are detected by class
  identity and routed through the legacy factory-call path.

- agent/browser_provider.py: BrowserProvider grows is_configured() and
  provider_name() as thin backward-compat aliases for the legacy
  CloudBrowserProvider API. Subclasses MUST implement is_available() and
  name; the aliases delegate. This keeps ~6 caller sites in browser_tool.py
  working without churning them.

- tests/tools/test_managed_browserbase_and_modal.py: _install_fake_tools_package
  grows stubs for agent.browser_provider / agent.browser_registry /
  plugins.browser.<vendor>.provider so the test's spec-loader path
  (sys.modules-reset + reload-tool-from-disk) can satisfy tools.browser_tool's
  top-level imports.

Verified: all 23 existing tests in test_browser_cloud_*.py +
test_managed_browserbase_and_modal.py still pass post-cutover.

The legacy tools/browser_providers/ directory is NOT yet deleted; several
tests still _load_tool_module() those files via spec_from_file_location.
The deletion + test-path updates land in a later commit.
2026-05-17 04:04:15 -07:00
kshitijk4poor
a15cdfb050 feat(browser): browser-use + firecrawl plugins; drop single-eligible shortcut
Migrates the remaining two cloud browser providers to plugins:

  plugins/browser/browser_use/    — dual auth (direct BROWSER_USE_API_KEY
                                    or managed Nous gateway), idempotency-
                                    key handling for retried managed-mode
                                    creates, x-external-call-id capture.
  plugins/browser/firecrawl/      — direct FIRECRAWL_API_KEY only;
                                    distinct from plugins/web/firecrawl/
                                    (same key, different endpoint).

Also drops the 'single-eligible shortcut' rule from
agent.browser_registry._resolve(). Was a copy-paste from
web_search_registry that would have introduced a real behavior change:
a user with only FIRECRAWL_API_KEY set (for web-extract) would silently
get routed to a paid Firecrawl cloud browser on a fresh install — not
matching origin/main, which only auto-detected between Browser Use and
Browserbase. Third-party browser plugins are subject to the same gate:
they require explicit `browser.cloud_provider` to take effect.

Verified end-to-end via plugin discovery:
  - 3 plugins register (browser-use, browserbase, firecrawl)
  - _resolve(None) with no creds: None (local mode)
  - _resolve(None) with only FIRECRAWL_API_KEY: None (matches main)
  - _resolve('firecrawl'): firecrawl (explicit wins)
  - _resolve(None) with BU+firecrawl: browser-use (legacy walk first hit)
  - _resolve(None) with all three: browser-use (legacy walk order)
2026-05-17 04:04:15 -07:00
kshitijk4poor
c6e6909e5a feat(browser): add BrowserProvider ABC mirroring web_search_provider template
Foundation commit for the browser-provider plugin migration (#25214).
Mirrors the architecture established by PR #25182 (web providers):

- agent/browser_provider.py — BrowserProvider ABC. Preserves the legacy
  CloudBrowserProvider lifecycle contract bit-for-bit (create_session,
  close_session, emergency_cleanup, session metadata shape) so the
  dispatcher in tools/browser_tool.py becomes a pure registry lookup.
  Renames is_configured() → is_available() for parity with WebSearchProvider.

- agent/browser_registry.py — selection registry with the same
  three-rule resolution as web_search_registry:
    1. Explicit config wins (returns even if is_available() == False so
       the dispatcher surfaces a precise credentials error)
    2. Single-eligible shortcut
    3. Legacy preference walk: browser-use → browserbase, filtered by
       availability. Firecrawl is intentionally NOT in the legacy walk
       (matches pre-migration behaviour — Firecrawl was only reachable
       via explicit browser.cloud_provider: firecrawl).

- hermes_cli/plugins.py — adds ctx.register_browser_provider() facade,
  one-liner mirror of register_web_search_provider().

No plugins registered yet; no dispatcher cutover yet. The next commits
move browserbase/browser-use/firecrawl into plugins/browser/<vendor>/
and switch tools/browser_tool.py over to the registry.
2026-05-17 04:04:15 -07:00
hawknewton
c02606a385 chore(deps): lazy-install boto3/botocore for bedrock adapter
agent/bedrock_adapter.py now calls lazy_deps to install boto3 and
botocore on first import, mirroring how other optional provider
adapters defer their heavy AWS dependencies until actually used.

Keeps the base install slim for users who don't run on Bedrock.
2026-05-17 02:31:18 -07:00
flamiinngo
dbeaaa47f2 refactor(security): extract _block_message helper to unify block logic in _parse_response
Both the `action=block` and `decision=block` branches in _parse_response
shared identical field-priority and type-validation logic. Extract it into
a single _block_message(primary, secondary) helper so the two branches are
one line each and the type guard lives in exactly one place.

No functional change: existing tests (TestParseResponse, 14 tests) all
pass unchanged, confirming identical behaviour.
2026-05-17 02:31:18 -07:00
flamiinngo
63805965e7 fix(security): restore type safety and extract constant in shell hook block handler
Address code review feedback on _parse_response:

1. Restore isinstance(raw, str) guard so non-string message/reason values
   (e.g. integers, lists) from a malformed hook response fall back to the
   default rather than being forwarded as-is. This keeps the contract that
   message in the returned dict is always a string.

2. Extract the repeated literal 'Blocked by shell hook.' into a module-level
   constant _DEFAULT_BLOCK_MESSAGE to avoid duplication and make it easy to
   change in one place.

Four new unit tests added to tests/agent/test_shell_hooks.py covering:
- action block with no message (uses default)
- decision block with no reason (uses default)
- action block with empty string message (uses default)
- action block with non-string message, e.g. integer (uses default)
2026-05-17 02:31:18 -07:00
flamiinngo
aeda146112 fix(security): honor shell hook blocks even when message/reason is absent
_parse_response in agent/shell_hooks.py only forwarded a pre_tool_call
block directive if the hook also provided a non-empty message or reason.
When either field was missing the function returned None, causing Hermes
to treat the response as a no-op and execute the tool unconditionally.

This means a hook that outputs {"action": "block"} or {"decision": "block"}
without a reason string is silently ignored. The security boundary fails
open: tools the user intended to gate are executed anyway.

Fix: remove the message-presence guard. Honor the block unconditionally
and fall back to a default message when none is provided. Existing hooks
that already include a message or reason are unaffected.
2026-05-17 02:31:18 -07:00
haran2001
d9abbe7fa4 fix(metadata): qwen3.6-plus has a 1M context window (#27008)
qwen3.6-plus did not have an explicit entry in DEFAULT_CONTEXT_LENGTHS,
so the longest-substring fallback matched the generic 'qwen': 131072
catch-all. That dropped the effective context limit from 1,048,576
tokens to 131,072, prematurely lowered the compression threshold, and
produced misleading warnings about main/compression context mismatch
in long sessions.

Add an explicit 'qwen3.6-plus': 1048576 entry before the catch-all and
cover it with a regression test (bare, qwen/, and dashscope/ prefixes).

Note: PR #6599 also mentions touching model_metadata.py but the actual
diff only edits hermes_cli/models.py, so this fix is independent and
not duplicated by that PR.

Closes #27008
2026-05-17 02:31:18 -07:00
kshitij
5fba236644
chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355)
Six days after #23937 (608 fixes) the codebase had accumulated 241 new
PLR6201 violations. Same mechanical `x in (...)` → `x in {...}` fix,
same zero-risk profile: set lookup is O(1) vs O(n) for tuple and the
two are semantically equivalent for hashable scalar membership tests.

All 241 instances fixed via `ruff check --select PLR6201 --fix
--unsafe-fixes`, zero remaining. Every changed value is a hashable
scalar (str/int/None/enum/signal); no risk of unhashable runtime
errors. No behavior change.

Test plan:
- 119 files changed, +244/-244 (net zero) — exactly one-line edits
- `ruff check` clean afterward
- Compile checks pass on the largest touched files (cli.py, run_agent.py,
  gateway/run.py, gateway/platforms/discord.py, model_tools.py)
- Subset broad test run on tests/gateway/ tests/hermes_cli/ tests/agent/
  tests/tools/: 18187 passed, 59 pre-existing failures (verified against
  origin/main with the same shape — identical failure count, identical
  category — all xdist test-order flakes unrelated to this change)

Follows the same template as PR #23937 ([tracker: #23972](https://github.com/NousResearch/hermes-agent/issues/23972)).
2026-05-17 02:29:41 -07:00
teknium1
563b4d9e51
fix: strip image parts for non-vision models with provider profiles + getattr-safe _custom_providers
Original commit 75e5d0f6b by hueilau targeted _build_api_kwargs in
pre-refactor run_agent.py. The body now lives in
agent/chat_completion_helpers.build_api_kwargs — re-applied there.

Also: switch the custom_providers forward (from 21078ebce) to use
getattr() — tests build a bare AIAgent via __new__ and would otherwise
hit AttributeError on _custom_providers.

Co-authored-by: hueilau <33933019+hueilau@users.noreply.github.com>
2026-05-16 23:47:51 -07:00
teknium1
36ad8336f9
fix(run_agent): guard memory provider init against empty/whitespace string
Original commit 8d756a421 by austrian_guy targeted __init__ in
pre-refactor run_agent.py. The body now lives in
agent/agent_init.init_agent — re-applied there.

Co-authored-by: austrian_guy <33156212+ether-btc@users.noreply.github.com>
2026-05-16 23:43:09 -07:00
teknium1
4ece521bcf
fix(run_agent): isolate background review fork from external memory plugins (#27190)
Original commit 973f27e95 by Teknium targeted _spawn_background_review in
pre-refactor run_agent.py. The body now lives in
agent/background_review._spawn_background_review — re-applied there.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-16 23:42:49 -07:00
teknium1
b5bcffe167
fix(fallback): forward custom_providers to fallback model context-length detection
Original commit 21078ebce by PaTTeeL targeted _try_activate_fallback in
pre-refactor run_agent.py. The body now lives in
agent/chat_completion_helpers.try_activate_fallback — re-applied there.

Co-authored-by: PaTTeeL <9150277+PaTTeeL@users.noreply.github.com>
2026-05-16 23:42:16 -07:00
teknium1
4ab9a06a51
fix(agent): reset _fallback_index at turn start even when no fallback activated
Original commit 33528b428 by konsisumer targeted _restore_primary_runtime
in pre-refactor run_agent.py. The body now lives in
agent/agent_runtime_helpers.restore_primary_runtime — re-applied there.

Fixes #20465

Co-authored-by: konsisumer <der@konsi.org>
2026-05-16 23:41:45 -07:00
teknium1
aa05ffba53
fix(xai): surface provider 'error' SSE frame in Codex fallback stream (#27184)
Original commit 2b193907d by Teknium added a new module-level
_StreamErrorEvent class and threaded its raise into
_run_codex_create_stream_fallback in pre-refactor run_agent.py.

  - _StreamErrorEvent class → run_agent.py (module-level, next to
    _qwen_portal_headers; class needs to be top-level for the codex
    runtime to import it)
  - The fallback event-loop's 'type=error' handler → agent/codex_runtime.py
    where run_codex_create_stream_fallback now lives. Imports
    _StreamErrorEvent lazily from run_agent to avoid circular import.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-16 23:41:09 -07:00
teknium1
80fa92a491
fix(codex): rotate pool on usage limit 429 — port to extracted modules
Original commit e51d74ab9 by Maxim Esipov targeted _extract_api_error_context
and _recover_with_credential_pool in pre-refactor run_agent.py. Both bodies
now live in agent/agent_runtime_helpers.py — re-applied to that module:

  - extract_api_error_context: payload.get('type') added to the reason
    fallback chain (Codex error bodies use 'type' instead of 'code'/'error')
  - recover_with_credential_pool: usage_limit_reached detection in the
    rate_limit branch — skip the retry-once-then-rotate dance and rotate
    immediately when the body says the per-account usage limit hit.

Co-authored-by: Maxim Esipov <maksesipov@gmail.com>
2026-05-16 23:39:41 -07:00
teknium1
df22d29522
fix(copilot): GitHub Models 413 hint — port to extracted conversation_loop
Original commits 4ded3ede3 (@konsisumer) + 374dc81c2 (Teknium) added a
413 hint to run_agent.py's agent loop. Final-state version (the sharpened
374dc81c2 wording) ported to agent/conversation_loop.py, where the
payload_too_large branch now lives.

The deprecation detection + _URL_TO_PROVIDER changes from both commits
landed in agent/copilot_acp_client.py and agent/model_metadata.py via
the prior merge.

Closes #10648

Co-authored-by: konsisumer <der@konsi.org>
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-16 23:38:45 -07:00
teknium1
3fbedd732e
feat: add supports_parallel_tool_calls for MCP servers (#26825) — port to tool_dispatch_helpers
Original commit 395e9dd9e by Teknium targeted module-level _is_mcp_tool_parallel_safe
and _should_parallelize_tool_batch helpers in pre-refactor run_agent.py. Both
helpers now live in agent/tool_dispatch_helpers.py — re-applied to that
module.

The tools/mcp_tool.py portion (the public is_mcp_tool_parallel_safe API
+ _parallel_safe_servers tracking) merged cleanly from main via the prior
merge commit.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-16 23:36:37 -07:00
teknium1
fe4c87eb28
fix(agent): retry malformed anthropic stream parser errors — port to extracted modules
Original commit 9c304a7f5 by helix4u targeted _flatten_exception_chain,
_summarize_api_error, and the _call streaming retry loop in pre-refactor
run_agent.py. Re-applied to:

  - New _is_provider_stream_parse_error helper → run_agent.py (next
    to _flatten_exception_chain in the AIAgent class)
  - _summarize_api_error early-return for the malformed-streaming
    ValueError → run_agent.py (kept method body)
  - _call streaming retry: _is_stream_parse_err flag wired into
    _is_transient AND the post-exhaustion branch + dedicated
    malformed-streaming user-status string → agent/chat_completion_helpers.py
    (the _call body now lives there)

Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
2026-05-16 23:35:54 -07:00
teknium1
f885be030c
fix(auxiliary): resolve xai oauth compression from pool — port to conversation_compression
Original commit 97a32afdc by helix4u targeted _check_compression_model_feasibility
in pre-refactor run_agent.py. The function body now lives in
agent/conversation_compression.py — re-applied the configured-but-unavailable
provider message there.

Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
2026-05-16 23:33:59 -07:00
teknium1
6975a2d9ae
fix(xai-oauth): entitlement-403 chain — final state (ce0e189d3 + 9818b9a1a + 6784c8079 + dffb602f3)
Collapses the four-commit xAI entitlement-403 chain to its final
on-main state, ported to the post-refactor module layout:

  - Added _is_entitlement_failure on AIAgent (run_agent.py) — detects
    Grok subscription-shape 403s on (401|403|None) status codes.
  - Added entitlement-skip branch to recover_with_credential_pool
    (agent/agent_runtime_helpers.py) — breaks the refresh-loop that
    Don's 100-iteration trace exposed when a Premium+ user hit a real
    entitlement issue.
  - Removed _decorate_xai_entitlement_error and unwrapped its two
    _summarize_api_error call sites — xAI's own body text already
    points users at grok.com/?_s=usage so we surface that verbatim
    (dffb602f3 reasoning: X Premium subs DO now work per xAI's
    2026-05-16 announcement, so editorialising would misdirect).
  - grok-4.3 1M context entry landed in agent/model_metadata.py
    via the prior merge — no additional port needed.

Tests already on disk (tests/run_agent/test_codex_xai_oauth_recovery.py)
assert _is_entitlement_failure shape and verbatim body surfacing.

Closes #27110.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-16 23:33:18 -07:00
teknium1
6362e71973
fix(xai-oauth): recover from prelude SSE errors, gate reasoning replay, surface entitlement 403s
Original commit 31ba2b0cb by Teknium targeted run_codex_stream() at
its pre-refactor location in run_agent.py. Re-applied:

  - Prelude error retry/fallback → agent/codex_runtime.py (in
    run_codex_stream where the body now lives)
  - _decorate_xai_entitlement_error helper + _summarize_api_error
    wrapping → run_agent.py (these methods remained on AIAgent
    as @staticmethod's; cherry-pick applied them cleanly)

The xai-oauth provider gate, encrypted_content drop on replay, etc.
landed in agent/codex_responses_adapter.py via the prior merge from main.

Closes #8133, #14634

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-16 23:28:05 -07:00
teknium1
27df249564
feat(nvidia): add NIM billing origin header — port to extracted modules
Original commit 13c3d4b4e by kchantharuan touched __init__ and
_apply_client_headers_for_base_url in pre-refactor run_agent.py. Re-applied to:

  - __init__: agent/agent_init.py (3 hunks — NVIDIA branch + _custom_headers
    fallback in routed-client and fallback-client paths)
  - _apply_client_headers_for_base_url: still in run_agent.py (1 hunk)

build_nvidia_nim_headers was already present in agent/auxiliary_client.py
from the prior merge — no additional port needed.

Co-authored-by: kchantharuan <kchantharuan@nvidia.com>
2026-05-16 23:25:11 -07:00
teknium1
b07524e53a
feat(xai-oauth): add xAI Grok OAuth (SuperGrok Subscription) provider — port to extracted modules
Original commit b62c99797 by Jaaneek targeted six locations in
pre-refactor run_agent.py. Re-applied to the extracted post-PR locations:

  - api_mode dispatch → agent/agent_init.py
  - is_xai_responses build_api_kwargs → agent/chat_completion_helpers.py
  - codex_auth_retry block + 401 hint → agent/conversation_loop.py
  - _try_refresh_codex_client_credentials body → run_agent.py (kept)

The non-run_agent.py portions of the commit (auxiliary_client, codex
transport, hermes_cli/auth, tools/xai_http, tests, docs) merged cleanly
from main via the prior merge commit.

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
2026-05-16 23:23:38 -07:00
teknium1
7d221aa1f2
fix(langfuse): complete observability fix — port to extracted conversation_loop
Original commit db84a78e6 by kshitij targeted run_conversation()'s
pre_api_request and post_api_request hooks in pre-refactor run_agent.py.
Re-applied to the extracted location in agent/conversation_loop.py.

Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
2026-05-16 23:21:51 -07:00
teknium1
a77ca9295e
perf(run_agent): accumulate length-continuation prefix via list+join
Original commit 4f8aaf104 by InB4DevOps targeted run_conversation() in
the pre-refactor run_agent.py. Re-applied to the extracted location in
agent/conversation_loop.py.

Co-authored-by: InB4DevOps <tolle.lege+github@gmail.com>
2026-05-16 23:20:27 -07:00
teknium1
152d42d1a7
Merge origin/main into pr-27248 (resolving run_agent.py = ours)
run_agent.py taken from HEAD (the extracted forwarder structure). The 25
run_agent.py fixes that landed on main during the PR's life need to be
ported into the agent/* extracted modules in follow-up commits.
2026-05-16 23:16:52 -07:00
phoenixshen
52c89715a2 fix: respect user-configured vision model for OpenRouter
_OPENROUTER_MODEL hardcoded 'google/gemini-3-flash-preview' which
returns 404 on OpenRouter, breaking all vision tasks for users who
rely on the OpenRouter default.  Additionally, _try_openrouter()
ignored the user-configured auxiliary.vision.model entirely.

Changes:
- Update _OPENROUTER_MODEL default to google/gemini-2.5-flash (valid)
- Add optional 'model' parameter to _try_openrouter()
- Pass configured model from _resolve_strict_vision_backend() through
  to _try_openrouter()

This allows users who set auxiliary.vision.model (e.g. x-ai/grok-4.3)
to have it actually used, while maintaining backward compatibility.
2026-05-16 23:11:43 -07:00
zccyman
b389796ae3 fix(auxiliary): resolve api_key_env alias in named custom provider path of resolve_provider_client
In resolve_provider_client(), the named custom provider code path at
~line 2914 only checked the ``key_env`` field when looking for an
environment-variable-based API key. The documented ``api_key_env``
snake_case alias was silently ignored, causing custom providers
configured with ``api_key_env`` to fall through to the
``no-key-required`` placeholder — which produces a confusing 401
(``****ired`` mask) on auth-required remote endpoints.

This mirrors the same fix already applied to run_agent.py in commit
6ddc48b05 (fix(fallback): resolve api_key_env in fallback chain entries).

Also adds a logger.warning() when the placeholder is reached, so
future alias gaps are easier to debug.

Closes #25091
2026-05-16 23:11:43 -07:00
teknium1
47823790b0
refactor(run_agent): review fixes — keyword-forward __init__, drop dead code, tighten guards
Four fixes from PR #27248 review:

1. **__init__ forwarder is now keyword-forwarded** (daimon-nous review).
   Previously the run_agent.AIAgent.__init__ wrapper forwarded all 64
   params positionally to agent.agent_init.init_agent, so adding a
   65th param on main would require three lockstep edits (signature,
   init_agent signature, forwarder call) or silently shift every value.
   Keyword forwarding makes this trivially safe — adding a param now
   only needs the two signatures and one extra keyword line.

2. **Drop dead _ra() in agent/codex_runtime.py** (daimon-nous + Copilot).
   The lazy run_agent reference was defined but never called inside
   this module — the codex paths use agent.* accessors only.

3. **Drop unused imports in agent/codex_runtime.py** (Copilot):
   contextvars, threading, time, uuid, Optional. Carried over from
   run_agent.py during the original extraction.

4. **Tighten three source-introspection test guards** (Copilot):
   - test_memory_nudge_counter_hydration.py — was scanning the
     concatenated source of run_agent.py + agent/conversation_loop.py
     and matching self.X or agent.X form.  Now asserts the
     hydration block lives in agent/conversation_loop.py specifically
     with the agent.X form — the body never moves back, so if it
     ever drifts a future re-introduction fails the guard.
   - test_run_agent.py::TestMemoryNudgeCounterPersistence — anchor on
     agent.iteration_budget = IterationBudget exactly (was just
     iteration_budget = IterationBudget) so an unrelated identifier
     ending in iteration_budget can't match.
   - test_run_agent.py::TestMemoryProviderTurnStart — assert the
     agent._user_turn_count form directly (the extracted body uses
     agent.X, not self.X — accepting either was a transitional fudge).
   - test_jsondecodeerror_retryable.py — scan agent/conversation_loop.py
     only, not the concatenation.

Not addressed in this commit:

* Pre-existing bugs in agent/tool_executor.py (heartbeat index
  mismatch when calls are blocked, _current_tool clobber in result
  loop, blocked-counted-as-completed in spinner summary, dead
  result_preview computation). These were preserved byte-for-byte from
  the original _execute_tool_calls_concurrent — worth a separate
  follow-up PR with proper tests.
* _OpenAIProxy.__instancecheck__ concern — pre-existing, not flagged
  by any of the original test patches (nothing actually does
  isinstance(x, OpenAI) against the proxy instance).
* agent_init.py:949 mem_config potential NameError — pre-existing;
  only triggers if _agent_cfg.get('memory', {}) itself raises, which
  it can't with a stock dict.

tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing
test_auxiliary_client failure (unchanged).

run_agent.py: 3821 -> 3937 lines (+116 from the keyword-forwarded
init call's verbosity).  Final: 16083 -> 3937 (-12146, 75% reduction).
2026-05-16 22:55:49 -07:00
shellybotmoyer
1a4e64ba06 fix(credential_pool): parse ISO-string last_status_at during from_dict rehydration (#25516) 2026-05-16 22:54:22 -07:00
0xchainer
4b17c2411a fix(skills): return None instead of truthy stub when skill load fails
build_skill_invocation_message() returns a non-empty placeholder string
('[Failed to load skill: ...]') when the skill exists in the command cache
but loading the actual SKILL.md payload fails. CLI/gateway callers treat
any truthy return value as success, so the failure is silently routed into
the model as if it were a valid skill prompt.

Return None instead, matching the existing behavior for unknown commands,
so callers using 'if msg:' can properly detect the failure.
2026-05-16 22:52:22 -07:00
teknium1
94c3e0ab8e
refactor(run_agent): extract 10 more helpers to agent/agent_runtime_helpers.py
Final extraction pass — the methods left over after run_conversation
and __init__ moved out. Together these 10 cover ~813 LOC of medium-
sized helpers:

* switch_model (194 LOC) — model switching mid-session
* _invoke_tool (87) — central tool dispatch with overrides
* _repair_tool_call (72) — argument JSON repair entrypoint
* _sanitize_api_messages (71) — role-filter for API send
* _looks_like_codex_intermediate_ack (72) — codex transcript heuristic
* _copy_reasoning_content_for_api (70) — reasoning preservation
* _cleanup_dead_connections (70) — periodic dead-socket sweep
* _extract_api_error_context (65) — error-dump context builder
* _apply_pending_steer_to_tool_results (63) — /steer injection
* _force_close_tcp_sockets (59) — aggressive socket cleanup

AIAgent keeps thin forwarder methods for all 10 (staticmethods preserved
where present). Names tests patch on run_agent (handle_function_call,
AIAgent class attrs, logger) routed through _ra() so the patch surface
is preserved.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure as on main).

run_agent.py: 4634 -> 3821 lines (-813).
Final total: 16083 -> 3821 (-12262, 76% reduction).
2026-05-16 20:35:19 -07:00
teknium1
9f408989c4
refactor(run_agent): extract __init__ (1,381 LOC) to agent/agent_init.py
The largest method left on AIAgent (60+ parameters, the entire startup
sequence — credential resolution, provider auto-detection, context
engine bootstrap, memory store hydration, plugin lifecycle hooks)
moves into agent/agent_init.py.

AIAgent.__init__ is now a thin wrapper that calls
agent.agent_init.init_agent(self, ...) with the original full
parameter list preserved.

Module-level run_agent names referenced in the body (_openrouter_prewarm_done,
_qwen_portal_headers, _routermint_headers, _hermes_home, OpenAI,
get_tool_definitions, check_toolset_requirements) are resolved through
_ra() so test patches on those names keep working.  agent_init's logger
warnings are routed via _ra().logger so tests patching run_agent.logger
capture them (TestStringKSuffixContextLengthWarns,
TestCustomProvidersInvalidContextLengthWarns).

Live E2E reconfirmed on three model paths (openai/gpt-5.4,
anthropic/claude-sonnet-4.6, moonshotai/kimi-k2-thinking).

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).

run_agent.py: 5944 -> 4564 lines (-1380).
Total reduction since baseline: 16083 -> 4564 (-11519, 72%).
2026-05-16 19:43:38 -07:00
teknium1
0530252384
refactor(run_agent): extract run_conversation to agent/conversation_loop.py
The 3,877-line run_conversation body — the agent loop itself — moves out
of run_agent.py into a dedicated module.  AIAgent.run_conversation is
now a thin forwarder that delegates to agent.conversation_loop.run_conversation
with the AIAgent instance as the first argument.

This is the largest single extraction in the run_agent.py refactor.
The body keeps all 163 self.X references intact (rewritten as agent.X),
all nested closures, all retry/backoff/compression machinery.  Symbols
that tests or callers patch on run_agent (_set_interrupt,
handle_function_call, AIAgent class attrs) are resolved through _ra()
inside the extracted module so the patch surface is preserved.

Five tests doing inspect.getsource(AIAgent.run_conversation) updated to
scan agent.conversation_loop.run_conversation. Two source-introspection
tests (TestMemoryNudgeCounterPersistence, TestMemoryProviderTurnStart)
updated to accept either self.X (legacy) or agent.X (extracted
form) in the matched assertions.

Live E2E verified on three model paths:
  * openai/gpt-5.4 (OpenAI chat completions via OpenRouter)
  * anthropic/claude-sonnet-4.6 (Anthropic Messages via OpenRouter)
  * moonshotai/kimi-k2-thinking (reasoning model, reasoning_content path)
Plus read_file tool execution, terminal tool, web_search.

tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure
(test_auxiliary_client::test_custom_endpoint... — same as on main).

run_agent.py: 9800 -> 5944 lines (-3856).
Total reduction since baseline: 16083 -> 5944 (-10139, 63%).
2026-05-16 19:26:52 -07:00
teknium1
d35ee7bcdd
refactor(run_agent): move review prompts to agent/background_review.py
The three big review-prompt strings (_MEMORY_REVIEW_PROMPT,
_SKILL_REVIEW_PROMPT, _COMBINED_REVIEW_PROMPT — 183 lines combined) move
out of the AIAgent class body and into agent/background_review.py where
they're consumed.

AIAgent re-exposes them as class attributes via 'from ... import' inside
the class body — Python binds those names into the class namespace so
existing AIAgent._MEMORY_REVIEW_PROMPT references keep working.
spawn_background_review_thread also falls back to the module-level
constants if an agent doesn't have the attribute (preserves the test
pattern of mocking these on the agent).

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).

run_agent.py: 9986 -> 9800 lines (-186).
2026-05-16 19:11:58 -07:00
teknium1
c42fa94afc
refactor(run_agent): extract Codex runtime + assorted helpers to dedicated modules
Two new modules:

* agent/codex_runtime.py — three Codex API-mode methods
  - run_codex_app_server_turn (148 LOC) — Codex CLI subprocess driver
  - run_codex_stream (125 LOC) — Codex Responses API stream
  - run_codex_create_stream_fallback (78 LOC) — fallback after Responses
    stream=true initial create failure

* agent/agent_runtime_helpers.py — twelve assorted AIAgent helpers
  totalling ~1,166 LOC: convert_to_trajectory_format, sanitize_tool_call_arguments
  (static), repair_message_sequence, strip_think_blocks,
  recover_with_credential_pool, try_recover_primary_transport,
  drop_thinking_only_and_merge_users (static), restore_primary_runtime,
  extract_reasoning, dump_api_request_debug,
  anthropic_prompt_cache_policy, create_openai_client

AIAgent keeps thin forwarder methods for all 15 (preserving @staticmethod
where needed). Symbols tests patch on run_agent (OpenAI, AIAgent class
attrs) are routed through _ra() to honor the patch contract. The
_TRANSIENT_TRANSPORT_ERRORS frozenset moves with try_recover_primary_transport
and is referenced as a module-level constant in the extracted code.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).

run_agent.py: 11391 -> 9887 lines (-1504).
2026-05-16 19:03:30 -07:00
teknium1
0430e71ec9
refactor(run_agent): extract streaming API caller (893 LOC) to agent/chat_completion_helpers.py
Move _interruptible_streaming_api_call out of run_agent.py — the biggest
single method in the file.  Body lives next to interruptible_api_call
in agent/chat_completion_helpers.py so streaming + non-streaming code
share one home.

Nested closures (_call_chat_completions, _call_anthropic, the codex
stream branch) all come along with the body and still capture the
parent function's locals as expected.

AIAgent keeps a thin forwarder method.  is_local_endpoint added to
the import block (used by the stream stale-timeout disable logic).

One source-introspection test in TestAnthropicInterruptHandler is
updated to scan agent.chat_completion_helpers.interruptible_streaming_api_call
instead of AIAgent._interruptible_streaming_api_call.

tests/run_agent/ + tests/agent/: 4312 passed (same pre-existing
test_auxiliary_client failure).

run_agent.py: 12277 -> 11385 lines (-892).
2026-05-16 18:48:22 -07:00
teknium1
4b25619bc4
refactor(run_agent): extract chat-completion helpers to agent/chat_completion_helpers.py
Six methods move into a new module — bodies live there, AIAgent keeps
thin forwarder methods so call sites and tests are unchanged.

* interruptible_api_call — non-streaming API call with interrupt handling
* build_api_kwargs — assemble OpenAI / Anthropic / Codex / Bedrock request kwargs
* build_assistant_message — normalize assistant message dict (reasoning,
  tool_calls, codex passthrough fields, alibaba glm-4.7 quirk)
* try_activate_fallback — provider fallback chain activation
* handle_max_iterations — controlled stop when iteration budget exhausts
* cleanup_task_resources — per-turn VM + browser teardown (skipped for
  persistent environments)

Names tests patch on run_agent (cleanup_vm, cleanup_browser) are routed
through _ra() so the patch surface is preserved.

Two TestAnthropicInterruptHandler source-introspection tests were
updated to scan agent.chat_completion_helpers.interruptible_api_call
instead of AIAgent._interruptible_api_call — the body lives in the
extracted module now.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).

run_agent.py: 13282 -> 12253 lines (-1029).
2026-05-16 18:41:44 -07:00
teknium1
57f6762ca0
refactor(run_agent): extract stream diagnostics to agent/stream_diag.py
Move the five stream-drop diagnostic helpers + the headers tuple:

* STREAM_DIAG_HEADERS — cf-ray, x-openrouter-provider, x-request-id, etc.
* stream_diag_init — fresh per-attempt diagnostic dict
* stream_diag_capture_response — snapshot upstream headers + HTTP status
* flatten_exception_chain — compact Outer(msg) <- Inner(msg) rendering
* log_stream_retry — structured WARNING with provider/bytes/elapsed/ttfb
* emit_stream_drop — user-facing status line + activity touch

AIAgent keeps thin forwarder methods (and exposes the headers tuple as
_STREAM_DIAG_HEADERS for back-compat).  All test patches and call sites
unchanged.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).

run_agent.py: 13470 -> 13227 lines (-243).
2026-05-16 18:28:17 -07:00
teknium1
79559214a6
refactor(run_agent): extract tool execution to agent/tool_executor.py
Move the two big tool-dispatch methods out of run_agent.py:

* execute_tool_calls_concurrent — 408-line concurrent path (interrupt
  pre-flight, guardrail+plugin block, callback fan-out, ContextVar-
  preserving ThreadPoolExecutor, periodic heartbeats for the gateway
  inactivity monitor, per-tool result handling with subdir hints +
  guardrail observations + checkpoint, /steer drain)
* execute_tool_calls_sequential — 441-line sequential path (the
  original behavior used for single-tool batches and interactive
  tools)

Both take the parent AIAgent as their first argument; AIAgent keeps
thin forwarders so call sites unchanged. handle_function_call is
routed through _ra() so tests that patch run_agent.handle_function_call
keep working. _set_interrupt likewise.

The AST guard in test_tool_executor_contextvar_propagation.py is
updated to scan both run_agent.py AND agent/tool_executor.py so it
still catches the executor.submit(_run_tool, ...) regression
regardless of which file the body lives in.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure as before).

run_agent.py: 14309 -> 13461 lines (-848).
2026-05-16 18:24:05 -07:00
teknium1
2d2cd5e904
refactor(run_agent): extract system-prompt builder to agent/system_prompt.py
Four AIAgent methods move into a dedicated module:

* build_system_prompt_parts — three-tier stable/context/volatile dict
* build_system_prompt        — joiner used at session start
* invalidate_system_prompt   — drop cache + reload memory
* format_tools_for_system_message — trajectory-format tool dump

The extracted helpers look up patch-target names (load_soul_md,
build_skills_system_prompt, get_toolset_for_tool, build_environment_hints,
build_context_files_prompt, build_nous_subscription_prompt) through the
run_agent module via _ra() instead of importing them directly.  That
preserves the patch surface tests rely on
(patch('run_agent.load_soul_md', ...) and friends).

AIAgent keeps thin forwarder methods.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure as before).

run_agent.py: 14555 -> 14292 lines (-263).
2026-05-16 18:16:20 -07:00
teknium1
5311d9959e
refactor(run_agent): extract context compression to agent/conversation_compression.py
Move four compression-related methods to a dedicated module:

* check_compression_model_feasibility — startup probe + auto-lowered threshold + hard floor
* replay_compression_warning — re-emit stored warning through gateway status_callback
* compress_context — run compressor, split SQLite session, notify plugins+memory
* try_shrink_image_parts_in_messages — image-too-large recovery via re-encode

AIAgent keeps thin forwarder methods so existing call sites and tests
that patch run_agent.AIAgent methods keep working.

tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure as before).

run_agent.py: 15013 -> 14535 lines (-478).
2026-05-16 18:09:33 -07:00
teknium1
1f6eb1738c
refactor(run_agent): extract background memory/skill review to agent/background_review.py
Move the background-review subsystem (the self-improvement loop — see the
README) out of run_agent.py into a dedicated module.

* summarize_background_review_actions — was the @staticmethod that builds
  the user-facing action summary
* spawn_background_review_thread — builds the thread target + prompt;
  the actual review loop body (forked AIAgent, runtime inheritance,
  tool whitelist, suppression, teardown) lives in _run_review_in_thread
* build_memory_write_metadata — provenance for external memory mirrors

AIAgent keeps thin wrappers for backward compatibility AND because tests
patch run_agent.threading.Thread to assert lifecycle behavior — the
threading.Thread construction stays in AIAgent._spawn_background_review,
the inner work moves out.

tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure
(test_auxiliary_client.py::test_custom_endpoint... — confirmed failing
on main before this change). 3 skipped.

run_agent.py: 15272 -> 14972 lines (-300).
2026-05-16 18:05:01 -07:00
teknium1
5f309ae685
refactor(run_agent): extract OpenAI proxy, safe stdio, IterationBudget
Three small extractions into focused modules:

* agent/process_bootstrap.py — \_OpenAIProxy (lazy openai.OpenAI import),
  \_SafeWriter (broken-pipe-resistant stdio wrapper), \_install_safe_stdio,
  \_get_proxy_from_env, \_get_proxy_for_base_url. All process / IO bootstrap.
* agent/iteration_budget.py — IterationBudget class (thread-safe consume/
  refund counter shared by parent agent and subagents).

run_agent re-exports every name so existing test patches like
patch('run_agent.OpenAI', ...) and 'from run_agent import IterationBudget'
keep working unchanged.  Verified the patch-rebinding contract for OpenAI
explicitly.

tests/run_agent/ + tests/agent/test_gemini_fast_fallback.py:
1347 passed, 3 skipped.
run_agent.py: 15427 -> 15261 lines (-166).
2026-05-16 17:59:32 -07:00
teknium1
59f1c0f0b6
refactor(run_agent): extract tool-dispatch helpers to agent/tool_dispatch_helpers.py
Pull module-level helpers used by the tool-execution path out of
run_agent.py:

* parallelism gating — _NEVER_PARALLEL_TOOLS, _PARALLEL_SAFE_TOOLS,
  _PATH_SCOPED_TOOLS, _DESTRUCTIVE_PATTERNS, _REDIRECT_OVERWRITE,
  _is_destructive_command, _should_parallelize_tool_batch,
  _extract_parallel_scope_path, _paths_overlap
* multimodal envelopes — _is_multimodal_tool_result,
  _multimodal_text_summary, _append_subdir_hint_to_multimodal
* file-mutation verifier inputs — _extract_file_mutation_targets,
  _extract_error_preview
* trajectory normalization — _trajectory_normalize_msg

All pure functions. run_agent re-exports every name so existing
'from run_agent import _is_multimodal_tool_result' callers in
tests/tools/, tests/run_agent/, and tools/file_state.py keep working.

tests/run_agent/: 1341 passed, 3 skipped.
run_agent.py: 15682 -> 15427 lines (-255).
2026-05-16 17:54:26 -07:00
teknium1
885d1242a2
refactor(run_agent): extract message sanitization to agent/message_sanitization.py
Pull the 10 pure sanitization/repair helpers (\_sanitize_surrogates,
\_sanitize_structure_surrogates, \_sanitize_messages_surrogates,
\_escape_invalid_chars_in_json_strings, \_repair_tool_call_arguments,
\_strip_non_ascii, \_sanitize_messages_non_ascii, \_sanitize_tools_non_ascii,
\_strip_images_from_messages, \_sanitize_structure_non_ascii) and the
\_SURROGATE_RE constant out of run_agent.py into a new module.

These are stateless byte-walking helpers with no AIAgent dependency.

Backward compatibility: run_agent re-exports every name via a single
import block, so existing 'from run_agent import _sanitize_surrogates'
imports in tests and cli.py keep working unchanged. Same pattern the
file already uses for _summarize_user_message_for_log (codex_responses_adapter).

run_agent.py: 16077 -> 15682 lines (-395).
2026-05-16 17:41:09 -07:00
Teknium
3b39096904
Port from Kilo-Org/kilocode#9434: strip historical media after compression (#27189)
After context compression, the protected tail messages retain their
original image parts. When those include multi-MB pasted screenshots,
every subsequent API request re-ships the same base-64 blobs forever —
which can push the request past provider body-size limits and wedge the
session even though compression 'succeeded'.

Add _strip_historical_media() to agent/context_compressor.py. After the
summary is built, find the newest user message that carries an image
part and replace image parts in every earlier message with a short
text placeholder ('[Attached image — stripped after compression]').
The newest image-bearing user turn keeps its media so the model can
still analyse what the user just sent.

Handles all three multimodal shapes:
  - OpenAI chat.completions image_url
  - OpenAI Responses API input_image
  - Anthropic native {type: image, source: ...}

Includes 27 unit tests covering the helpers and the end-to-end
compress() integration, plus a manual E2E check confirming a ~4MB
two-image conversation shrinks to ~2MB after compression.
2026-05-16 17:18:25 -07:00
Teknium
93e109a1d5
fix(moonshot): strip $ref siblings and collapse tuple items in tool schemas (#27104)
Port from anomalyco/opencode#24730: Moonshot's JSON Schema validator rejects
two shapes that the rest of the JSON Schema ecosystem accepts:

1. $ref nodes with sibling keywords. Moonshot expands the reference before
   validation and then rejects the node if keys like `description`, `type`,
   or `default` appear alongside $ref. MCP-sourced tool schemas commonly
   put a `description` on $ref-typed properties so the model sees the
   field hint — which worked on every provider except Moonshot.

2. Tuple-style `items` arrays (positional element schemas). Moonshot's
   engine requires ONE schema applied to every array element. Common in
   tool schemas generated from Go/Protobuf that model fixed-length arrays
   as `[{type:number}, {type:number}]`.

Repairs applied in `agent/moonshot_schema.py`:

- Rule 3: when a node has `$ref`, return `{"$ref": <value>}` only
  (strip every sibling). The referenced definition still carries its own
  description on the target node, which Moonshot accepts.
- Rule 4: when `items` is a list, collapse to the first element schema
  (falling back to `{}` which is then filled by the generic missing-type
  rule). Preserves `minItems` / `maxItems` / other siblings.

Tests: 10 new cases across TestRefSiblingStripping + TestTupleItems,
plus the existing TestMissingTypeFilled::test_ref_node_is_not_given_synthetic_type
still passes (it asserted plain $ref passes through; now it passes through
as exactly `{"$ref": "..."}` which is strictly compatible).

All 35 tests in test_moonshot_schema.py pass.
2026-05-16 13:02:19 -07:00
JunghwanNA
345821b4a1 style: move secrets import alongside other function-level imports
Group the secrets import with time and webbrowser at the top of
run_hermes_oauth_login_pure(), matching the existing pattern.
Drop the _secrets alias — no name conflict in this scope.
2026-05-16 02:38:02 -07:00
JunghwanNA
fcd9011f8d fix(security): separate OAuth PKCE state from code_verifier
The PKCE flow reused the code_verifier as the OAuth state parameter.
Per RFC 6749 §10.12 and RFC 7636, these serve different purposes:
state is an anti-CSRF token visible in the authorization URL; the
code_verifier must remain secret for the token exchange.

Generate an independent secrets.token_urlsafe(32) for state and
validate it on callback to provide actual CSRF protection.

Closes #10693
2026-05-16 02:38:02 -07:00
teknium1
374dc81c23 fix(copilot-acp): tighten deprecation detection + sharpen GitHub Models 413 hint
Follow-up improvements on top of @konsisumer's cherry-picked fix for #10648:

1. Deprecation patterns required BOTH a product fingerprint ('gh-copilot') and
   a deprecation marker. The previous list included 'copilot-cli' and bare
   'deprecation', which would false-positive on stderr from the NEW
   @github/copilot CLI — whose repo is literally github.com/github/copilot-cli
   and which legitimately surfaces those substrings in its own messages.

2. Replace the deprecation hint. The user in #10648 installed
   'gh extension install github/gh-copilot' (the deprecated extension)
   thinking that's what ACP mode uses, when ACP actually spawns the new
   'copilot' binary from '@github/copilot'. The hint now points users at the
   correct install command ('npm install -g @github/copilot') with the new
   CLI's repo URL, and demotes provider-switching to a fallback alternative.

3. Change _URL_TO_PROVIDER value for models.inference.ai.azure.com from the
   'github-models' alias to the canonical 'copilot' provider id, matching the
   convention used by every other entry in the table.

4. Sharpen the 413 hint message. The free tier's ~8K cap is below the
   system-prompt floor, so this endpoint is fundamentally incompatible with
   an agentic loop — not a 'use a different URL' problem.

Tests:
- New parametrized false-positive coverage for the new CLI's stderr shape.
- Updated assertion to require canonical 'copilot' provider mapping.
- All 14 deprecation/URL tests pass.
2026-05-16 02:24:48 -07:00
konsisumer
4ded3ede33 fix: detect gh-copilot deprecation and improve GitHub Models 413 errors (#10648)
Address two blocking issues when using GitHub Copilot integrations:

1. ACP mode: detect the gh-copilot CLI deprecation error from stderr
   and surface an actionable message with alternatives instead of
   hanging or showing a cryptic error.

2. GitHub Models (Azure) 413: recognize models.inference.ai.azure.com
   as a known GitHub Models URL, and print a targeted hint explaining
   the hard 8K token limit that makes this endpoint incompatible with
   Hermes' system prompt size.
2026-05-16 02:24:48 -07:00
helix4u
97a32afdc4 fix(auxiliary): resolve xai oauth compression from pool 2026-05-15 19:53:37 -07:00
Teknium
ce0e189d3e
fix(xai-oauth): break entitlement-403 credential-refresh loop, bump grok-4.3 context to 1M (#26664)
Don Piedro's 18-minute hang on grok-4.3 traced to two issues PR #26644
didn't cover:

- _recover_with_credential_pool classifies 403 as FailoverReason.auth
  and calls pool.try_refresh_current().  For xAI OAuth on an
  unsubscribed account, refresh succeeds (mints a new token from the
  same account) but the next API call 403s with the same entitlement
  error.  Result: infinite refresh → retry → 403 loop until Ctrl+C
  (1133s in Don's log).  New _is_entitlement_failure(error_context,
  status_code) detects the subscription-shape body ("do not have an
  active Grok subscription" / "out of available resources" + grok /
  "does not have permission" + grok) and short-circuits recovery so
  _summarize_api_error surfaces PR #26644's friendly hint.

- grok-4.3 resolved to 256k via the grok-4 catch-all in
  DEFAULT_CONTEXT_LENGTHS.  Per docs.x.ai/developers/models/grok-4.3
  the model ships with 1M context.  Add explicit grok-4.3 entry
  before the grok-4 fallback (longest-first substring matching
  ensures grok-4.3 and grok-4.3-latest both land on the new value).

Tests: 8 new (23 total in test_codex_xai_oauth_recovery.py).
E2E verified Don's 100-iteration loop bails out with 0 refresh calls
while genuine auth failures still refresh once and recover.
2026-05-15 17:11:06 -07:00
teknium1
cd9470f416 fix(deepseek): wire thinking-mode via DeepSeekProfile, not legacy fallback
The cherry-picked PR #15251 from @tw2818 correctly identified the
DeepSeek 400 root cause but placed the fix in the legacy fallback path
of `build_kwargs`, which DeepSeek never reaches — DeepSeek has a
registered ProviderProfile and goes through `_build_kwargs_from_profile`
instead. The legacy-path block was therefore dead code.

This commit pivots the fix to where it actually fires:

- New `DeepSeekProfile` in `plugins/model-providers/deepseek/__init__.py`
  overrides `build_api_kwargs_extras` to emit DeepSeek's expected wire
  format (mirrors `KimiProfile`):

      {"reasoning_effort": "<low|medium|high|max>",
       "extra_body": {"thinking": {"type": "enabled" | "disabled"}}}

- Model gating: only `deepseek-v4-*` and `deepseek-reasoner` emit
  thinking control. `deepseek-chat` (V3) is untouched — current behavior.

- Effort mapping: low/medium/high passthrough, xhigh/max → max, unset →
  omitted (DeepSeek server applies its own default).

- Revert the legacy-path additions from PR #15251 — they were dead code,
  and the `_copy_reasoning_content_for_api` strip block specifically
  would have nullified the existing reasoning_content padding machinery
  (`_needs_deepseek_tool_reasoning` → space-pad on replay) that the
  active provider already relies on for replay correctness.

- Unit tests pin the wire-shape contract and the model gating rules
  (26 tests, all passing). Existing transport + provider profile suites
  (321 tests) continue to pass.

- AUTHOR_MAP: map twebefy@gmail.com → tw2818 for release notes credit.

Closes #15700, #17212, #17825.
Co-authored-by: tw2818 <twebefy@gmail.com>
2026-05-15 17:03:26 -07:00
twebefy
068c24f8a4 feat(deepseek): add thinking.type + reasoning_effort mapping for DeepSeek API
DeepSeek's thinking mode requires both:
- extra_body.thinking.type: "enabled" to activate thinking mode
- top-level reasoning_effort: "max" or "high" to control depth

Previously, the ChatCompletionsTransport only handled Kimi's thinking
mode — DeepSeek was left unmapped, so reasoning_effort config was
silently dropped.

This patch:
1. Adds is_deepseek: bool to the Params dataclass, detected by
   base_url matching api.deepseek.com
2. Maps Hermes effort levels (xhigh/max → "max", low/medium/high →
   themselves) to the top-level reasoning_effort parameter
3. Sets extra_body.thinking.type alongside the effort
4. Strips reasoning_content from assistant messages sent back to
   DeepSeek, preventing 400 errors when thinking was enabled
2026-05-15 17:03:26 -07:00
Teknium
31ba2b0cbc
fix(xai-oauth): recover from prelude SSE errors, gate reasoning replay, surface entitlement 403s (#26644)
Three fixes for the May 2026 xAI OAuth (SuperGrok / X Premium) rollout
failures:

- _run_codex_stream: when openai SDK raises RuntimeError("Expected to
  have received `response.created` before `<type>`"), retry once then
  fall back to responses.create(stream=True) — same path used for
  missing-response.completed postlude.  Fallback surfaces the real
  provider error with body+status_code intact.  Also fixes #8133
  (response.in_progress prelude on custom relays) and #14634
  (codex.rate_limits prelude on codex-lb).

- _summarize_api_error: when error body matches xAI's entitlement
  shape, append a one-line hint pointing to https://grok.com and
  /model.  Once-only, applies to both auxiliary warnings and
  main-loop error surfacing.

- _chat_messages_to_responses_input: new is_xai_responses kwarg
  drops replayed codex_reasoning_items (encrypted_content) before
  they reach xAI.  Also drops reasoning.encrypted_content from the
  xAI include array.  Native Codex behavior unchanged.  Grok still
  reasons natively each turn; coherence rides on visible message
  text alone.

Closes #8133, #14634.
2026-05-15 16:35:12 -07:00
Teknium
032fb84222
docs(hermes_tools_mcp_server): align scope docstring with EXPOSED_TOOLS (#26603)
The top-of-file scope docstring listed delegate_task, memory, and
session_search as exposed tools, but EXPOSED_TOOLS deliberately omits
them (they're _AGENT_LOOP_TOOLS and require the running AIAgent context
to dispatch — the inline comment block already explains this). Kanban
tools, which ARE exposed, were missing from the docstring entirely.

Rewrite the Scope / DO NOT expose sections to match the actual tuple:
drop delegate_task/memory/session_search from 'expose', add the
kanban_* family, move delegate_task/memory/session_search/todo into
'DO NOT expose' with the agent-loop rationale.

Fixes #26567 (doc-only fix; option 2 — shimming memory/session_search
through MemoryStore/SessionDB directly — left for a follow-up issue
once the plugin-memory locking story is audited).
2026-05-15 14:44:27 -07:00
kchantharuan
13c3d4b4ef feat(nvidia): add NIM billing origin header 2026-05-15 14:06:51 -07:00
Teknium
4e89c53082
fix(async): close unscheduled coroutines in all threadsafe bridges (#26584)
Wraps every sync->async coroutine-scheduling site in the codebase with a
new agent.async_utils.safe_schedule_threadsafe() helper that closes the
coroutine on scheduling failure (closed loop, shutdown race, etc.)
instead of leaking it as 'coroutine was never awaited' RuntimeWarnings
plus reference leaks.

22 production call sites migrated across the codebase:
- acp_adapter/events.py, acp_adapter/permissions.py
- agent/lsp/manager.py
- cron/scheduler.py (media + text delivery paths)
- gateway/platforms/feishu.py (5 sites, via existing _submit_on_loop helper
  which now delegates to safe_schedule_threadsafe)
- gateway/run.py (10 sites: telegram rename, agent:step hook, status
  callback, interim+bg-review, clarify send, exec-approval button+text,
  temp-bubble cleanup, channel-directory refresh)
- plugins/memory/hindsight, plugins/platforms/google_chat
- tools/browser_supervisor.py (3), browser_cdp_tool.py,
  computer_use/cua_backend.py, slash_confirm.py
- tools/environments/modal.py (_AsyncWorker)
- tools/mcp_tool.py (2 + 8 _run_on_mcp_loop callers converted to
  factory-style so the coroutine is never constructed on a dead loop)
- tui_gateway/ws.py

Tests: new tests/agent/test_async_utils.py covers helper behavior under
live loop, dead loop, None loop, and scheduling exceptions. Regression
tests added at three PR-original sites (acp events, acp permissions,
mcp loop runner) mirroring contributor's intent.

Live-tested end-to-end:
- Helper stress test: 1500 schedules across live/dead/race scenarios,
  zero leaked coroutines
- Race exercised: 5000 schedules with loop killed mid-flight, 100 ok /
  4900 None returns, zero leaks
- hermes chat -q with terminal tool call (exercises step_callback bridge)
- MCP probe against failing subprocess servers + factory path
- Real gateway daemon boot + SIGINT shutdown across multiple platform
  adapter inits
- WSTransport 100 live + 50 dead-loop writes
- Cron delivery path live + dead loop

Salvages PR #2657 — adopts contributor's intent over a much wider site
list and a single centralized helper instead of inline try/except at
each site. 3 of the original PR's 6 sites no longer exist on main
(environments/patches.py deleted, DingTalk refactored to native async);
the equivalent fix lives in tools/environments/modal.py instead.

Co-authored-by: JithendraNara <jithendranaidunara@gmail.com>
2026-05-15 14:00:01 -07:00
Jaaneek
7fdc16dd4a refactor(transports/codex): trim duplicated cache-key comments
The xAI prompt_cache_key block carried two long comment paragraphs
that either restated setdefault semantics, narrated the SDK
type-validation mechanism, or recapped the historical motivation for
the extra_body indirection — all already covered by the test
docstring at test_xai_responses_sends_cache_key_via_extra_body
(which links to the xAI docs). Also restored the truncated link in
the body-injection comment.

No behavior change.
2026-05-15 12:11:32 -07:00
Jaaneek
b62c997973 feat(xai-oauth): add xAI Grok OAuth (SuperGrok Subscription) provider
Adds a new authentication provider that lets SuperGrok subscribers sign
in to Hermes with their xAI account via the standard OAuth 2.0 PKCE
loopback flow, instead of pasting a raw API key from console.x.ai.

Highlights
----------
* OAuth 2.0 PKCE loopback login against accounts.x.ai with discovery,
  state/nonce, and a strict CORS-origin allowlist on the callback.
* Authorize URL carries `plan=generic` (required for non-allowlisted
  loopback clients) and `referrer=hermes-agent` for best-effort
  attribution in xAI's OAuth server logs.
* Token storage in `auth.json` with file-locked atomic writes; JWT
  `exp`-based expiry detection with skew; refresh-token rotation
  synced both ways between the singleton store and the credential
  pool so multi-process / multi-profile setups don't tear each other's
  refresh tokens.
* Reactive 401 retry: on a 401 from the xAI Responses API, the agent
  refreshes the token, swaps it back into `self.api_key`, and retries
  the call once. Guarded against silent account swaps when the active
  key was sourced from a different (manual) pool entry.
* Auxiliary tasks (curator, vision, embeddings, etc.) route through a
  dedicated xAI Responses-mode auxiliary client instead of falling back
  to OpenRouter billing.
* Direct HTTP tools (`tools/xai_http.py`, transcription, TTS, image-gen
  plugin) resolve credentials through a unified runtime → singleton →
  env-var fallback chain so xai-oauth users get them for free.
* `hermes auth add xai-oauth` and `hermes auth remove xai-oauth N` are
  wired through the standard auth-commands surface; remove cleans up
  the singleton loopback_pkce entry so it doesn't silently reinstate.
* `hermes model` provider picker shows
  "xAI Grok OAuth (SuperGrok Subscription)" and the model-flow falls
  back to pool credentials when the singleton is missing.

Hardening
---------
* Discovery and refresh responses validate the returned
  `token_endpoint` host against the same `*.x.ai` allowlist as the
  authorization endpoint, blocking MITM persistence of a hostile
  endpoint.
* Discovery / refresh / token-exchange `response.json()` calls are
  wrapped to raise typed `AuthError` on malformed bodies (captive
  portals, proxy error pages) instead of leaking JSONDecodeError
  tracebacks.
* `prompt_cache_key` is routed through `extra_body` on the codex
  transport (sending it as a top-level kwarg trips xAI's SDK with a
  TypeError).
* Credential-pool sync-back preserves `active_provider` so refreshing
  an OAuth entry doesn't silently flip the active provider out from
  under the running agent.

Testing
-------
* New `tests/hermes_cli/test_auth_xai_oauth_provider.py` (~63 tests)
  covers JWT expiry, OAuth URL params (plan + referrer), CORS origins,
  redirect URI validation, singleton↔pool sync, concurrency races,
  refresh error paths, runtime resolution, and malformed-JSON guards.
* Extended `test_credential_pool.py`, `test_codex_transport.py`, and
  `test_run_agent_codex_responses.py` cover the pool sync-back,
  `extra_body` routing, and 401 reactive refresh paths.
* 165 tests passing on this branch via `scripts/run_tests.sh`.
2026-05-15 12:11:32 -07:00
Siddharth Balyan
5af672c753
chore: remove Atropos RL environments and tinker-atropos integration (#26106)
* chore: remove Atropos RL environments, tools, tests, skill, and tinker-atropos submodule

Delete:
- environments/ (43 files — base env, agent loop, tool call parsers, benchmarks)
- rl_cli.py (standalone RL training CLI)
- tools/rl_training_tool.py (all 10 rl_* tools)
- tests: test_rl_training_tool, test_tool_call_parsers, test_managed_server_tool_support,
  test_agent_loop, test_agent_loop_vllm, test_agent_loop_tool_calling,
  test_terminalbench2_env_security
- optional-skills/mlops/hermes-atropos-environments/
- tinker-atropos git submodule + .gitmodules

* chore: remove RL/Atropos references from Python source

- toolsets.py: remove rl toolset block + update comment
- model_tools.py: remove rl_tools group + update async bridging comment
- hermes_cli/tools_config.py: remove RL display entry, _DEFAULT_OFF_TOOLSETS,
  setup block, and rl_training post-setup handler
- tools/budget_config.py: remove RL environment reference in docstring
- tests/test_model_tools.py: remove rl_tools from expected groups
- tests/run_agent/test_streaming_tool_call_repair.py: fix stale cross-reference

* chore: remove rl/yc-bench extras and tinker-atropos refs from pyproject.toml

- Remove rl extra (atroposlib, tinker, fastapi, uvicorn, wandb)
- Remove yc-bench extra
- Remove rl_cli from py-modules
- Remove [tool.ty.src] exclude for tinker-atropos
- Remove [tool.ruff] exclude for tinker-atropos
- Regenerate uv.lock

* chore: remove tinker-atropos from install/setup scripts

- setup-hermes.sh: remove entire tinker-atropos submodule install block
- scripts/install.sh: remove both tinker-atropos blocks (Termux + standard)
- scripts/install.ps1: remove tinker-atropos block
- nix/hermes-agent.nix: remove tinker-atropos pip install line

* chore: remove RL references from cli-config.yaml.example

* docs: remove Atropos/RL references from README, CONTRIBUTING, AGENTS.md

* docs: remove RL/Atropos references from website

- Delete: environments.md, rl-training.md, mlops-hermes-atropos-environments.md
- sidebars.ts: remove rl-training and environments sidebar entries
- optional-skills-catalog.md: remove hermes-atropos-environments row
- tools-reference.md: remove entire rl toolset section
- toolsets-reference.md: remove rl row + update example
- integrations/index.md: remove RL Training bullet
- architecture.md: remove environments/ from tree + RL section
- contributing.md: remove tinker-atropos setup
- updating.md: remove tinker-atropos install + stale submodule update

* chore: remove remaining RL/Atropos stragglers

- hermes_cli/config.py: remove TINKER_API_KEY + WANDB_API_KEY env var defs
- hermes_cli/doctor.py: remove Submodules check section (tinker-atropos)
- hermes_cli/setup.py: remove RL Training status check
- hermes_cli/status.py: remove Tinker + WandB from API key status display
- agent/display.py: remove both rl_* tool preview/activity blocks
- website/docs: remove RL references from providers.md + env-variables.md
- tests: remove TINKER_API_KEY from conftest, set_config_value, setup_script

* chore: remove RL training section from .env.example
2026-05-15 10:36:38 +05:30
Harry Riddle
e8b9f5ff9a fix(aux): surface Nous auth-unavailable warning in auxiliary client
When the auxiliary client falls through Nous (e.g. no stored auth, or
runtime credential mint failed), users currently see only `debug`-level
lines, so the next provider in the fallback chain takes over silently.
Promote the no-auth path to a warning that tells operators to run
`hermes auth`, and add a debug breadcrumb on the rarer
mint-failed-but-stored-auth-still-present fallback path so the existing
behavior (use the raw stored token) is preserved while staying
investigable.

Salvaged from #23881 by @0xharryriddle. The contributor's original
patch also short-circuited the second branch with a return, which broke
the pool-entry fallback path covered by
`test_try_nous_uses_pool_entry` — kept the warning intent, dropped the
return so the fallback still works. Dropped the contributor's changes
to `hermes_cli/goals.py` because the goal-pause path is unreachable
when the auxiliary client is None (`judge_goal` returns
`parse_failed=False`, which resets `consecutive_parse_failures`),
so the reason string they added never surfaces in the pause message.

Refs #23876
2026-05-14 20:15:29 -07:00
ethernet
1702a94c88
Merge pull request #25957 from stephenschoettler/fix/main-ci-unblocker-after-21012
fix(ci): stabilize shared test state after 21012
2026-05-14 21:26:52 -04:00
Teknium
19071529f6
fix(lsp): shift baseline diagnostics into post-edit coordinates (#25978)
Pre-existing diagnostics below an edit point used to surface as 'LSP
diagnostics introduced by this edit' whenever the edit deleted or
inserted lines.  The delta-filter key included the diagnostic's
range, so the same logical error reported at a different line in
the post-edit snapshot looked like a brand new diagnostic.

Concrete case: deleting 14 lines in cli.py caused Pyright errors at
lines 9873, 10590, 12413, 13004 (unrelated to the edit) to be
reported as introduced by it.

Fix: build a piecewise-linear line-shift map (via difflib's
SequenceMatcher) from pre and post content, and remap baseline
diagnostics into post-edit coordinates before the set-difference.
Diagnostics in deleted regions drop out cleanly; diagnostics below
the edit shift by the right amount; diagnostics above are untouched.
The strict (range-aware) equality key stays — so a genuinely new
instance of an identical error class at a different line still
surfaces as new.

Pieces:
- agent/lsp/range_shift.py — build_line_shift, shift_diagnostic_range,
  shift_baseline.  Pure functions, no LSP state.
- agent/lsp/manager.py — LSPService.get_diagnostics_sync gains an
  optional line_shift kwarg; baseline is shift_baseline'd before
  computing the seen-set.  _diag_key keeps the strict range key.
- tools/file_operations.py — write_file captures pre_content for any
  LSP-handled extension (not just LINTERS_INPROC) and passes pre/post
  to _maybe_lsp_diagnostics, which builds the shift map.
- New _lsp_handles_extension helper guards the pre_content read.

Trade-offs preserved:
- Genuinely new same-class errors at different lines still surface
  (content-only key would have swallowed them).
- Pre-existing errors at unshifted positions still get filtered
  (covered by the strict-key path with no shift).
- Best-effort: when pre_content can't be captured (file didn't
  exist, permissions), the unshifted comparison still catches
  most pre-existing errors; the edge case it misses is a new file
  with a non-empty baseline, which is structurally impossible.
2026-05-14 15:56:07 -07:00
Teknium
fe83c4001b
fix(codex-app-server): attach redacted stderr tail to generic failures (#25929)
When codex app-server fails outside the OAuth-classified path
(non-auth turn/start errors, plain TimeoutErrors, generic turn-ended
status, subprocess silently exits, hard deadline timeout), the user
got a bare 'Internal error' / 'turn/start failed: ...' with no
context. Diagnosing config/provider/auth-bridge issues forced a
re-run with verbose codex flags.

Add a _format_error_with_stderr helper that appends the last few
stderr lines via agent.redact.redact_sensitive_text(force=True),
and use it at every catch-all error site:

- ensure_started() failures (codex init / thread/start) now return
  a TurnResult.error with should_retire=True instead of bubbling
- non-OAuth turn/start CodexAppServerError / TimeoutError
- subprocess-died branch (previously dumped raw stderr_blob[-300:]
  with no redaction — a leak risk)
- turn ended with non-completed status
- hard turn-timeout deadline

OAuth-classified failures and the post-tool quiet watchdog already
produce clean hints and stay unchanged. The redactor catches sk-*,
gh*_*, Authorization: Bearer, query-string tokens, JWTs, private
keys, etc., so provider error payloads can't leak into chat output
or trajectories.

Inspired by openclaw#80718, adapted for our app-server transport.
2026-05-14 14:55:23 -07:00
Stephen Schoettler
5ce0067c08 fix(ci): stabilize shared test state after 21012 2026-05-14 14:28:14 -07:00
EthanGuo-coder
26933c2f59 fix(agent/gemini-cloudcode): seed delta defaults for reasoning-only stream chunks
_make_stream_chunk built delta_kwargs with only `role`, so a reasoning-only
chunk produced a SimpleNamespace without a `.content` attribute. Downstream
consumers that read `delta.content` then raised AttributeError on Gemini 2.5
Flash, where the thinking delta arrives before any content delta.

Seed `content`, `tool_calls`, `reasoning`, and `reasoning_content` as None
up front, matching the pattern already used in gemini_native_adapter.py.
Key-present arguments still override the defaults.

Fixes #24974
References: Related open PR #24984 (luyao618) applies the same 1-line fix; this PR adds a regression test that #24984 omits
Co-Authored-By: Claude <noreply@anthropic.com>
2026-05-14 08:03:56 -07:00
Teknium
12f755c9eb
fix(codex-runtime): retire wedged sessions + post-tool watchdog + OAuth refresh classify (#25769)
Mirrors openclaw beta.8's app-server resilience fixes so a stuck codex
subprocess can't burn the full turn deadline and so users get a
`codex login` pointer instead of raw RPC errors when their token expires.

- TurnResult.should_retire signals the caller to drop+respawn codex.
- Deadline-hit path and dead-subprocess detection set should_retire so
  the next turn doesn't ride a CPU-spinning or auth-broken process.
- Post-tool watchdog (post_tool_quiet_timeout=90s): if a tool item
  completes and codex goes silent past the threshold without further
  output or turn/completed, fast-fail instead of waiting the full 600s.
  Resets on any non-tool activity so normal think-after-tool flows are
  not affected.
- <turn_aborted> and <turn_aborted/> in agent text are treated as
  terminal — some codex builds tear down a turn that way without
  emitting turn/completed.
- _classify_oauth_failure() inspects RPC error message + stderr tail
  for invalid_grant / token refresh / 401 / etc. and rewrites
  user-facing errors to 'run codex login'. Conservative: generic
  failures still surface verbatim. Fires at turn/start failure,
  turn/completed failure, and dead-subprocess paths.
- thread/start cross-fill: tolerate thread.id, thread.sessionId,
  top-level sessionId/threadId so future codex schema drift doesn't
  KeyError us at handshake.
- run_agent.py: when run_turn returns should_retire=True OR raises,
  close + null self._codex_session so the next turn respawns.

Tests: +30 cases across session + integration suites.
  tests/agent/transports/test_codex_app_server_session.py 50/50 pass
  tests/run_agent/test_codex_app_server_integration.py 27/27 pass
  Broader codex scope (transports + cli runtime/migration) 376/376 pass
2026-05-14 07:55:09 -07:00
Alex-wuhu
1551ce46a4 docs: update NovitaAI description to "90+ models, pay-per-use" 2026-05-13 23:51:15 -07:00
Alex-wuhu
c76e879574 feat: add NovitaAI as LLM provider
Add NovitaAI as a first-class provider with dedicated model selection
flow, live pricing, and authoritative context length resolution.

- Register provider in PROVIDER_REGISTRY, HERMES_OVERLAYS, and all
  alias/label maps (ID: novita, aliases: novita-ai, novitaai)
- Add dedicated _model_flow_novita() with 3-tier model list fallback:
  Novita API → models.dev → static curated list
- Fetch live pricing from /v1/models with correct unit conversion
  (input_token_price_per_m is 0.0001 USD per Mtok)
- Add Novita-specific context length resolution (step 4b) in
  get_model_context_length(), prioritized over models.dev/OpenRouter
- Register api.novita.ai in _URL_TO_PROVIDER to prevent early return
  from the custom-endpoint code path
- Add models.dev mapping (novita → novita-ai)
- Add default auxiliary model (deepseek/deepseek-v3-0324)
- Add NOVITA_API_KEY to test isolation (conftest.py)
- Update docs: providers page, env vars reference, CLI reference,
  .env.example, README, and landing page
2026-05-13 23:51:15 -07:00
AllynSheep
057f5a31d1 fix(auxiliary): skip providers without credentials immediately
When the auxiliary client fallback chain reaches a provider that has no
credentials configured (no API key, no pool entry), the current code
just returns (None, None) which counts toward the per-call timeout
budget on the next attempt. Mark the provider unhealthy with a short
TTL so the chain advances quickly to the next viable option.

Closes #25384.

Salvage of #25395 by @AllynSheep.
2026-05-13 23:10:33 -07:00
kshitijk4poor
657e6d87cc fix(web): align _LEGACY_PREFERENCE with legacy 7-provider order + doc cleanup
Self-review of the plugin migration surfaced one warning and a handful of
doc/dead-code cleanups. None affect production behaviour through the main
dispatcher (which always calls `tools.web_tools._get_backend()` first and
preserves the full 7-provider walk), but direct callers of
`agent.web_search_registry.get_active_*_provider()` previously diverged
from the legacy order and could return `None` for users with credentials
but no explicit `web.backend` config key.

Changes
-------
1. `_LEGACY_PREFERENCE` was shipped as a 4-tuple
   `("brave-free", "firecrawl", "searxng", "ddgs")` while the PR
   description and the legacy `_get_backend()` candidate order both
   call for the 7-tuple
   `(firecrawl, parallel, tavily, exa, searxng, brave-free, ddgs)`.
   Replaced with the 7-tuple. Verified empirically: with TAVILY+EXA keys
   and no config, `get_active_search_provider()` now returns tavily
   (was None); with EXA+PARALLEL it returns parallel (was None); with
   BRAVE+FIRECRAWL it returns firecrawl (was brave-free).

2. `agent/web_search_registry.py` — module docstring, `_resolve` step-3
   docstring, and inline comment all listed the old 4-tuple and claimed
   "brave-free first because it was the shipped default". The legacy
   default is `"firecrawl"`. Rewritten to match the new ordering and
   reference `tools.web_tools._get_backend()` as the source of truth.

3. `agent/web_search_registry.py` — `get_active_crawl_provider`
   docstring said "only Tavily implements it among built-in providers".
   Firecrawl also advertises `supports_crawl=True` after the previous
   commit. Updated to "Tavily and Firecrawl".

4. `plugins/web/tavily/provider.py` — module docstring said "Tavily is
   the only built-in backend that natively crawls". Updated.

5. `agent/web_search_provider.py` — ABC docstring mentioned only
   `search` / `extract` capabilities. Added `crawl` for accuracy.

6. `plugins/web/{firecrawl,parallel,exa}/provider.py` — dead plugin-level
   cache globals (`_firecrawl_client`, `_parallel_client`,
   `_async_parallel_client`, `_exa_client`) were declared but never read
   (all reads/writes go through `_wt.*` per the `extracting-inline-
   helpers-to-plugins` recipe). Removed the dead declarations; the
   reset-for-tests helpers in firecrawl + parallel now clear the
   canonical `_wt._<name>` slots, matching the pattern exa already used.

Tests
-----
218/218 web-targeted tests still pass (no test changes needed). 4910/4910
in `tests/tools/` still green.
2026-05-13 22:31:28 -07:00
kshitijk4poor
39b4ebfcea refactor(web): delete legacy tools/web_providers/ directory + migrate ABC tests
Removes the legacy in-tree provider scaffolding that PR #25182 fully
replaced with the plugin architecture:

  tools/web_providers/__init__.py        (6 lines)
  tools/web_providers/base.py            (89 lines — old ABCs)
  tools/web_providers/ARCHITECTURE.md    (73 lines — old design doc)

These were the staging-ground ABCs and provider modules that the
plugin migration absorbed. All seven web providers now implement the
single :class:`agent.web_search_provider.WebSearchProvider` ABC and
live under ``plugins/web/<vendor>/``. Nothing else in the tree imports
``tools.web_providers`` — verified via grep before deletion.

Test migration (tests/tools/test_web_providers.py)
--------------------------------------------------
Rewrote ``TestWebProviderABCs`` to test the new unified ABC at
:mod:`agent.web_search_provider`:

  - test_cannot_instantiate_abc_directly — abstract ``name`` + ``is_available``
  - test_concrete_search_only_provider_works — exercise default
    ``supports_extract=False`` / ``supports_crawl=False`` flags
  - test_concrete_multi_capability_provider_works — exercise all three
    capabilities, async extract supported (declared sync here for
    simplicity; real plugins like parallel + firecrawl use async)
  - test_search_only_provider_skips_extract_and_crawl — verify
    ``supports_*()`` flags default to False so search-only providers
    don't have to implement extract() or crawl()

The 9 other tests in the file (per-capability backend selection,
DEFAULT_CONFIG merge, dispatcher routing) test public helpers in
``tools.web_tools`` that still exist and pass unchanged.

agent/web_search_provider.py docstring updated to reflect that the
legacy ABCs no longer exist; the response-shape contract is preserved
bit-for-bit so external consumers see no behavioral change.

Net diff
--------
- tools/web_providers/ removed (-168 lines)
- tests/tools/test_web_providers.py rewritten ABC section (+78/-30 net,
  same coverage, new API)
- agent/web_search_provider.py docstring (-3/+5 lines)

Verified
--------
- 173/173 targeted web tests pass
- 12/12 ABC contract tests pass with the new interface
- No remaining grep hits for ``tools.web_providers`` outside of
  intentional historical references in plugin docstrings.
2026-05-13 22:31:28 -07:00
kshitijk4poor
e3f0a88891 feat(web): extend ABC with supports_crawl and async-extract semantics
Two ABC additions to cover the surface area of the remaining four
providers (exa, parallel, tavily, firecrawl) which were untouched by the
initial spike:

1. supports_crawl() + crawl() — Tavily natively crawls a seed URL via
   its /crawl endpoint. Exposing supports_crawl=True lets the crawl
   tool's dispatcher route to Tavily when configured, falling back to
   the auxiliary-model summarization path otherwise. Firecrawl could
   add this in a follow-up (the SDK supports it; we just don't surface
   it as a tool today).

2. Async-or-sync extract() — Parallel's SDK is natively async
   (AsyncParallel.beta.extract); Exa and Tavily are sync; Firecrawl is
   sync but called inside asyncio.to_thread() with a 60s timeout. The
   ABC docstring now permits either shape: implementations declare
   their own sync/async signature and the dispatcher uses
   inspect.iscoroutinefunction to detect and await.

Also adds get_active_crawl_provider() to web_search_registry mirroring
the search/extract resolvers, with web.crawl_backend as the explicit
override config key.

No behavior change on its own — these are scaffolds for the four
remaining provider migrations.
2026-05-13 22:31:28 -07:00
kshitijk4poor
0a7cbd3342 fix(plugins): filter resolution by is_available() in web + image_gen registries
Both web_search_registry._resolve() and image_gen_registry.get_active_provider()
walked their registered providers and returned the first one matching the
capability flag — without checking whether that provider was actually
usable. On a fresh install with no credentials at all, this meant
get_active_search_provider() returned `brave-free` (legacy preference
order) even though BRAVE_SEARCH_API_KEY was unset, leading the
dispatcher to surface a "BRAVE_SEARCH_API_KEY is not set" error for a
provider the user never chose. Same bug shape in image_gen for FAL.

Resolution semantics now match tools.web_tools._get_backend():

  1. Explicit config name wins, ignoring is_available() — the dispatcher
     surfaces a precise "X_API_KEY is not set" error rather than silently
     switching backends. Matches user expectation: "I configured X, tell
     me what's wrong with X."
  2. Fallback (no explicit config) walks the legacy preference order
     filtered by is_available() — pick the highest-priority backend the
     user actually has credentials for.

is_available() is wrapped in a try/except so a buggy provider doesn't
brick resolution.

E2E verified:
  - No creds + no config: get_active_search_provider() -> None
  - Explicit brave-free + no key: get_active_search_provider() -> brave-free
    (and .is_available() correctly reports False)

This fix was identified during the spike (#25182 finding #1) and is
fold-in to the same PR rather than a follow-up.
2026-05-13 22:31:28 -07:00
kshitijk4poor
007a630b16 feat(web): add web search provider registry mirroring image_gen pattern 2026-05-13 22:31:28 -07:00
kshitijk4poor
2cea98e143 feat(web): add WebSearchProvider ABC mirroring image_gen template 2026-05-13 22:31:28 -07:00
teknium1
4ceab16893 fix(compression): keep default protect_first_n at 3 + align ABC
Follow-up on the salvaged feat commit:

- Keep the constructor / config / yaml-example default at 3 so existing
  gateway and CLI users see no behavioural change. PR #13754 (which this
  builds on) had lowered the default to 2 to chase pre-feature parity in
  the system-prompt-present case, at the cost of quietly halving the
  protected head for the gateway path (which strips the system prompt
  before calling compress()). With the new "system prompt is implicit"
  semantics, default 3 gives every caller a stable head shape.
- agent/context_engine.py: bring the ABC's protect_first_n docstring in
  line with the new semantics so plugin context engines interpret the
  config key the same way the built-in compressor does.
- tests: adjust the default-value test (3, not 2) and a stale comment;
  per-test protect_first_n=2/3/1 values added in PR #13754 stay as-is
  since those tests fix concrete head shapes.
2026-05-13 22:25:16 -07:00
snav
dee71a31e5 feat(compression): make protect_first_n configurable
The number of head messages preserved verbatim across context compactions
was previously hardcoded to 3 in AIAgent.__init__. Expose it as
`compression.protect_first_n` in config, matching the existing
`protect_last_n` pattern.

Motivation: users who rely on rolling compaction for long-running sessions
had the opening user/assistant exchange pinned as head forever, which
doesn't always match how they want the session framed after many
compactions. Lowering to 1 preserves the system prompt + first non-system
message; lowering to 0 preserves only the system prompt and lets the
entire first exchange age out naturally through the summary.

Semantics: `protect_first_n` counts non-system head messages protected
**in addition to** the system prompt, which is always implicitly protected
when present. Same meaning across both code paths:

  protect_first_n=0 → system prompt only (or nothing if no system message)
  protect_first_n=2 → system prompt + first 2 non-system messages (default)

This unifies the CLI path (which reads messages with the system prompt at
position 0) and the gateway path (where the gateway /compress handler
strips the system prompt before calling compress() — see
gateway/run.py L9150-9154 on the parent fork). Previously these two paths
disagreed:

  CLI path:     protect_first_n=1 → protect system prompt only
  Gateway path: protect_first_n=1 → protect first USER turn forever

In practice on long-running gateway sessions the old semantics pinned
whatever stale aside happened to be the first user message, reinserting
it into every compaction summary indefinitely.

Default chosen as 2 (not 3) so that the effective protected head count
remains 3 messages in the common case — assuming a system prompt is
present, default protection becomes system + 2 non-system = 3 total,
matching the pre-feature behaviour where `protect_first_n` was hardcoded
to protect 3 messages total. Sessions without a system prompt will see a
small behaviour change (2 protected head messages instead of 3), but this
is the rare path and the new semantics make the system-prompt-present
case the well-defined one.

Changes:

- agent/context_compressor.py: redefine protect_first_n as the count of
  non-system head messages protected beyond the implicit system-prompt
  guarantee; both paths converge. Constructor default updated to 2.
- hermes_cli/config.py: add `compression.protect_first_n` default (2),
  matching the new semantics. `show_config` label tweaked to
  'Protect first: N non-system head messages' for clarity.
- run_agent.py: read protect_first_n from config; 0 is now valid (system
  prompt is always implicitly protected).
- cli-config.yaml.example: document the new key and rationale.
- tests/agent/test_context_compressor.py: cover default, override, the
  end-to-end `protect_first_n=0` and `protect_first_n=1` behaviour,
  the no-system-prompt (gateway) path, and the new shared-semantics
  regression test.

Fixes #13751
Tested on Ubuntu 24.04.
2026-05-13 22:25:16 -07:00
Teknium
091d8e1030
feat(codex-runtime): optional codex app-server runtime for OpenAI/Codex models (#24182)
* feat(codex-runtime): scaffold optional codex app-server runtime

Foundational commit for an opt-in alternate runtime that hands OpenAI/Codex
turns to a 'codex app-server' subprocess instead of Hermes' tool dispatch.
Default behavior is unchanged.

Lands in three pieces:

1. agent/transports/codex_app_server.py — JSON-RPC 2.0 over stdio speaker
   for codex's app-server protocol (codex-rs/app-server). Spawn, init
   handshake, request/response, notification queue, server-initiated
   request queue (for approval round-trips), interrupt-friendly blocking
   reads. Tested against real codex 0.130.0 binary end-to-end during
   development.

2. hermes_cli/runtime_provider.py:
   - Adds 'codex_app_server' to _VALID_API_MODES.
   - Adds _maybe_apply_codex_app_server_runtime() helper, called at the
     end of _resolve_runtime_from_pool_entry(). Inert unless
     'model.openai_runtime: codex_app_server' is set in config.yaml AND
     provider in {openai, openai-codex}. Other providers cannot be
     rerouted (anthropic, openrouter, etc. preserved).

3. tests/agent/transports/test_codex_app_server_runtime.py — 24 tests
   covering api_mode registration, the rewriter helper (default-off,
   case-insensitive, opt-in, non-eligible providers preserved), version
   parser, missing-binary handling, error class. Does NOT require codex
   CLI installed.

This commit is wire-only: the api_mode is recognized but AIAgent does
not yet branch on it. Followup commits add the session adapter, event
projector, approval bridge, transcript projection (so memory/skill
review still works), plugin migration, and slash command.

Existing tests remain green:
- tests/cli/test_cli_provider_resolution.py (29 passed)
- tests/agent/test_credential_pool_routing.py (included above)

* feat(codex-runtime): add codex item projector for memory/skill review

The translator that lets Hermes' self-improvement loop keep working under the
Codex runtime: converts codex 'item/*' notifications into Hermes' standard
{role, content, tool_calls, tool_call_id} message shape that
agent/curator.py already knows how to read.

Item taxonomy (matches codex-rs/app-server-protocol/src/protocol/v2/item.rs):
  - userMessage          → {role: user, content}
  - agentMessage         → {role: assistant, content: text}
  - reasoning            → stashed in next assistant's 'reasoning' field
  - commandExecution     → assistant tool_call(name='exec_command') + tool result
  - fileChange           → assistant tool_call(name='apply_patch') + tool result
  - mcpToolCall          → assistant tool_call(name='mcp.<server>.<tool>') + tool result
  - dynamicToolCall      → assistant tool_call(name=<tool>) + tool result
  - plan/hookPrompt/etc  → opaque assistant note, no fabricated tool_calls

Invariants preserved:
  - Message role alternation never violated: each tool item produces at most
    one assistant + one tool message in that order, correlated by call_id.
  - Streaming deltas (item/<type>/outputDelta, item/agentMessage/delta)
    don't materialize messages — only item/completed does. Mirrors how
    Hermes already only writes the assistant message after streaming ends.
  - Tool call ids are deterministic (codex item id-based) so replays produce
    identical messages and prefix caches stay valid (AGENTS.md pitfall #16).
  - JSON args use sorted_keys for the same reason.

Real wire formats verified against codex 0.130.0 by capturing live
notifications from thread/shellCommand and including one as a fixture
(COMMAND_EXEC_COMPLETED).

23 new tests, all green:
  - Streaming deltas don't materialize (3 paths)
  - Turn/thread frame events are silent
  - commandExecution: 5 tests including non-zero exit annotation +
    deterministic id stability across replays
  - agentMessage + reasoning attachment + reasoning consumption
  - fileChange: summary without inlined content
  - mcpToolCall: namespaced naming + error surfacing
  - userMessage: text fragments only (drops images/etc)
  - opaque items: no fabricated tool_calls
  - Helpers: deterministic id stability + sorted JSON args
  - Role alternation invariant across all four tool-shaped item types

This commit is a pure addition. AIAgent integration (the wire that uses the
projector) is the next commit.

* feat(codex-runtime): add session adapter + approval bridge

The third self-contained module: CodexAppServerSession owns one Codex
thread per Hermes session, drives turn/start, consumes streaming
notifications via CodexEventProjector, handles server-initiated approval
requests, and translates cancellation into turn/interrupt.

The adapter has a single public per-turn method:

    result = session.run_turn(user_input='...', turn_timeout=600)
    # result.final_text          → assistant text for the caller
    # result.projected_messages  → list ready to splice into AIAgent.messages
    # result.tool_iterations     → tick count for _iters_since_skill nudge
    # result.interrupted         → True on Ctrl+C / deadline / interrupt
    # result.error               → error string when the turn cannot complete
    # result.turn_id, thread_id  → for sessions DB / resume

Behavior:

  - ensure_started() spawns codex, does the initialize handshake, and
    issues thread/start with cwd + permissions profile. Idempotent.
  - run_turn() blocks until turn/completed, drains server-initiated
    requests (approvals) before reading notifications so codex never
    deadlocks waiting for us, projects every item/completed via the
    projector, and increments tool_iterations for the skill nudge gate.
  - request_interrupt() is thread-safe (threading.Event); the next loop
    iteration issues turn/interrupt and unwinds.
  - turn_timeout deadlock guard issues turn/interrupt and records an
    error if the turn never completes.
  - close() escalates terminate → kill via the underlying client.

Approval bridge:

  Codex emits server-initiated requests for execCommandApproval and
  applyPatchApproval. The adapter translates Hermes' approval choice
  vocabulary onto codex's decision vocabulary:

    Hermes 'once'                → codex 'approved'
    Hermes 'session' or 'always' → codex 'approvedForSession'
    Hermes 'deny' / anything else → codex 'denied'

  Routing precedence:
    1. _ServerRequestRouting.auto_approve_* flags (cron / non-interactive)
    2. approval_callback wired by the CLI (defers to
       tools.approval.prompt_dangerous_approval())
    3. Fail-closed denial when neither is wired

  Unknown server-request methods are answered with JSON-RPC error -32601
  so codex doesn't hang waiting for us.

Permission profile mapping mirrors AGENTS.md:
    Hermes 'auto'              → codex 'workspace-write'
    Hermes 'approval-required' → codex 'read-only-with-approval'
    Hermes 'unrestricted/yolo' → codex 'full-access'

20 new tests, all green. Combined with prior commits this PR now has
67 tests across three modules:
  - test_codex_app_server_runtime.py: 24 (api_mode + transport surface)
  - test_codex_event_projector.py: 23 (item taxonomy projections)
  - test_codex_app_server_session.py: 20 (turn loop + approvals + interrupts)

Full tests/agent/transports/ directory: 249/249 pass — no regressions
to existing transport tests.

Still no wire into AIAgent.run_conversation(); that integration commit
is small and goes next.

* feat(codex-runtime): wire codex_app_server runtime into AIAgent

The integration commit. AIAgent.run_conversation() now early-returns to a
new helper _run_codex_app_server_turn() when self.api_mode ==
'codex_app_server', bypassing the chat_completions tool loop entirely.

Three small surgical edits to run_agent.py (~105 LOC total):

1. Line ~1204 (constructor api_mode validation set):
   Add 'codex_app_server' so an explicit api_mode='codex_app_server'
   passed to AIAgent() isn't silently rewritten to 'chat_completions'.

2. Line ~12048 (run_conversation, just before the while loop):
   Early-return to _run_codex_app_server_turn() when self.api_mode is
   'codex_app_server'. Placed AFTER all standard pre-loop setup —
   logging context, session DB, surrogate sanitization, _user_turn_count
   and _turns_since_memory increments, _ext_prefetch_cache, memory
   manager on_turn_start — so behavior outside the model-call loop is
   identical between paths. Default Hermes flow is unchanged when the
   flag is off.

3. End-of-class (line ~15497):
   New method _run_codex_app_server_turn(). Lazy-instantiates one
   CodexAppServerSession per AIAgent (reused across turns), runs the
   turn, splices projected_messages into messages, increments
   _iters_since_skill by tool_iterations (since the chat_completions
   loop normally does that per iteration), fires
   _spawn_background_review on the same cadence as the default path.

Counter accounting:

  _turns_since_memory  ← already incremented at run_conversation:11817
                         (gated on memory store configured) — codex
                         helper does NOT touch it (would double-count).
  _user_turn_count     ← already incremented at run_conversation:11793
                         — codex helper does NOT touch it.
  _iters_since_skill   ← incremented in the chat_completions loop per
                         tool iteration. Codex helper increments by
                         turn.tool_iterations since the loop is bypassed.

User message:

  ALREADY appended to messages by run_conversation pre-loop (line 11823)
  before the early-return reaches us. Helper does NOT append again.
  Regression test test_user_message_not_duplicated guards this.

Approval callback wiring:

  Lazy-fetches tools.terminal_tool._get_approval_callback at session
  spawn time, passes to CodexAppServerSession. CLI threads with
  prompt_toolkit get interactive approvals; gateway/cron contexts get
  the codex-side fail-closed deny.

Error path:

  Codex session exceptions become a 'partial' result with completed=False
  and a final_response that explicitly tells the user how to switch back:
  'Codex app-server turn failed: ... Fall back to default runtime with
  /codex-runtime auto.' Same return-dict shape as the chat_completions
  path so all callers (gateway, CLI, batch_runner, ACP) work unchanged.

9 new integration tests in tests/run_agent/test_codex_app_server_integration.py:
  - api_mode='codex_app_server' is accepted on AIAgent construction
  - run_conversation returns the expected codex shape
    (final_response, codex_thread_id, codex_turn_id, completed, partial)
  - Projected messages are spliced into messages list
  - _iters_since_skill ticks per tool iteration
  - _user_turn_count delegated to standard flow (not double-counted)
  - User message appears exactly once (regression guard)
  - _spawn_background_review IS invoked (memory/skill review keeps working)
  - chat.completions.create is NEVER called (loop fully bypassed)
  - Session exception → partial result with /codex-runtime auto hint
  - Interrupted turn → partial result with error preserved

Adjacent test runs confirm no regressions:
  - tests/run_agent/test_memory_nudge_counter_hydration.py: green
  - tests/run_agent/test_background_review.py: green
  - tests/run_agent/test_fallback_model.py: green
  - tests/agent/transports/: 249/249 green

Still missing for full feature: /codex-runtime slash command, plugin
migration helper, docs page, live e2e test gated on codex binary. Those
are the remaining followup commits.

* feat(codex-runtime): add /codex-runtime slash command (CLI + gateway)

User-facing toggle for the optional codex app-server runtime. Follows the
'Adding a Slash Command (All Platforms)' pattern from AGENTS.md exactly:
single CommandDef in the central registry → CLI handler → gateway handler
→ running-agent guard → all surfaces (autocomplete, /help, Telegram menu,
Slack subcommands) update automatically.

Surface:
    /codex-runtime                    — show current state + codex CLI status
    /codex-runtime auto               — Hermes default runtime
    /codex-runtime codex_app_server   — codex subprocess runtime
    /codex-runtime on / off           — synonyms

Files changed:

  hermes_cli/codex_runtime_switch.py (new):
    Pure-Python state machine shared by CLI and gateway. Parse args,
    read/write model.openai_runtime in the config dict, gate enabling
    behind a codex --version check (don't let users opt in to a runtime
    they have no binary for; print npm install hint instead).
    Returns a CodexRuntimeStatus dataclass that callers render however
    suits their surface.

  hermes_cli/commands.py:
    Single CommandDef entry, no aliases (codex-runtime is its own thing).

  cli.py:
    Dispatch in process_command() + _handle_codex_runtime() handler that
    delegates to the shared module and renders results via _cprint.

  gateway/run.py:
    Dispatch in _handle_message() + _handle_codex_runtime_command() that
    returns a string (gateway sends as message). On a successful change
    that requires a new session, _evict_cached_agent() forces the next
    inbound message to construct a fresh AIAgent with the new api_mode —
    avoids prompt-cache invalidation mid-session.

  gateway/run.py running-agent guard:
    /codex-runtime joins /model in the early-intercept block so a runtime
    flip mid-turn can't split a turn across two transports.

Tests:
  tests/hermes_cli/test_codex_runtime_switch.py — 25 tests covering the
  state machine: arg parsing (10 cases incl. case-insensitive and
  synonyms), reading current runtime (5 cases incl. malformed configs),
  writing runtime (3 cases), apply() entry point covering read-only,
  no-op, codex-missing-blocked, codex-present-success, disable-no-binary-check,
  and persist-failure paths (8 cases). All green.

Adjacent test suites confirm no regressions:
  - tests/hermes_cli/test_commands.py + test_codex_runtime_switch.py:
    167/167 green
  - tests/agent/transports/: 283/283 green when combined with prior commits

Still missing: plugin migration helper, docs page, live e2e test gated on
codex binary. Followup commits.

* feat(codex-runtime): auto-migrate Hermes MCP servers to ~/.codex/config.toml

Translates the user's mcp_servers config from ~/.hermes/config.yaml into
the TOML format codex's MCP client expects. Wired into the
/codex-runtime codex_app_server enable path so users get their MCP tool
surface in the spawned subprocess automatically.

The migration runs on every enable. Failures are non-fatal — the runtime
change still proceeds and the user gets a warning so they can fix the
codex config manually.

What translates (mapping verified against codex-rs/core/src/config/edit.rs):
  Hermes mcp_servers.<n>.command/args/env  → codex stdio transport
  Hermes mcp_servers.<n>.url/headers       → codex streamable_http transport
  Hermes mcp_servers.<n>.timeout           → codex tool_timeout_sec
  Hermes mcp_servers.<n>.connect_timeout   → codex startup_timeout_sec
  Hermes mcp_servers.<n>.cwd               → codex stdio cwd
  Hermes mcp_servers.<n>.enabled: false    → codex enabled = false

What does NOT translate (warned + skipped per server):
  Hermes-specific keys (sampling, etc.) — codex's MCP client has no
  equivalent. Listed in the per-server skipped[] field of the report.

What's NOT migrated (intentional):
  AGENTS.md — codex respects this file natively in its cwd. Hermes' own
  AGENTS.md (project-level) is already in the worktree, so codex picks
  it up without translation. No code needed.

Idempotency design:
  All managed content lives between a 'managed by hermes-agent' marker
  and the next non-mcp_servers section header. _strip_existing_managed_block
  removes the prior managed region cleanly, preserving any user-added
  codex config (model, providers.openai, sandbox profiles, etc.) above
  or below.

Files added:
  hermes_cli/codex_runtime_plugin_migration.py — pure-Python migration
    helper. Public API: migrate(hermes_config, codex_home=None,
    dry_run=False) returns MigrationReport with .migrated/.errors/
    .skipped_keys_per_server. No external TOML dependency — minimal
    formatter handles strings/numbers/booleans/lists/inline-tables.

  tests/hermes_cli/test_codex_runtime_plugin_migration.py — 39 tests
  covering:
    - per-server translation (12): stdio/http/sse, cwd, timeouts,
      enabled flag, command+url precedence, sampling drop, unknown keys
    - TOML formatter (8): types, escaping, inline tables, error case
    - existing-block stripping (4): no marker, alone, with user content
      above, with user content below
    - end-to-end migrate() (8): empty, dry-run, round-trip, idempotent
      re-run, preserves user config, error reporting, invalid input,
      summary formatting

Files changed:
  hermes_cli/codex_runtime_switch.py — apply() now calls migrate() in
    the codex_app_server enable branch. Migration failure logs a warning
    in the result message but does NOT fail the runtime change. Disable
    path (auto) explicitly skips migration.

  tests/hermes_cli/test_codex_runtime_switch.py — 3 new tests:
    test_enable_triggers_mcp_migration, test_disable_does_not_trigger_migration,
    test_migration_failure_does_not_block_enable.

All 325 feature tests green:
  - tests/agent/transports/: 249 (incl. 67 new)
  - tests/run_agent/test_codex_app_server_integration.py: 9
  - tests/hermes_cli/test_codex_runtime_switch.py: 28 (3 new)
  - tests/hermes_cli/test_codex_runtime_plugin_migration.py: 39 (new)

* perf(codex-runtime): cache codex --version check within apply()

Single /codex-runtime invocation could spawn 'codex --version' up to 3
times (state report, enable gate, success message). Each spawn is ~50ms,
so the cumulative cost wasn't a crisis, but it was wasteful and turned a
trivial slash command into something noticeably laggy on slower systems.

Refactored to lazy-once via a closure over a nonlocal cache. First call
spawns; subsequent calls in the same apply() reuse the result.

Behavior unchanged — same return shape, same error handling, same install
hint when codex is missing. Just one subprocess per call instead of three.

Two regression-guard tests added:
  - test_binary_check_cached_within_apply: enable path → call_count == 1
  - test_binary_check_cached_on_read_only_call: state-report path → call_count == 1

Total tests for /codex-runtime now 30 (was 28); all 143 codex-runtime
tests still green.

* fix(codex-runtime): correct protocol field names found via live e2e test

Three real bugs caught only by running a turn end-to-end against codex
0.130.0 with a real ChatGPT subscription. Unit tests passed because they
asserted on our own (incorrect) wire shapes; the wire format from
codex-rs/app-server-protocol/src/protocol/v2/* is the source of truth and
my initial reading of the README was incomplete.

Bug 1: thread/start.permissions wire format

Was sending {"profileId": "workspace-write"}.
Real format per PermissionProfileSelectionParams enum (tagged union):
  {"type": "profile", "id": "workspace-write"}
AND requires the experimentalApi capability declared during initialize.
AND requires a matching [permissions] table in ~/.codex/config.toml or
codex fails the request with 'default_permissions requires a [permissions]
table'.

Fix: stop overriding permissions on thread/start. Codex picks its default
profile (read-only unless user configures otherwise), which matches what
codex CLI users expect — they configure their default permission profile
in ~/.codex/config.toml the standard way. Trying to be clever about
profile selection broke every turn we tested.

Live error before fix: 'Invalid request: missing field type' on every
turn/start, even though our turn/start payload was correct — the field
codex was complaining about was inside the permissions sub-object we
shouldn't have been sending.

Bug 2: server-request method names

Was matching 'execCommandApproval' and 'applyPatchApproval'.
Real names per common.rs ServerRequest enum:
  item/commandExecution/requestApproval
  item/fileChange/requestApproval
  item/permissions/requestApproval (new third method)

Fix: match the documented names. Added handler for
item/permissions/requestApproval that always declines — codex sometimes
asks to escalate permissions mid-turn and silent acceptance would surprise
users.

Live symptom before fix: agent.log showed
'Unknown codex server request: item/commandExecution/requestApproval'
and codex stalled because we replied with -32601 (unsupported method)
instead of an approval decision. The agent reported back 'The write
command was rejected' even though Hermes never showed the user an
approval prompt.

Bug 3: approval decision values

Was sending decision strings 'approved'/'approvedForSession'/'denied'.
Real values per CommandExecutionApprovalDecision enum (camelCase):
  accept, acceptForSession, decline, cancel
(also AcceptWithExecpolicyAmendment and ApplyNetworkPolicyAmendment
variants we don't currently use).

Fix: rename _approval_choice_to_codex_decision return values; update
auto_approve_* fallbacks; update fail-closed default from 'denied' to
'decline'. Test mapping table updated to match.

Live test verified after fixes:
  $ hermes (with model.openai_runtime: codex_app_server)
  > Run the shell command: echo hermes-codex-livetest > .../proof.txt
    then read it back

  Approval prompt fired with 'Codex requests exec in <cwd>'.
  User chose 'Allow once'. Codex executed the command, wrote the file,
  read it back. Final response: 'Read back from proof.txt:
  hermes-codex-livetest'. File contents on disk match.

agent.log confirms:
  codex app-server thread started: id=019e200e profile=workspace-write
                                    cwd=/tmp/hermes-codex-livetest/workspace

All 20 session tests still green after wire-format updates.

* fix(codex-runtime): correct apply_patch approval params + ship docs

Live e2e revealed FileChangeRequestApprovalParams doesn't carry the
changeset (just itemId, threadId, turnId, reason, grantRoot) — Codex's
'reason' field describes what the patch wants to do. Test config and
display logic updated to use it. The first 'apply_patch (0 change(s))'
display from the live test is now 'apply_patch: <reason>'.

Adds website/docs/user-guide/features/codex-app-server-runtime.md
covering enable/disable, prerequisites, approval UX, MCP migration
behavior, permission profile delegation to ~/.codex/config.toml, known
limitations, and the architecture diagram. Wired into the Automation
category in sidebars.ts.

Live e2e validation across the path matrix:
  ✓ thread/start handshake
  ✓ turn/start with text input
  ✓ commandExecution items + projection
  ✓ item/commandExecution/requestApproval → Hermes UI → response
  ✓ Approve once → command runs
  ✓ Deny → command rejected, codex falls back to read-only message
  ✓ Multi-turn (codex remembers prior turn's results)
  ✓ apply_patch via Codex's fileChange path
  ✓ item/fileChange/requestApproval → Hermes UI
  ✓ MCP server migration loads inside spawned codex (verified via
    'use the filesystem MCP tool' prompt)
  ✓ /codex-runtime auto → codex_app_server toggle cycle
  ✓ Disable doesn't trigger migration
  ✓ Enable with codex CLI present succeeds + migrates
  ✓ Hermes-side interrupt path (turn/interrupt request issued cleanly
    even if codex finishes before the interrupt lands)

Known live-validated limitations now documented in the docs page:
  - delegate_task subagents unavailable on this runtime
  - permission profile selection delegated to ~/.codex/config.toml
  - apply_patch approval prompt has no inline changeset (codex protocol
    doesn't expose it)

145/145 codex-runtime tests still green.

* feat(codex-runtime): native plugin migration + UX polish (quirks 2/4/5/10/11)

Major: migrate native Codex plugins (#7 in OpenClaw's PR list)

Discovers installed curated plugins via codex's plugin/list RPC and
writes [plugins."<name>@<marketplace>"] entries to ~/.codex/config.toml
so they're enabled in the spawned Codex sessions. This is the
'YouTube-video-worthy' bit Pash highlighted: when a user has
google-calendar, github, etc. installed in their Codex CLI, those
plugins activate automatically when they enable Hermes' codex runtime.

Implementation:
  - hermes_cli/codex_runtime_plugin_migration.py: new _query_codex_plugins()
    helper spawns 'codex app-server' briefly and walks plugin/list. Returns
    (plugins, error) — failures are non-fatal so MCP migration still works.
  - render_codex_toml_section() now takes plugins + permissions args.
  - migrate() defaults: discover_plugins=True, default_permission_profile=
    'workspace-write'. Explicit None on either disables that side.
  - _strip_existing_managed_block() now also strips [plugins.*] and
    [permissions]/[permissions.*] sections inside the managed block, so
    re-runs replace plugins cleanly without touching codex's own config.

Quirk fixes:

#2 Default permissions profile written on enable.
   Without this, Codex's read-only default kicks in and EVERY write
   triggers an approval prompt. Now writes [permissions] default =
   'workspace-write' so the runtime feels normal out of the box. Set
   default_permission_profile=None to opt out.

#4 apply_patch approval prompt now shows what's changing.
   Codex's FileChangeRequestApprovalParams doesn't carry the changeset.
   Session adapter now caches the fileChange item from item/started
   notifications and looks it up by itemId when codex requests approval.
   Prompt shows '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of
   'apply_patch (0 change(s))'.

   Side benefit: also drains pending notifications BEFORE handling a
   server request, so the projector and per-turn caches are up to date
   when the approval decision fires. Bounded to 8 notifications per
   loop iter to avoid starving codex's response.

#5/#10 Exec approval prompt never shows empty cwd.
   When codex omits cwd in CommandExecutionRequestApprovalParams, fall
   back to the session's cwd. If somehow neither is available, show
   '<unknown>' explicitly instead of an empty string.

   Also surfaces 'reason' from the approval params when codex provides
   it — gives users more context on why codex wants to run something.

#11 Banner indicates the codex_app_server runtime when active.
   New 'Runtime: codex app-server (terminal/file ops/MCP run inside
   codex)' line appears in the welcome banner only when the runtime is
   on. Default banner is unchanged.

Tests:
  - 7 new tests in test_codex_runtime_plugin_migration.py covering
    plugin discovery (mocked), failure handling, dry-run skip, opt-out
    flag, idempotent re-runs, and permissions writing.
  - 3 new tests in test_codex_app_server_session.py covering the
    enriched approval prompts: cwd fallback, change summary on
    apply_patch, fallback when no item/started cache exists.
  - All 26 session tests + 46 migration tests green; 153 total in PR.

* feat(codex-runtime): hermes-tools MCP callback + native plugin migration

The big architectural addition: when codex_app_server runtime is on,
Hermes registers its own tool surface as an MCP server in
~/.codex/config.toml so the codex subprocess can call back into Hermes
for tools codex doesn't ship with — web_search, browser_*, vision,
image_generate, skills, TTS.

Also: 'migrate native codex plugins' (Pash's YouTube-video-worthy bit) —
when the user has plugins like Linear, GitHub, Gmail, Calendar, Canva
installed via 'codex plugin', Hermes discovers them via plugin/list and
writes [plugins.<name>@openai-curated] entries so they activate
automatically.

New module: agent/transports/hermes_tools_mcp_server.py
  FastMCP stdio server exposing 17 Hermes tools. Each call dispatches
  through model_tools.handle_function_call() — same code path as the
  Hermes default runtime. Run with:
    python -m agent.transports.hermes_tools_mcp_server [--verbose]

  Exposed: web_search, web_extract, browser_navigate / _click / _type /
    _press / _snapshot / _scroll / _back / _get_images / _console /
    _vision, vision_analyze, image_generate, skill_view, skills_list,
    text_to_speech.

  NOT exposed (deliberately):
    - terminal/shell/read_file/write_file/patch — codex has built-ins
    - delegate_task/memory/session_search/todo — _AGENT_LOOP_TOOLS in
      model_tools.py:493, require running AIAgent context. Documented
      as a limitation and surfaced in the slash command output.

Migration changes (hermes_cli/codex_runtime_plugin_migration.py):
  - _query_codex_plugins() spawns 'codex app-server' briefly to walk
    plugin/list and pull installed openai-curated plugins. Failures are
    non-fatal — MCP migration still completes.
  - render_codex_toml_section() now takes plugins + permissions args
    AND wraps the managed block with a MIGRATION_END_MARKER comment so
    the stripper can reliably find both ends, even when the block
    contains top-level keys (default_permissions = ...).
  - migrate() defaults: discover_plugins=True, expose_hermes_tools=True,
    default_permission_profile=':workspace' (built-in codex profile name
    — must be prefixed with ':'). All three opt-out via explicit args.
  - _build_hermes_tools_mcp_entry() builds the codex stdio entry with
    HERMES_HOME and PYTHONPATH passthrough so a worktree-launched
    Hermes points the MCP subprocess at the same module layout.

Live-caught wire bugs fixed during this turn:
  1. Permission profile config key is top-level , NOT a [permissions] table. The [permissions] table is
     for *user-defined* profiles with structured fields. Built-in
     profile names start with ':' (':workspace', ':read-only',
     ':danger-no-sandbox'). Was emitting
     which codex rejected with 'invalid type: string "X", expected
     struct PermissionProfileToml'.
  2. Built-in profile is , NOT . Codex
     rejected  with 'unknown built-in profile'.
  3. Codex's MCP layer sends  for
     tool-call confirmation. We weren't handling it, so codex stalled
     and returned 'MCP tool call was rejected'. Now: auto-accept for
     our own hermes-tools server (user already opted in by enabling
     the runtime), decline for third-party servers.

Quirk fixes shipped (from the limitations list):
  #2 default permissions: workspace profile written on enable. No more
     approval prompt on every write.
  #4 apply_patch approval shows what's changing: cache fileChange
     items from item/started, look up by itemId when codex sends
     item/fileChange/requestApproval. Prompt: '1 add, 1 update:
     /tmp/new.py, /tmp/old.py' instead of '0 change(s)'.
  #5/#10 exec approval cwd never empty: fall back to session cwd, then
     '<unknown>'. Also surfaces 'reason' from codex when present.
  #11 banner shows 'Runtime: codex app-server' line when active so
     users understand why tool counts may not match what's reachable.

Tests:
  - 5 new tests in test_codex_runtime_plugin_migration.py covering
    plugin discovery, expose_hermes_tools entry generation, idempotent
    re-runs, opt-out flag, permissions profile.
  - 3 new tests in test_codex_app_server_session.py covering enriched
    approval prompts (cwd fallback, fileChange summary).
  - 2 new tests for mcpServer/elicitation/request handling (accept
    hermes-tools, decline others).
  - New test file test_hermes_tools_mcp_server.py covering module
    surface, EXPOSED_TOOLS safety invariants (no shell/file_ops,
    no agent-loop tools), and main() error paths.
  - 166 codex-runtime tests total, all green.

Live e2e validated against codex 0.130.0 + ChatGPT subscription:
  ✓ /codex-runtime codex_app_server enables, migrates filesystem MCP,
    registers hermes-tools, writes default_permissions = ':workspace'
  ✓ Banner shows 'Runtime: codex app-server' line in subsequent sessions
  ✓ Shell command runs without approval prompt (workspace profile works)
  ✓ Multi-turn — codex remembers prior turn's results
  ✓ apply_patch path via fileChange request approval
  ✓ web_search via hermes-tools MCP callback returns real Firecrawl
    results: 'OpenAI Codex CLI – Getting Started' end-to-end in 13s
  ✓ Disable cycle clean

Docs updated: website/docs/user-guide/features/codex-app-server-runtime.md
  Full re-write covering native plugin migration, the hermes-tools
  callback architecture, the prerequisites change ('codex login is
  separate from hermes auth login codex'), the trade-off table now
  reflecting which Hermes tools work via callback, and the limitations
  list updated with what's actually unavailable on this runtime.

* feat(codex-runtime): pin user-config preservation invariant for quirk #6

Quirk #6 from the limitations list — user MCP servers / overrides /
codex-only sections in ~/.codex/config.toml that live OUTSIDE the
hermes-managed block must survive re-migration verbatim.

This already worked thanks to the MIGRATION_MARKER + MIGRATION_END_MARKER
pair I added when fixing the default_permissions wire format (so the
strip can find both ends of the managed region even with top-level
keys like default_permissions). But it was an emergent property
without a test pinning it.

Now explicitly tested:
  - User MCP server above the managed block survives migration
  - User MCP server below the managed block survives migration
  - Both above + below survive a second re-migration
  - User content (model, providers, sandbox, otel, etc.) outside our
    region is left untouched

Docs added a section "Editing ~/.codex/config.toml safely" explaining
the marker contract — so users know they can add their own MCP
servers, override permissions, configure codex-only options, etc.
without fear of Hermes overwriting their work.

167 codex-runtime tests, all green.

* docs(codex-runtime): clarify the actual tool surface — shell covers terminal/read/write/find

Previous docs and PR description undersold what codex's built-in
toolset actually provides. apply_patch alone made it sound like the
runtime could only edit files in patch format — implying you'd lose
terminal use, read_file, write_file, search/find. That was wrong.

Codex's 'shell' tool runs arbitrary shell commands inside the sandbox,
which covers everything you'd do in bash: cat/head/tail (read), echo>
or heredocs (write), find/rg/grep (search), ls/cd (navigate), build/
test/git/etc. apply_patch is for structured multi-file edits on top
of that. update_plan is its in-runtime todo. view_image loads images.
And codex has its own web_search built in (in addition to the
Firecrawl-backed one Hermes exposes via MCP callback).

Docs now have a 'What tools the model actually has' section right
after Why, breaking the surface into three clearly-labeled buckets:

  1. Codex's built-in toolset (always on) — shell, apply_patch,
     update_plan, view_image, web_search; covers everything terminal-
     adjacent.
  2. Native Codex plugins (auto-migrated from your codex plugin
     install) — Linear, GitHub, Gmail, Calendar, Outlook, Canva, etc.
  3. Hermes tool callback (MCP server in ~/.codex/config.toml) —
     web_search/web_extract via Firecrawl, browser_*, vision_analyze,
     image_generate, skill_view/skills_list, text_to_speech.

Plus a 'What's NOT available' callout listing the four agent-loop tools
(delegate_task, memory, session_search, todo) that need running
AIAgent context and can't reach the codex runtime.

Trade-offs table broken out: shell, apply_patch, update_plan,
view_image, sandbox each get their own row with a one-line description
so users can see at a glance what's available natively.

Architecture diagram updated to list the codex built-ins by name
instead of 'apply_patch + shell + sandbox'.

No code changes — purely docs clarification. 167 codex-runtime tests
still green.

* fix(codex-runtime): _spawn_background_review signature + review fork api_mode downgrade

Two real bugs in the self-improvement loop integration that the previous
test mocked away.

Bug 1: wrong call signature

The codex helper was calling self._spawn_background_review() with no
args after every turn. That function actually requires:
  messages_snapshot=list   (positional or keyword)
  review_memory=bool       (at least one trigger must be True)
  review_skills=bool

So the call would have raised TypeError at runtime — except the only
test that exercised this path mocked _spawn_background_review entirely
and just asserted spawn.called, so the wrong-arg shape never surfaced.

Bug 2: review fork inherits codex_app_server api_mode

The review fork is constructed with:
  api_mode = _parent_runtime.get('api_mode')

So when the parent is codex_app_server, the review fork ALSO runs as
codex_app_server. But the review fork's whole job is to call agent-loop
tools (memory, skill_manage) which require Hermes' own dispatch — they
short-circuit with 'must be handled by the agent loop' on the codex
runtime. So the review fork would have run, decided to save something,
called memory or skill_manage, and silently no-op'd.

Fixed in run_agent.py:_spawn_background_review() — when the parent
api_mode is 'codex_app_server', the review fork is downgraded to
'codex_responses' (same OAuth credentials, same openai-codex provider,
but talks to OpenAI's Responses API directly so Hermes owns the loop).

Also rewrote the codex helper's review wiring to match the
chat_completions path:
  - Computes _should_review_memory in the pre-loop block (was already
    being computed; now passed through to the helper as an arg).
  - Computes _should_review_skills AFTER the codex turn returns +
    counters tick (line ~15432 pattern in chat_completions).
  - Calls _spawn_background_review(messages_snapshot=, review_memory=,
    review_skills=) only when at least one trigger fires.
  - Adds the external memory provider sync (_sync_external_memory_for_turn)
    that the chat_completions path runs after every turn.

Tests:

  Replaced the broken test_background_review_invoked (which only
  asserted spawn.called) with three sharper tests:
    - test_background_review_NOT_invoked_below_threshold:
      single turn at default thresholds → no review fires (would have
      caught the original 'every turn calls spawn with no args' bug)
    - test_background_review_skill_trigger_fires_above_threshold:
      10 tool_iterations at threshold=10 → review fires with
      messages_snapshot=list, review_skills=True, counter resets
    - test_background_review_signature_never_breaks: regression guard
      asserting positional args are always empty and kwargs include
      messages_snapshot

  New TestReviewForkApiModeDowngrade class:
    - test_codex_app_server_parent_downgrades_review_fork: drives the
      real _spawn_background_review function (no mock at that level),
      asserts the review_agent gets api_mode='codex_responses' when
      the parent was codex_app_server.

Live-validated against real run_conversation:
  - Counter ticked from 0 to 5 after a 5-tool-iteration turn
  - _spawn_background_review fired exactly once with kwargs-only signature
  - review_skills=True, review_memory=False
  - messages_snapshot was 12 entries (5 assistant tool_calls + 5 tool
    results + 1 final assistant + initial system/user)
  - Counter reset to 0 after fire

170 codex-runtime tests, all green.

Docs: added a Self-improvement loop section to the codex runtime page
explaining both how the trigger logic stays equivalent and that the
review fork is auto-downgraded to codex_responses for the agent-loop
tools. Also clarified that apply_patch and update_plan ARE codex's
built-in tools (the previous version made it sound like they were
separate from 'codex's stuff' — they're not, all five tools listed
in 'What tools the model actually has' section 1 are codex built-ins).

* feat(codex-runtime): expose kanban tools through Hermes MCP callback

Kanban workers spawn as separate hermes chat -q subprocesses that read
the user's config.yaml. If model.openai_runtime: codex_app_server is set
globally (which is the whole point of opt-in), every dispatched worker
ALSO comes up on the codex runtime.

That mostly works — codex's built-in shell + apply_patch + update_plan
do the actual task work fine — but it had one critical break: the
worker handoff tools (kanban_complete, kanban_block, kanban_comment,
kanban_heartbeat) are Hermes-registered tools, not codex built-ins.
On the codex runtime, codex builds its own tool list and these never
reach the model, so the worker would do the work but not be able to
report back, hanging until the dispatcher's timeout escalates it as
zombie.

Fix: add all 9 kanban tools to the EXPOSED_TOOLS list in the Hermes
MCP callback. They dispatch statelessly through handle_function_call()
just like web_search and the others — they read HERMES_KANBAN_TASK
from env (set by the dispatcher), gate correctly (worker tools require
the env var, orchestrator tools require it unset), and write to
~/.hermes/kanban.db.

Why kanban tools work via stateless dispatch when delegate_task/memory/
session_search/todo don't: those four are listed in _AGENT_LOOP_TOOLS
(model_tools.py:493) and short-circuit in handle_function_call() with
'must be handled by the agent loop' — they need to mutate AIAgent's
mid-loop state. Kanban tools have no such requirement; they're pure
side-effect functions against the kanban.db plus state_meta.

Tools exposed:
  Worker handoff (require HERMES_KANBAN_TASK):
    kanban_complete, kanban_block, kanban_comment, kanban_heartbeat
  Read-only board queries:
    kanban_show, kanban_list
  Orchestrator (require HERMES_KANBAN_TASK unset):
    kanban_create, kanban_unblock, kanban_link

Tests:
  - test_kanban_worker_tools_exposed: complete/block/comment/heartbeat
    in EXPOSED_TOOLS (regression guard for the would-hang-worker bug)
  - test_kanban_orchestrator_tools_exposed: create/show/list/unblock/link

Docs:
  - New 'Workflow features' section in the docs page covering /goal,
    kanban, and cron behavior on this runtime
  - /goal: works fully via run_conversation feedback; only caveat is
    approval-prompt noise on long writes-heavy goals (mitigated by
    the default :workspace permission profile)
  - Kanban: enumerated which tools are reachable via the callback and
    why the env var propagates correctly through the codex subprocess
    to the MCP server subprocess
  - Cron: documented as 'not specifically tested' — same rules as the
    CLI apply since cron runs through AIAgent.run_conversation
  - Trade-offs table gained rows for /goal, kanban worker, kanban
    orchestrator

172/172 codex-runtime tests green (+2 from kanban tests).

* docs(codex-runtime): wire /codex-runtime into slash-commands ref + flag aux token cost

Three docs gaps caught during a final audit:

1. /codex-runtime was only in the feature docs page, not in the
   slash-commands reference. Added rows to both the CLI section and
   the Messaging section so users discover it where they'd look for
   slash command syntax.

2. CODEX_HOME and HERMES_KANBAN_TASK weren't in environment-variables.md.
   CODEX_HOME lets users redirect Codex CLI's config dir (the migration
   honors it). HERMES_KANBAN_TASK is set by the kanban dispatcher and
   propagates to the codex subprocess + the hermes-tools MCP subprocess
   so kanban worker tools gate correctly — documented as 'don't set
   manually' since it's an internal handoff.

3. Aux client behavior on this runtime. When openai_runtime=
   codex_app_server is on with the openai-codex provider, every aux
   task (title generation, context compression, vision auto-detect,
   session search summarization, the background self-improvement review
   fork) flows through the user's ChatGPT subscription by default.

   This is true for the existing codex_responses path too, but it's
   more visible / important here because users explicitly opted in for
   subscription billing. Added a 'Auxiliary tasks and ChatGPT
   subscription token cost' section to the docs page with a YAML
   example showing how to override specific aux tasks to a cheaper
   model (typically google/gemini-3-flash-preview via OpenRouter).

   Also documents how the self-improvement review fork gets
   auto-downgraded from codex_app_server to codex_responses by the
   fix earlier in this PR.

No code changes — pure docs. 172 codex-runtime tests still green.

* docs+test(codex-runtime): pin HOME passthrough, document multi-profile + CODEX_HOME

OpenClaw hit a real footgun in openclaw/openclaw#81562: when spawning
codex app-server they were synthesizing a per-agent HOME alongside
CODEX_HOME. That made every subprocess codex's shell tool launches
(gh, git, aws, npm, gcloud, ...) see a fake $HOME and miss the user's
real config files. They had to back it out in PR #81562 — keep
CODEX_HOME isolation, leave HOME alone.

Audit confirms Hermes' codex spawn doesn't have this problem. We do
os.environ.copy() and only overlay CODEX_HOME (when provided) and
RUST_LOG. HOME passes through unchanged. But it was an emergent
property without a test pinning it, so adding a regression guard:

  test_spawn_env_preserves_HOME — confirms parent HOME survives intact
                                  in the subprocess env
  test_spawn_env_sets_CODEX_HOME_when_provided — confirms codex_home
                                                  arg still isolates
                                                  codex state correctly

Docs additions:

  'HOME environment variable passthrough' section — calls out the
  contract explicitly: CODEX_HOME isolates codex's own state, HOME
  stays user-real so gh/git/aws/npm/etc. find their normal config.
  Cites openclaw#81562 as the cautionary tale.

  'Multi-profile / multi-tenant setups' section — addresses the
  related concern: profiles share ~/.codex/ by default. For users who
  want per-profile codex isolation (separate auth, separate plugins),
  documents the manual CODEX_HOME=<profile-scoped-dir> approach.

  Explains why we DON'T auto-scope CODEX_HOME per profile: doing so
  would silently invalidate existing codex login state for anyone
  upgrading to this PR with tokens already at ~/.codex/auth.json.
  Opt-in is safer than surprising users.

174 codex-runtime tests (+2 from HOME guards), all green.

* fix(codex-runtime): TOML control-char escapes + atomic config.toml write

Two footguns caught in a final audit pass before merge.

Bug 1: TOML control characters not escaped

The _format_toml_value() helper escaped backslashes and double quotes
but passed literal control characters (\n, \t, \r, \f, \b) through
unchanged. TOML basic strings don't allow literal control characters
— a path or env var containing a newline would produce invalid TOML
that codex refuses to load.

Realistic exposure: pathological cases like a HERMES_HOME with a
trailing newline (env var concatenation accident), or a PYTHONPATH
with a tab from a multi-line shell heredoc.

Fix: escape all five TOML basic-string control sequences (\b \t \n
\f \r) in addition to \\ and \" that we already did. Order
matters — backslash must come first or the other escapes get
re-escaped.

Bug 2: config.toml write wasn't atomic

If the python process crashed between target.mkdir() and the
write_text() finishing, a half-written config.toml could be left
behind. On NFS / Windows / some FUSE mounts this is a real concern;
on ext4/APFS small writes are usually atomic in practice but not
guaranteed.

Fix: write to a tempfile.mkstemp() temp file in the same directory,
then Path.replace() (atomic same-dir rename on POSIX, ReplaceFile on
Windows). On rename failure, clean up the temp file so repeated
failed migrations don't pile up .config.toml.* files.

Tests:
  - test_string_with_newline_escaped — \n in value → \n in output
  - test_string_with_tab_escaped — \t in value → \t in output
  - test_string_with_other_controls_escaped — \r, \f, \b
  - test_windows_path_escaped_correctly — backslash doubling
  - test_atomic_write_no_temp_leak_on_success — no .config.toml.*
    left over after a successful write
  - test_atomic_write_cleanup_on_rename_failure — temp file removed
    when Path.replace raises (simulated disk full)

180 codex-runtime tests, all green (+6 from this commit).

Footguns audited but NOT fixed (with rationale):

- Concurrent migrations race. Two Hermes processes hitting
  /codex-runtime codex_app_server within seconds of each other could
  cause one writer to lose entries. Low probability (you'd have to
  enable from two surfaces simultaneously) and low impact (just re-run
  migration). Adding fcntl/msvcrt locking is more code than it's
  worth here. The atomic rename above means each individual write is
  consistent — only the merge step is racy.

- Codex protocol version drift. We pin MIN_CODEX_VERSION=0.125 and
  check at runtime but don't reject too-new versions. Right call —
  the protocol has been stable through 0.125 → 0.130. If OpenAI
  breaks it later we'd see the error in test_codex_app_server_runtime
  on CI before users hit it.
2026-05-13 17:18:15 -07:00
Teknium
9d42c2c286
feat(video_gen): unified video_generate tool with pluggable provider backends (#25126)
* feat(video_gen): unified video_generate tool with pluggable provider backends

One core video_generate tool, every backend a plugin. Mirrors the
image_gen + memory_provider + context_engine architecture: ABC, registry,
plugin-context registration hook, and per-plugin model catalogs surfaced
through hermes tools.

Surface (one schema, every backend):
- operation: generate / edit / extend
- modalities: text-to-video (prompt only), image-to-video (prompt +
  image_url), video edit (prompt + video_url), video extend (video_url)
- reference_image_urls, duration, aspect_ratio, resolution,
  negative_prompt, audio, seed, model override
- Providers ignore unknown kwargs and declare what they support via
  VideoGenProvider.capabilities() — backend-specific quirks stay in the
  backend, the agent learns one tool

Backends shipped:
- plugins/video_gen/xai/  — Grok-Imagine, full generate/edit/extend +
  image-to-video + reference images (salvaged from PR #10600 by
  @Jaaneek, reshaped into the plugin interface)
- plugins/video_gen/fal/  — Veo 3.1 (t2v + i2v), Kling O3 i2v,
  Pixverse v6 i2v with model-aware payload building that drops keys a
  model doesn't declare

Wiring:
- agent/video_gen_provider.py — VideoGenProvider ABC, normalize_operation,
  success_response / error_response, save_b64_video / save_bytes_video,
  $HERMES_HOME/cache/videos/
- agent/video_gen_registry.py — thread-safe register/get/list +
  get_active_provider() reading video_gen.provider from config.yaml
- hermes_cli/plugins.py — PluginContext.register_video_gen_provider()
- hermes_cli/tools_config.py — Video Generation category in
  hermes tools, plugin-only providers list, model picker per plugin,
  config write to video_gen.{provider,model}
- toolsets.py — new video_gen toolset
- tests: 31 new tests covering ABC, registry, tool dispatch, both plugins
- docs: developer-guide/video-gen-provider-plugin.md (parallel to the
  image-gen guide), sidebar + toolsets-reference + plugin guides updated

Supersedes: #25035 (FAL), #17972 (FAL), #14543 (xAI), #13847 (HappyHorse),
#10458 (provider categories), #10786 (xAI media+search bundle), #2984
(FAL duplicate), #19086 (Google Veo standalone — easy port to plugin
interface).

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>

* feat(video_gen): dynamic schema reflects active backend's capabilities

Address the 'capability variance' question — instead of one tool with a
static schema that lies about what every backend supports, the
video_generate tool now rebuilds its description at get_definitions()
time based on the configured video_gen.provider and video_gen.model.

The agent sees backend-specific guidance up-front:
- 'fal-ai/veo3.1/image-to-video': 'image-to-video only — image_url is
  REQUIRED; text-only prompts will be rejected'
- 'fal-ai/veo3.1' (t2v): no image_url restriction shown
- xAI grok-imagine-video: 'operations: generate, edit, extend; up to 7
  reference_image_urls'
- Backends without edit/extend: 'not supported on this backend — surface
  that they need to switch backends via hermes tools'

This is the same pattern PR #22694 used for delegate_task self-capping —
documented in the dynamic-tool-schemas skill. Cache invalidation is
free: get_tool_definitions() already memoizes on config.yaml mtime, so a
mid-session backend swap rebuilds the schema automatically.

Tested:
- Empirical FAL OpenAPI schema check confirms image-to-video models
  require image_url (FAL returns HTTP 422 otherwise) — client-side
  rejection in FALVideoGenProvider.generate() now prevents the wasted
  round-trip
- Live E2E: fal-ai/veo3.1/image-to-video + prompt-only → clean
  missing_image_url error; fal-ai/veo3.1 + prompt-only → dispatches
- 6 new tests cover the builder (no config / image-only / full-surface /
  text-only / unknown provider / registry wiring), all passing
- 37/37 in the slice, 134/134 in the broader regression set

* test(video_gen/xai): full surface integration tests + cleaner schema

Verified end-to-end that the xAI plugin handles every documented mode
from PR #10600's surface: text-to-video, image-to-video,
reference-images-to-video, video edit, video extend (with and without
prompt). All five modes route to the correct xAI endpoint
(/videos/generations, /videos/edits, /videos/extensions) with the right
payload shape (image / reference_images / video keys), and all five
client-side rejections fire before the network: edit-without-prompt,
extend-without-video_url, image+refs conflict, >7 references, and
duration/aspect_ratio clamping.

15 new integration tests grouped into four classes (endpoint routing,
modalities, validation, clamping). httpx is stubbed via a small fake
AsyncClient that records POSTs so the tests assert the actual payload
the plugin would send to xAI — not just the success/error envelope.

Also cleaned up a description redundancy: when a model's operations
match the backend's overall set, we no longer print the duplicate
'operations supported by this model' line. xAI's description now reads:

    Active backend: xAI . model: grok-imagine-video
    - operations supported by this backend: edit, extend, generate
    - modalities supported by this backend: image, reference_images, text
    - aspect_ratio choices: 16:9, 1:1, 2:3, 3:2, 3:4, 4:3, 9:16
    - resolution choices: 480p, 720p
    - duration range: 1-15s
    - reference_image_urls: up to 7 images

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>

* feat(video_gen): collapse surface to t2v + i2v, family-based auto-routing

Two design changes per Teknium:

1) Drop edit/extend from the tool surface entirely. Only text-to-video
and image-to-video remain. The agent sees a clean tool with two
modalities; backend-specific quirks like xAI's edit/extend endpoints
stay out of the unified schema.

2) FAL: pick a model FAMILY once, the plugin routes between the
family's text-to-video and image-to-video endpoints based on whether
image_url was passed. Users no longer pick 'fal-ai/veo3.1' AND
'fal-ai/veo3.1/image-to-video' as separate options — they pick
'veo3.1', and the plugin handles the rest.

Catalog rewritten as families:

    veo3.1            fal-ai/veo3.1                                /  fal-ai/veo3.1/image-to-video
    pixverse-v6       fal-ai/pixverse/v6/text-to-video             /  fal-ai/pixverse/v6/image-to-video
    kling-o3-standard fal-ai/kling-video/o3/standard/text-to-video /  fal-ai/kling-video/o3/standard/image-to-video

xAI uses a single endpoint (/videos/generations) for both modes,
routed by the presence of the 'image' field in the payload — no
edit/extend exposure.

Schema changes:
- VIDEO_GENERATE_SCHEMA: drop operation, drop video_url. Final params:
  prompt (required), image_url, reference_image_urls, duration,
  aspect_ratio, resolution, negative_prompt, audio, seed, model.
- VideoGenProvider ABC: drop normalize_operation, VALID_OPERATIONS,
  DEFAULT_OPERATION. capabilities() drops 'operations' key.
- success_response: add 'modality' field ('text' | 'image') so the
  agent and logs can see which endpoint was actually hit.

Dynamic schema builder simplified — no operations bullet, no
'switch backends if you need edit/extend' guidance. When the active
backend supports both modalities (the common case), description reads:

    Active backend: FAL . model: pixverse-v6
    - supports both text-to-video (omit image_url) and image-to-video
      (pass image_url) - routes automatically
    - aspect_ratio choices: 16:9, 9:16, 1:1
    - resolution choices: 360p, 540p, 720p, 1080p
    - duration range: 1-15s
    - audio: pass audio=true to enable native audio (pricing tier)
    - negative_prompt: supported

Tests: 51 in the video_gen slice, 216 across the broader image+video
sweep, all passing. New FAL routing tests prove pixverse-v6 + no image
hits text-to-video endpoint, pixverse-v6 + image_url hits
image-to-video endpoint, same for veo3.1 and kling-o3-standard.

Docs updated: developer-guide page rewrites the 'model families' pattern
as a first-class section so external plugin authors know the convention.
toolsets-reference and toolsets.py descriptions match the new surface.

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>

* feat(video_gen/fal): expand catalog to 6 families, cheap + premium tiers

Catalog now covers everything Teknium specced from FAL:

  Cheap tier:
    ltx-2.3        fal-ai/ltx-2.3-22b/text-to-video       / image-to-video
    pixverse-v6    fal-ai/pixverse/v6/text-to-video       / image-to-video

  Premium tier:
    veo3.1         fal-ai/veo3.1                          / fal-ai/veo3.1/image-to-video
    seedance-2.0   bytedance/seedance-2.0/text-to-video   / image-to-video
    kling-v3-4k    fal-ai/kling-video/v3/4k/text-to-video / image-to-video
    happy-horse    fal-ai/happy-horse/text-to-video       / image-to-video

DEFAULT_MODEL moved from veo3.1 (premium) to pixverse-v6 (cheap, sane
defaults, both modalities) — better first-run UX for users who haven't
explicitly picked a model.

New family-entry knob: image_param_key. Kling v3 4K's image-to-video
endpoint expects start_image_url instead of image_url; declaring
image_param_key='start_image_url' on the family lets _build_payload
remap correctly. Other families default to plain image_url.

Per-family capability flags reflect each model's docs:
- LTX 2.3 + Happy Horse: minimal payloads (no duration/aspect/resolution
  enum exposed by FAL — let endpoint apply defaults)
- Seedance: 6 aspect ratios incl 21:9, durations 4-15, audio supported,
  negative prompts NOT supported per docs
- Kling v3 4K: 16:9/9:16/1:1, 3-15s, audio + negative
- Veo 3.1: unchanged, 16:9/9:16, 4/6/8s

Tests: +5 covering the new families (full catalog, Kling 4K
start_image_url remap, Seedance routing, LTX payload minimality, Happy
Horse minimality). 56/56 in the slice green.

Note: I did NOT add the FAL-hosted xAI Grok-Imagine variant. Hermes
already has a direct xAI plugin that talks to xAI's own API; routing
the same model through FAL's wrapper would duplicate the surface
without adding capabilities. Users on FAL who want Grok-Imagine should
use the xAI plugin directly; flag if you want both routes available.

* test(video_gen): tool-surface routing matrix — every model x modality

End-to-end matrix test driven through _handle_video_generate() — the
actual function the agent's video_generate tool call lands in. Writes
config.yaml, invokes the registered handler with a raw args dict, then
asserts the outbound HTTP/SDK call hit the right endpoint with the right
payload shape.

Parametrized over FAL_FAMILIES.keys() so the matrix auto-discovers new
families as they're added (add a family to FAL_FAMILIES and you get
both modalities tested for free).

Coverage:
- All 6 FAL families x {text-only, text+image} = 12 cases
- xAI x {text-only, text+image} = 2 cases
- tool-level model= arg overrides config = 2 cases

For each case, verifies:
- result['success'] is True
- result['modality'] matches input shape ('text' if no image_url, 'image' otherwise)
- outbound endpoint URL matches the family's text_endpoint or image_endpoint
- text-only payloads carry no image-shaped keys
- text+image payloads carry the family's image key (image_url for most,
  start_image_url for kling-v3-4k, wrapped 'image' object for xAI)

All 16 cases passing. Confirms the tool surface routes every
(provider, model, modality) combination correctly with zero leakage.

* feat(video_gen): keep video_gen out of first-run setup, surface in status

Two changes:

1. video_gen joins _DEFAULT_OFF_TOOLSETS, so it is NOT pre-selected in
   the first-run toolset checklist. Video gen is niche, paid, and slow —
   most users don't want it nagging them during initial setup. Anyone
   who wants it opts in via 'hermes tools' -> Video Generation, which
   already routes to the provider+model picker.

2. The 'hermes setup' status panel learns about video_gen — but only
   shows the row when a plugin reports available. Users without
   FAL_KEY/XAI_API_KEY see nothing about video gen; users with one of
   those keys see 'Video Generation (FAL) ✓' as confirmation it's wired.

Verified live:
- Fresh install (no creds): zero video_gen mentions in wizard.
- With FAL_KEY: status row appears with active backend name.
- 160/160 in the setup + tools_config + video_gen test slice.

Rationale: image_gen is on by default because it's a featured creative
tool used in casual chat (telegrams, etc). Video gen is heavier — long
wait, paid per-second pricing. Default-off matches user intent better.

---------

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
2026-05-13 16:39:41 -07:00
GodsBoy
da0ddbf88a fix: classify landed file mutations with diagnostics 2026-05-13 06:46:23 -07:00
Teknium
486b692ddd
feat(nous): unified client=hermes-client-v<version> tag on every Portal request (#24779)
* feat(nous): unified client=hermes-client-v<version> tag on every Portal request

Every Hermes request to Nous Portal now carries the same
client=hermes-client-v<__version__> tag (e.g. client=hermes-client-v0.13.0
on this release), sourced live from hermes_cli.__version__. The release
script's regex bump auto-aligns it on every release.

Centralized in agent/portal_tags.py and wired into all four call sites:
- NousProfile.build_extra_body (main agent loop, every chat completion)
- auxiliary_client.NOUS_EXTRA_BODY + _build_call_kwargs (aux client)
- run_agent.py compression-summary fallback path
- tools/web_tools.py web_extract fallback

Replaces the client=aux marker added in #24194 with the unified version
tag. Tests assert against the helper output (invariant) rather than the
literal string, so they don't need updating on every release.

* feat(nous): cover /goal judge and kanban specify aux paths

Two aux-using surfaces bypassed call_llm by invoking
client.chat.completions.create() directly without extra_body, so they
were missing the unified Portal client tag:

- hermes_cli/goals.py — /goal standing-goal judge
- hermes_cli/kanban_specify.py — kanban triage specifier

Both now pass extra_body=get_auxiliary_extra_body() or None so they
inherit the version tag when the aux client points at Nous Portal, and
emit nothing otherwise (no tag leak to OpenRouter/Anthropic auxes).
2026-05-12 20:49:20 -07:00
Teknium
b06e999302
fix(cache): kill long-lived prefix layout — system prompt is now byte-static within a session (#24778)
The long-lived prefix-cache layout split the system prompt into stable/
context/volatile blocks and re-derived them on every API call. The
volatile tier (timestamp + memory snapshot + USER profile) ticks per
turn, so the system message bytes mutated mid-conversation and broke
upstream prompt caches (OpenRouter, Nous Portal, Anthropic).

Diagnosed via live wire-format diffing: an 8-turn conversation showed
OLD layout flipping system block[1] sha mid-session at the minute
boundary, dropping cached_tokens to 0 on that turn (cumulative
66.6% vs 83.3% for the single-block layout). Hermes invariant:
history (system + all but the last 1-2 messages) must be static.

Fix: drop the long-lived layout entirely. Single layout everywhere —
system_and_3 with one cached system string built once on first turn,
replayed verbatim on every subsequent turn. Loses cross-session 1h
prefix caching for Claude (the feature that motivated the split), but
within-session caching now actually works on every provider.

Removed:
- run_agent.py: _use_long_lived_prefix_cache flag, _long_lived_cache_ttl,
  _supports_long_lived_anthropic_cache method, the long-lived branch in
  run_conversation, mark_tools_for_long_lived_cache call site
- agent/prompt_caching.py: apply_anthropic_cache_control_long_lived,
  mark_tools_for_long_lived_cache, _mark_system_stable_block helper
- hermes_cli/config.py: prompt_caching.long_lived_prefix and
  prompt_caching.long_lived_ttl config keys
- tests/agent/test_prompt_caching_live.py (entire file)
- tests/agent/test_prompt_caching.py: TestMarkToolsForLongLivedCache,
  TestApplyAnthropicCacheControlLongLived
- tests/run_agent/test_anthropic_prompt_cache_policy.py:
  TestSupportsLongLivedAnthropicCache

Targeted tests: 62/62 pass.
2026-05-12 20:46:04 -07:00
ALIYILD
afa5b81918 fix(prompt_builder): inject tool-use enforcement for GLM models
GLM-family models (z-ai/glm-4.5-air, z-ai/glm-4.5-flash, etc.) exhibit
the same "describe-instead-of-call" failure mode that gpt/codex/gemini/
gemma/grok already trigger enforcement for. Without the injection,
free-tier GLM workers spawned by the kanban dispatcher routinely exit
cleanly (rc=0) without invoking kanban_complete or kanban_block,
producing the "protocol violation" error and triggering the dispatcher's
gave_up path.

Observed in real workloads: seven consecutive kanban tasks across three
GLM-tier profiles (shipbackend, frontend-engineer, backend-engineer) all
failed with the identical message:

    worker exited cleanly (rc=0) without calling kanban_complete or
    kanban_block — protocol violation

Re-running the same tasks on Claude Haiku immediately resolved them.
Adding "glm" to TOOL_USE_ENFORCEMENT_MODELS closes the gap so future
GLM-routed work receives the explicit "every response must contain a
tool call or final result" steering that already protects the other
enforcement-gated model families.

One-line change; no behavior change for non-GLM models.
2026-05-12 18:46:28 -07:00
Teknium
29c9ff9ba5
fix(lsp): typescript SDK install + tsc-missing skip + shellcheck warning (#24630)
Three follow-ups to PR #24168 found during live E2E testing on TS/bash files:

1. typescript-language-server now installs the typescript SDK (tsserver)
   alongside it. Without that sibling install, initialize() failed with
   "Could not find a valid TypeScript installation" and the server was
   marked broken — no diagnostics ever reached the agent. New extra_pkgs
   field on INSTALL_RECIPES makes that explicit and reusable for future
   peer-dep cases.

2. _check_lint now treats "linter command exists on PATH but cannot
   actually run" as skipped instead of error. The motivating case is
   npx tsc when typescript is not in node_modules — npx prints its
   "This is not the tsc command you are looking for" banner and exits
   non-zero, which previously blocked the LSP semantic tier (gated on
   success or skipped). Pattern-matched per base command (npx,
   rustfmt, go) so genuine lint errors still flow through normally.

3. hermes lsp status now surfaces a Backend warnings section when
   bash-language-server is installed but shellcheck is missing. The
   server itself spawns fine but bash-language-server delegates
   diagnostics to shellcheck — without it on PATH the integration
   looks alive but never reports any problems. Same warning is
   logged once at server spawn time.

Validation:

- 12 new tests in tests/agent/lsp/test_install_and_lint_fixes.py:
    * recipe carries typescript SDK
    * _install_npm passes both pkg + extras to npm CLI
    * backwards compat: recipes without extras still work
    * _backend_warnings quiet when bash absent / both present
    * _backend_warnings fires when bash installed without shellcheck
    * status output includes the Backend warnings section
    * _looks_like_linter_unusable catches the npx tsc banner
    * real TS type errors not misclassified as unusable
    * unfamiliar linters fall through normally
    * _check_lint returns skipped on npx tsc unusable
    * _check_lint returns error on real tsc type errors
- Full lsp + file_operations test suite: 245/245 pass
- Live E2E:
    * try_install("typescript-language-server") installs both packages
      into node_modules
    * write_file(bad.ts, ...) returns lint=skipped + lsp_diagnostics
      with two real TS errors (was lint=error, no lsp_diagnostics)
    * hermes lsp status renders the shellcheck warning when bash is
      installed but shellcheck is not on PATH
2026-05-12 17:02:35 -07:00
hookinglau
d68a0ec383 fix(auxiliary): pass cfg_base_url and cfg_api_key when resolving task provider
_resolve_task_provider_model drops cfg_base_url and cfg_api_key when
returning a named provider, causing configured API keys and base URLs
to be lost. Pass them through so named providers can use custom
endpoints while still resolving credentials from provider-specific
env vars.

Closes #20139
2026-05-12 16:36:20 -07:00
zccyman
88ede807c4 fix(pricing): add deepseek-v4-pro to official docs pricing table
deepseek-v4-pro has been routable since v0.12 but was missing from
the _OFFICIAL_DOCS_PRICING table. Sessions using this model showed
as "unknown cost" in hermes insights instead of a dollar estimate.

Add pricing entry using published list prices:
- input: \$1.74/M tokens
- output: \$3.48/M tokens
- cache_read: \$0.0145/M tokens

Uses standard list rates (not the 75% promo) so estimates remain
accurate after promo expires 2026-05-31.

Closes #24218
2026-05-12 16:32:57 -07:00
Teknium
83b93898c2
feat(lsp): semantic diagnostics from real language servers in write_file/patch (#24168)
* feat(lsp): semantic diagnostics from real language servers in write_file/patch

Wire ~26 language servers (pyright, gopls, rust-analyzer, typescript-language-server,
clangd, bash-language-server, ...) into the post-write lint check used by write_file
and patch. The model now sees type errors, undefined names, missing imports, and
project-wide semantic issues introduced by its edits, not just syntax errors.

LSP is gated on git workspace detection: when the agent's cwd or the file being
edited is inside a git worktree, LSP runs against that workspace; otherwise the
existing in-process syntax checks are the only tier. This keeps users on
user-home cwds (Telegram/Discord gateway chats) from spawning daemons.

The post-write check is layered: in-process syntax check first (microseconds),
then LSP semantic diagnostics second when syntax is clean. Diagnostics are
delta-filtered against a baseline captured at write start, so the agent only
sees errors its edit introduced. A flaky/missing language server can never
break a write -- every LSP failure path falls back silently to the syntax-only
result.

New module agent/lsp/ split into:

- protocol.py: Content-Length JSON-RPC framer + envelope helpers
- client.py: async LSPClient (spawn, initialize, didOpen/didChange,
  ContentModified retry, push/pull diagnostic stores)
- workspace.py: git worktree walk-up + per-server NearestRoot resolver
- servers.py: registry of 26 language servers (extension match,
  root resolver, spawn builder per language)
- install.py: auto-install dispatch (npm install --prefix, go install
  with GOBIN, pip install --target) into HERMES_HOME/lsp/bin/
- manager.py: LSPService (per-(server_id, root) client registry, lazy
  spawn, broken-set, in-flight dedupe, sync facade for tools layer)
- reporter.py: <diagnostics> block formatter (severity-1-only, 20-per-file)
- cli.py: hermes lsp {status,list,install,install-all,restart,which}

Wired into tools/file_operations.py:

- write_file/patch_replace now call _snapshot_lsp_baseline before write
- _check_lint_delta gains a third tier: LSP semantic diagnostics when
  syntax is clean
- All LSP code paths swallow exceptions; write_file's contract unchanged

Config: 'lsp' section in DEFAULT_CONFIG with enabled (default true),
wait_mode, wait_timeout, install_strategy (default 'auto'), and per-server
overrides (disabled, command, env, initialization_options).

Tests: tests/agent/lsp/ -- 49 tests covering protocol framing (encode and
read_message round-trip, EOF/truncation/missing Content-Length), workspace
gate (git walk-up, exclude markers, fallback to file location), reporter
(severity filter, max-per-file cap, truncation), service-level delta filter,
and an in-process mock LSP server that exercises the full client lifecycle
including didChange version bumps, dedup, crash recovery, and idempotent
teardown.

Live E2E verified end-to-end through ShellFileOperations: pyright
auto-installed via npm into HERMES_HOME, baseline captured, type error
introduced, single delta diagnostic surfaced with correct line/column/code/
source, then patch fix removes the diagnostic from the output.

Docs: new website/docs/user-guide/features/lsp.md page covering supported
languages, configuration knobs, performance characteristics, and
troubleshooting; cli-commands.md updated with the 'hermes lsp' reference;
sidebar updated.

* feat(lsp): structured logging, backend gate, defensive walk caps

Cherry-picks the substantive ideas from #24155 (different scope, same
problem space) onto our PR.

agent/lsp/eventlog.py (new): dedicated structured logger
``hermes.lint.lsp`` with steady-state silence. Module-level dedup sets
keep a 1000-write session at exactly ONE INFO line ("active for
<root>") at the default INFO threshold; clean writes log at DEBUG so
they never reach agent.log under normal config. State transitions
(server starts, no project root for a file, server unavailable) fire
at INFO/WARNING once per (server_id, key); novel events (timeouts,
unexpected errors) fire WARNING per call. Grep recipe: ``rg 'lsp\\['``.

agent/lsp/manager.py: wire the eventlog into _get_or_spawn and
get_diagnostics_sync so users can answer "did LSP fire on this edit?"
with a single grep, plus surface "binary not on PATH" warnings once
instead of silently retrying every write.

tools/file_operations.py: backend-type gate. ``_lsp_local_only()``
returns False for non-local backends (Docker / Modal / SSH /
Daytona); ``_snapshot_lsp_baseline`` and ``_maybe_lsp_diagnostics``
now skip entirely on remote envs. The host-side language server
can't see files inside a sandbox, so this prevents pretending to
lint a file the host process can't open.

agent/lsp/protocol.py: 8 KiB cap on the header block in
``read_message``. A pathological server that streams headers
without ever emitting CRLF-CRLF would have looped forever consuming
bytes; now raises ``LSPProtocolError`` instead.

agent/lsp/workspace.py: 64-step cap on ``find_git_worktree`` and
``nearest_root`` upward walks, plus try/except containment around
``Path(...).resolve()`` and child ``.exists()`` calls. Defensive
against pathological inputs (symlink loops, encoding errors,
permission failures mid-walk) — the lint hook is hot-path code and
must never raise.

Tests:
- tests/agent/lsp/test_eventlog.py: 18 tests covering steady-state
  silence (clean writes stay DEBUG), state-transition INFO-once
  semantics (active for, no project root), action-required
  WARNING-once (server unavailable), per-call WARNING (timeouts,
  spawn failures), and the "1000 clean writes => 1 INFO" contract.
- tests/agent/lsp/test_backend_gate.py: 5 tests verifying
  _lsp_local_only / snapshot_baseline / maybe_lsp_diagnostics skip
  the LSP layer for non-local backends and route correctly for
  LocalEnvironment.
- tests/agent/lsp/test_protocol.py: new test_read_message_rejects_runaway_header
  exercising the 8 KiB cap.

Validation:
- 73/73 LSP tests pass (49 original + 18 eventlog + 5 backend-gate + 1 framer cap)
- 198/198 pass when run alongside existing file_operations tests
- Live E2E re-run with pyright still surfaces "ERROR [2:12] Type
  ... reportReturnType (Pyright)" through the full path, then patch
  fix removes it on the next call.

* feat(lsp): atexit cleanup + separate lsp_diagnostics JSON field

Two improvements salvaged from #24414's plugin-form alternative,
keeping our core-integrated design:

1. atexit cleanup of spawned language servers
   ----------------------------------------------------------------
   ``agent/lsp/__init__.get_service`` now registers an ``atexit``
   handler on first creation that tears down the LSPService on
   Python exit.  Without this, every ``hermes chat`` exit was
   leaking pyright/gopls/etc. processes for a few seconds while
   their stdout buffers drained -- they got reaped by the kernel
   eventually but a watchful ``ps aux`` would catch them.

   The handler runs once per process (gated by
   ``_atexit_registered``); idempotent ``shutdown_service``
   ensures double-fire is a no-op.  Errors during shutdown are
   swallowed at debug level since by the time atexit fires the
   user has already seen the agent's final response.

2. Separate ``lsp_diagnostics`` field on WriteResult / PatchResult
   ----------------------------------------------------------------
   Previously the LSP layer folded its diagnostic block into the
   ``lint.output`` string, conflating the syntax-check tier with
   the semantic tier.  The agent (and any downstream parsers) now
   read syntax errors and semantic errors as independent signals:

       {
         "bytes_written": 42,
         "lint": {"status": "ok", "output": ""},
         "lsp_diagnostics": "<diagnostics file=...>\nERROR [2:12] ..."
       }

   ``_check_lint_delta`` returns to its original two-tier shape
   (syntax check + delta filter); ``write_file`` and
   ``patch_replace`` independently fetch LSP diagnostics via
   ``_maybe_lsp_diagnostics`` and pass them into the new field.
   ``patch_replace`` propagates the inner write_file's
   ``lsp_diagnostics`` so the outer PatchResult carries the patch's
   delta correctly.

Tests: 19 new
- tests/agent/lsp/test_lifecycle.py (8 tests): atexit registration
  fires once and only once across N get_service calls; the
  registered callable is our internal shutdown wrapper;
  shutdown_service is idempotent and safe when never started;
  exceptions during shutdown are swallowed; inactive service is
  cached so we don't rebuild on every check.
- tests/agent/lsp/test_diagnostics_field.py (11 tests): WriteResult
  / PatchResult dataclass shape, to_dict include/omit semantics,
  channel separation (lint and lsp_diagnostics carry independent
  signals), write_file populates the field via
  _maybe_lsp_diagnostics only when the syntax tier is clean,
  patch_replace propagates the field forward from its internal
  write_file.

Validation:
- 92/92 LSP tests pass (73 prior + 8 lifecycle + 11 diagnostics field)
- 217/217 pass with file_operations + LSP combined
- Live E2E reverified: clean writes -> both fields empty/none; type
  error introduced -> lint clean (parses), lsp_diagnostics carries
  the pyright reportReturnType block; patch fix -> both fields
  clean again.

* fix(lsp): broken-set short-circuit so a wedged server isn't paid every write

Discovered while auditing failure paths: a language server binary that
hangs (sleep forever, no LSP traffic on stdin/stdout) caused EVERY
subsequent write to re-pay the 8s snapshot_baseline timeout. Five
writes = ~64s of dead time.

The bug: ``_get_or_spawn`` adds the (server_id, root) pair to
``_broken`` inside its inner exception handler, but when the OUTER
``_loop.run`` timeout fires, it cancels the inner task before that
handler runs. The pair never makes it to broken-set, so the next
write re-enters the spawn path and re-pays the timeout.

Fix:

- New ``_mark_broken_for_file`` helper at the service layer marks
  the (server_id, workspace_root) pair broken from the OUTSIDE when
  the outer timeout fires. Called from the except branches in
  ``snapshot_baseline``, ``get_diagnostics_sync`` (asyncio.TimeoutError
  + generic Exception). Also kills any orphan client process that
  survived the cancelled future, fire-and-forget with a 1s ceiling.

- ``enabled_for`` now consults the broken-set BEFORE returning True.
  Files in already-broken (server_id, root) pairs short-circuit to
  False, so the file_operations layer skips the LSP path entirely
  with no spawn cost. Until the service is restarted (``hermes lsp
  restart``) or the process exits.

- A single eventlog WARNING is emitted on first mark-broken so the
  user knows which server gave up. Subsequent edits in the same
  project stay silent.

Tests: 7 new in tests/agent/lsp/test_broken_set.py — covers the
key shape (server_id, per_server_root), enabled_for short-circuit,
sibling-file skip in same project, project isolation (broken in
A doesn't affect B), graceful no-op for missing-server / no-workspace,
and an end-to-end test that snapshots after a failure and verifies
the next ``enabled_for`` returns False.

Validation:

- Live retest of the wedged-binary scenario: 5 sequential writes,
  first 8.88s (the one snapshot timeout), subsequent four ~0.84s
  (no LSP cost). Down from 5x12.85s = 64s before this fix.
- 99/99 LSP tests pass (92 prior + 7 broken-set)
- 224/224 pass with file_operations + LSP combined
- Happy path E2E reverified — clean write, type error introduced,
  patch fix all behave correctly with the new broken-set logic.

Note: the FIRST write to a wedged binary still pays 8s (the
snapshot_baseline timeout). We could shorten that, but pyright/
tsserver normally take 2-3s and slow CI rust-analyzer can need
5+ seconds, so 8s is the conservative ceiling. Subsequent writes
are instant.
2026-05-12 16:31:54 -07:00
rob-maron
2863e9484a
Use nous portal as model metadata authority (#24502)
* nous portal metadata resolver

* minor fixes
2026-05-12 11:59:31 -07:00
Teknium
c1eb2dcda7
feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback (#24220)
* feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback

Three coordinated mitigations for the Mini Shai-Hulud worm hitting
mistralai 2.4.6 on PyPI (2026-05-12) and for the next single-package
compromise that follows.

# What this PR makes true

1. Users with the poisoned mistralai 2.4.6 in their venv get a loud
   detection banner with copy-pasteable remediation steps the moment
   they run hermes (and on every gateway startup).
2. One quarantined / yanked PyPI package can no longer silently demote
   a fresh install to 'core only' — the installer keeps every other
   extra and tells the user which tier landed.
3. Future opt-in backends (Mistral, ElevenLabs, Honcho, etc.) can
   lazy-install on first use under a strict allowlist, instead of
   eagerly pulling everything at install time.

# Detection: hermes_cli/security_advisories.py

- ADVISORIES catalog (one entry currently: shai-hulud-2026-05 for
  mistralai==2.4.6). Adding the next one is a single dataclass.
- detect_compromised() uses importlib.metadata.version() — no pip
  dependency, works in uv venvs that lack pip.
- Banner cache (~/.hermes/cache/advisory_banner_seen) rate-limits
  the startup banner to once per 24h per advisory.
- Acks persisted to security.acked_advisories in config.yaml; never
  re-banner after ack.
- Wired into:
  * hermes doctor — runs first, prints full remediation block
  * hermes doctor --ack <id> — dismisses an advisory
  * cli.py interactive run() and single-query branches — short
    stderr banner pointing at hermes doctor
  * gateway/run.py startup — operator-visible warning in gateway.log

# Lazy-install framework: tools/lazy_deps.py

- LAZY_DEPS allowlist maps namespaced feature keys (tts.elevenlabs,
  memory.honcho, provider.bedrock, etc.) to pip specs.
- ensure(feature) installs missing deps in the active venv via the
  uv → pip → ensurepip ladder (matches tools_config._pip_install).
- Strict spec safety regex rejects URLs, file paths, shell metas,
  pip flag injection, control chars — only PyPI-by-name accepted.
- Gated on security.allow_lazy_installs (default true) plus the
  HERMES_DISABLE_LAZY_INSTALLS env var for restricted/audited envs.
- Migrated three backends as proof of pattern:
  * tools/tts_tool.py — _import_elevenlabs() calls ensure first
  * plugins/memory/honcho/client.py — get_honcho_client lazy-installs
  * tts.mistral / stt.mistral entries pre-registered for when PyPI
    restores mistralai

# Installer fallback tiers

scripts/install.sh, scripts/install.ps1, setup-hermes.sh:

- Centralised _BROKEN_EXTRAS list (currently: mistral). Edit one
  array when a transitive breaks; users keep every other extra.
- New 'all minus known-broken' tier between [all] and the existing
  PyPI-only-extras tier. Only kicks in when [all] fails resolve.
- All three tiers explicit: every fallback announces which tier
  landed and prints a re-run hint when not on Tier 1.
- install.ps1 and install.sh both regenerate their tier specs from
  the same _BROKEN_EXTRAS array so updates stay in sync.

Side effect: install.ps1 Tier 2 spec previously hardcoded 'mistral'
in its extra list — bug fixed by the refactor (mistral is filtered
out).

# Config

hermes_cli/config.py — DEFAULT_CONFIG.security gains:
- acked_advisories: []  (advisory IDs the user has dismissed)
- allow_lazy_installs: True  (security gate for ensure())

No config version bump needed — both keys nest under existing
security: block, and load_config's deep-merge picks up DEFAULT_CONFIG
defaults for users with older configs.

# Tests

tests/hermes_cli/test_security_advisories.py — 23 tests covering:
- detect_compromised matches/non-matches, wildcard frozenset
- ack persistence, idempotence, blank rejection, config-failure path
- banner cache rate limiting + 24h re-banner + ack-stops-banner
- short_banner_lines / full_remediation_text / render_doctor_section /
  gateway_log_message
- shipped catalog well-formedness invariant

tests/tools/test_lazy_deps.py — 40 tests covering:
- spec safety: 11 safe parametrized + 18 unsafe parametrized
- allowlist: unknown-feature rejection, namespace.name shape,
  every shipped spec passes the safety regex
- security gating: config flag, env var, default, fail-open
- ensure() happy/sad paths: already-satisfied, install success,
  pip stderr surfaced on failure, install-succeeds-but-still-missing
- is_available, feature_install_command

Combined: 63 new tests, all passing under scripts/run_tests.sh.

# Validation

- scripts/run_tests.sh tests/hermes_cli/test_security_advisories.py
  tests/tools/test_lazy_deps.py → 63/63 passing
- scripts/run_tests.sh tests/hermes_cli/test_doctor.py
  tests/hermes_cli/test_doctor_command_install.py
  tests/tools/test_tts_mistral.py tests/tools/test_transcription_tools.py
  tests/tools/test_transcription_dotenv_fallback.py → 165/165 passing
- scripts/run_tests.sh tests/hermes_cli/ tests/tools/ →
  9191 passed, 8 pre-existing failures (verified on origin/main
  before this change)
- bash -n on install.sh and setup-hermes.sh → OK
- py_compile on all modified .py files → OK
- End-to-end smoke test of detect_compromised + render_doctor_section
  + gateway_log_message with mocked installed version → produces
  copy-pasteable remediation output

# Community

Full advisory + remediation steps:
website/docs/community/security-advisories/shai-hulud-mistralai-2026-05.md

Short-form post drafts (Discord, GitHub pinned issue, README banner):
scripts/community-announcement-shai-hulud.md

Refs: PR #24205 (mistral disabled), Socket Security advisory
<https://socket.dev/blog/mini-shai-hulud-worm-pypi>

* build(deps): pin every direct dep to ==X.Y.Z (no ranges)

Companion to the supply-chain advisory work: replace every >=/</~= range
in pyproject.toml's [project.dependencies] and [project.optional-dependencies]
with an exact ==X.Y.Z pin sourced from uv.lock.

Why: ranges allow PyPI to ship a fresh version of any direct dep at any
time without a code review on our side. With ranges, the malicious
mistralai 2.4.6 release would have been pulled by every fresh
'pip install -e .[all]' for the hours between upload and PyPI's
quarantine — exactly the install window we got hit on. Exact pins close
that window: the only way a new package version reaches a user is via
an intentional update on our end.

What the user-facing change is: nothing, behavior-wise. Every package
resolves to the same version it was already resolving to via uv.lock —
the pins just remove the resolver's freedom to pick a different one.

Cost: any user installing Hermes alongside another package that requires
a newer pin gets a resolver conflict. Acceptable for our isolated-venv
install path; documented in the new comment block.

Build-system requires line (setuptools>=61.0) is intentionally left
as a range — pinning the build backend would block fresh pip from
bootstrapping the build on architectures where that exact wheel isn't
available.

mistral extra (mistralai==2.3.0) is pinned but stays out of [all]
(per PR #24205). 'uv lock' regeneration will fail until PyPI restores
mistralai; lockfile regeneration is gated behind that, NOT on every PR.

LAZY_DEPS in tools/lazy_deps.py also moved to exact pins so the lazy-
install pathway can never resolve a different version than the one
declared in pyproject.toml.

Validation:

- Cross-checked all 77 pinned direct deps in pyproject.toml against
  uv.lock — every pin matches the resolved version exactly.
- Cross-checked all LAZY_DEPS specs against uv.lock — same.
- 'uv pip install -e .[all] --dry-run' resolves 205 packages cleanly.
- tests/tools/test_lazy_deps.py + tests/hermes_cli/test_security_advisories.py
  → 63/63 passing (every shipped spec passes the safety regex).
- Doctor + TTS + transcription targeted suite → 146/146 passing.

* build(deps): hash-verify transitives via uv.lock; remove unresolvable [mistral] extra

You asked: 'what about the dependencies the dependencies rely on?' —
correctly noting that exact-pinning direct deps in pyproject.toml does
NOT cover the transitive graph. `pip install` and `uv pip install` both
re-resolve transitives fresh from PyPI at install time, so a compromised
transitive (e.g. `httpcore` if it got worm-poisoned tomorrow) would
still hit our users even with every direct dep exact-pinned.

# What this commit fixes

1. **Both real installer scripts now prefer `uv sync --locked` as Tier 0.**
   uv.lock records SHA256 hashes for every transitive — a compromised
   package with a different hash gets REJECTED. Falls through to the
   existing `uv pip install` cascade if the lockfile is missing or
   stale, with a loud warning that the fallback path does NOT
   hash-verify transitives. Previously only `setup-hermes.sh` (the dev
   path) used the lockfile; `scripts/install.sh` and `scripts/install.ps1`
   (the paths fresh users actually run) skipped it.

2. **Removed the `[mistral]` extra entirely.** The `mistralai` PyPI
   project is fully quarantined right now — every version returns 404,
   so any pin we wrote was unresolvable, which broke `uv lock --check`
   in CI. Restoration is documented in pyproject.toml as a 5-step
   checklist (verify, re-add extra, re-enable in 4 modules, regenerate
   lock, optionally re-add to [all]).

3. **Regenerated uv.lock.** 262 packages, mistralai/eval-type-backport/
   jsonpath-python pruned. `uv lock --check` now passes.

# Defense-in-depth view

| Layer                      | Where             | Protects against                          |
|----------------------------|-------------------|-------------------------------------------|
| Exact pins in pyproject    | direct deps       | new mistralai 2.4.6-style direct compromise |
| uv.lock + `--locked` install | transitive graph  | transitive worm injection                  |
| Tier-0 hash-verified path  | install.sh / .ps1 | actually USE the lockfile in fresh installs |
| `uv lock --check` CI gate  | every PR          | drift between pyproject and lockfile      |
| `hermes_cli/security_advisories.py` | runtime  | cleanup for users who already got hit      |

The exact pinning + hash verification together close the supply-chain
gap. Without the lockfile path, exact pins alone are theater.

# Validation

- `uv lock --check` → passes (262 packages resolved, no drift).
- `bash -n` on install.sh + setup-hermes.sh → OK.
- 209/209 tests passing across new + adjacent test files
  (test_lazy_deps.py, test_security_advisories.py, test_doctor.py,
  test_tts_mistral.py, test_transcription_tools.py).
- TOML parse OK.

* chore: remove community announcement drafts (PR body covers it)

* build(deps): lazy-install every opt-in backend (anthropic, search, terminal, platforms, dashboard)

Extends the lazy-install framework to cover everything that's not used by
every hermes session. Base install drops from ~60 packages to 45.

Moved out of core dependencies = []:
- anthropic   (only when provider=anthropic native, not via aggregators)
- exa-py, firecrawl-py, parallel-web (search backends; only when picked)
- fal-client  (image gen; only when picked)
- edge-tts    (default TTS but still optional)

New extras in pyproject.toml: [anthropic] [exa] [firecrawl] [parallel-web]
[fal] [edge-tts]. All added to [all].

New LAZY_DEPS entries: provider.anthropic, search.{exa,firecrawl,parallel},
tts.edge, image.fal, memory.hindsight, platform.{telegram,discord,matrix},
terminal.{modal,daytona,vercel}, tool.dashboard.

Each import site now calls ensure() before importing the SDK. Where the
module had a top-level try/except (telegram, discord, fastapi), the
graceful-fallback pattern was extended to lazy-install on first
check_*_requirements() call and re-bind module globals.

Updated test_windows_native_support.py tzdata check from snapshot
(>=2023.3 literal) to invariant (any version + win32 marker).

Validation:
- Base install: 45 packages (was ~60); 6 newly-extracted packages absent
- uv lock --check: passes (262 packages, no drift)
- 209/209 lazy_deps + advisory + doctor + tts/transcription tests passing
- py_compile clean on all 12 modified modules
2026-05-12 01:02:25 -07:00
Robin Fernandes
94d9db72ba add client marker tag on aux inference requests 2026-05-11 22:30:42 -07:00
rob-maron
32abe742fa fix comment 2026-05-11 21:30:29 -07:00
rob-maron
f0c2964f0b remove comments 2026-05-11 21:30:29 -07:00
rob-maron
057fc7b073 fix guard 2026-05-11 21:30:29 -07:00
rob-maron
528bba6734 fix kimi 2026-05-11 21:30:29 -07:00
Teknium
ea1d0462cf
fix(cli): vertical fallback for markdown tables wider than terminal (#23948)
Follow-up to #23863 (CJK table alignment). The realigner was
correctly padding pipes to identical column offsets, but when a
table's natural width exceeds terminal cells it produced lines that
the terminal soft-wrapped mid-cell, destroying column alignment
visually even though the bytes were perfectly padded. Reported as
'columns are not aligned' on tables containing one long row alongside
several short rows.

Approach mirrors Claude Code's MarkdownTable.tsx narrow-terminal
fallback: when realign_markdown_tables is given an available_width
budget and the rebuilt horizontal table exceeds it, render each body
row as 'Header: value' lines separated by a thin ─ rule. Word-wraps
oversize values at the budget with a 2-space continuation indent.

- agent/markdown_tables.py: realign_markdown_tables(text, available_width=None);
  threshold check at the top of _render_block flips into a new
  _render_vertical fallback. Includes _wrap_to_width with hard-break
  for tokens longer than the budget.
- cli.py: helper _terminal_width_for_streaming() returns
  shutil.get_terminal_size().columns minus _STREAM_PAD and a 2-cell
  safety margin; passed to all three realign call sites
  (_render_final_assistant_content for strip+render Panel paths, and
  the streaming flushers in _emit_stream_text / _flush_stream).
- tests/agent/test_markdown_tables.py: 4 new tests covering the
  overflow-vertical fallback for ASCII + CJK content, the
  'fits → keep horizontal' case, and the long-cell wrap with indent.

Live-verified: with COLUMNS=100, the user's reported 'long row in
ASCII table' case now renders as vertical key-value rows that all fit
the panel; the 6-column CJK comparison table still renders as an
aligned horizontal table because it fits inside 100 cols.
2026-05-11 16:49:13 -07:00
nicoechaniz
e2b713cced fix(model-metadata): skip OpenRouter for known providers, add kimi/moonshot to PROVIDER_TO_MODELS_DEV
Based on PR #23950 by @nicoechaniz.

- Add "kimi" and "moonshot" to PROVIDER_TO_MODELS_DEV → kimi-for-coding
- Gate OpenRouter metadata step behind "if not effective_provider":
  known providers should not be overridden by community-maintained OR data
- Keep the targeted Kimi-family 32k guard as a secondary safety net
  inside the OR gate (for unknown providers with Kimi models)

Co-authored-by: nicoechaniz <nicoechaniz@altermundi.net>
2026-05-11 13:16:07 -07:00
kshitijk4poor
91eef6255e fix: correct context-length resolution for kimi-k2.6 on Ollama Cloud and Kimi Coding
Kimi-k2.6 (which supports 262K context) was incorrectly resolved as 32K,
tripping the 64K minimum-context guard and preventing use of the model on
Ollama Cloud and Kimi Coding / Moonshot providers.

Three fixes in the context-length resolution chain:

1. Ollama Cloud native /api/show query: new _query_ollama_api_show()
   queries the Ollama native API for authoritative GGUF model_info
   context_length.  For hosted Ollama, prefers model_info over num_ctx
   since users can't set their own num_ctx on Cloud.  Added at step 5e
   in get_model_context_length(), before the models.dev fallback.

2. models.dev :cloud/-cloud suffix fallback: lookup_models_dev_context()
   now also tries appending :cloud and -cloud suffixes when the bare
   model name doesn't match.  models.dev stores 'kimi-k2.6:cloud' but
   users and the live API use bare 'kimi-k2.6'.

3. Kimi-family 32K guard: after the OpenRouter metadata step, reject
   exactly 32768 for Kimi-named models (kimi-*, moonshot*) and fall
   through to hardcoded defaults ('kimi': 262144).  OpenRouter reports
   32768 for moonshotai/kimi-k2.6 but the model actually supports 262K.
   Narrow filter — only 32768, only Kimi-family — becomes dead code
   when OpenRouter updates its metadata.

---
2026-05-11 13:16:07 -07:00
Teknium
7b76366552
feat(prompt-cache): cross-session 1h prefix cache for Claude on Anthropic / OpenRouter / Nous Portal (#23828)
Cuts input cost for first-turn Claude requests by ~85-90% on subsequent
sessions within an hour. Tools array (~13k tokens for default toolset) +
stable system prefix (~5-8k tokens) get a 1h cache_control marker; the
volatile suffix (memory, USER profile, timestamp, session id) sits in a
separate non-cached block at the end so it doesn't poison the cross-session
prefix when it changes.

Provider gate: Claude on native Anthropic (incl. OAuth subscription),
OpenRouter, and Nous Portal (which proxies to OpenRouter). All other
providers keep today's system_and_3 layout unchanged.

Layout (4 cache_control breakpoints, Anthropic max):
  1. tools[-1]              -> 1h (cross-session)
  2. system content[0]      -> 1h (cross-session, stable prefix)
  3. messages[-2]           -> 5m (within-session rolling)
  4. messages[-1]           -> 5m (within-session rolling)

Within-session rolling shrinks from 3 messages to 2 to free the breakpoint
budget. On Claude with realistic tool loadouts the long-lived tier carries
the bulk of cross-session value anyway.

System prompt is now always assembled cache-friendly: stable identity /
guidance / skills / platform hints first, then session-stable context
files (AGENTS.md, .cursorrules), then per-call volatile content. Old
single-string callers see the same logical content (same join order),
just reordered so volatile lives at the end.

Config knobs (defaults shown):
  prompt_caching:
    cache_ttl: "5m"           # rolling-window TTL (unchanged)
    long_lived_prefix: true    # opt-out switch
    long_lived_ttl: "1h"       # cross-session prefix TTL

Live E2E (tests/agent/test_prompt_caching_live.py, gated on
OPENROUTER_API_KEY) on anthropic/claude-haiku-4.5 with default toolset:
  Call 1 (cold):              cache_write=13,415  cache_read=0
  Call 2 (NEW agent + msg):   cache_write=391     cache_read=13,025
  Cross-session reuse:        97.09%

Implementation:
* agent/prompt_caching.py: new apply_anthropic_cache_control_long_lived()
  + mark_tools_for_long_lived_cache(); existing apply_anthropic_cache_control()
  preserved verbatim for the fallback path.
* agent/anthropic_adapter.py: convert_tools_to_anthropic() now forwards
  cache_control onto each Anthropic-format tool dict.
* run_agent.py: _build_system_prompt_parts() returns the 3-tier dict;
  _build_system_prompt() joins them (backward compatible).
  _supports_long_lived_anthropic_cache() policy added next to the existing
  _anthropic_prompt_cache_policy() (which now also recognises Nous Portal
  Claude — pre-existing gap fixed in passing).
  _build_api_kwargs() resolves tools_for_api once and propagates the
  marker through all four build paths (anthropic_messages, bedrock,
  codex_responses, profile/legacy chat completions).
  Long-lived flag plumbed into the runtime snapshot/restore + model-switch
  + fallback-promotion paths.

Tests:
* tests/agent/test_prompt_caching.py: +8 tests (TestMarkToolsForLongLivedCache,
  TestApplyAnthropicCacheControlLongLived).
* tests/run_agent/test_anthropic_prompt_cache_policy.py: +9 tests
  (TestSupportsLongLivedAnthropicCache matrix across 8 endpoint classes
  + a fallback-target case).
* tests/agent/test_prompt_caching_live.py: new live E2E (skipif when
  OPENROUTER_API_KEY is unset; runs outside the hermetic suite).
* Targeted suites: 327/327 pass (caching/adapter/policy/builder).
* tests/agent/ + tests/run_agent/: 3992 pass, 17 skip, 1 pre-existing
  flake (test_async_httpx_del_neuter::test_same_key_replaces_stale_loop_entry,
  verified failing on pristine origin/main).
2026-05-11 11:14:56 -07:00
kshitij
2ec8d2b42f
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace  with  for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.

608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
2026-05-11 11:13:25 -07:00
wuli666
111b859e49 fix(auxiliary): evict async wrappers on poisoned client (follow-up to #23482)
#23482 fixed cache poisoning in the sync path: when a Codex auxiliary
timeout closes the underlying OpenAI client, _evict_cached_client_instance
walks CodexAuxiliaryClient wrappers via their _real_client attribute and
drops the cache entry so the next aux call rebuilds.

The cache key includes async_mode (see _client_cache_key), so the sync and
async clients for the same provider live in two distinct entries pointing
at the same underlying transport. The fix walked the sync wrapper's
_real_client correctly but the async wrappers
(AsyncCodexAuxiliaryClient, AsyncAnthropicAuxiliaryClient,
AsyncGeminiNativeClient) never exposed _real_client at all, so the async
entry survived eviction and kept handing out the poisoned client.

Effect on async aux callers: one timeout now poisons every subsequent
async aux call (compression, vision, session_search, title_generation)
with 'Connection error' until gateway restart -- even while the sync
route recovered as designed in #23482.

Mirror the sync wrapper's _real_client onto each async wrapper so the
existing eviction helper finds them. Three changes, one per wrapper:

- AsyncCodexAuxiliaryClient: self._real_client = sync_wrapper._real_client
  (the underlying OpenAI client)
- AsyncAnthropicAuxiliaryClient: same shape
- AsyncGeminiNativeClient: self._real_client = sync_client (Gemini's
  native facade is itself the leaf; no OpenAI client beneath it)

Update _evict_cached_client_instance docstring to reflect that it now
covers both sync and async wrappers via the same attribute walk.

Test: TestAuxiliaryClientPoisonedCacheEviction.test_evict_cached_client_instance_walks_async_wrapper
seeds both sync and async cache entries pointing at the same leaf and
asserts both are dropped on a single eviction call. Verified the test
fails without the wrapper changes ("async cache entry survived
eviction -- wrapper is missing _real_client") and passes with them.

Refs #23482, #23432
2026-05-11 11:13:20 -07:00
Teknium
1d00716754
fix(cli,tui): align CJK / wide-char markdown tables (#23863)
CJK and emoji glyphs render as two terminal cells but JS String#length
and the model's own padding count them as one, so any markdown table
with Chinese / Japanese / Korean cells drifts right per row when a
real terminal renders it. Both surfaces fix this with a display-cell
width measurement (wcswidth on the Python side, stringWidth on the
TUI side).

Changes:
- agent/markdown_tables.py: new helper. realign_markdown_tables(text)
  detects markdown table blocks (header + |---| divider) and
  rewrites the row padding using wcwidth.wcswidth so every pipe and
  dash lines up across rows. No-op on text without tables.
- cli.py: hook the helper into _render_final_assistant_content for
  strip / render modes (raw passes through untouched), and into the
  streaming line emitter so live token-by-token rendering also
  produces aligned tables. A small two-buffer state machine in
  _emit_stream_text holds table rows until the block ends, then
  flushes them through the realigner so all rows pad to a single
  per-column width.
- ui-tui/src/components/markdown.tsx: renderTable now uses
  stringWidth (Bun.stringWidth fast path + East-Asian-width-aware
  fallback, already memoised in @hermes/ink) instead of UTF-16
  String#length for both column-width measurement and per-cell
  padding. Drops the comment that documented the bug as a deliberate
  limitation.

Validation:
- New tests/agent/test_markdown_tables.py (11): every rebuilt block
  shares pipe column offsets across rows for pure CJK, mixed
  CJK+emoji, ragged-row, and multi-table inputs.
- Updated tests/cli/test_cli_markdown_rendering.py: the existing
  strip-mode test asserted exact whitespace; rewritten to assert the
  alignment contract (cell content survives + every rendered row
  shares pipe offsets).
- New ui-tui markdown.test.ts case (1): rendered column-2 start
  offset is identical for the header + every body row, including
  the CJK row that drifted before the fix.
- Live: hermes chat -q with the user-reported screenshot prompt now
  produces a perfectly aligned table on the wire (header, divider,
  4 body rows including '通义千问', all pipes at identical columns).
2026-05-11 11:13:06 -07:00
kshitij
657874460f
chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926)
PLR5501 (collapsible-else-if): 28 instances — else: if: → elif:
PLR1730 (if-stmt-min-max):   15 instances — if x<y: x=y → x=max(x,y)
C420   (dict.fromkeys):       2 instances — dictcomp → dict.fromkeys
PLR1704 (redefined-argument): 1 instance — reason → err_msg (shadow fix)
C414   (unnecessary-list):    1 instance — sorted(list(x)) → sorted(x)

28 files, -44 net lines. All mechanical, zero logic changes.
17,211 tests pass, zero regressions.
2026-05-11 11:03:29 -07:00
Teknium
228b7d27bd
fix(auxiliary): cache 402'd providers as unhealthy with TTL to stop per-call retry storms (#23597)
When an auxiliary provider returns HTTP 402 (credit / payment), every
subsequent compression / title-gen / session-search / vision call still
re-tried it as the FIRST entry in the chain — burning ~1 RTT to hit 402
again, then falling back. On a long Discord/LCM session that meant dozens
of doomed 402s per minute (issue #23570).

Add a per-process unhealthy-provider cache with a 10 min TTL. When any
caller observes a payment error against a provider, the label is marked
unhealthy and skipped by:
  * _resolve_auto Step-1 (main provider use-as-aux path)
  * _resolve_auto Step-2 (aggregator/fallback chain)
  * _try_payment_fallback (used by call_llm/acall_llm on first 402)

Skip-logs are throttled to once per minute per label so a bursty session
doesn't spam agent.log. Entries auto-expire so a topped-up account
recovers without manual intervention. The cache is in-process only by
design — multi-profile users with different keys per profile must each
hit the 402 once.

Refs #23570
2026-05-10 22:43:14 -07:00
Teknium
e5bce320db
fix(auxiliary): evict cached client on timeout/connection error (#23482)
A Codex auxiliary timeout closes the underlying OpenAI client (so the
streaming hang doesn't sit until the user kills the session), but the
cached wrapper kept pointing at the now-dead transport. Subsequent
auxiliary calls (compression retry, memory flush, background review,
title generation routed via provider: main) reused that closed client
and failed fast with 'Connection error' until the gateway restarted —
even though the main agent route was healthy the whole time.

Sync `_get_cached_client` had no liveness check (async did, via loop
identity), and the connection-error fallback in `call_llm` only fired
on the auto provider path, so an explicit provider — including the
common `auxiliary.compression.provider: main` shape — never evicted.

Three fixes:

* New `_evict_cached_client_instance(target)` helper that drops the
  cache entry whose stored client is target (or wraps it via
  `_real_client`, for `CodexAuxiliaryClient`).
* `_CodexCompletionsAdapter._close_client_on_timeout` evicts the
  wrapper after closing the inner OpenAI client.
* `call_llm` and `async_call_llm` evict on `_is_connection_error`
  before re-raising, regardless of whether the provider is auto.

Net effect: one timeout costs one summary attempt + the existing 30s
compressor cooldown; the next compaction rebuilds the client and
works. Non-connection errors (4xx/5xx) do not evict, so cache hits
stay stable.

Closes #23432
2026-05-10 18:55:05 -07:00
Teknium1
ae83a54be4 docs(kanban): worker lane contract page + review-required convention
Closes the architectural-pin part of #19931. Most of what that issue
asked for is already implemented (logs under kanban root, env-pinned
workspace, dispatcher routing of unknown assignees, lifecycle
ownership, structured handoff conventions). What was missing:

1. A written contract integrators can point at when adding a new
   worker lane shape, and
2. The "code-changing workers should not auto-promote success to
   done" convention.

This commit ships both as docs+convention layered on existing primitives.
No kernel changes — the kanban_complete / kanban_block / kanban_comment
surfaces already support the review-required pattern; we just hadn't
written it down or made it visible to workers.

Changes:

- `agent/prompt_builder.py::KANBAN_GUIDANCE`: append the review-required
  exception to step 5 of the lifecycle. Workers get the cue
  auto-injected into their system prompt — drop structured metadata
  into a kanban_comment first, then end with
  kanban_block(reason="review-required: <summary>") instead of
  kanban_complete when the work needs review. Total prompt size went
  from ~3000 to ~3275 chars; well under the 4096 budget enforced by
  test_kanban_guidance_size.

- `skills/devops/kanban-worker/SKILL.md`: add a worked example to the
  existing "Good summary + metadata shapes" section between the
  Coding-task and Research-task examples. Same shape as the others
  (kanban_comment with structured handoff JSON, then kanban_block with
  the human-readable reason). Plus a one-line guide on when to use
  kanban_complete vs the review-required pattern.

- `website/docs/user-guide/features/kanban-worker-lanes.md` (new): the
  integrator-facing contract. Covers the hierarchy, the three things
  every lane must provide (assignee, spawn mechanism, lifecycle
  terminator), the env vars the dispatcher injects, the
  review-required convention, the failure modes the kernel handles
  for free, and an explicit "external CLI worker lane" deferred-
  pending-concrete-asker section that links to #19931 and #19924.

- `website/sidebars.ts`: link the new page under user-guide/features.

The "specialist worker lanes for external CLI tools (Codex / Claude
Code / OpenCode)" runner is NOT shipped here. The dispatcher's
spawn_fn parameter already supports plugin-shaped extension; the
per-CLI integration work (auth, sandbox policy, exit-code mapping)
needs a concrete asker. The new docs page tells would-be integrators
the contract any such lane must satisfy.

Refs #19931
2026-05-10 18:15:52 -07:00
Teknium
d6e1fadbf5
fix(xai): omit reasoning.effort for grok models that reject it (#23435)
xAI's Responses API returns HTTP 400 ("Model X does not support
parameter reasoningEffort") for grok-4, grok-4-0709, grok-4-fast-*,
grok-4-1-fast-*, grok-3, grok-4.20-0309-*, and grok-code-fast-1 — even
though those models reason natively. Hermes was unconditionally sending
`reasoning: {effort: 'medium'}` to xAI for every Grok model, breaking
direct `--provider xai` for the entire grok-4 line.

Add a substring allowlist predicate (verified live against api.x.ai
2026-05-10) covering the only Grok families that accept the effort dial:
grok-3-mini*, grok-4.20-multi-agent*, grok-4.3*. The Responses transport
omits the `reasoning` key entirely for everything else while still
including `reasoning.encrypted_content` so we capture native reasoning
tokens.

Verified end-to-end: `hermes chat -q hi --provider xai --model grok-4-0709`
went from HTTP 400 to a successful reply.
2026-05-10 15:21:30 -07:00
Teknium
c39168453d
feat(i18n): localize all gateway commands + web dashboard, add 8 new locales (16 total) (#22914)
* feat(i18n): localize /model command output

Reported by @tianma8888: when Chinese users run /model, the labels
("Provider:", "Context:", "_session only_", etc.) are still English.
This routes the static prose through the existing i18n catalog so it
follows display.language / HERMES_LANGUAGE.

Changes:
- locales/{en,zh,ja,de,es,fr,tr,uk}.yaml: add 17 keys under
  gateway.model.* covering switched/provider/context/max_output/cost/
  capabilities/prompt_caching/warning/saved_global/session_only_hint/
  current_label/current_tag/more_models_suffix/usage_*.
- gateway/run.py _handle_model_command: replace hardcoded f-strings in
  the picker callback, the text-list fallback, and the direct-switch
  confirmation block with t("gateway.model.<key>", ...).

What stays English:
- model IDs, provider slugs, capability strings, cost figures, and the
  "[Note: model was just switched...]" prepended to the model's next
  prompt (LLM-facing, not user-facing).
- The two slightly-different session-only hints unify on a single key
  with the em-dash phrasing.

Validation: tests/agent/test_i18n.py 27/27 passing (parity contract
holds), tests/gateway/ -k 'model or i18n' 74/74 passing.

* feat(i18n): localize all gateway slash command outputs

Expands the i18n catalog from 7 strings to 234 keys across 35 gateway
slash command handlers, so non-English users see localized output for
\`/profile\`, \`/status\`, \`/help\`, \`/personality\`, \`/voice\`, \`/reset\`,
\`/agents\`, \`/restart\`, \`/commands\`, \`/goal\`, \`/retry\`, \`/undo\`,
\`/sethome\`, \`/title\`, \`/yolo\`, \`/background\`, \`/approve\`, \`/deny\`,
\`/insights\`, \`/debug\`, \`/rollback\`, \`/reasoning\`, \`/fast\`,
\`/verbose\`, \`/footer\`, \`/compress\`, \`/topic\`, \`/kanban\`,
\`/resume\`, \`/branch\`, \`/usage\`, \`/reload-mcp\`, \`/reload-skills\`,
\`/update\`, \`/stop\` (plus the \`/model\` block already added in the
previous commit).

Reported by @tianma8888 — Chinese users want command output prose in
their language, not just the labels we already had.

Translations are hand-written for all 8 supported locales (en, zh, ja,
de, es, fr, tr, uk), matching each catalog's existing style: full-width
punctuation in zh, em-dashes in zh/ja/uk, French spaced colons,
German noun capitalization, etc.

What stays English (unchanged):
- Identifiers/values: model IDs, file paths, profile names, session IDs,
  command flag names like --global, URLs, config keys.
- Backtick code spans: \`/foo\`, \`config.yaml\`.
- Log messages (logger.info/warning/error).
- LLM-facing system notes prepended to next prompt (e.g. [Note: model
  was just switched...]).
- Strings produced by external modules (gateway_help_lines,
  format_gateway, manual_compression_feedback) — those have their
  own surfaces.

New shared keys for cross-handler boilerplate:
- gateway.shared.session_db_unavailable (5 call sites: branch, title,
  resume, topic, _disable_telegram_topic_mode_for_chat)
- gateway.shared.session_not_found (1 site)
- gateway.shared.warn_passthrough (2 sites in /title's f"⚠️ {e}" pattern)

YAML gotcha fixed: \`yolo.on\` and \`yolo.off\` were originally written
unquoted, which YAML 1.1 parses as boolean True/False keys. Renamed to
\`yolo.enabled\` / \`yolo.disabled\` for both safety and clarity.

Test fix: tests/agent/test_i18n.py::test_t_missing_key_in_non_english_falls_back_to_english
now resets the catalog cache on teardown, so the fake "foo: English Foo"
locale doesn't poison the module-level cache for subsequent tests in
the same xdist worker. (Without this, every gateway slash command test
that shares a worker with the i18n suite would see the fake catalog.)

Validation:
- tests/agent/test_i18n.py: 27/27 (parity contract — every key in every
  locale, matching placeholder tokens).
- tests/gateway/: 5077 passed, 0 failed (full gateway suite).
- 180 t() call sites added across 35 handlers; 1872 catalog entries
  total (234 keys × 8 locales).

* feat(i18n): add 8 new locales — af, ko, it, ga, zh-hant, pt, ru, hu

Expands the static-message catalog from 8 → 16 languages, each with full
270-key parity against the English source-of-truth.  Every locale now
covers the same surface PR #22914 added: approval prompts plus all 35
gateway slash command outputs.

New locales:
- af  Afrikaans      (community ask in #21961 by @GodsBoy; PRs #21962, #21970)
- ko  Korean         (PRs #20297 by @tmdgusya, #22285 by @project820)
- it  Italian        (PR #20371 by @leprincep35700)
- ga  Irish/Gaeilge  (PR #20962 by @ryanmcc09-dot)
- zh-hant Traditional Chinese (PRs #20523 by @jackey8616, #13140 by @anomixer)
- pt  Portuguese     (PRs #20443 by @pedroborges, #15737 by @carloshenriquecarniatto, #22063 by @Magaav)
- ru  Russian        (PR #22770 by @DrMaks22)
- hu  Hungarian      (PR #22336 by @lunasec007)

Each locale uses native-quality translations matching the existing tone
and conventions of the older 8 locales:
- zh-hant uses 繁體 characters with TW/HK technical vocabulary (軟體
  not 软件, 連線 not 连接, 設定 not 设置, 訊息 not 消息, 工作階段 not 会话, 程式
  not 程序, 預設 not 默认, 伺服器 not 服务器), full-width punctuation 「:()」.
- ko uses formal 합니다체 (습니다/합니다) register throughout.
- pt uses European Portuguese as baseline with neutral PT/BR vocabulary
  where possible.
- ga uses standard An Caighdeán Oifigiúil; English loanwords retained
  for tech terms without good Irish equivalents (gateway, API, JSON).
- All preserve {placeholder} tokens, backtick code spans, slash commands,
  brand names (Hermes, MCP, TTS, YOLO, OpenAI, Telegram, etc.), and emoji.

Aliases added in agent/i18n.py:
- af-za, Afrikaans → af
- ko-kr, Korean, 한국어 → ko
- it-it, italiano → it
- ga-ie, Irish, Gaeilge → ga
- zh-tw, zh-hk, zh-mo, traditional-chinese → zh-hant (note: zh-tw used to
  alias to zh; now aliases to its own zh-hant catalog)
- zh-cn, zh-hans, zh-sg → zh (unchanged from before)
- pt-pt, pt-br, brazilian, portuguese → pt
- ru-ru, Russian, русский → ru
- hu-hu, Magyar → hu

The zh-tw alias re-routing is intentional: previously typing 'zh-TW' got
the Simplified Chinese catalog (wrong vocabulary for Taiwan/HK users).
Now those users get the proper Traditional Chinese catalog.

Validation:
- tests/agent/test_i18n.py: 43/43 (parity contract holds for all 16
  languages × 270 keys = 4320 catalog entries, with matching placeholder
  tokens).
- E2E alias resolution verified for all 19 alias inputs (Afrikaans, ko-KR,
  한국어, italiano, Gaeilge, zh-TW, zh-HK, traditional-chinese, pt-BR,
  brazilian, Magyar, etc.).
- tests/gateway/: 5198 passed (3 pre-existing TTS routing failures
  unrelated to i18n).

Credit to all contributors whose PRs surfaced these language requests.
Their original PRs may now be closed as superseded with credit.

* feat(dashboard-i18n): add 14 web dashboard locales matching the static catalog

Brings the React dashboard (web/src/) up to the same 16-language
coverage the static catalog already has after the previous commits in
this PR. The Translations interface is TypeScript-typed, so every new
locale must provide every key — tsc -b is the parity guard.

Languages added (each is a complete 429-line locale file):
- af  Afrikaans
- ja  Japanese        (PR #22513 by @snuffxxx surfaced this)
- de  German          (PR #21749 by @mag1art)
- es  Spanish         (PR #21749)
- fr  French          (PRs #21749, #10310 by @foXaCe)
- tr  Turkish
- uk  Ukrainian
- ko  Korean          (PRs #21749, #18894 by @ovstng, #22285 by @project820)
- it  Italian
- ga  Irish (Gaeilge)
- zh-hant Traditional Chinese (PR #13140 by @anomixer)
- pt  Portuguese      (PRs #22063 by @Magaav, #22182 by @wesleysimplicio, #15737 by @carloshenriquecarniatto)
- ru  Russian         (PRs #21749, #22770 by @DrMaks22)
- hu  Hungarian       (PR #22336 by @lunasec007)

Each translation covers all 15 namespaces with full key parity vs en.ts,
preserves every {placeholder} token verbatim, keeps identifiers
untranslated (brand names, file paths, cron expressions, code spans),
translates the language.switchTo tooltip into the target language, and
matches existing tone conventions (zh-hant uses TW/HK vocab; ja uses
formal desu/masu; ko uses formal seumnida register; ga uses An
Caighdean Oifigiuil with English loanwords for tech vocab without good
Irish equivalents).

Plumbing:
- web/src/i18n/types.ts: Locale union expanded to all 16 codes.
- web/src/i18n/context.tsx: imports all 16 catalogs; exports
  LOCALE_META (endonym + flag per locale); isLocale() type guard.
- web/src/i18n/index.ts: re-export LOCALE_META.
- web/src/components/LanguageSwitcher.tsx: replaced two-state EN-ZH
  toggle with a click-to-open dropdown listing all 16 languages.

Note: zh-hant.ts exports zhHant (camelCase) since hyphen is invalid in
a JS identifier; the canonical 'zh-hant' string keys it in TRANSLATIONS.

Validation:
- npx tsc -b: 0 errors. Every locale satisfies Translations.
- npm run build (tsc + vite production): green, 2062 modules.
- Each locale file is exactly 429 lines.

Out of scope: plugin dashboards (kanban/achievements ship as prebuilt
bundles with no source in repo); Docusaurus docs (separate surface);
TUI (no i18n yet).

* feat(plugin-i18n): localize achievements + kanban plugin dashboards across all 16 locales

Brings the two shipped plugin dashboards (hermes-achievements, kanban)
under the same i18n umbrella as the core dashboard PR #22914 just
established.  Both bundles now read user-facing strings from the host's
i18n catalog via SDK.useI18n() instead of hardcoded English.

## Approach

Plugin dashboards ship as prebuilt IIFE bundles in
plugins/<name>/dashboard/dist/index.js — no build step, no source in
repo (upstream-authored, vendored as compiled JS).  Earlier contributor
PRs (#22594, #22595, #18747) tried direct edits but didn't actually
wire the bundles to read translations.

This change does the wiring properly:

1.  Each bundle gets a useI18n shim at IIFE scope:
        const useI18n = SDK.useI18n
          || function () { return { t: { kanban: null }, locale: "en" }; };
    Older host SDKs without useI18n still load the bundle and render
    English fallbacks.

2.  A small tx(t, path, fallback, vars) helper resolves dotted keys
    under the plugin's namespace (t.kanban.* or t.achievements.*) and
    interpolates {placeholder} tokens.

3.  Every React component starts with const { t } = useI18n() and
    each user-visible string is wrapped in tx(t, "key", "English fallback").
    Helpers called outside React components (window.prompt callers,
    constants used during init) take t as a parameter.

4.  Top-level constants that were English dictionaries (COLUMN_LABEL,
    COLUMN_HELP, DESTRUCTIVE_TRANSITIONS, DIAGNOSTIC_EVENT_LABELS in
    kanban) become getColumnLabel(t, status)-style functions backed by
    FALLBACK_* dictionaries.

## Translations added

Two new top-level namespaces added to the dashboard's TypeScript-typed
Translations interface:

- achievements: ~70 keys covering the hero, scan banner, achievement
  card, share dialog, stats, filters, and empty states.
- kanban: ~145 keys covering the board, columns (with nested
  columnLabels and columnHelp sub-dicts), card detail panel,
  bulk-actions toolbar, dependency editor, board switcher, and
  diagnostic callouts.

Each key is provided across all 16 supported locales:
en, zh, zh-hant, ja, de, es, fr, tr, uk, af, ko, it, ga, pt, ru, hu.

Total new translation entries: ~3,440 (215 keys × 16 locales).

## What stays English (deliberate)

- API paths, CSS class names, data-* attributes, JSON keys, regex
  strings, URLs, file paths (~/.hermes/kanban.db, boards/_archived/).
- State identifier strings used as lookup keys (triage / todo / ready /
  running / blocked / done / archived) — labels translate, key strings
  don't.
- The PNG share-card text rendered to canvas in the achievements
  ShareDialog (HERMES AGENT watermark, UNLOCKED stamp, tier names) —
  these become part of a globally-shared image and stay English.
- localStorage keys (hermes.kanban.selectedBoard).
- Brand names (Kanban, Hermes, WebSocket, Nous Research).

## Contributor credit

PR #22594 by @02356abc and PR #22595 by @02356abc supplied the
en + zh kanban namespace skeleton (145 keys); used as the en source-
of-truth in this commit and translated to the other 14 locales.

PR #18747 by @laolaoshiren first surfaced the achievements
localization request.

## Validation

- npx tsc -b: 0 errors. All 16 locale .ts files satisfy the
  Translations type with full key parity.
- npm run build (tsc + vite production build): green, 2062 modules,
  1.56MB JS / 95KB CSS, ~2.5s build.
- node --check on both plugin bundles: parse cleanly.
- 126 tx() call sites in kanban, 46 in achievements.

## Out of scope

- TUI (ui-tui/) has no i18n infrastructure yet.
- Docusaurus docs (website/i18n/) — already had zh-Hans; expanding
  is a separate translation workstream (Thai / Korean / Hindi PRs).
2026-05-10 07:14:14 -07:00
Teknium
5aa755e4e6
feat(plugins): run any LLM call from inside a plugin via ctx.llm (#23194)
* feat(plugins): host-owned LLM access via ctx.llm

Plugins can now ask the host to run a one-shot chat or structured
completion against the user's active model and auth, without ever
seeing an OAuth token or API key. Closes the gap where plugins that
needed bounded structured inference (receipts, CRM extraction,
support classification) had to either bring their own provider keys
or register a tool the agent had to call.

New surface on PluginContext:
- ctx.llm.complete(messages, ...)
- ctx.llm.complete_structured(instructions, input, json_schema, ...)
- async siblings ctx.llm.acomplete / acomplete_structured

Backed by the existing auxiliary_client.call_llm pipeline — every
provider, fallback chain, vision routing, and timeout policy Hermes
already supports applies automatically.

Trust gate (fail-closed by default):
- plugins.entries.<id>.llm.allow_model_override
- plugins.entries.<id>.llm.allowed_models (allowlist; '*' = any)
- plugins.entries.<id>.llm.allow_agent_id_override
- plugins.entries.<id>.llm.allow_profile_override

Embedded model@profile shorthand goes through the same gate as
explicit profile=, so it can't bypass the auth-profile policy.
Conflicting explicit and embedded profiles fail closed.

Also lands:
- plugins/plugin-llm-example/ — reference plugin that registers
  /receipt-extract, demonstrating image+text structured input,
  jsonschema validation, and the trust-gate config.
- website/docs/developer-guide/plugin-llm-access.md — full API docs.
- 45 unit tests covering trust gates, JSON parsing, schema
  validation, image encoding, async surface, and config loading.

Validation:
- 2628 tests pass in tests/agent/
- E2E: bundled plugin loaded with isolated HERMES_HOME, slash
  command produced parsed JSON via stubbed call_llm
- response_format extra_body wired correctly for both json_object
  and json_schema modes

* docs(plugin-llm): rewrite quickstart and framing

The quickstart now uses a meeting-notes-to-tasks example instead of
a receipt extractor, and the page leads with hook-time / gateway
pre-filter / scheduled-job framing rather than the OpenClaw
KB/support/CRM/finance/migration enumeration that the original
upstream PR used. Receipt example moved to a separate worked
example link so the docs page itself doesn't echo any of the
upstream framing.

Also clarifies where ctx.llm fits in the broader plugin surface
(table comparing register_tool / register_platform / register_hook
/ etc.) and what makes this lane different from auxiliary_client
internals.

No code change.

* docs(plugin-llm): reframe as any LLM call, not just structured output

The original draft leaned heavily on complete_structured() and made
the chat lane (complete() / acomplete()) feel like a footnote.
Restructure so:

- The page title and description say 'any LLM call.'
- The lead shows BOTH a plain chat call (error rewriter) AND a
  structured call (triage scorer) up top.
- Quick start has two complete plugin examples — /tldr (chat) and
  /paste-to-tasks (structured).
- New 'When to use which' table for choosing complete() vs
  complete_structured() vs the async siblings.
- Trust-gate sections explicitly note 'all four methods,' and the
  request-shaping list calls out chat-only fields (messages) and
  structured-only fields (instructions, input, json_schema)
  alongside each other.
- The 'Where this fits' section now says 'for any reason,
  structured or not.'

The receipt-extractor reference plugin still exists under
plugins/plugin-llm-example/ — but the docs page no longer treats
it as the canonical surface example. It's now described as 'a third
worked example, this time with image input.'

No code change.

* feat(plugin-llm): split provider/model into independent explicit kwargs

The first cut accepted a single 'provider/model' slug on every method
and split it internally. That looked clean but broke under live test:
the model-override path tried to use the slug's vendor prefix as a
literal Hermes provider id, which silently switched the user off
their aggregator (e.g. plugin asks for 'openai/gpt-4o-mini' on a user
who routes through OpenRouter — host attempted to call the 'openai'
provider directly, failed because OPENAI_API_KEY wasn't set).

New shape mirrors the host's main config:

  ctx.llm.complete(
      messages=[...],
      provider='openrouter',         # gated, optional
      model='openai/gpt-4o-mini',    # gated, optional
      profile='work',                # gated, optional
      ...
  )

Each is independently gated by its own allow_*_override flag.
Granting model-override does NOT auto-grant provider-override.
Allowlists are now per-axis (allowed_providers, allowed_models)
matched literally against whatever string the plugin sends.

Dropped 'model@profile' embedded-suffix shorthand entirely. Hermes
doesn't use that pattern anywhere else; profile= is its own kwarg.

Live E2E (against real OpenRouter via Teknium's config) confirms:
- zero-config call works
- default-deny blocks each override with a helpful error
- model-only override stays on user's active provider (the bug)
- provider+model override switches cleanly
- allowlist refuses non-listed entries
- structured output round-trip parses + schema-validates

Tests: 49 cases (up from 45); all green. Docs updated to match the
new shape, including a 'most plugins never need this section' callout
on the trust-gate config block.

* fix+cleanup(plugin-llm): real attribution, hook-mode coverage, move example out of core

Three integration fixes for the ctx.llm surface:

1. Attribution bug — result.provider and result.model now reflect
   what call_llm actually used, not placeholder fallbacks ('auto',
   'default'). New _resolve_attribution() helper:

     - explicit overrides win (what the call targeted)
     - response.model wins for the recorded model (provider
       canonicalisation: 'gpt-4o' → 'gpt-4o-2024-08-06' etc.)
     - falls back to _read_main_provider() / _read_main_model()
       when no override is set, so audit logs reflect the user's
       active main provider/model
     - 'auto' / 'default' only when EVERYTHING is empty

   Live verified: zero-config call now records
   provider='openrouter', model='anthropic/claude-4.7-opus-20260416'
   instead of provider='auto', model='default'.

2. Hook-mode coverage — TestHookMode confirms ctx.llm.complete
   works from inside a registered post_tool_call callback. The
   docs page promised hook integration; now there's a test that
   exercises the lazy-import path through the real invoke_hook
   machinery. Two cases: traceback-rewrite hook with conditional
   ctx.llm.complete, and minimal hook regression for the
   sync-hook + sync-llm path.

3. Reference plugin moved out of core. plugins/plugin-llm-example/
   is gone from hermes-agent — it now lives in the new
   NousResearch/hermes-example-plugins companion repo. The docs
   page links there. Hermes' bundled plugins should be plugins
   users actually run; reference / docs-companion plugins live
   externally.

Test count: 56 (up from 49). Wider sweep on tests/hermes_cli/
+ tests/gateway/ + tests/tools/ + tests/agent/ shows 16770
passing; the 12 failures are all pre-existing on origin/main
(verified by stashing this branch's changes and re-running) —
kanban-boards, delegate-task, gateway-restart, tts-routing —
none touch the plugin_llm surface.

* chore(plugins): move all example plugins to companion repo

Reference / docs-companion plugins now live exclusively in
NousResearch/hermes-example-plugins, not bundled with the core repo:

- example-dashboard
- strike-freedom-cockpit

A new fourth example, plugin-llm-async-example, was added to that
repo demonstrating ctx.llm's async surface (acomplete()) with
asyncio.gather() — registers /translate <lang>: <text> which fires
forward translation + sentiment classifier in parallel, then a
back-translation for QA. Live-tested at 2.5s for three real
provider round-trips (would be ~5-6s sequential).

Docs updated:
- developer-guide/plugin-llm-access.md links both sync and async
  examples in the Reference section
- user-guide/features/extending-the-dashboard.md repoints both demo
  sections to the companion repo with corrected install paths
- user-guide/features/built-in-plugins.md drops the two demo rows
- AGENTS.md notes that example plugins live in the companion repo

Net: hermes-agent's plugins/ directory now contains only plugins
users actually run (memory providers, dashboard tabs that ship real
features, the disk-cleanup hook, platform adapters). All four
demo / reference plugins live externally where they can be cloned
on demand instead of inflating the core install.
2026-05-10 07:09:28 -07:00
Teknium
7312f7f849
feat(curator): hint at hermes curator pin in the rename block (#23212)
Surfaces the pin command at the moment users care about it: when a
consolidation just landed against their skill library and they're
looking at the umbrella name in the curator output. Previously `hermes
curator pin` existed but had no discovery surface — users only learned
it existed by reading docs or stumbling onto `hermes curator --help`.

The hint:

    archived 3 skill(s):
      • docx-extraction → document-tools
      • pdf-extraction → document-tools
      • old-stale — pruned (stale)
    full report: hermes curator status
    keep an umbrella stable: hermes curator pin document-tools

Gated on having at least one consolidation that produced an umbrella.
Pruned-only runs (nothing surviving to pin) skip the hint. When
multiple umbrellas were produced, picks alphabetically first as a
concrete example rather than listing them all.

3 new tests in tests/agent/test_curator_classification.py covering:
consolidation produces hint with real umbrella name, pruned-only run
omits it, multi-umbrella picks one example.
2026-05-10 06:44:53 -07:00
kshitijk4poor
44cdf555a8 fix(codex-spark): defensive 128k entry in DEFAULT_CONTEXT_LENGTHS + clarify validation test docstring
Two follow-ups from self-review:

1. Add gpt-5.3-codex-spark to DEFAULT_CONTEXT_LENGTHS at 128k. The
   primary resolution path for Spark goes through provider='openai-codex'
   → _CODEX_OAUTH_CONTEXT_FALLBACK (already correct). But if any future
   code path resolves Spark's context with a different provider (custom
   proxy, generic fallthrough), the longest-substring-first lookup in
   step 8 would match 'gpt-5' and report 400k, which is wrong by ~3x.
   Adding the explicit override is a cheap defensive correctness fix
   matching how gpt-5.4-mini and gpt-5.4-nano already shadow the generic
   gpt-5 entry.

2. Update test_openai_codex_model_validation_fallback.py docstring. The
   bug it was originally written for (gpt-5.3-codex-spark missing from
   listing) is now resolved by this PR's catalog restoration. The test
   still validly exercises the soft-accept code path for any future
   entitlement-gated Codex slug that ships before Hermes catalogs it,
   but the framing was stale — clarified.
2026-05-09 23:17:25 -07:00
kshitij
9ee9a4297d docs(codex-spark): document ChatGPT Pro entitlement gating
PR #12994 stripped gpt-5.3-codex-spark on the assumption that it was
unsupported. It's actually research-preview, ChatGPT-Pro-only, exposed
via the Codex OAuth backend at chatgpt.com/backend-api/codex/models —
not via the public OpenAI API.

Add explanatory comments in:
  - DEFAULT_CODEX_MODELS / _FORWARD_COMPAT_TEMPLATE_MODELS (codex_models.py)
  - _CODEX_OAUTH_CONTEXT_FALLBACK (model_metadata.py)
  - list_authenticated_providers' live-discovery branch (model_switch.py)

so future maintainers don't strip the entry again. Also documents the
intentional asymmetry that Spark stays out of the "openai" provider
catalog (it isn't on the public API) and why the supported_in_api
filter is *not* applied for the openai-codex route.
2026-05-09 23:17:25 -07:00
olegdater
c6dc295a35 fix(model-metadata): set codex-spark fallback context to 128k 2026-05-09 23:17:25 -07:00
olegdater
2a6f3deb50 fix(model-metadata): restore gpt-5.3-codex-spark fallback context 2026-05-09 23:17:25 -07:00