hermes-agent-features

Author	SHA1	Message	Date
burjorjee	52b049b560	fix: treat inline-shell timeout guard as timeout	2026-05-18 19:36:04 -07:00
墨綠BG	50e93f23f2	🐛 fix(memory): require newline after context tag	2026-05-18 10:53:08 -07:00
墨綠BG	341c8d3030	🐛 fix(memory): keep inline memory-context mentions visible	2026-05-18 10:53:08 -07:00
Slimydog21	aae1615977	fix(xai-responses): strip enum values containing '/' from tool schemas xAI's /v1/responses and /v1/chat/completions endpoints reject tool schemas whose enum values contain a forward slash with a generic HTTP 400 'Invalid arguments passed to the model.' before any token is emitted — the schema compiler trips on the '/' character regardless of where it appears. Most commonly hit by MCP-derived tools whose enum lists HuggingFace model IDs ('Qwen/Qwen3.5-0.8B', 'openai/gpt-oss-20b') or owner/name environment identifiers. Mirrors the existing strip_pattern_and_format sanitizer (PR for #27197). The new strip_slash_enum walks tool parameters and drops the entire enum keyword when any value contains '/' — keeping it partial would still 400 since xAI's failure is all-or-nothing on the enum. The field description still reaches the model so the prompting hint is preserved. Wired in at both code paths for parity: - agent/chat_completion_helpers.py (main agent xAI Responses path) - agent/auxiliary_client.py (aux client xAI Responses path, matching the same parity guarantee `2fae8fba9` established for pattern/format) Salvaged from #28021 by @Slimydog21 — contributor's branch was severely stale (would have reverted ~5000 LOC across azure/kanban/i18n); fix re-applied surgically on current main with their sanitizer + 9 tests preserved verbatim. Author noreply email used (original was a Mac hostname leak).	2026-05-18 10:37:35 -07:00
EloquentBrush0x	b570e0fdd0	fix(codex-oauth): quarantine terminal refresh errors so dead tokens are not replayed across sessions When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403, token revoked or reused), _mark_exhausted was called but auth.json was left with the dead credentials. On the next session, _seed_from_singletons re-read auth.json and re-seeded the pool with the same revoked token, triggering the same terminal failure in a loop. Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine block in _refresh_entry: when a terminal error is detected and auth.json holds no newer tokens, clear access_token/refresh_token from auth.json and remove all device_code-sourced pool entries from memory. Mirrors the Nous quarantine added in `c90556262` and the xAI quarantine in #28116. Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure, matching the xAI and Nous patterns, to avoid refresh_token_reused races when multiple Hermes processes share the same auth.json singleton. Salvaged from #27911 by @EloquentBrush0x — contributor's branch was severely stale (would have reverted ~5000 LOC across azure/kanban/i18n subsystems); fix re-applied surgically on current main with their predicate and tests preserved.	2026-05-18 10:31:40 -07:00
Teknium	9aae59feab	fix(compress): make abort-on-summary-failure opt-in via config flag (#28117 ) PR #28102 made the summary-failure abort path the unconditional default, changing established behavior. Gate it behind config.yaml flag `compression.abort_on_summary_failure` (default False = historical fallback-placeholder behavior). - hermes_cli/config.py: new `compression.abort_on_summary_failure` key, default False, documented inline. - agent/agent_init.py: read the flag from compression config and pass to ContextCompressor. - agent/context_compressor.py: `__init__` accepts `abort_on_summary_failure` (default False). `compress()` failure branch gates the abort on the flag; when False, falls through to the restored legacy fallback path (static "summary unavailable" placeholder + drop middle window). - tests: restore original fallback expectations as default; add new TestAbortOnSummaryFailure class for the opt-in mode. Gateway/CLI plumbing (force=True on /compress, hygiene/handler abort detection, locale `gateway.compress.aborted` key) from PR #28102 stays intact — those paths only fire when `_last_compress_aborted` is True, which now only happens when the flag is enabled.	2026-05-18 10:28:20 -07:00
EloquentBrush0x	5e40f83cb7	fix(xai-oauth): quarantine terminal refresh errors so dead tokens are not replayed across sessions When refresh_xai_oauth_pure raises a terminal error (HTTP 400/401/403, i.e. revoked or reused refresh token), _refresh_entry's existing race- recovery path re-syncs from auth.json and returns if another process has already rotated the tokens. If auth.json still holds the same stale token pair, the function fell through to _mark_exhausted — leaving the dead credentials in auth.json. On the next Hermes startup _seed_from_singletons re-seeded the pool from those stale tokens, causing the same failure loop on every session. Fix: after the auth.json re-sync check in the xAI-oauth error handler, detect terminal errors with the new _is_terminal_xai_oauth_refresh_error helper and apply a quarantine: - Clear access_token and refresh_token from providers["xai-oauth"]["tokens"] in auth.json so they are not re-seeded. - Write a last_auth_error entry for hermes doctor / auth status diagnostics. - Remove all loopback_pkce entries from the in-memory pool so the current session stops retrying with the dead credentials. Mirrors the identical quarantine already in place for Nous OAuth (`c90556262`). Closes the parity gap introduced when `c90556262` added Nous-only terminal error handling without a corresponding xAI-oauth path.	2026-05-18 10:28:09 -07:00
EloquentBrush0x	1fabd6e100	fix(error_classifier): classify xAI Grok entitlement SSE errors as auth When xAI returns a subscription/entitlement error through an SSE ``type=error`` frame, ``_StreamErrorEvent`` is raised with ``status_code=None``. This caused ``_classify_by_status`` (step 2 of ``classify_api_error``) to be skipped entirely, and the Grok-specific phrases ("do not have an active Grok subscription", "out of available resources") appeared in none of the message-pattern lists. The error fell through to ``FailoverReason.unknown (retryable=True)``, burning ``max_retries`` on every affected X Premium+ / SuperGrok user before the agent stopped — and ``_is_entitlement_failure`` was never called because it only fires under ``FailoverReason.auth``. The HTTP 403 path already handled this correctly (``_classify_by_status`` returns ``auth/non-retryable`` for 403). Add an explicit pattern block at step 1 (highest priority, before the ``status_code`` guard) so both code paths route to ``FailoverReason.auth, retryable=False, should_fallback=True`` — matching the 403 path exactly. Add three regression tests in ``Fix D`` section of ``test_codex_xai_oauth_recovery.py``: - primary "do not have an active Grok subscription" phrase - "out of available resources" + "grok" variant - unrelated ``_StreamErrorEvent`` must not be reclassified	2026-05-18 10:24:13 -07:00
flamiinngo	5613dfea93	fix(security): redact xAI (Grok) API keys in logs xAI is a first-class provider in hermes-agent with its own credential pool entry (XAI_API_KEY / xai-oauth). API keys follow the format xai-<60+ alphanumeric chars> and were absent from _PREFIX_PATTERNS in agent/redact.py. When a key appears raw in log output, tool results, or error messages, it passed through completely unmasked. The ENV-assignment and Bearer header patterns catch the most common cases, but a raw token in a stack trace or debug print had no protection. Verified before fix: redact_sensitive_text("using key xai-ABCD...rstu to call xAI", force=True) # "using key xai-ABCD...rstu to call xAI" <- exposed After fix: # "using key xai-AB...rstu to call xAI" <- masked Five unit tests added to TestXaiToken covering bare token masking, env assignment, short-prefix false positive, company name false positive, and visible prefix in masked output.	2026-05-18 10:21:22 -07:00
Teknium	1634397ddb	fix(compress): abort instead of dropping messages when summary LLM fails (#28102 ) When auxiliary compression's summary generation returns None (aux model errored, returned non-JSON, timed out, etc.) the compressor previously still dropped every middle message between compress_start..compress_end and replaced them with a static 'Summary generation was unavailable' placeholder. The session kept going but the user silently lost N turns of context for nothing. New behavior: on summary failure, compress() aborts entirely — returns the input messages unchanged and sets _last_compress_aborted=True. The existing _summary_failure_cooldown_until gate (30-60s) keeps the aux model from being burned on every turn. Auto-compress callers detect the no-op (len(after) == len(before)) and stop looping. The chat is 'frozen' at its current size until the next /compress or /new. Manual /compress (CLI + gateway) now passes force=True which clears the cooldown so users can retry immediately after an auto-abort. If the manual retry also fails, the user gets a visible warning telling them nothing was dropped and how to retry. - agent/context_compressor.py: compress() gains force= kwarg; failure branch sets _last_compress_aborted and returns messages unchanged instead of inserting placeholder. - run_agent.py: _compress_context() detects abort, surfaces warning, skips session-rotation entirely, returns messages unchanged. - cli.py + gateway/run.py: manual /compress paths pass force=True. - gateway/run.py: hygiene + /compress handlers detect _last_compress_aborted and emit the new 'Compression aborted' warning (gateway.compress.aborted) instead of the old 'N historical messages were removed' message. - locales/*.yaml: new gateway.compress.aborted key in all 16 locales. - tests: updated to assert the abort contract (messages preserved, compression_count not incremented, abort flag set, no placeholder leaked). New test_force_true_bypasses_failure_cooldown covers the manual-retry path.	2026-05-18 10:19:40 -07:00
glennc	9df9816dab	feat(azure-foundry): add Microsoft Entra ID auth Use azure-identity DefaultAzureCredential for keyless Foundry auth. Preserve refreshable callable credentials through OpenAI and Anthropic client paths. Add setup, doctor, auth status, docs, and tests for Entra auth. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-18 10:14:38 -07:00
teknium1	9cae9c0166	fix(aux): log sanitizer failures instead of silently swallowing them Match the warning behavior of the parent main-agent path in chat_completion_helpers.py — sanitizer failures should be visible in logs, not silent.	2026-05-18 09:43:44 -07:00
EloquentBrush0x	2fae8fba9c	fix(aux): strip pattern/format keywords from tool schemas on xAI Responses path xAI's /responses endpoint rejects tool schemas that contain pattern or format JSON Schema keywords with HTTP 400. chat_completion_helpers.py already strips these for the main-agent xAI/xai-oauth path (lines 294-302), but _CodexCompletionsAdapter.create() — used for every xAI OAuth auxiliary call (kanban decomposer, profile describer, etc.) — passed raw tool schemas without sanitization. MCP tools that carry pattern/format keywords (common for string fields) silently caused every auxiliary call over xAI OAuth to fail with an HTTP 400, while the main agent worked fine. Parity fix: call strip_pattern_and_format() on the tool list before converting to Responses API format, matching the main-agent guarantee.	2026-05-18 09:43:44 -07:00
teknium	f0c6d59148	fix(anthropic): scope MiniMax beta-strip to MiniMax only Cherry-pick of @sharziki's #27022 routed Azure Foundry through _requires_bearer_auth, which also triggered the MiniMax-specific beta-strip in _common_betas_for_base_url — dropping the 1M-context beta from Azure even though Azure needs it for 1M context. Split the strip predicate: introduce _is_minimax_anthropic_endpoint so the fine-grained-tool-streaming and context-1m strips only fire for MiniMax hosts, leaving Azure's bearer-auth header swap intact without losing 1M context. Also add a regression test that asserts Azure gets Bearer auth, the api-version query param, and the context-1m-2025-08-07 beta.	2026-05-18 09:27:18 -07:00
sharziki	73407b1e30	fix(auth): send Bearer auth for Azure Foundry anthropic_messages endpoints Azure AI Foundry's Anthropic-style endpoint requires `Authorization: Bearer` instead of `x-api-key`. Add `azure.com` to `_requires_bearer_auth()` so the existing Bearer path at line 586 fires before the generic third-party branch sets `api_key` (x-api-key). Fixes #26970	2026-05-18 09:27:18 -07:00
wysie	ff078738ea	fix(skills): load symlinked skill slash commands	2026-05-18 00:34:29 -07:00
Teknium	abf1af5401	feat(session_search): single-shape tool with discovery, scroll, browse — no LLM (#27590 ) * feat(session_search): single-shape tool with discovery, scroll, browse — no LLM Replaces the LLM-summarized session_search with a single-shape tool that returns actual messages from the DB. Three calling shapes inferred from args (no mode parameter): 1. Discovery — pass query. FTS5 + anchored ±5 window + bookends per hit, all in one call. ~20ms on a real DB instead of ~90s for the previous three aux-LLM calls. 2. Scroll — pass session_id + around_message_id. Returns a window centered on the anchor. To paginate, re-anchor on the first/last id of the returned window. Boundary message appears in both windows as the orientation marker. ~1ms per scroll call. 3. Browse — no args. Recent sessions chronologically. Bookend_start (first 3 user+assistant msgs) and bookend_end (last 3) give the agent goal + resolution on every discovery hit, so a single tool call reconstructs a long session's arc without loading the whole transcript. The aux-LLM summary path is gone: it cost ~$0.30/call, took ~30s, and laundered FTS5 hits through a model that could confabulate when the right session wasn't in the hit list. The merged shape returns byte-for-byte content from SQLite. History: - PR #20238 (JabberELF) seeded the fast/summary dual-mode split. - PR #26419 (yoniebans) expanded to fast/guided/summary with bookends, multi-anchor drill-down, default-mode config, and a teaching skill. This PR collapses that toolkit into one shape with explicit scroll support, drops the summary path, drops the mode parameter, drops the config knob, drops the skill. JabberELF's seed work is acknowledged via the AUTHOR_MAP entry. Validation: - 38/38 tool tests pass (tests/tools/test_session_search.py) - 12/12 get_messages_around tests pass (tests/hermes_state/) - 11/11 get_anchored_view tests pass (tests/hermes_state/) - Full tests/tools/ run: 5168 passing, 2 failures pre-exist on main (test ordering in test_delegate.py, unrelated) - E2E against live state DB: discovery 20ms, scroll 1ms, browse 280ms; pagination forward+backward works with boundary-message orientation; error paths return clean tool_error responses Co-authored-by: JabberELF <abcdjmm970703@gmail.com> Co-authored-by: yoniebans <jonny@nousresearch.com> * chore(session_search): prune dead LLM-summary config and docs Companion to the single-shape rewrite. The auxiliary.session_search config block, max_concurrency / extra_body tunables, and matching docs sections all referenced the removed LLM summarization path. Removing them so users don't try to tune knobs that nothing reads. - hermes_cli/config.py: drop dead auxiliary.session_search block from DEFAULT_CONFIG. Leftover keys in user config.yaml are harmless and ignored. - hermes_cli/tips.py: drop two tips referencing the removed max_concurrency / extra_body knobs. - website/docs/user-guide/configuration.md: drop 'Session Search Tuning' section and the auxiliary.session_search block from the example. - website/docs/user-guide/features/fallback-providers.md: drop session_search rows from the auxiliary-tasks tables and the dedicated tuning subsection. - website/docs/reference/tools-reference.md: rewrite the session_search entry to describe the new three-shape behaviour. - CONTRIBUTING.md: update the file-tree description. - tests/tools/test_llm_content_none_guard.py: remove TestSessionSearchContentNone class and test_session_search_tool_guarded — both guard against an unguarded .content.strip() call site in _summarize_session() that no longer exists. Validation: 97/97 targeted tests still pass (hermes_state + session_search + llm_content_none_guard). Config tests 55/55. --------- Co-authored-by: JabberELF <abcdjmm970703@gmail.com> Co-authored-by: yoniebans <jonny@nousresearch.com>	2026-05-17 23:28:45 -07:00
teknium1	4a3f13b47b	perf(prompt-cache): date-only timestamp + loud gateway-DB roundtrip logging The system prompt's 'Conversation started:' line carried minute precision (%I:%M %p), making it byte-unstable across every rebuild path. Within a CLI session the in-memory cache held, but on the gateway path (fresh AIAgent per turn → restore from session DB), any silent failure in the read or write path dropped the cache stem and forced a full re-prefill on every subsequent turn. Local prefix-caching backends (llama.cpp / vLLM) saw this as KV-cache invalidation; remote prefix-caching providers saw it as an Anthropic-style cache miss. Three changes: 1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM'). System prompt now byte-stable for the full day. The model can still query exact time via tools when it actually needs it. Credit: @iamfoz (PR #20451). 2. Loud logging on session DB write failures. The update_system_prompt call used to log at DEBUG, hiding disk-full / locked-database / schema drift behind a silent fall-through that forced fresh rebuilds on every subsequent turn. Now WARN with the session id and exception so persistent issues show up in agent.log without verbose mode. 3. Three-way stored-state distinction on read. The previous 'session_row.get("system_prompt") or None' collapsed three states into one (missing row / null column / empty string). Now we tell them apart and WARN when a continuing session lands on null/empty (which means the previous turn's write never persisted — every subsequent turn rebuilds and the prefix cache misses every time). The restore block is extracted into _restore_or_build_system_prompt() so the prefix-cache path can be unit-tested in isolation. E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary sleep restores byte-identical bytes from the session DB. NULL stored prompt fires the new warning. Date-only timestamp survives the rebuild path. All on real SessionDB, no mocks. Tests: - tests/agent/test_system_prompt_restore.py (10 new tests) - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt:: test_datetime_is_date_only_not_minute_precision Closes #20451 (date-only), #18547 (prefix stabilization), #8689 (stabilize timestamp across compression), #15866 (timestamp caching question), #8687 (compression timestamp), #27339 (claim #3: live timestamp in cached system prompt). Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>	2026-05-17 23:20:37 -07:00
Teknium	9b91377bec	feat(grok): apply OpenAI execution guidance to xAI Grok / xai-oauth models (#27797 ) Grok models hit the same failure modes that OPENAI_MODEL_EXECUTION_GUIDANCE addresses for GPT/Codex: claiming completion without tool calls ('to be honest, I didn't create the file yet'), suggesting workarounds instead of using existing tools (proposing a folder-based memory system when the memory tool exists), replying with plans instead of executing. TOOL_USE_ENFORCEMENT_GUIDANCE was already injected for any model whose name contains 'grok' (TOOL_USE_ENFORCEMENT_MODELS). This extends the follow-on family-specific block — OPENAI_MODEL_EXECUTION_GUIDANCE (tool_persistence / mandatory_tool_use / act_dont_ask / prerequisite_checks / verification / missing_context) — to grok-named models too. The OPENAI_ prefix is retained for backwards compat with imports/tests; docstring + inline comment now note that the body is family-agnostic and the prefix reflects origin, not exclusivity. Tests cover the OpenRouter slug (x-ai/grok-4.3) and the xai-oauth bare name (grok-4.3), plus a negative control on claude. E2E verified against a real AIAgent build of the system prompt for both xai-oauth and openrouter grok models.	2026-05-17 23:00:37 -07:00
zccyman	a574246837	feat(auxiliary): add configurable fallback chains + main-agent safety net Layered fallback for auxiliary tasks (compression, vision, tts, web_extract, session_search, etc.): 1. Primary aux provider (existing) 2. User-configured auxiliary.<task>.fallback_chain (new) 3. Main agent provider + model (new — last-resort safety net) 4. Warn user + re-raise original error (new) For users on 'auto' (no explicit aux provider), the existing _try_payment_fallback auto-detection chain runs instead — its Step 1 already IS the main agent model, so they get the same behaviour without configuration. The configured fallback_chain config schema comes from #26882 / @zccyman; the main-agent safety net + exhaustion warning were added on top. Closes #26882. Builds on the capacity-error gate fix in the previous commit (#26803 / @Bartok9).	2026-05-17 17:15:31 -07:00
Bartok9	24c209f112	fix(auxiliary): detect quota exhaustion as payment error; allow capacity-error fallback for explicit providers Closes #26803 Root causes: 1. _is_payment_error() checked for billing keywords (credits, insufficient funds, billing, payment required) but missed daily token quota exhaustion phrases used by Bedrock, Vertex AI, and LiteLLM proxies — e.g. 'Too many tokens per day', 'quota exceeded', 'resource exhausted', 'daily limit'. These are functionally identical to credit exhaustion (provider cannot serve the request) but don't trigger fallback. 2. The call_llm() fallback chain was gated on resolved_provider == 'auto'. When a task resolves to a specific provider (e.g. 'custom' for a LiteLLM proxy, or 'openrouter'), capacity failures (payment/quota/connection) silently raise instead of trying alternatives. This is overly conservative: capacity errors mean the provider cannot serve the request regardless of user intent, so alternatives should always be tried. Fixes: - Add quota-related keywords to _is_payment_error(): quota_exceeded, too many tokens per day, daily limit, tokens per day, daily quota, resource exhausted (Vertex AI gRPC code). - Allow fallback for capacity errors (payment + connection) even when resolved_provider is not 'auto'. Rate-limit fallback stays gated on is_auto to honour explicit provider constraints for transient limits. - Apply both fixes to sync call_llm() and async acall_llm() paths. - Add 6 targeted tests for the new quota-error detection cases.	2026-05-17 17:15:31 -07:00
Robin Fernandes	569bc94b59	fix(auth) fix a few cases where refresh tokens were not rotated.	2026-05-17 16:56:37 -07:00
Robin Fernandes	20bffa5b37	refactor(auth): mostly cleanups and style changes	2026-05-17 16:56:37 -07:00
Robin Fernandes	0bac7dd05b	refactor(auth): collapse Nous inference fallback controls	2026-05-17 16:56:37 -07:00
Robin Fernandes	89a3d038cf	Switch to JWT token for inference against Nous, falling back to old opaque token on failure.	2026-05-17 16:56:37 -07:00
Robin Fernandes	c905562623	fix(auth): stop replaying invalid Nous refresh tokens Quarantine Nous OAuth state when refresh fails with terminal invalid_grant/invalid_token errors. Clear local and shared refresh material across runtime, managed access-token, proxy, and credential-pool paths so Hermes stops retrying revoked refresh sessions.	2026-05-17 16:56:37 -07:00
teknium1	bdc2113b5c	fix(xai): wire schema sanitizer into post-refactor build_api_kwargs Port of the run_agent.py changes from #27219 to current main: the _build_api_kwargs body was extracted into agent/chat_completion_helpers. build_api_kwargs, so wire the xAI tool-schema sanitization there (provider in {'xai', 'xai-oauth'} or base_url=api.x.ai). Logs a warning instead of silently swallowing exceptions, matching the contributor's review-followup fix. Co-authored-by: zccyman <zccyman@163.com>	2026-05-17 13:13:22 -07:00
teknium1	822e92edb3	fix(aux): default OpenRouter auxiliary to gemini-3-flash-preview	2026-05-17 12:44:48 -07:00
Hoang V. Pham	4a7cd2e16d	fix(codex): allow kanban worker board writes	2026-05-17 11:50:43 -07:00
teknium1	55d6a1636b	fix(agent): honor provider timeout config in streaming API calls Closes #25249 (and supersedes PR #25260) in spirit. Two bugs in the streaming chat-completions path caused provider timeout configuration to be silently ignored: 1. Hardcoded connect/pool timeout. The httpx.Timeout for streaming calls used hardcoded connect=30.0 and pool=30.0 regardless of the user's providers.<id>.request_timeout_seconds config. If the custom provider (e.g. Ollama) was unreachable, the call always waited exactly 30s before failing, ignoring any configured timeout. Fix: use min(_base_timeout, 60.0) for connect and pool when a provider timeout is configured, falling back to 30.0 otherwise. The 60s cap addresses review feedback (TCP handshake shouldn't wait the inference timeout — connect/pool cover the connection layer, not model latency). 2. Streaming stale-stream detector ignored provider config. The stale detector read only HERMES_STREAM_STALE_TIMEOUT (env default 180s). The providers.<id>.stale_timeout_seconds key (correctly used in the non-streaming path) was never consulted. Fix: check get_provider_stale_timeout(provider, model) first, then fall back to the env var. Aligns the streaming path with the non-streaming path's priority chain (config > env > default). Salvage shape diverged from PR #25260: the function moved to agent/chat_completion_helpers.py and the contributor's two commits (initial fix + 60s-cap review follow-up) are squashed into one final commit applied at the new location. Original diagnosis, fix shape, AND the 60s-cap review response from @zccyman in PR #25260; credited via Co-authored-by. Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>	2026-05-17 11:39:37 -07:00
QuenVix	d5a0815c3d	fix(transports): use monotonic deadlines in codex app-server turn loop	2026-05-17 11:37:45 -07:00
kshitijk4poor	c74ff2c8ef	fix(browser): self-review pass — dead-import, log levels, future-proofing Addresses findings from two self-review passes pre-merge. First pass (3-agent parallel review): 1. plugins/browser/browser_use/provider.py: drop the ``_ = managed_nous_tools_enabled`` dead-import-hider in _get_config_or_none(). The import was actively misleading — the helper IS used in _get_config() (separate method, separate import), not here. The "keep static analysis happy" comment was wrong about what the helper does in this scope. 2. agent/browser_provider.py: drop ``pragma: no cover`` from is_configured() / provider_name() backward-compat aliases. They ARE covered by ``TestLegacyAbcAliases`` — the pragma would have masked future regressions. 3. tools/browser_tool.py: refactor _is_legacy_provider_registry_overridden() to compare against a module-frozen _DEFAULT_PROVIDER_REGISTRY snapshot instead of hardcoded set of 3 keys. Future maintainers adding a 4th built-in provider now just extend _PROVIDER_REGISTRY; the override detection adapts automatically. Previously the hardcoded ``set(...) != {"browserbase", "browser-use", "firecrawl"}`` would flip True forever on any 4-key registry, silently routing every install onto the legacy fixture path. 4. tools/browser_tool.py: when explicit ``browser.cloud_provider`` is set but the registry has no matching plugin (typo, uninstalled plugin, discovery failure), emit a WARNING with actionable text instead of silently falling through to auto-detect. Legacy code surfaced a typed credentials error via direct class instantiation; this log restores the signal in the post-migration path. 5. agent/browser_registry.py: trim the triple-redundant _LEGACY_PREFERENCE documentation. Module docstring + 13-line block-comment + 5-line inline comment was repeating the same point. Kept the docstring and trimmed the block-comment to 5 lines. 6. agent/browser_registry.py: upgrade is_available()-raised logging from DEBUG to WARNING with exc_info=True. A provider's availability check throwing is unusual enough that users debugging "no cloud provider" need the traceback in logs. 7. tests/plugins/browser/check_parity_vs_main.py: drop dead top-level imports (os, shutil, tempfile — only referenced inside the SUBPROCESS_SCRIPT string literal that runs in a child process). Second pass (architecture + claim-verification review): 8. tools/browser_tool.py: rewrite the inline comment in _get_cloud_provider auto-detect branch. Prior text claimed it "routes through the plugin registry's legacy preference walk so third-party plugins still get a chance to be selected when they're explicitly configured" — false on both counts. The branch uses module-level legacy class aliases (BrowserUseProvider / BrowserbaseProvider) directly; third-party plugins are intentionally reachable only via explicit ``browser.cloud_provider``. Corrected comment now matches behaviour and cross-references _LEGACY_PREFERENCE for the firecrawl gate rationale. 9. tools/browser_tool.py + tests/tools/test_managed_browserbase_and_modal.py: drop the unused ``get_active_browser_provider as _registry_get_active_browser_provider`` alias from the ``from agent.browser_registry import ...`` block. It was never referenced; matching test-stub line in the agent.browser_registry SimpleNamespace also dropped. ``get_provider`` is still imported (used by the explicit-config dispatch path at line 535). 10. plugins/browser/firecrawl/provider.py: align emergency_cleanup() with the early-guard pattern used in browserbase + browser_use plugins. Previously firecrawl tried the DELETE and relied on ``_headers()`` raising ValueError to trip a "missing credentials" warning; same final outcome but a different control flow that read like a bug to a maintainer skimming the three modules. Now: if is_available() is False, log+return early — identical shape to the other two providers. Verification: 54/54 unit tests + 13/13 parity scenarios still pass.	2026-05-17 04:04:15 -07:00
kshitijk4poor	40fde853fa	refactor(browser): dispatch _get_cloud_provider through agent.browser_registry Switches tools.browser_tool's cloud-provider lookup from the hardcoded _PROVIDER_REGISTRY class-instantiation pattern to the agent.browser_registry singleton registry that plugins self-populate. Changes: - tools/browser_tool.py top imports: pull BrowserProvider from agent.browser_provider (re-exported as CloudBrowserProvider for legacy callers) and the three provider classes from plugins/browser/<vendor>/. Legacy class names (BrowserbaseProvider, BrowserUseProvider, FirecrawlProvider) remain on tools.browser_tool as re-export shims so existing test patches (monkeypatch.setattr(browser_tool, 'BrowserUseProvider', ...)) keep working. - _get_cloud_provider() now consults agent.browser_registry.get_provider() for explicit-config lookups. The auto-detect fallback still uses BrowserUseProvider() / BrowserbaseProvider() at the module level so the cache-policy test fixtures (which patch those names) keep driving the function. Test-time _PROVIDER_REGISTRY overrides are detected by class identity and routed through the legacy factory-call path. - agent/browser_provider.py: BrowserProvider grows is_configured() and provider_name() as thin backward-compat aliases for the legacy CloudBrowserProvider API. Subclasses MUST implement is_available() and name; the aliases delegate. This keeps ~6 caller sites in browser_tool.py working without churning them. - tests/tools/test_managed_browserbase_and_modal.py: _install_fake_tools_package grows stubs for agent.browser_provider / agent.browser_registry / plugins.browser.<vendor>.provider so the test's spec-loader path (sys.modules-reset + reload-tool-from-disk) can satisfy tools.browser_tool's top-level imports. Verified: all 23 existing tests in test_browser_cloud_*.py + test_managed_browserbase_and_modal.py still pass post-cutover. The legacy tools/browser_providers/ directory is NOT yet deleted; several tests still _load_tool_module() those files via spec_from_file_location. The deletion + test-path updates land in a later commit.	2026-05-17 04:04:15 -07:00
kshitijk4poor	a15cdfb050	feat(browser): browser-use + firecrawl plugins; drop single-eligible shortcut Migrates the remaining two cloud browser providers to plugins: plugins/browser/browser_use/ — dual auth (direct BROWSER_USE_API_KEY or managed Nous gateway), idempotency- key handling for retried managed-mode creates, x-external-call-id capture. plugins/browser/firecrawl/ — direct FIRECRAWL_API_KEY only; distinct from plugins/web/firecrawl/ (same key, different endpoint). Also drops the 'single-eligible shortcut' rule from agent.browser_registry._resolve(). Was a copy-paste from web_search_registry that would have introduced a real behavior change: a user with only FIRECRAWL_API_KEY set (for web-extract) would silently get routed to a paid Firecrawl cloud browser on a fresh install — not matching origin/main, which only auto-detected between Browser Use and Browserbase. Third-party browser plugins are subject to the same gate: they require explicit `browser.cloud_provider` to take effect. Verified end-to-end via plugin discovery: - 3 plugins register (browser-use, browserbase, firecrawl) - _resolve(None) with no creds: None (local mode) - _resolve(None) with only FIRECRAWL_API_KEY: None (matches main) - _resolve('firecrawl'): firecrawl (explicit wins) - _resolve(None) with BU+firecrawl: browser-use (legacy walk first hit) - _resolve(None) with all three: browser-use (legacy walk order)	2026-05-17 04:04:15 -07:00
kshitijk4poor	c6e6909e5a	feat(browser): add BrowserProvider ABC mirroring web_search_provider template Foundation commit for the browser-provider plugin migration (#25214). Mirrors the architecture established by PR #25182 (web providers): - agent/browser_provider.py — BrowserProvider ABC. Preserves the legacy CloudBrowserProvider lifecycle contract bit-for-bit (create_session, close_session, emergency_cleanup, session metadata shape) so the dispatcher in tools/browser_tool.py becomes a pure registry lookup. Renames is_configured() → is_available() for parity with WebSearchProvider. - agent/browser_registry.py — selection registry with the same three-rule resolution as web_search_registry: 1. Explicit config wins (returns even if is_available() == False so the dispatcher surfaces a precise credentials error) 2. Single-eligible shortcut 3. Legacy preference walk: browser-use → browserbase, filtered by availability. Firecrawl is intentionally NOT in the legacy walk (matches pre-migration behaviour — Firecrawl was only reachable via explicit browser.cloud_provider: firecrawl). - hermes_cli/plugins.py — adds ctx.register_browser_provider() facade, one-liner mirror of register_web_search_provider(). No plugins registered yet; no dispatcher cutover yet. The next commits move browserbase/browser-use/firecrawl into plugins/browser/<vendor>/ and switch tools/browser_tool.py over to the registry.	2026-05-17 04:04:15 -07:00
hawknewton	c02606a385	chore(deps): lazy-install boto3/botocore for bedrock adapter agent/bedrock_adapter.py now calls lazy_deps to install boto3 and botocore on first import, mirroring how other optional provider adapters defer their heavy AWS dependencies until actually used. Keeps the base install slim for users who don't run on Bedrock.	2026-05-17 02:31:18 -07:00
flamiinngo	dbeaaa47f2	refactor(security): extract _block_message helper to unify block logic in _parse_response Both the `action=block` and `decision=block` branches in _parse_response shared identical field-priority and type-validation logic. Extract it into a single _block_message(primary, secondary) helper so the two branches are one line each and the type guard lives in exactly one place. No functional change: existing tests (TestParseResponse, 14 tests) all pass unchanged, confirming identical behaviour.	2026-05-17 02:31:18 -07:00
flamiinngo	63805965e7	fix(security): restore type safety and extract constant in shell hook block handler Address code review feedback on _parse_response: 1. Restore isinstance(raw, str) guard so non-string message/reason values (e.g. integers, lists) from a malformed hook response fall back to the default rather than being forwarded as-is. This keeps the contract that message in the returned dict is always a string. 2. Extract the repeated literal 'Blocked by shell hook.' into a module-level constant _DEFAULT_BLOCK_MESSAGE to avoid duplication and make it easy to change in one place. Four new unit tests added to tests/agent/test_shell_hooks.py covering: - action block with no message (uses default) - decision block with no reason (uses default) - action block with empty string message (uses default) - action block with non-string message, e.g. integer (uses default)	2026-05-17 02:31:18 -07:00
flamiinngo	aeda146112	fix(security): honor shell hook blocks even when message/reason is absent _parse_response in agent/shell_hooks.py only forwarded a pre_tool_call block directive if the hook also provided a non-empty message or reason. When either field was missing the function returned None, causing Hermes to treat the response as a no-op and execute the tool unconditionally. This means a hook that outputs {"action": "block"} or {"decision": "block"} without a reason string is silently ignored. The security boundary fails open: tools the user intended to gate are executed anyway. Fix: remove the message-presence guard. Honor the block unconditionally and fall back to a default message when none is provided. Existing hooks that already include a message or reason are unaffected.	2026-05-17 02:31:18 -07:00
haran2001	d9abbe7fa4	fix(metadata): qwen3.6-plus has a 1M context window (#27008 ) qwen3.6-plus did not have an explicit entry in DEFAULT_CONTEXT_LENGTHS, so the longest-substring fallback matched the generic 'qwen': 131072 catch-all. That dropped the effective context limit from 1,048,576 tokens to 131,072, prematurely lowered the compression threshold, and produced misleading warnings about main/compression context mismatch in long sessions. Add an explicit 'qwen3.6-plus': 1048576 entry before the catch-all and cover it with a regression test (bare, qwen/, and dashscope/ prefixes). Note: PR #6599 also mentions touching model_metadata.py but the actual diff only edits hermes_cli/models.py, so this fix is independent and not duplicated by that PR. Closes #27008	2026-05-17 02:31:18 -07:00
kshitij	5fba236644	chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355 ) Six days after #23937 (608 fixes) the codebase had accumulated 241 new PLR6201 violations. Same mechanical `x in (...)` → `x in {...}` fix, same zero-risk profile: set lookup is O(1) vs O(n) for tuple and the two are semantically equivalent for hashable scalar membership tests. All 241 instances fixed via `ruff check --select PLR6201 --fix --unsafe-fixes`, zero remaining. Every changed value is a hashable scalar (str/int/None/enum/signal); no risk of unhashable runtime errors. No behavior change. Test plan: - 119 files changed, +244/-244 (net zero) — exactly one-line edits - `ruff check` clean afterward - Compile checks pass on the largest touched files (cli.py, run_agent.py, gateway/run.py, gateway/platforms/discord.py, model_tools.py) - Subset broad test run on tests/gateway/ tests/hermes_cli/ tests/agent/ tests/tools/: 18187 passed, 59 pre-existing failures (verified against origin/main with the same shape — identical failure count, identical category — all xdist test-order flakes unrelated to this change) Follows the same template as PR #23937 ([tracker: #23972](https://github.com/NousResearch/hermes-agent/issues/23972)).	2026-05-17 02:29:41 -07:00
teknium1	563b4d9e51	fix: strip image parts for non-vision models with provider profiles + getattr-safe _custom_providers Original commit `75e5d0f6b` by hueilau targeted _build_api_kwargs in pre-refactor run_agent.py. The body now lives in agent/chat_completion_helpers.build_api_kwargs — re-applied there. Also: switch the custom_providers forward (from `21078ebce`) to use getattr() — tests build a bare AIAgent via __new__ and would otherwise hit AttributeError on _custom_providers. Co-authored-by: hueilau <33933019+hueilau@users.noreply.github.com>	2026-05-16 23:47:51 -07:00
teknium1	36ad8336f9	fix(run_agent): guard memory provider init against empty/whitespace string Original commit `8d756a421` by austrian_guy targeted __init__ in pre-refactor run_agent.py. The body now lives in agent/agent_init.init_agent — re-applied there. Co-authored-by: austrian_guy <33156212+ether-btc@users.noreply.github.com>	2026-05-16 23:43:09 -07:00
teknium1	4ece521bcf	fix(run_agent): isolate background review fork from external memory plugins (#27190 ) Original commit `973f27e95` by Teknium targeted _spawn_background_review in pre-refactor run_agent.py. The body now lives in agent/background_review._spawn_background_review — re-applied there. Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-05-16 23:42:49 -07:00
teknium1	b5bcffe167	fix(fallback): forward custom_providers to fallback model context-length detection Original commit `21078ebce` by PaTTeeL targeted _try_activate_fallback in pre-refactor run_agent.py. The body now lives in agent/chat_completion_helpers.try_activate_fallback — re-applied there. Co-authored-by: PaTTeeL <9150277+PaTTeeL@users.noreply.github.com>	2026-05-16 23:42:16 -07:00
teknium1	4ab9a06a51	fix(agent): reset _fallback_index at turn start even when no fallback activated Original commit `33528b428` by konsisumer targeted _restore_primary_runtime in pre-refactor run_agent.py. The body now lives in agent/agent_runtime_helpers.restore_primary_runtime — re-applied there. Fixes #20465 Co-authored-by: konsisumer <der@konsi.org>	2026-05-16 23:41:45 -07:00
teknium1	aa05ffba53	fix(xai): surface provider 'error' SSE frame in Codex fallback stream (#27184 ) Original commit `2b193907d` by Teknium added a new module-level _StreamErrorEvent class and threaded its raise into _run_codex_create_stream_fallback in pre-refactor run_agent.py. - _StreamErrorEvent class → run_agent.py (module-level, next to _qwen_portal_headers; class needs to be top-level for the codex runtime to import it) - The fallback event-loop's 'type=error' handler → agent/codex_runtime.py where run_codex_create_stream_fallback now lives. Imports _StreamErrorEvent lazily from run_agent to avoid circular import. Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-05-16 23:41:09 -07:00
teknium1	80fa92a491	fix(codex): rotate pool on usage limit 429 — port to extracted modules Original commit `e51d74ab9` by Maxim Esipov targeted _extract_api_error_context and _recover_with_credential_pool in pre-refactor run_agent.py. Both bodies now live in agent/agent_runtime_helpers.py — re-applied to that module: - extract_api_error_context: payload.get('type') added to the reason fallback chain (Codex error bodies use 'type' instead of 'code'/'error') - recover_with_credential_pool: usage_limit_reached detection in the rate_limit branch — skip the retry-once-then-rotate dance and rotate immediately when the body says the per-account usage limit hit. Co-authored-by: Maxim Esipov <maksesipov@gmail.com>	2026-05-16 23:39:41 -07:00
teknium1	df22d29522	fix(copilot): GitHub Models 413 hint — port to extracted conversation_loop Original commits `4ded3ede3` (@konsisumer) + `374dc81c2` (Teknium) added a 413 hint to run_agent.py's agent loop. Final-state version (the sharpened `374dc81c2` wording) ported to agent/conversation_loop.py, where the payload_too_large branch now lives. The deprecation detection + _URL_TO_PROVIDER changes from both commits landed in agent/copilot_acp_client.py and agent/model_metadata.py via the prior merge. Closes #10648 Co-authored-by: konsisumer <der@konsi.org> Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-05-16 23:38:45 -07:00
teknium1	3fbedd732e	feat: add supports_parallel_tool_calls for MCP servers (#26825 ) — port to tool_dispatch_helpers Original commit `395e9dd9e` by Teknium targeted module-level _is_mcp_tool_parallel_safe and _should_parallelize_tool_batch helpers in pre-refactor run_agent.py. Both helpers now live in agent/tool_dispatch_helpers.py — re-applied to that module. The tools/mcp_tool.py portion (the public is_mcp_tool_parallel_safe API + _parallel_safe_servers tracking) merged cleanly from main via the prior merge commit. Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-05-16 23:36:37 -07:00
teknium1	fe4c87eb28	fix(agent): retry malformed anthropic stream parser errors — port to extracted modules Original commit `9c304a7f5` by helix4u targeted _flatten_exception_chain, _summarize_api_error, and the _call streaming retry loop in pre-refactor run_agent.py. Re-applied to: - New _is_provider_stream_parse_error helper → run_agent.py (next to _flatten_exception_chain in the AIAgent class) - _summarize_api_error early-return for the malformed-streaming ValueError → run_agent.py (kept method body) - _call streaming retry: _is_stream_parse_err flag wired into _is_transient AND the post-exhaustion branch + dedicated malformed-streaming user-status string → agent/chat_completion_helpers.py (the _call body now lives there) Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>	2026-05-16 23:35:54 -07:00
teknium1	f885be030c	fix(auxiliary): resolve xai oauth compression from pool — port to conversation_compression Original commit `97a32afdc` by helix4u targeted _check_compression_model_feasibility in pre-refactor run_agent.py. The function body now lives in agent/conversation_compression.py — re-applied the configured-but-unavailable provider message there. Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>	2026-05-16 23:33:59 -07:00
teknium1	6975a2d9ae	fix(xai-oauth): entitlement-403 chain — final state (`ce0e189d3` + `9818b9a1a` + `6784c8079` + `dffb602f3`) Collapses the four-commit xAI entitlement-403 chain to its final on-main state, ported to the post-refactor module layout: - Added _is_entitlement_failure on AIAgent (run_agent.py) — detects Grok subscription-shape 403s on (401\|403\|None) status codes. - Added entitlement-skip branch to recover_with_credential_pool (agent/agent_runtime_helpers.py) — breaks the refresh-loop that Don's 100-iteration trace exposed when a Premium+ user hit a real entitlement issue. - Removed _decorate_xai_entitlement_error and unwrapped its two _summarize_api_error call sites — xAI's own body text already points users at grok.com/?_s=usage so we surface that verbatim (`dffb602f3` reasoning: X Premium subs DO now work per xAI's 2026-05-16 announcement, so editorialising would misdirect). - grok-4.3 1M context entry landed in agent/model_metadata.py via the prior merge — no additional port needed. Tests already on disk (tests/run_agent/test_codex_xai_oauth_recovery.py) assert _is_entitlement_failure shape and verbatim body surfacing. Closes #27110. Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-05-16 23:33:18 -07:00
teknium1	6362e71973	fix(xai-oauth): recover from prelude SSE errors, gate reasoning replay, surface entitlement 403s Original commit `31ba2b0cb` by Teknium targeted run_codex_stream() at its pre-refactor location in run_agent.py. Re-applied: - Prelude error retry/fallback → agent/codex_runtime.py (in run_codex_stream where the body now lives) - _decorate_xai_entitlement_error helper + _summarize_api_error wrapping → run_agent.py (these methods remained on AIAgent as @staticmethod's; cherry-pick applied them cleanly) The xai-oauth provider gate, encrypted_content drop on replay, etc. landed in agent/codex_responses_adapter.py via the prior merge from main. Closes #8133, #14634 Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-05-16 23:28:05 -07:00
teknium1	27df249564	feat(nvidia): add NIM billing origin header — port to extracted modules Original commit `13c3d4b4e` by kchantharuan touched __init__ and _apply_client_headers_for_base_url in pre-refactor run_agent.py. Re-applied to: - __init__: agent/agent_init.py (3 hunks — NVIDIA branch + _custom_headers fallback in routed-client and fallback-client paths) - _apply_client_headers_for_base_url: still in run_agent.py (1 hunk) build_nvidia_nim_headers was already present in agent/auxiliary_client.py from the prior merge — no additional port needed. Co-authored-by: kchantharuan <kchantharuan@nvidia.com>	2026-05-16 23:25:11 -07:00
teknium1	b07524e53a	feat(xai-oauth): add xAI Grok OAuth (SuperGrok Subscription) provider — port to extracted modules Original commit `b62c99797` by Jaaneek targeted six locations in pre-refactor run_agent.py. Re-applied to the extracted post-PR locations: - api_mode dispatch → agent/agent_init.py - is_xai_responses build_api_kwargs → agent/chat_completion_helpers.py - codex_auth_retry block + 401 hint → agent/conversation_loop.py - _try_refresh_codex_client_credentials body → run_agent.py (kept) The non-run_agent.py portions of the commit (auxiliary_client, codex transport, hermes_cli/auth, tools/xai_http, tests, docs) merged cleanly from main via the prior merge commit. Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>	2026-05-16 23:23:38 -07:00
teknium1	7d221aa1f2	fix(langfuse): complete observability fix — port to extracted conversation_loop Original commit `db84a78e6` by kshitij targeted run_conversation()'s pre_api_request and post_api_request hooks in pre-refactor run_agent.py. Re-applied to the extracted location in agent/conversation_loop.py. Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com> Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com>	2026-05-16 23:21:51 -07:00
teknium1	a77ca9295e	perf(run_agent): accumulate length-continuation prefix via list+join Original commit `4f8aaf104` by InB4DevOps targeted run_conversation() in the pre-refactor run_agent.py. Re-applied to the extracted location in agent/conversation_loop.py. Co-authored-by: InB4DevOps <tolle.lege+github@gmail.com>	2026-05-16 23:20:27 -07:00
teknium1	152d42d1a7	Merge origin/main into pr-27248 (resolving run_agent.py = ours) run_agent.py taken from HEAD (the extracted forwarder structure). The 25 run_agent.py fixes that landed on main during the PR's life need to be ported into the agent/* extracted modules in follow-up commits.	2026-05-16 23:16:52 -07:00
phoenixshen	52c89715a2	fix: respect user-configured vision model for OpenRouter _OPENROUTER_MODEL hardcoded 'google/gemini-3-flash-preview' which returns 404 on OpenRouter, breaking all vision tasks for users who rely on the OpenRouter default. Additionally, _try_openrouter() ignored the user-configured auxiliary.vision.model entirely. Changes: - Update _OPENROUTER_MODEL default to google/gemini-2.5-flash (valid) - Add optional 'model' parameter to _try_openrouter() - Pass configured model from _resolve_strict_vision_backend() through to _try_openrouter() This allows users who set auxiliary.vision.model (e.g. x-ai/grok-4.3) to have it actually used, while maintaining backward compatibility.	2026-05-16 23:11:43 -07:00
zccyman	b389796ae3	fix(auxiliary): resolve api_key_env alias in named custom provider path of resolve_provider_client In resolve_provider_client(), the named custom provider code path at ~line 2914 only checked the ``key_env`` field when looking for an environment-variable-based API key. The documented ``api_key_env`` snake_case alias was silently ignored, causing custom providers configured with ``api_key_env`` to fall through to the ``no-key-required`` placeholder — which produces a confusing 401 (``****ired`` mask) on auth-required remote endpoints. This mirrors the same fix already applied to run_agent.py in commit `6ddc48b05` (fix(fallback): resolve api_key_env in fallback chain entries). Also adds a logger.warning() when the placeholder is reached, so future alias gaps are easier to debug. Closes #25091	2026-05-16 23:11:43 -07:00
teknium1	47823790b0	refactor(run_agent): review fixes — keyword-forward __init__, drop dead code, tighten guards Four fixes from PR #27248 review: 1. __init__ forwarder is now keyword-forwarded (daimon-nous review). Previously the run_agent.AIAgent.__init__ wrapper forwarded all 64 params positionally to agent.agent_init.init_agent, so adding a 65th param on main would require three lockstep edits (signature, init_agent signature, forwarder call) or silently shift every value. Keyword forwarding makes this trivially safe — adding a param now only needs the two signatures and one extra keyword line. 2. Drop dead _ra() in agent/codex_runtime.py (daimon-nous + Copilot). The lazy run_agent reference was defined but never called inside this module — the codex paths use agent.* accessors only. 3. Drop unused imports in agent/codex_runtime.py (Copilot): contextvars, threading, time, uuid, Optional. Carried over from run_agent.py during the original extraction. 4. Tighten three source-introspection test guards (Copilot): - test_memory_nudge_counter_hydration.py — was scanning the concatenated source of run_agent.py + agent/conversation_loop.py and matching self.X or agent.X form. Now asserts the hydration block lives in agent/conversation_loop.py specifically with the agent.X form — the body never moves back, so if it ever drifts a future re-introduction fails the guard. - test_run_agent.py::TestMemoryNudgeCounterPersistence — anchor on agent.iteration_budget = IterationBudget exactly (was just iteration_budget = IterationBudget) so an unrelated identifier ending in iteration_budget can't match. - test_run_agent.py::TestMemoryProviderTurnStart — assert the agent._user_turn_count form directly (the extracted body uses agent.X, not self.X — accepting either was a transitional fudge). - test_jsondecodeerror_retryable.py — scan agent/conversation_loop.py only, not the concatenation. Not addressed in this commit: * Pre-existing bugs in agent/tool_executor.py (heartbeat index mismatch when calls are blocked, _current_tool clobber in result loop, blocked-counted-as-completed in spinner summary, dead result_preview computation). These were preserved byte-for-byte from the original _execute_tool_calls_concurrent — worth a separate follow-up PR with proper tests. * _OpenAIProxy.__instancecheck__ concern — pre-existing, not flagged by any of the original test patches (nothing actually does isinstance(x, OpenAI) against the proxy instance). * agent_init.py:949 mem_config potential NameError — pre-existing; only triggers if _agent_cfg.get('memory', {}) itself raises, which it can't with a stock dict. tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing test_auxiliary_client failure (unchanged). run_agent.py: 3821 -> 3937 lines (+116 from the keyword-forwarded init call's verbosity). Final: 16083 -> 3937 (-12146, 75% reduction).	2026-05-16 22:55:49 -07:00
shellybotmoyer	1a4e64ba06	fix(credential_pool): parse ISO-string last_status_at during from_dict rehydration (#25516 )	2026-05-16 22:54:22 -07:00
0xchainer	4b17c2411a	fix(skills): return None instead of truthy stub when skill load fails build_skill_invocation_message() returns a non-empty placeholder string ('[Failed to load skill: ...]') when the skill exists in the command cache but loading the actual SKILL.md payload fails. CLI/gateway callers treat any truthy return value as success, so the failure is silently routed into the model as if it were a valid skill prompt. Return None instead, matching the existing behavior for unknown commands, so callers using 'if msg:' can properly detect the failure.	2026-05-16 22:52:22 -07:00
teknium1	94c3e0ab8e	refactor(run_agent): extract 10 more helpers to agent/agent_runtime_helpers.py Final extraction pass — the methods left over after run_conversation and __init__ moved out. Together these 10 cover ~813 LOC of medium- sized helpers: * switch_model (194 LOC) — model switching mid-session * _invoke_tool (87) — central tool dispatch with overrides * _repair_tool_call (72) — argument JSON repair entrypoint * _sanitize_api_messages (71) — role-filter for API send * _looks_like_codex_intermediate_ack (72) — codex transcript heuristic * _copy_reasoning_content_for_api (70) — reasoning preservation * _cleanup_dead_connections (70) — periodic dead-socket sweep * _extract_api_error_context (65) — error-dump context builder * _apply_pending_steer_to_tool_results (63) — /steer injection * _force_close_tcp_sockets (59) — aggressive socket cleanup AIAgent keeps thin forwarder methods for all 10 (staticmethods preserved where present). Names tests patch on run_agent (handle_function_call, AIAgent class attrs, logger) routed through _ra() so the patch surface is preserved. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure as on main). run_agent.py: 4634 -> 3821 lines (-813). Final total: 16083 -> 3821 (-12262, 76% reduction).	2026-05-16 20:35:19 -07:00
teknium1	9f408989c4	refactor(run_agent): extract __init__ (1,381 LOC) to agent/agent_init.py The largest method left on AIAgent (60+ parameters, the entire startup sequence — credential resolution, provider auto-detection, context engine bootstrap, memory store hydration, plugin lifecycle hooks) moves into agent/agent_init.py. AIAgent.__init__ is now a thin wrapper that calls agent.agent_init.init_agent(self, ...) with the original full parameter list preserved. Module-level run_agent names referenced in the body (_openrouter_prewarm_done, _qwen_portal_headers, _routermint_headers, _hermes_home, OpenAI, get_tool_definitions, check_toolset_requirements) are resolved through _ra() so test patches on those names keep working. agent_init's logger warnings are routed via _ra().logger so tests patching run_agent.logger capture them (TestStringKSuffixContextLengthWarns, TestCustomProvidersInvalidContextLengthWarns). Live E2E reconfirmed on three model paths (openai/gpt-5.4, anthropic/claude-sonnet-4.6, moonshotai/kimi-k2-thinking). tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure). run_agent.py: 5944 -> 4564 lines (-1380). Total reduction since baseline: 16083 -> 4564 (-11519, 72%).	2026-05-16 19:43:38 -07:00
teknium1	0530252384	refactor(run_agent): extract run_conversation to agent/conversation_loop.py The 3,877-line run_conversation body — the agent loop itself — moves out of run_agent.py into a dedicated module. AIAgent.run_conversation is now a thin forwarder that delegates to agent.conversation_loop.run_conversation with the AIAgent instance as the first argument. This is the largest single extraction in the run_agent.py refactor. The body keeps all 163 self.X references intact (rewritten as agent.X), all nested closures, all retry/backoff/compression machinery. Symbols that tests or callers patch on run_agent (_set_interrupt, handle_function_call, AIAgent class attrs) are resolved through _ra() inside the extracted module so the patch surface is preserved. Five tests doing inspect.getsource(AIAgent.run_conversation) updated to scan agent.conversation_loop.run_conversation. Two source-introspection tests (TestMemoryNudgeCounterPersistence, TestMemoryProviderTurnStart) updated to accept either self.X (legacy) or agent.X (extracted form) in the matched assertions. Live E2E verified on three model paths: * openai/gpt-5.4 (OpenAI chat completions via OpenRouter) * anthropic/claude-sonnet-4.6 (Anthropic Messages via OpenRouter) * moonshotai/kimi-k2-thinking (reasoning model, reasoning_content path) Plus read_file tool execution, terminal tool, web_search. tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure (test_auxiliary_client::test_custom_endpoint... — same as on main). run_agent.py: 9800 -> 5944 lines (-3856). Total reduction since baseline: 16083 -> 5944 (-10139, 63%).	2026-05-16 19:26:52 -07:00
teknium1	d35ee7bcdd	refactor(run_agent): move review prompts to agent/background_review.py The three big review-prompt strings (_MEMORY_REVIEW_PROMPT, _SKILL_REVIEW_PROMPT, _COMBINED_REVIEW_PROMPT — 183 lines combined) move out of the AIAgent class body and into agent/background_review.py where they're consumed. AIAgent re-exposes them as class attributes via 'from ... import' inside the class body — Python binds those names into the class namespace so existing AIAgent._MEMORY_REVIEW_PROMPT references keep working. spawn_background_review_thread also falls back to the module-level constants if an agent doesn't have the attribute (preserves the test pattern of mocking these on the agent). tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure). run_agent.py: 9986 -> 9800 lines (-186).	2026-05-16 19:11:58 -07:00
teknium1	c42fa94afc	refactor(run_agent): extract Codex runtime + assorted helpers to dedicated modules Two new modules: * agent/codex_runtime.py — three Codex API-mode methods - run_codex_app_server_turn (148 LOC) — Codex CLI subprocess driver - run_codex_stream (125 LOC) — Codex Responses API stream - run_codex_create_stream_fallback (78 LOC) — fallback after Responses stream=true initial create failure * agent/agent_runtime_helpers.py — twelve assorted AIAgent helpers totalling ~1,166 LOC: convert_to_trajectory_format, sanitize_tool_call_arguments (static), repair_message_sequence, strip_think_blocks, recover_with_credential_pool, try_recover_primary_transport, drop_thinking_only_and_merge_users (static), restore_primary_runtime, extract_reasoning, dump_api_request_debug, anthropic_prompt_cache_policy, create_openai_client AIAgent keeps thin forwarder methods for all 15 (preserving @staticmethod where needed). Symbols tests patch on run_agent (OpenAI, AIAgent class attrs) are routed through _ra() to honor the patch contract. The _TRANSIENT_TRANSPORT_ERRORS frozenset moves with try_recover_primary_transport and is referenced as a module-level constant in the extracted code. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure). run_agent.py: 11391 -> 9887 lines (-1504).	2026-05-16 19:03:30 -07:00
teknium1	0430e71ec9	refactor(run_agent): extract streaming API caller (893 LOC) to agent/chat_completion_helpers.py Move _interruptible_streaming_api_call out of run_agent.py — the biggest single method in the file. Body lives next to interruptible_api_call in agent/chat_completion_helpers.py so streaming + non-streaming code share one home. Nested closures (_call_chat_completions, _call_anthropic, the codex stream branch) all come along with the body and still capture the parent function's locals as expected. AIAgent keeps a thin forwarder method. is_local_endpoint added to the import block (used by the stream stale-timeout disable logic). One source-introspection test in TestAnthropicInterruptHandler is updated to scan agent.chat_completion_helpers.interruptible_streaming_api_call instead of AIAgent._interruptible_streaming_api_call. tests/run_agent/ + tests/agent/: 4312 passed (same pre-existing test_auxiliary_client failure). run_agent.py: 12277 -> 11385 lines (-892).	2026-05-16 18:48:22 -07:00
teknium1	4b25619bc4	refactor(run_agent): extract chat-completion helpers to agent/chat_completion_helpers.py Six methods move into a new module — bodies live there, AIAgent keeps thin forwarder methods so call sites and tests are unchanged. * interruptible_api_call — non-streaming API call with interrupt handling * build_api_kwargs — assemble OpenAI / Anthropic / Codex / Bedrock request kwargs * build_assistant_message — normalize assistant message dict (reasoning, tool_calls, codex passthrough fields, alibaba glm-4.7 quirk) * try_activate_fallback — provider fallback chain activation * handle_max_iterations — controlled stop when iteration budget exhausts * cleanup_task_resources — per-turn VM + browser teardown (skipped for persistent environments) Names tests patch on run_agent (cleanup_vm, cleanup_browser) are routed through _ra() so the patch surface is preserved. Two TestAnthropicInterruptHandler source-introspection tests were updated to scan agent.chat_completion_helpers.interruptible_api_call instead of AIAgent._interruptible_api_call — the body lives in the extracted module now. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure). run_agent.py: 13282 -> 12253 lines (-1029).	2026-05-16 18:41:44 -07:00
teknium1	57f6762ca0	refactor(run_agent): extract stream diagnostics to agent/stream_diag.py Move the five stream-drop diagnostic helpers + the headers tuple: * STREAM_DIAG_HEADERS — cf-ray, x-openrouter-provider, x-request-id, etc. * stream_diag_init — fresh per-attempt diagnostic dict * stream_diag_capture_response — snapshot upstream headers + HTTP status * flatten_exception_chain — compact Outer(msg) <- Inner(msg) rendering * log_stream_retry — structured WARNING with provider/bytes/elapsed/ttfb * emit_stream_drop — user-facing status line + activity touch AIAgent keeps thin forwarder methods (and exposes the headers tuple as _STREAM_DIAG_HEADERS for back-compat). All test patches and call sites unchanged. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure). run_agent.py: 13470 -> 13227 lines (-243).	2026-05-16 18:28:17 -07:00
teknium1	79559214a6	refactor(run_agent): extract tool execution to agent/tool_executor.py Move the two big tool-dispatch methods out of run_agent.py: * execute_tool_calls_concurrent — 408-line concurrent path (interrupt pre-flight, guardrail+plugin block, callback fan-out, ContextVar- preserving ThreadPoolExecutor, periodic heartbeats for the gateway inactivity monitor, per-tool result handling with subdir hints + guardrail observations + checkpoint, /steer drain) * execute_tool_calls_sequential — 441-line sequential path (the original behavior used for single-tool batches and interactive tools) Both take the parent AIAgent as their first argument; AIAgent keeps thin forwarders so call sites unchanged. handle_function_call is routed through _ra() so tests that patch run_agent.handle_function_call keep working. _set_interrupt likewise. The AST guard in test_tool_executor_contextvar_propagation.py is updated to scan both run_agent.py AND agent/tool_executor.py so it still catches the executor.submit(_run_tool, ...) regression regardless of which file the body lives in. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure as before). run_agent.py: 14309 -> 13461 lines (-848).	2026-05-16 18:24:05 -07:00
teknium1	2d2cd5e904	refactor(run_agent): extract system-prompt builder to agent/system_prompt.py Four AIAgent methods move into a dedicated module: * build_system_prompt_parts — three-tier stable/context/volatile dict * build_system_prompt — joiner used at session start * invalidate_system_prompt — drop cache + reload memory * format_tools_for_system_message — trajectory-format tool dump The extracted helpers look up patch-target names (load_soul_md, build_skills_system_prompt, get_toolset_for_tool, build_environment_hints, build_context_files_prompt, build_nous_subscription_prompt) through the run_agent module via _ra() instead of importing them directly. That preserves the patch surface tests rely on (patch('run_agent.load_soul_md', ...) and friends). AIAgent keeps thin forwarder methods. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure as before). run_agent.py: 14555 -> 14292 lines (-263).	2026-05-16 18:16:20 -07:00
teknium1	5311d9959e	refactor(run_agent): extract context compression to agent/conversation_compression.py Move four compression-related methods to a dedicated module: * check_compression_model_feasibility — startup probe + auto-lowered threshold + hard floor * replay_compression_warning — re-emit stored warning through gateway status_callback * compress_context — run compressor, split SQLite session, notify plugins+memory * try_shrink_image_parts_in_messages — image-too-large recovery via re-encode AIAgent keeps thin forwarder methods so existing call sites and tests that patch run_agent.AIAgent methods keep working. tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing test_auxiliary_client failure as before). run_agent.py: 15013 -> 14535 lines (-478).	2026-05-16 18:09:33 -07:00
teknium1	1f6eb1738c	refactor(run_agent): extract background memory/skill review to agent/background_review.py Move the background-review subsystem (the self-improvement loop — see the README) out of run_agent.py into a dedicated module. * summarize_background_review_actions — was the @staticmethod that builds the user-facing action summary * spawn_background_review_thread — builds the thread target + prompt; the actual review loop body (forked AIAgent, runtime inheritance, tool whitelist, suppression, teardown) lives in _run_review_in_thread * build_memory_write_metadata — provenance for external memory mirrors AIAgent keeps thin wrappers for backward compatibility AND because tests patch run_agent.threading.Thread to assert lifecycle behavior — the threading.Thread construction stays in AIAgent._spawn_background_review, the inner work moves out. tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure (test_auxiliary_client.py::test_custom_endpoint... — confirmed failing on main before this change). 3 skipped. run_agent.py: 15272 -> 14972 lines (-300).	2026-05-16 18:05:01 -07:00
teknium1	5f309ae685	refactor(run_agent): extract OpenAI proxy, safe stdio, IterationBudget Three small extractions into focused modules: * agent/process_bootstrap.py — \_OpenAIProxy (lazy openai.OpenAI import), \_SafeWriter (broken-pipe-resistant stdio wrapper), \_install_safe_stdio, \_get_proxy_from_env, \_get_proxy_for_base_url. All process / IO bootstrap. * agent/iteration_budget.py — IterationBudget class (thread-safe consume/ refund counter shared by parent agent and subagents). run_agent re-exports every name so existing test patches like patch('run_agent.OpenAI', ...) and 'from run_agent import IterationBudget' keep working unchanged. Verified the patch-rebinding contract for OpenAI explicitly. tests/run_agent/ + tests/agent/test_gemini_fast_fallback.py: 1347 passed, 3 skipped. run_agent.py: 15427 -> 15261 lines (-166).	2026-05-16 17:59:32 -07:00
teknium1	59f1c0f0b6	refactor(run_agent): extract tool-dispatch helpers to agent/tool_dispatch_helpers.py Pull module-level helpers used by the tool-execution path out of run_agent.py: * parallelism gating — _NEVER_PARALLEL_TOOLS, _PARALLEL_SAFE_TOOLS, _PATH_SCOPED_TOOLS, _DESTRUCTIVE_PATTERNS, _REDIRECT_OVERWRITE, _is_destructive_command, _should_parallelize_tool_batch, _extract_parallel_scope_path, _paths_overlap * multimodal envelopes — _is_multimodal_tool_result, _multimodal_text_summary, _append_subdir_hint_to_multimodal * file-mutation verifier inputs — _extract_file_mutation_targets, _extract_error_preview * trajectory normalization — _trajectory_normalize_msg All pure functions. run_agent re-exports every name so existing 'from run_agent import _is_multimodal_tool_result' callers in tests/tools/, tests/run_agent/, and tools/file_state.py keep working. tests/run_agent/: 1341 passed, 3 skipped. run_agent.py: 15682 -> 15427 lines (-255).	2026-05-16 17:54:26 -07:00
teknium1	885d1242a2	refactor(run_agent): extract message sanitization to agent/message_sanitization.py Pull the 10 pure sanitization/repair helpers (\_sanitize_surrogates, \_sanitize_structure_surrogates, \_sanitize_messages_surrogates, \_escape_invalid_chars_in_json_strings, \_repair_tool_call_arguments, \_strip_non_ascii, \_sanitize_messages_non_ascii, \_sanitize_tools_non_ascii, \_strip_images_from_messages, \_sanitize_structure_non_ascii) and the \_SURROGATE_RE constant out of run_agent.py into a new module. These are stateless byte-walking helpers with no AIAgent dependency. Backward compatibility: run_agent re-exports every name via a single import block, so existing 'from run_agent import _sanitize_surrogates' imports in tests and cli.py keep working unchanged. Same pattern the file already uses for _summarize_user_message_for_log (codex_responses_adapter). run_agent.py: 16077 -> 15682 lines (-395).	2026-05-16 17:41:09 -07:00
Teknium	3b39096904	Port from Kilo-Org/kilocode#9434: strip historical media after compression (#27189 ) After context compression, the protected tail messages retain their original image parts. When those include multi-MB pasted screenshots, every subsequent API request re-ships the same base-64 blobs forever — which can push the request past provider body-size limits and wedge the session even though compression 'succeeded'. Add _strip_historical_media() to agent/context_compressor.py. After the summary is built, find the newest user message that carries an image part and replace image parts in every earlier message with a short text placeholder ('[Attached image — stripped after compression]'). The newest image-bearing user turn keeps its media so the model can still analyse what the user just sent. Handles all three multimodal shapes: - OpenAI chat.completions image_url - OpenAI Responses API input_image - Anthropic native {type: image, source: ...} Includes 27 unit tests covering the helpers and the end-to-end compress() integration, plus a manual E2E check confirming a ~4MB two-image conversation shrinks to ~2MB after compression.	2026-05-16 17:18:25 -07:00
Teknium	93e109a1d5	fix(moonshot): strip $ref siblings and collapse tuple items in tool schemas (#27104 ) Port from anomalyco/opencode#24730: Moonshot's JSON Schema validator rejects two shapes that the rest of the JSON Schema ecosystem accepts: 1. $ref nodes with sibling keywords. Moonshot expands the reference before validation and then rejects the node if keys like `description`, `type`, or `default` appear alongside $ref. MCP-sourced tool schemas commonly put a `description` on $ref-typed properties so the model sees the field hint — which worked on every provider except Moonshot. 2. Tuple-style `items` arrays (positional element schemas). Moonshot's engine requires ONE schema applied to every array element. Common in tool schemas generated from Go/Protobuf that model fixed-length arrays as `[{type:number}, {type:number}]`. Repairs applied in `agent/moonshot_schema.py`: - Rule 3: when a node has `$ref`, return `{"$ref": <value>}` only (strip every sibling). The referenced definition still carries its own description on the target node, which Moonshot accepts. - Rule 4: when `items` is a list, collapse to the first element schema (falling back to `{}` which is then filled by the generic missing-type rule). Preserves `minItems` / `maxItems` / other siblings. Tests: 10 new cases across TestRefSiblingStripping + TestTupleItems, plus the existing TestMissingTypeFilled::test_ref_node_is_not_given_synthetic_type still passes (it asserted plain $ref passes through; now it passes through as exactly `{"$ref": "..."}` which is strictly compatible). All 35 tests in test_moonshot_schema.py pass.	2026-05-16 13:02:19 -07:00
JunghwanNA	345821b4a1	style: move secrets import alongside other function-level imports Group the secrets import with time and webbrowser at the top of run_hermes_oauth_login_pure(), matching the existing pattern. Drop the _secrets alias — no name conflict in this scope.	2026-05-16 02:38:02 -07:00
JunghwanNA	fcd9011f8d	fix(security): separate OAuth PKCE state from code_verifier The PKCE flow reused the code_verifier as the OAuth state parameter. Per RFC 6749 §10.12 and RFC 7636, these serve different purposes: state is an anti-CSRF token visible in the authorization URL; the code_verifier must remain secret for the token exchange. Generate an independent secrets.token_urlsafe(32) for state and validate it on callback to provide actual CSRF protection. Closes #10693	2026-05-16 02:38:02 -07:00
teknium1	374dc81c23	fix(copilot-acp): tighten deprecation detection + sharpen GitHub Models 413 hint Follow-up improvements on top of @konsisumer's cherry-picked fix for #10648: 1. Deprecation patterns required BOTH a product fingerprint ('gh-copilot') and a deprecation marker. The previous list included 'copilot-cli' and bare 'deprecation', which would false-positive on stderr from the NEW @github/copilot CLI — whose repo is literally github.com/github/copilot-cli and which legitimately surfaces those substrings in its own messages. 2. Replace the deprecation hint. The user in #10648 installed 'gh extension install github/gh-copilot' (the deprecated extension) thinking that's what ACP mode uses, when ACP actually spawns the new 'copilot' binary from '@github/copilot'. The hint now points users at the correct install command ('npm install -g @github/copilot') with the new CLI's repo URL, and demotes provider-switching to a fallback alternative. 3. Change _URL_TO_PROVIDER value for models.inference.ai.azure.com from the 'github-models' alias to the canonical 'copilot' provider id, matching the convention used by every other entry in the table. 4. Sharpen the 413 hint message. The free tier's ~8K cap is below the system-prompt floor, so this endpoint is fundamentally incompatible with an agentic loop — not a 'use a different URL' problem. Tests: - New parametrized false-positive coverage for the new CLI's stderr shape. - Updated assertion to require canonical 'copilot' provider mapping. - All 14 deprecation/URL tests pass.	2026-05-16 02:24:48 -07:00
konsisumer	4ded3ede33	fix: detect gh-copilot deprecation and improve GitHub Models 413 errors (#10648 ) Address two blocking issues when using GitHub Copilot integrations: 1. ACP mode: detect the gh-copilot CLI deprecation error from stderr and surface an actionable message with alternatives instead of hanging or showing a cryptic error. 2. GitHub Models (Azure) 413: recognize models.inference.ai.azure.com as a known GitHub Models URL, and print a targeted hint explaining the hard 8K token limit that makes this endpoint incompatible with Hermes' system prompt size.	2026-05-16 02:24:48 -07:00
helix4u	97a32afdc4	fix(auxiliary): resolve xai oauth compression from pool	2026-05-15 19:53:37 -07:00
Teknium	ce0e189d3e	fix(xai-oauth): break entitlement-403 credential-refresh loop, bump grok-4.3 context to 1M (#26664 ) Don Piedro's 18-minute hang on grok-4.3 traced to two issues PR #26644 didn't cover: - _recover_with_credential_pool classifies 403 as FailoverReason.auth and calls pool.try_refresh_current(). For xAI OAuth on an unsubscribed account, refresh succeeds (mints a new token from the same account) but the next API call 403s with the same entitlement error. Result: infinite refresh → retry → 403 loop until Ctrl+C (1133s in Don's log). New _is_entitlement_failure(error_context, status_code) detects the subscription-shape body ("do not have an active Grok subscription" / "out of available resources" + grok / "does not have permission" + grok) and short-circuits recovery so _summarize_api_error surfaces PR #26644's friendly hint. - grok-4.3 resolved to 256k via the grok-4 catch-all in DEFAULT_CONTEXT_LENGTHS. Per docs.x.ai/developers/models/grok-4.3 the model ships with 1M context. Add explicit grok-4.3 entry before the grok-4 fallback (longest-first substring matching ensures grok-4.3 and grok-4.3-latest both land on the new value). Tests: 8 new (23 total in test_codex_xai_oauth_recovery.py). E2E verified Don's 100-iteration loop bails out with 0 refresh calls while genuine auth failures still refresh once and recover.	2026-05-15 17:11:06 -07:00
teknium1	cd9470f416	fix(deepseek): wire thinking-mode via DeepSeekProfile, not legacy fallback The cherry-picked PR #15251 from @tw2818 correctly identified the DeepSeek 400 root cause but placed the fix in the legacy fallback path of `build_kwargs`, which DeepSeek never reaches — DeepSeek has a registered ProviderProfile and goes through `_build_kwargs_from_profile` instead. The legacy-path block was therefore dead code. This commit pivots the fix to where it actually fires: - New `DeepSeekProfile` in `plugins/model-providers/deepseek/__init__.py` overrides `build_api_kwargs_extras` to emit DeepSeek's expected wire format (mirrors `KimiProfile`): {"reasoning_effort": "<low\|medium\|high\|max>", "extra_body": {"thinking": {"type": "enabled" \| "disabled"}}} - Model gating: only `deepseek-v4-*` and `deepseek-reasoner` emit thinking control. `deepseek-chat` (V3) is untouched — current behavior. - Effort mapping: low/medium/high passthrough, xhigh/max → max, unset → omitted (DeepSeek server applies its own default). - Revert the legacy-path additions from PR #15251 — they were dead code, and the `_copy_reasoning_content_for_api` strip block specifically would have nullified the existing reasoning_content padding machinery (`_needs_deepseek_tool_reasoning` → space-pad on replay) that the active provider already relies on for replay correctness. - Unit tests pin the wire-shape contract and the model gating rules (26 tests, all passing). Existing transport + provider profile suites (321 tests) continue to pass. - AUTHOR_MAP: map twebefy@gmail.com → tw2818 for release notes credit. Closes #15700, #17212, #17825. Co-authored-by: tw2818 <twebefy@gmail.com>	2026-05-15 17:03:26 -07:00
twebefy	068c24f8a4	feat(deepseek): add thinking.type + reasoning_effort mapping for DeepSeek API DeepSeek's thinking mode requires both: - extra_body.thinking.type: "enabled" to activate thinking mode - top-level reasoning_effort: "max" or "high" to control depth Previously, the ChatCompletionsTransport only handled Kimi's thinking mode — DeepSeek was left unmapped, so reasoning_effort config was silently dropped. This patch: 1. Adds is_deepseek: bool to the Params dataclass, detected by base_url matching api.deepseek.com 2. Maps Hermes effort levels (xhigh/max → "max", low/medium/high → themselves) to the top-level reasoning_effort parameter 3. Sets extra_body.thinking.type alongside the effort 4. Strips reasoning_content from assistant messages sent back to DeepSeek, preventing 400 errors when thinking was enabled	2026-05-15 17:03:26 -07:00
Teknium	31ba2b0cbc	fix(xai-oauth): recover from prelude SSE errors, gate reasoning replay, surface entitlement 403s (#26644 ) Three fixes for the May 2026 xAI OAuth (SuperGrok / X Premium) rollout failures: - _run_codex_stream: when openai SDK raises RuntimeError("Expected to have received `response.created` before `<type>`"), retry once then fall back to responses.create(stream=True) — same path used for missing-response.completed postlude. Fallback surfaces the real provider error with body+status_code intact. Also fixes #8133 (response.in_progress prelude on custom relays) and #14634 (codex.rate_limits prelude on codex-lb). - _summarize_api_error: when error body matches xAI's entitlement shape, append a one-line hint pointing to https://grok.com and /model. Once-only, applies to both auxiliary warnings and main-loop error surfacing. - _chat_messages_to_responses_input: new is_xai_responses kwarg drops replayed codex_reasoning_items (encrypted_content) before they reach xAI. Also drops reasoning.encrypted_content from the xAI include array. Native Codex behavior unchanged. Grok still reasons natively each turn; coherence rides on visible message text alone. Closes #8133, #14634.	2026-05-15 16:35:12 -07:00
Teknium	032fb84222	docs(hermes_tools_mcp_server): align scope docstring with EXPOSED_TOOLS (#26603 ) The top-of-file scope docstring listed delegate_task, memory, and session_search as exposed tools, but EXPOSED_TOOLS deliberately omits them (they're _AGENT_LOOP_TOOLS and require the running AIAgent context to dispatch — the inline comment block already explains this). Kanban tools, which ARE exposed, were missing from the docstring entirely. Rewrite the Scope / DO NOT expose sections to match the actual tuple: drop delegate_task/memory/session_search from 'expose', add the kanban_* family, move delegate_task/memory/session_search/todo into 'DO NOT expose' with the agent-loop rationale. Fixes #26567 (doc-only fix; option 2 — shimming memory/session_search through MemoryStore/SessionDB directly — left for a follow-up issue once the plugin-memory locking story is audited).	2026-05-15 14:44:27 -07:00
kchantharuan	13c3d4b4ef	feat(nvidia): add NIM billing origin header	2026-05-15 14:06:51 -07:00
Teknium	4e89c53082	fix(async): close unscheduled coroutines in all threadsafe bridges (#26584 ) Wraps every sync->async coroutine-scheduling site in the codebase with a new agent.async_utils.safe_schedule_threadsafe() helper that closes the coroutine on scheduling failure (closed loop, shutdown race, etc.) instead of leaking it as 'coroutine was never awaited' RuntimeWarnings plus reference leaks. 22 production call sites migrated across the codebase: - acp_adapter/events.py, acp_adapter/permissions.py - agent/lsp/manager.py - cron/scheduler.py (media + text delivery paths) - gateway/platforms/feishu.py (5 sites, via existing _submit_on_loop helper which now delegates to safe_schedule_threadsafe) - gateway/run.py (10 sites: telegram rename, agent:step hook, status callback, interim+bg-review, clarify send, exec-approval button+text, temp-bubble cleanup, channel-directory refresh) - plugins/memory/hindsight, plugins/platforms/google_chat - tools/browser_supervisor.py (3), browser_cdp_tool.py, computer_use/cua_backend.py, slash_confirm.py - tools/environments/modal.py (_AsyncWorker) - tools/mcp_tool.py (2 + 8 _run_on_mcp_loop callers converted to factory-style so the coroutine is never constructed on a dead loop) - tui_gateway/ws.py Tests: new tests/agent/test_async_utils.py covers helper behavior under live loop, dead loop, None loop, and scheduling exceptions. Regression tests added at three PR-original sites (acp events, acp permissions, mcp loop runner) mirroring contributor's intent. Live-tested end-to-end: - Helper stress test: 1500 schedules across live/dead/race scenarios, zero leaked coroutines - Race exercised: 5000 schedules with loop killed mid-flight, 100 ok / 4900 None returns, zero leaks - hermes chat -q with terminal tool call (exercises step_callback bridge) - MCP probe against failing subprocess servers + factory path - Real gateway daemon boot + SIGINT shutdown across multiple platform adapter inits - WSTransport 100 live + 50 dead-loop writes - Cron delivery path live + dead loop Salvages PR #2657 — adopts contributor's intent over a much wider site list and a single centralized helper instead of inline try/except at each site. 3 of the original PR's 6 sites no longer exist on main (environments/patches.py deleted, DingTalk refactored to native async); the equivalent fix lives in tools/environments/modal.py instead. Co-authored-by: JithendraNara <jithendranaidunara@gmail.com>	2026-05-15 14:00:01 -07:00
Jaaneek	7fdc16dd4a	refactor(transports/codex): trim duplicated cache-key comments The xAI prompt_cache_key block carried two long comment paragraphs that either restated setdefault semantics, narrated the SDK type-validation mechanism, or recapped the historical motivation for the extra_body indirection — all already covered by the test docstring at test_xai_responses_sends_cache_key_via_extra_body (which links to the xAI docs). Also restored the truncated link in the body-injection comment. No behavior change.	2026-05-15 12:11:32 -07:00
Jaaneek	b62c997973	feat(xai-oauth): add xAI Grok OAuth (SuperGrok Subscription) provider Adds a new authentication provider that lets SuperGrok subscribers sign in to Hermes with their xAI account via the standard OAuth 2.0 PKCE loopback flow, instead of pasting a raw API key from console.x.ai. Highlights ---------- * OAuth 2.0 PKCE loopback login against accounts.x.ai with discovery, state/nonce, and a strict CORS-origin allowlist on the callback. * Authorize URL carries `plan=generic` (required for non-allowlisted loopback clients) and `referrer=hermes-agent` for best-effort attribution in xAI's OAuth server logs. * Token storage in `auth.json` with file-locked atomic writes; JWT `exp`-based expiry detection with skew; refresh-token rotation synced both ways between the singleton store and the credential pool so multi-process / multi-profile setups don't tear each other's refresh tokens. * Reactive 401 retry: on a 401 from the xAI Responses API, the agent refreshes the token, swaps it back into `self.api_key`, and retries the call once. Guarded against silent account swaps when the active key was sourced from a different (manual) pool entry. * Auxiliary tasks (curator, vision, embeddings, etc.) route through a dedicated xAI Responses-mode auxiliary client instead of falling back to OpenRouter billing. * Direct HTTP tools (`tools/xai_http.py`, transcription, TTS, image-gen plugin) resolve credentials through a unified runtime → singleton → env-var fallback chain so xai-oauth users get them for free. * `hermes auth add xai-oauth` and `hermes auth remove xai-oauth N` are wired through the standard auth-commands surface; remove cleans up the singleton loopback_pkce entry so it doesn't silently reinstate. * `hermes model` provider picker shows "xAI Grok OAuth (SuperGrok Subscription)" and the model-flow falls back to pool credentials when the singleton is missing. Hardening --------- * Discovery and refresh responses validate the returned `token_endpoint` host against the same `.x.ai` allowlist as the authorization endpoint, blocking MITM persistence of a hostile endpoint. Discovery / refresh / token-exchange `response.json()` calls are wrapped to raise typed `AuthError` on malformed bodies (captive portals, proxy error pages) instead of leaking JSONDecodeError tracebacks. * `prompt_cache_key` is routed through `extra_body` on the codex transport (sending it as a top-level kwarg trips xAI's SDK with a TypeError). * Credential-pool sync-back preserves `active_provider` so refreshing an OAuth entry doesn't silently flip the active provider out from under the running agent. Testing ------- * New `tests/hermes_cli/test_auth_xai_oauth_provider.py` (~63 tests) covers JWT expiry, OAuth URL params (plan + referrer), CORS origins, redirect URI validation, singleton↔pool sync, concurrency races, refresh error paths, runtime resolution, and malformed-JSON guards. * Extended `test_credential_pool.py`, `test_codex_transport.py`, and `test_run_agent_codex_responses.py` cover the pool sync-back, `extra_body` routing, and 401 reactive refresh paths. * 165 tests passing on this branch via `scripts/run_tests.sh`.	2026-05-15 12:11:32 -07:00
Siddharth Balyan	5af672c753	chore: remove Atropos RL environments and tinker-atropos integration (#26106 ) * chore: remove Atropos RL environments, tools, tests, skill, and tinker-atropos submodule Delete: - environments/ (43 files — base env, agent loop, tool call parsers, benchmarks) - rl_cli.py (standalone RL training CLI) - tools/rl_training_tool.py (all 10 rl_* tools) - tests: test_rl_training_tool, test_tool_call_parsers, test_managed_server_tool_support, test_agent_loop, test_agent_loop_vllm, test_agent_loop_tool_calling, test_terminalbench2_env_security - optional-skills/mlops/hermes-atropos-environments/ - tinker-atropos git submodule + .gitmodules * chore: remove RL/Atropos references from Python source - toolsets.py: remove rl toolset block + update comment - model_tools.py: remove rl_tools group + update async bridging comment - hermes_cli/tools_config.py: remove RL display entry, _DEFAULT_OFF_TOOLSETS, setup block, and rl_training post-setup handler - tools/budget_config.py: remove RL environment reference in docstring - tests/test_model_tools.py: remove rl_tools from expected groups - tests/run_agent/test_streaming_tool_call_repair.py: fix stale cross-reference * chore: remove rl/yc-bench extras and tinker-atropos refs from pyproject.toml - Remove rl extra (atroposlib, tinker, fastapi, uvicorn, wandb) - Remove yc-bench extra - Remove rl_cli from py-modules - Remove [tool.ty.src] exclude for tinker-atropos - Remove [tool.ruff] exclude for tinker-atropos - Regenerate uv.lock * chore: remove tinker-atropos from install/setup scripts - setup-hermes.sh: remove entire tinker-atropos submodule install block - scripts/install.sh: remove both tinker-atropos blocks (Termux + standard) - scripts/install.ps1: remove tinker-atropos block - nix/hermes-agent.nix: remove tinker-atropos pip install line * chore: remove RL references from cli-config.yaml.example * docs: remove Atropos/RL references from README, CONTRIBUTING, AGENTS.md * docs: remove RL/Atropos references from website - Delete: environments.md, rl-training.md, mlops-hermes-atropos-environments.md - sidebars.ts: remove rl-training and environments sidebar entries - optional-skills-catalog.md: remove hermes-atropos-environments row - tools-reference.md: remove entire rl toolset section - toolsets-reference.md: remove rl row + update example - integrations/index.md: remove RL Training bullet - architecture.md: remove environments/ from tree + RL section - contributing.md: remove tinker-atropos setup - updating.md: remove tinker-atropos install + stale submodule update * chore: remove remaining RL/Atropos stragglers - hermes_cli/config.py: remove TINKER_API_KEY + WANDB_API_KEY env var defs - hermes_cli/doctor.py: remove Submodules check section (tinker-atropos) - hermes_cli/setup.py: remove RL Training status check - hermes_cli/status.py: remove Tinker + WandB from API key status display - agent/display.py: remove both rl_* tool preview/activity blocks - website/docs: remove RL references from providers.md + env-variables.md - tests: remove TINKER_API_KEY from conftest, set_config_value, setup_script * chore: remove RL training section from .env.example	2026-05-15 10:36:38 +05:30
Harry Riddle	e8b9f5ff9a	fix(aux): surface Nous auth-unavailable warning in auxiliary client When the auxiliary client falls through Nous (e.g. no stored auth, or runtime credential mint failed), users currently see only `debug`-level lines, so the next provider in the fallback chain takes over silently. Promote the no-auth path to a warning that tells operators to run `hermes auth`, and add a debug breadcrumb on the rarer mint-failed-but-stored-auth-still-present fallback path so the existing behavior (use the raw stored token) is preserved while staying investigable. Salvaged from #23881 by @0xharryriddle. The contributor's original patch also short-circuited the second branch with a return, which broke the pool-entry fallback path covered by `test_try_nous_uses_pool_entry` — kept the warning intent, dropped the return so the fallback still works. Dropped the contributor's changes to `hermes_cli/goals.py` because the goal-pause path is unreachable when the auxiliary client is None (`judge_goal` returns `parse_failed=False`, which resets `consecutive_parse_failures`), so the reason string they added never surfaces in the pause message. Refs #23876	2026-05-14 20:15:29 -07:00
ethernet	1702a94c88	Merge pull request #25957 from stephenschoettler/fix/main-ci-unblocker-after-21012 fix(ci): stabilize shared test state after 21012	2026-05-14 21:26:52 -04:00
Teknium	19071529f6	fix(lsp): shift baseline diagnostics into post-edit coordinates (#25978 ) Pre-existing diagnostics below an edit point used to surface as 'LSP diagnostics introduced by this edit' whenever the edit deleted or inserted lines. The delta-filter key included the diagnostic's range, so the same logical error reported at a different line in the post-edit snapshot looked like a brand new diagnostic. Concrete case: deleting 14 lines in cli.py caused Pyright errors at lines 9873, 10590, 12413, 13004 (unrelated to the edit) to be reported as introduced by it. Fix: build a piecewise-linear line-shift map (via difflib's SequenceMatcher) from pre and post content, and remap baseline diagnostics into post-edit coordinates before the set-difference. Diagnostics in deleted regions drop out cleanly; diagnostics below the edit shift by the right amount; diagnostics above are untouched. The strict (range-aware) equality key stays — so a genuinely new instance of an identical error class at a different line still surfaces as new. Pieces: - agent/lsp/range_shift.py — build_line_shift, shift_diagnostic_range, shift_baseline. Pure functions, no LSP state. - agent/lsp/manager.py — LSPService.get_diagnostics_sync gains an optional line_shift kwarg; baseline is shift_baseline'd before computing the seen-set. _diag_key keeps the strict range key. - tools/file_operations.py — write_file captures pre_content for any LSP-handled extension (not just LINTERS_INPROC) and passes pre/post to _maybe_lsp_diagnostics, which builds the shift map. - New _lsp_handles_extension helper guards the pre_content read. Trade-offs preserved: - Genuinely new same-class errors at different lines still surface (content-only key would have swallowed them). - Pre-existing errors at unshifted positions still get filtered (covered by the strict-key path with no shift). - Best-effort: when pre_content can't be captured (file didn't exist, permissions), the unshifted comparison still catches most pre-existing errors; the edge case it misses is a new file with a non-empty baseline, which is structurally impossible.	2026-05-14 15:56:07 -07:00
Teknium	fe83c4001b	fix(codex-app-server): attach redacted stderr tail to generic failures (#25929 ) When codex app-server fails outside the OAuth-classified path (non-auth turn/start errors, plain TimeoutErrors, generic turn-ended status, subprocess silently exits, hard deadline timeout), the user got a bare 'Internal error' / 'turn/start failed: ...' with no context. Diagnosing config/provider/auth-bridge issues forced a re-run with verbose codex flags. Add a _format_error_with_stderr helper that appends the last few stderr lines via agent.redact.redact_sensitive_text(force=True), and use it at every catch-all error site: - ensure_started() failures (codex init / thread/start) now return a TurnResult.error with should_retire=True instead of bubbling - non-OAuth turn/start CodexAppServerError / TimeoutError - subprocess-died branch (previously dumped raw stderr_blob[-300:] with no redaction — a leak risk) - turn ended with non-completed status - hard turn-timeout deadline OAuth-classified failures and the post-tool quiet watchdog already produce clean hints and stay unchanged. The redactor catches sk-, gh_*, Authorization: Bearer, query-string tokens, JWTs, private keys, etc., so provider error payloads can't leak into chat output or trajectories. Inspired by openclaw#80718, adapted for our app-server transport.	2026-05-14 14:55:23 -07:00
Stephen Schoettler	5ce0067c08	fix(ci): stabilize shared test state after 21012	2026-05-14 14:28:14 -07:00
EthanGuo-coder	26933c2f59	fix(agent/gemini-cloudcode): seed delta defaults for reasoning-only stream chunks _make_stream_chunk built delta_kwargs with only `role`, so a reasoning-only chunk produced a SimpleNamespace without a `.content` attribute. Downstream consumers that read `delta.content` then raised AttributeError on Gemini 2.5 Flash, where the thinking delta arrives before any content delta. Seed `content`, `tool_calls`, `reasoning`, and `reasoning_content` as None up front, matching the pattern already used in gemini_native_adapter.py. Key-present arguments still override the defaults. Fixes #24974 References: Related open PR #24984 (luyao618) applies the same 1-line fix; this PR adds a regression test that #24984 omits Co-Authored-By: Claude <noreply@anthropic.com>	2026-05-14 08:03:56 -07:00
Teknium	12f755c9eb	fix(codex-runtime): retire wedged sessions + post-tool watchdog + OAuth refresh classify (#25769 ) Mirrors openclaw beta.8's app-server resilience fixes so a stuck codex subprocess can't burn the full turn deadline and so users get a `codex login` pointer instead of raw RPC errors when their token expires. - TurnResult.should_retire signals the caller to drop+respawn codex. - Deadline-hit path and dead-subprocess detection set should_retire so the next turn doesn't ride a CPU-spinning or auth-broken process. - Post-tool watchdog (post_tool_quiet_timeout=90s): if a tool item completes and codex goes silent past the threshold without further output or turn/completed, fast-fail instead of waiting the full 600s. Resets on any non-tool activity so normal think-after-tool flows are not affected. - <turn_aborted> and <turn_aborted/> in agent text are treated as terminal — some codex builds tear down a turn that way without emitting turn/completed. - _classify_oauth_failure() inspects RPC error message + stderr tail for invalid_grant / token refresh / 401 / etc. and rewrites user-facing errors to 'run codex login'. Conservative: generic failures still surface verbatim. Fires at turn/start failure, turn/completed failure, and dead-subprocess paths. - thread/start cross-fill: tolerate thread.id, thread.sessionId, top-level sessionId/threadId so future codex schema drift doesn't KeyError us at handshake. - run_agent.py: when run_turn returns should_retire=True OR raises, close + null self._codex_session so the next turn respawns. Tests: +30 cases across session + integration suites. tests/agent/transports/test_codex_app_server_session.py 50/50 pass tests/run_agent/test_codex_app_server_integration.py 27/27 pass Broader codex scope (transports + cli runtime/migration) 376/376 pass	2026-05-14 07:55:09 -07:00
Alex-wuhu	1551ce46a4	docs: update NovitaAI description to "90+ models, pay-per-use"	2026-05-13 23:51:15 -07:00
Alex-wuhu	c76e879574	feat: add NovitaAI as LLM provider Add NovitaAI as a first-class provider with dedicated model selection flow, live pricing, and authoritative context length resolution. - Register provider in PROVIDER_REGISTRY, HERMES_OVERLAYS, and all alias/label maps (ID: novita, aliases: novita-ai, novitaai) - Add dedicated _model_flow_novita() with 3-tier model list fallback: Novita API → models.dev → static curated list - Fetch live pricing from /v1/models with correct unit conversion (input_token_price_per_m is 0.0001 USD per Mtok) - Add Novita-specific context length resolution (step 4b) in get_model_context_length(), prioritized over models.dev/OpenRouter - Register api.novita.ai in _URL_TO_PROVIDER to prevent early return from the custom-endpoint code path - Add models.dev mapping (novita → novita-ai) - Add default auxiliary model (deepseek/deepseek-v3-0324) - Add NOVITA_API_KEY to test isolation (conftest.py) - Update docs: providers page, env vars reference, CLI reference, .env.example, README, and landing page	2026-05-13 23:51:15 -07:00
AllynSheep	057f5a31d1	fix(auxiliary): skip providers without credentials immediately When the auxiliary client fallback chain reaches a provider that has no credentials configured (no API key, no pool entry), the current code just returns (None, None) which counts toward the per-call timeout budget on the next attempt. Mark the provider unhealthy with a short TTL so the chain advances quickly to the next viable option. Closes #25384. Salvage of #25395 by @AllynSheep.	2026-05-13 23:10:33 -07:00
kshitijk4poor	657e6d87cc	fix(web): align _LEGACY_PREFERENCE with legacy 7-provider order + doc cleanup Self-review of the plugin migration surfaced one warning and a handful of doc/dead-code cleanups. None affect production behaviour through the main dispatcher (which always calls `tools.web_tools._get_backend()` first and preserves the full 7-provider walk), but direct callers of `agent.web_search_registry.get_active__provider()` previously diverged from the legacy order and could return `None` for users with credentials but no explicit `web.backend` config key. Changes ------- 1. `_LEGACY_PREFERENCE` was shipped as a 4-tuple `("brave-free", "firecrawl", "searxng", "ddgs")` while the PR description and the legacy `_get_backend()` candidate order both call for the 7-tuple `(firecrawl, parallel, tavily, exa, searxng, brave-free, ddgs)`. Replaced with the 7-tuple. Verified empirically: with TAVILY+EXA keys and no config, `get_active_search_provider()` now returns tavily (was None); with EXA+PARALLEL it returns parallel (was None); with BRAVE+FIRECRAWL it returns firecrawl (was brave-free). 2. `agent/web_search_registry.py` — module docstring, `_resolve` step-3 docstring, and inline comment all listed the old 4-tuple and claimed "brave-free first because it was the shipped default". The legacy default is `"firecrawl"`. Rewritten to match the new ordering and reference `tools.web_tools._get_backend()` as the source of truth. 3. `agent/web_search_registry.py` — `get_active_crawl_provider` docstring said "only Tavily implements it among built-in providers". Firecrawl also advertises `supports_crawl=True` after the previous commit. Updated to "Tavily and Firecrawl". 4. `plugins/web/tavily/provider.py` — module docstring said "Tavily is the only built-in backend that natively crawls". Updated. 5. `agent/web_search_provider.py` — ABC docstring mentioned only `search` / `extract` capabilities. Added `crawl` for accuracy. 6. `plugins/web/{firecrawl,parallel,exa}/provider.py` — dead plugin-level cache globals (`_firecrawl_client`, `_parallel_client`, `_async_parallel_client`, `_exa_client`) were declared but never read (all reads/writes go through `_wt.` per the `extracting-inline- helpers-to-plugins` recipe). Removed the dead declarations; the reset-for-tests helpers in firecrawl + parallel now clear the canonical `_wt._<name>` slots, matching the pattern exa already used. Tests ----- 218/218 web-targeted tests still pass (no test changes needed). 4910/4910 in `tests/tools/` still green.	2026-05-13 22:31:28 -07:00
kshitijk4poor	39b4ebfcea	refactor(web): delete legacy tools/web_providers/ directory + migrate ABC tests Removes the legacy in-tree provider scaffolding that PR #25182 fully replaced with the plugin architecture: tools/web_providers/__init__.py (6 lines) tools/web_providers/base.py (89 lines — old ABCs) tools/web_providers/ARCHITECTURE.md (73 lines — old design doc) These were the staging-ground ABCs and provider modules that the plugin migration absorbed. All seven web providers now implement the single :class:`agent.web_search_provider.WebSearchProvider` ABC and live under ``plugins/web/<vendor>/``. Nothing else in the tree imports ``tools.web_providers`` — verified via grep before deletion. Test migration (tests/tools/test_web_providers.py) -------------------------------------------------- Rewrote ``TestWebProviderABCs`` to test the new unified ABC at :mod:`agent.web_search_provider`: - test_cannot_instantiate_abc_directly — abstract ``name`` + ``is_available`` - test_concrete_search_only_provider_works — exercise default ``supports_extract=False`` / ``supports_crawl=False`` flags - test_concrete_multi_capability_provider_works — exercise all three capabilities, async extract supported (declared sync here for simplicity; real plugins like parallel + firecrawl use async) - test_search_only_provider_skips_extract_and_crawl — verify ``supports_*()`` flags default to False so search-only providers don't have to implement extract() or crawl() The 9 other tests in the file (per-capability backend selection, DEFAULT_CONFIG merge, dispatcher routing) test public helpers in ``tools.web_tools`` that still exist and pass unchanged. agent/web_search_provider.py docstring updated to reflect that the legacy ABCs no longer exist; the response-shape contract is preserved bit-for-bit so external consumers see no behavioral change. Net diff -------- - tools/web_providers/ removed (-168 lines) - tests/tools/test_web_providers.py rewritten ABC section (+78/-30 net, same coverage, new API) - agent/web_search_provider.py docstring (-3/+5 lines) Verified -------- - 173/173 targeted web tests pass - 12/12 ABC contract tests pass with the new interface - No remaining grep hits for ``tools.web_providers`` outside of intentional historical references in plugin docstrings.	2026-05-13 22:31:28 -07:00
kshitijk4poor	e3f0a88891	feat(web): extend ABC with supports_crawl and async-extract semantics Two ABC additions to cover the surface area of the remaining four providers (exa, parallel, tavily, firecrawl) which were untouched by the initial spike: 1. supports_crawl() + crawl() — Tavily natively crawls a seed URL via its /crawl endpoint. Exposing supports_crawl=True lets the crawl tool's dispatcher route to Tavily when configured, falling back to the auxiliary-model summarization path otherwise. Firecrawl could add this in a follow-up (the SDK supports it; we just don't surface it as a tool today). 2. Async-or-sync extract() — Parallel's SDK is natively async (AsyncParallel.beta.extract); Exa and Tavily are sync; Firecrawl is sync but called inside asyncio.to_thread() with a 60s timeout. The ABC docstring now permits either shape: implementations declare their own sync/async signature and the dispatcher uses inspect.iscoroutinefunction to detect and await. Also adds get_active_crawl_provider() to web_search_registry mirroring the search/extract resolvers, with web.crawl_backend as the explicit override config key. No behavior change on its own — these are scaffolds for the four remaining provider migrations.	2026-05-13 22:31:28 -07:00
kshitijk4poor	0a7cbd3342	fix(plugins): filter resolution by is_available() in web + image_gen registries Both web_search_registry._resolve() and image_gen_registry.get_active_provider() walked their registered providers and returned the first one matching the capability flag — without checking whether that provider was actually usable. On a fresh install with no credentials at all, this meant get_active_search_provider() returned `brave-free` (legacy preference order) even though BRAVE_SEARCH_API_KEY was unset, leading the dispatcher to surface a "BRAVE_SEARCH_API_KEY is not set" error for a provider the user never chose. Same bug shape in image_gen for FAL. Resolution semantics now match tools.web_tools._get_backend(): 1. Explicit config name wins, ignoring is_available() — the dispatcher surfaces a precise "X_API_KEY is not set" error rather than silently switching backends. Matches user expectation: "I configured X, tell me what's wrong with X." 2. Fallback (no explicit config) walks the legacy preference order filtered by is_available() — pick the highest-priority backend the user actually has credentials for. is_available() is wrapped in a try/except so a buggy provider doesn't brick resolution. E2E verified: - No creds + no config: get_active_search_provider() -> None - Explicit brave-free + no key: get_active_search_provider() -> brave-free (and .is_available() correctly reports False) This fix was identified during the spike (#25182 finding #1) and is fold-in to the same PR rather than a follow-up.	2026-05-13 22:31:28 -07:00
kshitijk4poor	007a630b16	feat(web): add web search provider registry mirroring image_gen pattern	2026-05-13 22:31:28 -07:00
kshitijk4poor	2cea98e143	feat(web): add WebSearchProvider ABC mirroring image_gen template	2026-05-13 22:31:28 -07:00
teknium1	4ceab16893	fix(compression): keep default protect_first_n at 3 + align ABC Follow-up on the salvaged feat commit: - Keep the constructor / config / yaml-example default at 3 so existing gateway and CLI users see no behavioural change. PR #13754 (which this builds on) had lowered the default to 2 to chase pre-feature parity in the system-prompt-present case, at the cost of quietly halving the protected head for the gateway path (which strips the system prompt before calling compress()). With the new "system prompt is implicit" semantics, default 3 gives every caller a stable head shape. - agent/context_engine.py: bring the ABC's protect_first_n docstring in line with the new semantics so plugin context engines interpret the config key the same way the built-in compressor does. - tests: adjust the default-value test (3, not 2) and a stale comment; per-test protect_first_n=2/3/1 values added in PR #13754 stay as-is since those tests fix concrete head shapes.	2026-05-13 22:25:16 -07:00
snav	dee71a31e5	feat(compression): make protect_first_n configurable The number of head messages preserved verbatim across context compactions was previously hardcoded to 3 in AIAgent.__init__. Expose it as `compression.protect_first_n` in config, matching the existing `protect_last_n` pattern. Motivation: users who rely on rolling compaction for long-running sessions had the opening user/assistant exchange pinned as head forever, which doesn't always match how they want the session framed after many compactions. Lowering to 1 preserves the system prompt + first non-system message; lowering to 0 preserves only the system prompt and lets the entire first exchange age out naturally through the summary. Semantics: `protect_first_n` counts non-system head messages protected in addition to the system prompt, which is always implicitly protected when present. Same meaning across both code paths: protect_first_n=0 → system prompt only (or nothing if no system message) protect_first_n=2 → system prompt + first 2 non-system messages (default) This unifies the CLI path (which reads messages with the system prompt at position 0) and the gateway path (where the gateway /compress handler strips the system prompt before calling compress() — see gateway/run.py L9150-9154 on the parent fork). Previously these two paths disagreed: CLI path: protect_first_n=1 → protect system prompt only Gateway path: protect_first_n=1 → protect first USER turn forever In practice on long-running gateway sessions the old semantics pinned whatever stale aside happened to be the first user message, reinserting it into every compaction summary indefinitely. Default chosen as 2 (not 3) so that the effective protected head count remains 3 messages in the common case — assuming a system prompt is present, default protection becomes system + 2 non-system = 3 total, matching the pre-feature behaviour where `protect_first_n` was hardcoded to protect 3 messages total. Sessions without a system prompt will see a small behaviour change (2 protected head messages instead of 3), but this is the rare path and the new semantics make the system-prompt-present case the well-defined one. Changes: - agent/context_compressor.py: redefine protect_first_n as the count of non-system head messages protected beyond the implicit system-prompt guarantee; both paths converge. Constructor default updated to 2. - hermes_cli/config.py: add `compression.protect_first_n` default (2), matching the new semantics. `show_config` label tweaked to 'Protect first: N non-system head messages' for clarity. - run_agent.py: read protect_first_n from config; 0 is now valid (system prompt is always implicitly protected). - cli-config.yaml.example: document the new key and rationale. - tests/agent/test_context_compressor.py: cover default, override, the end-to-end `protect_first_n=0` and `protect_first_n=1` behaviour, the no-system-prompt (gateway) path, and the new shared-semantics regression test. Fixes #13751 Tested on Ubuntu 24.04.	2026-05-13 22:25:16 -07:00
Teknium	091d8e1030	feat(codex-runtime): optional codex app-server runtime for OpenAI/Codex models (#24182 ) * feat(codex-runtime): scaffold optional codex app-server runtime Foundational commit for an opt-in alternate runtime that hands OpenAI/Codex turns to a 'codex app-server' subprocess instead of Hermes' tool dispatch. Default behavior is unchanged. Lands in three pieces: 1. agent/transports/codex_app_server.py — JSON-RPC 2.0 over stdio speaker for codex's app-server protocol (codex-rs/app-server). Spawn, init handshake, request/response, notification queue, server-initiated request queue (for approval round-trips), interrupt-friendly blocking reads. Tested against real codex 0.130.0 binary end-to-end during development. 2. hermes_cli/runtime_provider.py: - Adds 'codex_app_server' to _VALID_API_MODES. - Adds _maybe_apply_codex_app_server_runtime() helper, called at the end of _resolve_runtime_from_pool_entry(). Inert unless 'model.openai_runtime: codex_app_server' is set in config.yaml AND provider in {openai, openai-codex}. Other providers cannot be rerouted (anthropic, openrouter, etc. preserved). 3. tests/agent/transports/test_codex_app_server_runtime.py — 24 tests covering api_mode registration, the rewriter helper (default-off, case-insensitive, opt-in, non-eligible providers preserved), version parser, missing-binary handling, error class. Does NOT require codex CLI installed. This commit is wire-only: the api_mode is recognized but AIAgent does not yet branch on it. Followup commits add the session adapter, event projector, approval bridge, transcript projection (so memory/skill review still works), plugin migration, and slash command. Existing tests remain green: - tests/cli/test_cli_provider_resolution.py (29 passed) - tests/agent/test_credential_pool_routing.py (included above) * feat(codex-runtime): add codex item projector for memory/skill review The translator that lets Hermes' self-improvement loop keep working under the Codex runtime: converts codex 'item/' notifications into Hermes' standard {role, content, tool_calls, tool_call_id} message shape that agent/curator.py already knows how to read. Item taxonomy (matches codex-rs/app-server-protocol/src/protocol/v2/item.rs): - userMessage → {role: user, content} - agentMessage → {role: assistant, content: text} - reasoning → stashed in next assistant's 'reasoning' field - commandExecution → assistant tool_call(name='exec_command') + tool result - fileChange → assistant tool_call(name='apply_patch') + tool result - mcpToolCall → assistant tool_call(name='mcp.<server>.<tool>') + tool result - dynamicToolCall → assistant tool_call(name=<tool>) + tool result - plan/hookPrompt/etc → opaque assistant note, no fabricated tool_calls Invariants preserved: - Message role alternation never violated: each tool item produces at most one assistant + one tool message in that order, correlated by call_id. - Streaming deltas (item/<type>/outputDelta, item/agentMessage/delta) don't materialize messages — only item/completed does. Mirrors how Hermes already only writes the assistant message after streaming ends. - Tool call ids are deterministic (codex item id-based) so replays produce identical messages and prefix caches stay valid (AGENTS.md pitfall #16). - JSON args use sorted_keys for the same reason. Real wire formats verified against codex 0.130.0 by capturing live notifications from thread/shellCommand and including one as a fixture (COMMAND_EXEC_COMPLETED). 23 new tests, all green: - Streaming deltas don't materialize (3 paths) - Turn/thread frame events are silent - commandExecution: 5 tests including non-zero exit annotation + deterministic id stability across replays - agentMessage + reasoning attachment + reasoning consumption - fileChange: summary without inlined content - mcpToolCall: namespaced naming + error surfacing - userMessage: text fragments only (drops images/etc) - opaque items: no fabricated tool_calls - Helpers: deterministic id stability + sorted JSON args - Role alternation invariant across all four tool-shaped item types This commit is a pure addition. AIAgent integration (the wire that uses the projector) is the next commit. feat(codex-runtime): add session adapter + approval bridge The third self-contained module: CodexAppServerSession owns one Codex thread per Hermes session, drives turn/start, consumes streaming notifications via CodexEventProjector, handles server-initiated approval requests, and translates cancellation into turn/interrupt. The adapter has a single public per-turn method: result = session.run_turn(user_input='...', turn_timeout=600) # result.final_text → assistant text for the caller # result.projected_messages → list ready to splice into AIAgent.messages # result.tool_iterations → tick count for _iters_since_skill nudge # result.interrupted → True on Ctrl+C / deadline / interrupt # result.error → error string when the turn cannot complete # result.turn_id, thread_id → for sessions DB / resume Behavior: - ensure_started() spawns codex, does the initialize handshake, and issues thread/start with cwd + permissions profile. Idempotent. - run_turn() blocks until turn/completed, drains server-initiated requests (approvals) before reading notifications so codex never deadlocks waiting for us, projects every item/completed via the projector, and increments tool_iterations for the skill nudge gate. - request_interrupt() is thread-safe (threading.Event); the next loop iteration issues turn/interrupt and unwinds. - turn_timeout deadlock guard issues turn/interrupt and records an error if the turn never completes. - close() escalates terminate → kill via the underlying client. Approval bridge: Codex emits server-initiated requests for execCommandApproval and applyPatchApproval. The adapter translates Hermes' approval choice vocabulary onto codex's decision vocabulary: Hermes 'once' → codex 'approved' Hermes 'session' or 'always' → codex 'approvedForSession' Hermes 'deny' / anything else → codex 'denied' Routing precedence: 1. _ServerRequestRouting.auto_approve_* flags (cron / non-interactive) 2. approval_callback wired by the CLI (defers to tools.approval.prompt_dangerous_approval()) 3. Fail-closed denial when neither is wired Unknown server-request methods are answered with JSON-RPC error -32601 so codex doesn't hang waiting for us. Permission profile mapping mirrors AGENTS.md: Hermes 'auto' → codex 'workspace-write' Hermes 'approval-required' → codex 'read-only-with-approval' Hermes 'unrestricted/yolo' → codex 'full-access' 20 new tests, all green. Combined with prior commits this PR now has 67 tests across three modules: - test_codex_app_server_runtime.py: 24 (api_mode + transport surface) - test_codex_event_projector.py: 23 (item taxonomy projections) - test_codex_app_server_session.py: 20 (turn loop + approvals + interrupts) Full tests/agent/transports/ directory: 249/249 pass — no regressions to existing transport tests. Still no wire into AIAgent.run_conversation(); that integration commit is small and goes next. * feat(codex-runtime): wire codex_app_server runtime into AIAgent The integration commit. AIAgent.run_conversation() now early-returns to a new helper _run_codex_app_server_turn() when self.api_mode == 'codex_app_server', bypassing the chat_completions tool loop entirely. Three small surgical edits to run_agent.py (~105 LOC total): 1. Line ~1204 (constructor api_mode validation set): Add 'codex_app_server' so an explicit api_mode='codex_app_server' passed to AIAgent() isn't silently rewritten to 'chat_completions'. 2. Line ~12048 (run_conversation, just before the while loop): Early-return to _run_codex_app_server_turn() when self.api_mode is 'codex_app_server'. Placed AFTER all standard pre-loop setup — logging context, session DB, surrogate sanitization, _user_turn_count and _turns_since_memory increments, _ext_prefetch_cache, memory manager on_turn_start — so behavior outside the model-call loop is identical between paths. Default Hermes flow is unchanged when the flag is off. 3. End-of-class (line ~15497): New method _run_codex_app_server_turn(). Lazy-instantiates one CodexAppServerSession per AIAgent (reused across turns), runs the turn, splices projected_messages into messages, increments _iters_since_skill by tool_iterations (since the chat_completions loop normally does that per iteration), fires _spawn_background_review on the same cadence as the default path. Counter accounting: _turns_since_memory ← already incremented at run_conversation:11817 (gated on memory store configured) — codex helper does NOT touch it (would double-count). _user_turn_count ← already incremented at run_conversation:11793 — codex helper does NOT touch it. _iters_since_skill ← incremented in the chat_completions loop per tool iteration. Codex helper increments by turn.tool_iterations since the loop is bypassed. User message: ALREADY appended to messages by run_conversation pre-loop (line 11823) before the early-return reaches us. Helper does NOT append again. Regression test test_user_message_not_duplicated guards this. Approval callback wiring: Lazy-fetches tools.terminal_tool._get_approval_callback at session spawn time, passes to CodexAppServerSession. CLI threads with prompt_toolkit get interactive approvals; gateway/cron contexts get the codex-side fail-closed deny. Error path: Codex session exceptions become a 'partial' result with completed=False and a final_response that explicitly tells the user how to switch back: 'Codex app-server turn failed: ... Fall back to default runtime with /codex-runtime auto.' Same return-dict shape as the chat_completions path so all callers (gateway, CLI, batch_runner, ACP) work unchanged. 9 new integration tests in tests/run_agent/test_codex_app_server_integration.py: - api_mode='codex_app_server' is accepted on AIAgent construction - run_conversation returns the expected codex shape (final_response, codex_thread_id, codex_turn_id, completed, partial) - Projected messages are spliced into messages list - _iters_since_skill ticks per tool iteration - _user_turn_count delegated to standard flow (not double-counted) - User message appears exactly once (regression guard) - _spawn_background_review IS invoked (memory/skill review keeps working) - chat.completions.create is NEVER called (loop fully bypassed) - Session exception → partial result with /codex-runtime auto hint - Interrupted turn → partial result with error preserved Adjacent test runs confirm no regressions: - tests/run_agent/test_memory_nudge_counter_hydration.py: green - tests/run_agent/test_background_review.py: green - tests/run_agent/test_fallback_model.py: green - tests/agent/transports/: 249/249 green Still missing for full feature: /codex-runtime slash command, plugin migration helper, docs page, live e2e test gated on codex binary. Those are the remaining followup commits. * feat(codex-runtime): add /codex-runtime slash command (CLI + gateway) User-facing toggle for the optional codex app-server runtime. Follows the 'Adding a Slash Command (All Platforms)' pattern from AGENTS.md exactly: single CommandDef in the central registry → CLI handler → gateway handler → running-agent guard → all surfaces (autocomplete, /help, Telegram menu, Slack subcommands) update automatically. Surface: /codex-runtime — show current state + codex CLI status /codex-runtime auto — Hermes default runtime /codex-runtime codex_app_server — codex subprocess runtime /codex-runtime on / off — synonyms Files changed: hermes_cli/codex_runtime_switch.py (new): Pure-Python state machine shared by CLI and gateway. Parse args, read/write model.openai_runtime in the config dict, gate enabling behind a codex --version check (don't let users opt in to a runtime they have no binary for; print npm install hint instead). Returns a CodexRuntimeStatus dataclass that callers render however suits their surface. hermes_cli/commands.py: Single CommandDef entry, no aliases (codex-runtime is its own thing). cli.py: Dispatch in process_command() + _handle_codex_runtime() handler that delegates to the shared module and renders results via _cprint. gateway/run.py: Dispatch in _handle_message() + _handle_codex_runtime_command() that returns a string (gateway sends as message). On a successful change that requires a new session, _evict_cached_agent() forces the next inbound message to construct a fresh AIAgent with the new api_mode — avoids prompt-cache invalidation mid-session. gateway/run.py running-agent guard: /codex-runtime joins /model in the early-intercept block so a runtime flip mid-turn can't split a turn across two transports. Tests: tests/hermes_cli/test_codex_runtime_switch.py — 25 tests covering the state machine: arg parsing (10 cases incl. case-insensitive and synonyms), reading current runtime (5 cases incl. malformed configs), writing runtime (3 cases), apply() entry point covering read-only, no-op, codex-missing-blocked, codex-present-success, disable-no-binary-check, and persist-failure paths (8 cases). All green. Adjacent test suites confirm no regressions: - tests/hermes_cli/test_commands.py + test_codex_runtime_switch.py: 167/167 green - tests/agent/transports/: 283/283 green when combined with prior commits Still missing: plugin migration helper, docs page, live e2e test gated on codex binary. Followup commits. * feat(codex-runtime): auto-migrate Hermes MCP servers to ~/.codex/config.toml Translates the user's mcp_servers config from ~/.hermes/config.yaml into the TOML format codex's MCP client expects. Wired into the /codex-runtime codex_app_server enable path so users get their MCP tool surface in the spawned subprocess automatically. The migration runs on every enable. Failures are non-fatal — the runtime change still proceeds and the user gets a warning so they can fix the codex config manually. What translates (mapping verified against codex-rs/core/src/config/edit.rs): Hermes mcp_servers.<n>.command/args/env → codex stdio transport Hermes mcp_servers.<n>.url/headers → codex streamable_http transport Hermes mcp_servers.<n>.timeout → codex tool_timeout_sec Hermes mcp_servers.<n>.connect_timeout → codex startup_timeout_sec Hermes mcp_servers.<n>.cwd → codex stdio cwd Hermes mcp_servers.<n>.enabled: false → codex enabled = false What does NOT translate (warned + skipped per server): Hermes-specific keys (sampling, etc.) — codex's MCP client has no equivalent. Listed in the per-server skipped[] field of the report. What's NOT migrated (intentional): AGENTS.md — codex respects this file natively in its cwd. Hermes' own AGENTS.md (project-level) is already in the worktree, so codex picks it up without translation. No code needed. Idempotency design: All managed content lives between a 'managed by hermes-agent' marker and the next non-mcp_servers section header. _strip_existing_managed_block removes the prior managed region cleanly, preserving any user-added codex config (model, providers.openai, sandbox profiles, etc.) above or below. Files added: hermes_cli/codex_runtime_plugin_migration.py — pure-Python migration helper. Public API: migrate(hermes_config, codex_home=None, dry_run=False) returns MigrationReport with .migrated/.errors/ .skipped_keys_per_server. No external TOML dependency — minimal formatter handles strings/numbers/booleans/lists/inline-tables. tests/hermes_cli/test_codex_runtime_plugin_migration.py — 39 tests covering: - per-server translation (12): stdio/http/sse, cwd, timeouts, enabled flag, command+url precedence, sampling drop, unknown keys - TOML formatter (8): types, escaping, inline tables, error case - existing-block stripping (4): no marker, alone, with user content above, with user content below - end-to-end migrate() (8): empty, dry-run, round-trip, idempotent re-run, preserves user config, error reporting, invalid input, summary formatting Files changed: hermes_cli/codex_runtime_switch.py — apply() now calls migrate() in the codex_app_server enable branch. Migration failure logs a warning in the result message but does NOT fail the runtime change. Disable path (auto) explicitly skips migration. tests/hermes_cli/test_codex_runtime_switch.py — 3 new tests: test_enable_triggers_mcp_migration, test_disable_does_not_trigger_migration, test_migration_failure_does_not_block_enable. All 325 feature tests green: - tests/agent/transports/: 249 (incl. 67 new) - tests/run_agent/test_codex_app_server_integration.py: 9 - tests/hermes_cli/test_codex_runtime_switch.py: 28 (3 new) - tests/hermes_cli/test_codex_runtime_plugin_migration.py: 39 (new) * perf(codex-runtime): cache codex --version check within apply() Single /codex-runtime invocation could spawn 'codex --version' up to 3 times (state report, enable gate, success message). Each spawn is ~50ms, so the cumulative cost wasn't a crisis, but it was wasteful and turned a trivial slash command into something noticeably laggy on slower systems. Refactored to lazy-once via a closure over a nonlocal cache. First call spawns; subsequent calls in the same apply() reuse the result. Behavior unchanged — same return shape, same error handling, same install hint when codex is missing. Just one subprocess per call instead of three. Two regression-guard tests added: - test_binary_check_cached_within_apply: enable path → call_count == 1 - test_binary_check_cached_on_read_only_call: state-report path → call_count == 1 Total tests for /codex-runtime now 30 (was 28); all 143 codex-runtime tests still green. * fix(codex-runtime): correct protocol field names found via live e2e test Three real bugs caught only by running a turn end-to-end against codex 0.130.0 with a real ChatGPT subscription. Unit tests passed because they asserted on our own (incorrect) wire shapes; the wire format from codex-rs/app-server-protocol/src/protocol/v2/* is the source of truth and my initial reading of the README was incomplete. Bug 1: thread/start.permissions wire format Was sending {"profileId": "workspace-write"}. Real format per PermissionProfileSelectionParams enum (tagged union): {"type": "profile", "id": "workspace-write"} AND requires the experimentalApi capability declared during initialize. AND requires a matching [permissions] table in ~/.codex/config.toml or codex fails the request with 'default_permissions requires a [permissions] table'. Fix: stop overriding permissions on thread/start. Codex picks its default profile (read-only unless user configures otherwise), which matches what codex CLI users expect — they configure their default permission profile in ~/.codex/config.toml the standard way. Trying to be clever about profile selection broke every turn we tested. Live error before fix: 'Invalid request: missing field type' on every turn/start, even though our turn/start payload was correct — the field codex was complaining about was inside the permissions sub-object we shouldn't have been sending. Bug 2: server-request method names Was matching 'execCommandApproval' and 'applyPatchApproval'. Real names per common.rs ServerRequest enum: item/commandExecution/requestApproval item/fileChange/requestApproval item/permissions/requestApproval (new third method) Fix: match the documented names. Added handler for item/permissions/requestApproval that always declines — codex sometimes asks to escalate permissions mid-turn and silent acceptance would surprise users. Live symptom before fix: agent.log showed 'Unknown codex server request: item/commandExecution/requestApproval' and codex stalled because we replied with -32601 (unsupported method) instead of an approval decision. The agent reported back 'The write command was rejected' even though Hermes never showed the user an approval prompt. Bug 3: approval decision values Was sending decision strings 'approved'/'approvedForSession'/'denied'. Real values per CommandExecutionApprovalDecision enum (camelCase): accept, acceptForSession, decline, cancel (also AcceptWithExecpolicyAmendment and ApplyNetworkPolicyAmendment variants we don't currently use). Fix: rename _approval_choice_to_codex_decision return values; update auto_approve_* fallbacks; update fail-closed default from 'denied' to 'decline'. Test mapping table updated to match. Live test verified after fixes: $ hermes (with model.openai_runtime: codex_app_server) > Run the shell command: echo hermes-codex-livetest > .../proof.txt then read it back Approval prompt fired with 'Codex requests exec in <cwd>'. User chose 'Allow once'. Codex executed the command, wrote the file, read it back. Final response: 'Read back from proof.txt: hermes-codex-livetest'. File contents on disk match. agent.log confirms: codex app-server thread started: id=019e200e profile=workspace-write cwd=/tmp/hermes-codex-livetest/workspace All 20 session tests still green after wire-format updates. * fix(codex-runtime): correct apply_patch approval params + ship docs Live e2e revealed FileChangeRequestApprovalParams doesn't carry the changeset (just itemId, threadId, turnId, reason, grantRoot) — Codex's 'reason' field describes what the patch wants to do. Test config and display logic updated to use it. The first 'apply_patch (0 change(s))' display from the live test is now 'apply_patch: <reason>'. Adds website/docs/user-guide/features/codex-app-server-runtime.md covering enable/disable, prerequisites, approval UX, MCP migration behavior, permission profile delegation to ~/.codex/config.toml, known limitations, and the architecture diagram. Wired into the Automation category in sidebars.ts. Live e2e validation across the path matrix: ✓ thread/start handshake ✓ turn/start with text input ✓ commandExecution items + projection ✓ item/commandExecution/requestApproval → Hermes UI → response ✓ Approve once → command runs ✓ Deny → command rejected, codex falls back to read-only message ✓ Multi-turn (codex remembers prior turn's results) ✓ apply_patch via Codex's fileChange path ✓ item/fileChange/requestApproval → Hermes UI ✓ MCP server migration loads inside spawned codex (verified via 'use the filesystem MCP tool' prompt) ✓ /codex-runtime auto → codex_app_server toggle cycle ✓ Disable doesn't trigger migration ✓ Enable with codex CLI present succeeds + migrates ✓ Hermes-side interrupt path (turn/interrupt request issued cleanly even if codex finishes before the interrupt lands) Known live-validated limitations now documented in the docs page: - delegate_task subagents unavailable on this runtime - permission profile selection delegated to ~/.codex/config.toml - apply_patch approval prompt has no inline changeset (codex protocol doesn't expose it) 145/145 codex-runtime tests still green. * feat(codex-runtime): native plugin migration + UX polish (quirks 2/4/5/10/11) Major: migrate native Codex plugins (#7 in OpenClaw's PR list) Discovers installed curated plugins via codex's plugin/list RPC and writes [plugins."<name>@<marketplace>"] entries to ~/.codex/config.toml so they're enabled in the spawned Codex sessions. This is the 'YouTube-video-worthy' bit Pash highlighted: when a user has google-calendar, github, etc. installed in their Codex CLI, those plugins activate automatically when they enable Hermes' codex runtime. Implementation: - hermes_cli/codex_runtime_plugin_migration.py: new _query_codex_plugins() helper spawns 'codex app-server' briefly and walks plugin/list. Returns (plugins, error) — failures are non-fatal so MCP migration still works. - render_codex_toml_section() now takes plugins + permissions args. - migrate() defaults: discover_plugins=True, default_permission_profile= 'workspace-write'. Explicit None on either disables that side. - _strip_existing_managed_block() now also strips [plugins.] and [permissions]/[permissions.] sections inside the managed block, so re-runs replace plugins cleanly without touching codex's own config. Quirk fixes: #2 Default permissions profile written on enable. Without this, Codex's read-only default kicks in and EVERY write triggers an approval prompt. Now writes [permissions] default = 'workspace-write' so the runtime feels normal out of the box. Set default_permission_profile=None to opt out. #4 apply_patch approval prompt now shows what's changing. Codex's FileChangeRequestApprovalParams doesn't carry the changeset. Session adapter now caches the fileChange item from item/started notifications and looks it up by itemId when codex requests approval. Prompt shows '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of 'apply_patch (0 change(s))'. Side benefit: also drains pending notifications BEFORE handling a server request, so the projector and per-turn caches are up to date when the approval decision fires. Bounded to 8 notifications per loop iter to avoid starving codex's response. #5/#10 Exec approval prompt never shows empty cwd. When codex omits cwd in CommandExecutionRequestApprovalParams, fall back to the session's cwd. If somehow neither is available, show '<unknown>' explicitly instead of an empty string. Also surfaces 'reason' from the approval params when codex provides it — gives users more context on why codex wants to run something. #11 Banner indicates the codex_app_server runtime when active. New 'Runtime: codex app-server (terminal/file ops/MCP run inside codex)' line appears in the welcome banner only when the runtime is on. Default banner is unchanged. Tests: - 7 new tests in test_codex_runtime_plugin_migration.py covering plugin discovery (mocked), failure handling, dry-run skip, opt-out flag, idempotent re-runs, and permissions writing. - 3 new tests in test_codex_app_server_session.py covering the enriched approval prompts: cwd fallback, change summary on apply_patch, fallback when no item/started cache exists. - All 26 session tests + 46 migration tests green; 153 total in PR. * feat(codex-runtime): hermes-tools MCP callback + native plugin migration The big architectural addition: when codex_app_server runtime is on, Hermes registers its own tool surface as an MCP server in ~/.codex/config.toml so the codex subprocess can call back into Hermes for tools codex doesn't ship with — web_search, browser_, vision, image_generate, skills, TTS. Also: 'migrate native codex plugins' (Pash's YouTube-video-worthy bit) — when the user has plugins like Linear, GitHub, Gmail, Calendar, Canva installed via 'codex plugin', Hermes discovers them via plugin/list and writes [plugins.<name>@openai-curated] entries so they activate automatically. New module: agent/transports/hermes_tools_mcp_server.py FastMCP stdio server exposing 17 Hermes tools. Each call dispatches through model_tools.handle_function_call() — same code path as the Hermes default runtime. Run with: python -m agent.transports.hermes_tools_mcp_server [--verbose] Exposed: web_search, web_extract, browser_navigate / _click / _type / _press / _snapshot / _scroll / _back / _get_images / _console / _vision, vision_analyze, image_generate, skill_view, skills_list, text_to_speech. NOT exposed (deliberately): - terminal/shell/read_file/write_file/patch — codex has built-ins - delegate_task/memory/session_search/todo — _AGENT_LOOP_TOOLS in model_tools.py:493, require running AIAgent context. Documented as a limitation and surfaced in the slash command output. Migration changes (hermes_cli/codex_runtime_plugin_migration.py): - _query_codex_plugins() spawns 'codex app-server' briefly to walk plugin/list and pull installed openai-curated plugins. Failures are non-fatal — MCP migration still completes. - render_codex_toml_section() now takes plugins + permissions args AND wraps the managed block with a MIGRATION_END_MARKER comment so the stripper can reliably find both ends, even when the block contains top-level keys (default_permissions = ...). - migrate() defaults: discover_plugins=True, expose_hermes_tools=True, default_permission_profile=':workspace' (built-in codex profile name — must be prefixed with ':'). All three opt-out via explicit args. - _build_hermes_tools_mcp_entry() builds the codex stdio entry with HERMES_HOME and PYTHONPATH passthrough so a worktree-launched Hermes points the MCP subprocess at the same module layout. Live-caught wire bugs fixed during this turn: 1. Permission profile config key is top-level , NOT a [permissions] table. The [permissions] table is for user-defined* profiles with structured fields. Built-in profile names start with ':' (':workspace', ':read-only', ':danger-no-sandbox'). Was emitting which codex rejected with 'invalid type: string "X", expected struct PermissionProfileToml'. 2. Built-in profile is , NOT . Codex rejected with 'unknown built-in profile'. 3. Codex's MCP layer sends for tool-call confirmation. We weren't handling it, so codex stalled and returned 'MCP tool call was rejected'. Now: auto-accept for our own hermes-tools server (user already opted in by enabling the runtime), decline for third-party servers. Quirk fixes shipped (from the limitations list): #2 default permissions: workspace profile written on enable. No more approval prompt on every write. #4 apply_patch approval shows what's changing: cache fileChange items from item/started, look up by itemId when codex sends item/fileChange/requestApproval. Prompt: '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of '0 change(s)'. #5/#10 exec approval cwd never empty: fall back to session cwd, then '<unknown>'. Also surfaces 'reason' from codex when present. #11 banner shows 'Runtime: codex app-server' line when active so users understand why tool counts may not match what's reachable. Tests: - 5 new tests in test_codex_runtime_plugin_migration.py covering plugin discovery, expose_hermes_tools entry generation, idempotent re-runs, opt-out flag, permissions profile. - 3 new tests in test_codex_app_server_session.py covering enriched approval prompts (cwd fallback, fileChange summary). - 2 new tests for mcpServer/elicitation/request handling (accept hermes-tools, decline others). - New test file test_hermes_tools_mcp_server.py covering module surface, EXPOSED_TOOLS safety invariants (no shell/file_ops, no agent-loop tools), and main() error paths. - 166 codex-runtime tests total, all green. Live e2e validated against codex 0.130.0 + ChatGPT subscription: ✓ /codex-runtime codex_app_server enables, migrates filesystem MCP, registers hermes-tools, writes default_permissions = ':workspace' ✓ Banner shows 'Runtime: codex app-server' line in subsequent sessions ✓ Shell command runs without approval prompt (workspace profile works) ✓ Multi-turn — codex remembers prior turn's results ✓ apply_patch path via fileChange request approval ✓ web_search via hermes-tools MCP callback returns real Firecrawl results: 'OpenAI Codex CLI – Getting Started' end-to-end in 13s ✓ Disable cycle clean Docs updated: website/docs/user-guide/features/codex-app-server-runtime.md Full re-write covering native plugin migration, the hermes-tools callback architecture, the prerequisites change ('codex login is separate from hermes auth login codex'), the trade-off table now reflecting which Hermes tools work via callback, and the limitations list updated with what's actually unavailable on this runtime. * feat(codex-runtime): pin user-config preservation invariant for quirk #6 Quirk #6 from the limitations list — user MCP servers / overrides / codex-only sections in ~/.codex/config.toml that live OUTSIDE the hermes-managed block must survive re-migration verbatim. This already worked thanks to the MIGRATION_MARKER + MIGRATION_END_MARKER pair I added when fixing the default_permissions wire format (so the strip can find both ends of the managed region even with top-level keys like default_permissions). But it was an emergent property without a test pinning it. Now explicitly tested: - User MCP server above the managed block survives migration - User MCP server below the managed block survives migration - Both above + below survive a second re-migration - User content (model, providers, sandbox, otel, etc.) outside our region is left untouched Docs added a section "Editing ~/.codex/config.toml safely" explaining the marker contract — so users know they can add their own MCP servers, override permissions, configure codex-only options, etc. without fear of Hermes overwriting their work. 167 codex-runtime tests, all green. * docs(codex-runtime): clarify the actual tool surface — shell covers terminal/read/write/find Previous docs and PR description undersold what codex's built-in toolset actually provides. apply_patch alone made it sound like the runtime could only edit files in patch format — implying you'd lose terminal use, read_file, write_file, search/find. That was wrong. Codex's 'shell' tool runs arbitrary shell commands inside the sandbox, which covers everything you'd do in bash: cat/head/tail (read), echo> or heredocs (write), find/rg/grep (search), ls/cd (navigate), build/ test/git/etc. apply_patch is for structured multi-file edits on top of that. update_plan is its in-runtime todo. view_image loads images. And codex has its own web_search built in (in addition to the Firecrawl-backed one Hermes exposes via MCP callback). Docs now have a 'What tools the model actually has' section right after Why, breaking the surface into three clearly-labeled buckets: 1. Codex's built-in toolset (always on) — shell, apply_patch, update_plan, view_image, web_search; covers everything terminal- adjacent. 2. Native Codex plugins (auto-migrated from your codex plugin install) — Linear, GitHub, Gmail, Calendar, Outlook, Canva, etc. 3. Hermes tool callback (MCP server in ~/.codex/config.toml) — web_search/web_extract via Firecrawl, browser_, vision_analyze, image_generate, skill_view/skills_list, text_to_speech. Plus a 'What's NOT available' callout listing the four agent-loop tools (delegate_task, memory, session_search, todo) that need running AIAgent context and can't reach the codex runtime. Trade-offs table broken out: shell, apply_patch, update_plan, view_image, sandbox each get their own row with a one-line description so users can see at a glance what's available natively. Architecture diagram updated to list the codex built-ins by name instead of 'apply_patch + shell + sandbox'. No code changes — purely docs clarification. 167 codex-runtime tests still green. fix(codex-runtime): _spawn_background_review signature + review fork api_mode downgrade Two real bugs in the self-improvement loop integration that the previous test mocked away. Bug 1: wrong call signature The codex helper was calling self._spawn_background_review() with no args after every turn. That function actually requires: messages_snapshot=list (positional or keyword) review_memory=bool (at least one trigger must be True) review_skills=bool So the call would have raised TypeError at runtime — except the only test that exercised this path mocked _spawn_background_review entirely and just asserted spawn.called, so the wrong-arg shape never surfaced. Bug 2: review fork inherits codex_app_server api_mode The review fork is constructed with: api_mode = _parent_runtime.get('api_mode') So when the parent is codex_app_server, the review fork ALSO runs as codex_app_server. But the review fork's whole job is to call agent-loop tools (memory, skill_manage) which require Hermes' own dispatch — they short-circuit with 'must be handled by the agent loop' on the codex runtime. So the review fork would have run, decided to save something, called memory or skill_manage, and silently no-op'd. Fixed in run_agent.py:_spawn_background_review() — when the parent api_mode is 'codex_app_server', the review fork is downgraded to 'codex_responses' (same OAuth credentials, same openai-codex provider, but talks to OpenAI's Responses API directly so Hermes owns the loop). Also rewrote the codex helper's review wiring to match the chat_completions path: - Computes _should_review_memory in the pre-loop block (was already being computed; now passed through to the helper as an arg). - Computes _should_review_skills AFTER the codex turn returns + counters tick (line ~15432 pattern in chat_completions). - Calls _spawn_background_review(messages_snapshot=, review_memory=, review_skills=) only when at least one trigger fires. - Adds the external memory provider sync (_sync_external_memory_for_turn) that the chat_completions path runs after every turn. Tests: Replaced the broken test_background_review_invoked (which only asserted spawn.called) with three sharper tests: - test_background_review_NOT_invoked_below_threshold: single turn at default thresholds → no review fires (would have caught the original 'every turn calls spawn with no args' bug) - test_background_review_skill_trigger_fires_above_threshold: 10 tool_iterations at threshold=10 → review fires with messages_snapshot=list, review_skills=True, counter resets - test_background_review_signature_never_breaks: regression guard asserting positional args are always empty and kwargs include messages_snapshot New TestReviewForkApiModeDowngrade class: - test_codex_app_server_parent_downgrades_review_fork: drives the real _spawn_background_review function (no mock at that level), asserts the review_agent gets api_mode='codex_responses' when the parent was codex_app_server. Live-validated against real run_conversation: - Counter ticked from 0 to 5 after a 5-tool-iteration turn - _spawn_background_review fired exactly once with kwargs-only signature - review_skills=True, review_memory=False - messages_snapshot was 12 entries (5 assistant tool_calls + 5 tool results + 1 final assistant + initial system/user) - Counter reset to 0 after fire 170 codex-runtime tests, all green. Docs: added a Self-improvement loop section to the codex runtime page explaining both how the trigger logic stays equivalent and that the review fork is auto-downgraded to codex_responses for the agent-loop tools. Also clarified that apply_patch and update_plan ARE codex's built-in tools (the previous version made it sound like they were separate from 'codex's stuff' — they're not, all five tools listed in 'What tools the model actually has' section 1 are codex built-ins). * feat(codex-runtime): expose kanban tools through Hermes MCP callback Kanban workers spawn as separate hermes chat -q subprocesses that read the user's config.yaml. If model.openai_runtime: codex_app_server is set globally (which is the whole point of opt-in), every dispatched worker ALSO comes up on the codex runtime. That mostly works — codex's built-in shell + apply_patch + update_plan do the actual task work fine — but it had one critical break: the worker handoff tools (kanban_complete, kanban_block, kanban_comment, kanban_heartbeat) are Hermes-registered tools, not codex built-ins. On the codex runtime, codex builds its own tool list and these never reach the model, so the worker would do the work but not be able to report back, hanging until the dispatcher's timeout escalates it as zombie. Fix: add all 9 kanban tools to the EXPOSED_TOOLS list in the Hermes MCP callback. They dispatch statelessly through handle_function_call() just like web_search and the others — they read HERMES_KANBAN_TASK from env (set by the dispatcher), gate correctly (worker tools require the env var, orchestrator tools require it unset), and write to ~/.hermes/kanban.db. Why kanban tools work via stateless dispatch when delegate_task/memory/ session_search/todo don't: those four are listed in _AGENT_LOOP_TOOLS (model_tools.py:493) and short-circuit in handle_function_call() with 'must be handled by the agent loop' — they need to mutate AIAgent's mid-loop state. Kanban tools have no such requirement; they're pure side-effect functions against the kanban.db plus state_meta. Tools exposed: Worker handoff (require HERMES_KANBAN_TASK): kanban_complete, kanban_block, kanban_comment, kanban_heartbeat Read-only board queries: kanban_show, kanban_list Orchestrator (require HERMES_KANBAN_TASK unset): kanban_create, kanban_unblock, kanban_link Tests: - test_kanban_worker_tools_exposed: complete/block/comment/heartbeat in EXPOSED_TOOLS (regression guard for the would-hang-worker bug) - test_kanban_orchestrator_tools_exposed: create/show/list/unblock/link Docs: - New 'Workflow features' section in the docs page covering /goal, kanban, and cron behavior on this runtime - /goal: works fully via run_conversation feedback; only caveat is approval-prompt noise on long writes-heavy goals (mitigated by the default :workspace permission profile) - Kanban: enumerated which tools are reachable via the callback and why the env var propagates correctly through the codex subprocess to the MCP server subprocess - Cron: documented as 'not specifically tested' — same rules as the CLI apply since cron runs through AIAgent.run_conversation - Trade-offs table gained rows for /goal, kanban worker, kanban orchestrator 172/172 codex-runtime tests green (+2 from kanban tests). * docs(codex-runtime): wire /codex-runtime into slash-commands ref + flag aux token cost Three docs gaps caught during a final audit: 1. /codex-runtime was only in the feature docs page, not in the slash-commands reference. Added rows to both the CLI section and the Messaging section so users discover it where they'd look for slash command syntax. 2. CODEX_HOME and HERMES_KANBAN_TASK weren't in environment-variables.md. CODEX_HOME lets users redirect Codex CLI's config dir (the migration honors it). HERMES_KANBAN_TASK is set by the kanban dispatcher and propagates to the codex subprocess + the hermes-tools MCP subprocess so kanban worker tools gate correctly — documented as 'don't set manually' since it's an internal handoff. 3. Aux client behavior on this runtime. When openai_runtime= codex_app_server is on with the openai-codex provider, every aux task (title generation, context compression, vision auto-detect, session search summarization, the background self-improvement review fork) flows through the user's ChatGPT subscription by default. This is true for the existing codex_responses path too, but it's more visible / important here because users explicitly opted in for subscription billing. Added a 'Auxiliary tasks and ChatGPT subscription token cost' section to the docs page with a YAML example showing how to override specific aux tasks to a cheaper model (typically google/gemini-3-flash-preview via OpenRouter). Also documents how the self-improvement review fork gets auto-downgraded from codex_app_server to codex_responses by the fix earlier in this PR. No code changes — pure docs. 172 codex-runtime tests still green. * docs+test(codex-runtime): pin HOME passthrough, document multi-profile + CODEX_HOME OpenClaw hit a real footgun in openclaw/openclaw#81562: when spawning codex app-server they were synthesizing a per-agent HOME alongside CODEX_HOME. That made every subprocess codex's shell tool launches (gh, git, aws, npm, gcloud, ...) see a fake $HOME and miss the user's real config files. They had to back it out in PR #81562 — keep CODEX_HOME isolation, leave HOME alone. Audit confirms Hermes' codex spawn doesn't have this problem. We do os.environ.copy() and only overlay CODEX_HOME (when provided) and RUST_LOG. HOME passes through unchanged. But it was an emergent property without a test pinning it, so adding a regression guard: test_spawn_env_preserves_HOME — confirms parent HOME survives intact in the subprocess env test_spawn_env_sets_CODEX_HOME_when_provided — confirms codex_home arg still isolates codex state correctly Docs additions: 'HOME environment variable passthrough' section — calls out the contract explicitly: CODEX_HOME isolates codex's own state, HOME stays user-real so gh/git/aws/npm/etc. find their normal config. Cites openclaw#81562 as the cautionary tale. 'Multi-profile / multi-tenant setups' section — addresses the related concern: profiles share ~/.codex/ by default. For users who want per-profile codex isolation (separate auth, separate plugins), documents the manual CODEX_HOME=<profile-scoped-dir> approach. Explains why we DON'T auto-scope CODEX_HOME per profile: doing so would silently invalidate existing codex login state for anyone upgrading to this PR with tokens already at ~/.codex/auth.json. Opt-in is safer than surprising users. 174 codex-runtime tests (+2 from HOME guards), all green. * fix(codex-runtime): TOML control-char escapes + atomic config.toml write Two footguns caught in a final audit pass before merge. Bug 1: TOML control characters not escaped The _format_toml_value() helper escaped backslashes and double quotes but passed literal control characters (\n, \t, \r, \f, \b) through unchanged. TOML basic strings don't allow literal control characters — a path or env var containing a newline would produce invalid TOML that codex refuses to load. Realistic exposure: pathological cases like a HERMES_HOME with a trailing newline (env var concatenation accident), or a PYTHONPATH with a tab from a multi-line shell heredoc. Fix: escape all five TOML basic-string control sequences (\b \t \n \f \r) in addition to \\ and \" that we already did. Order matters — backslash must come first or the other escapes get re-escaped. Bug 2: config.toml write wasn't atomic If the python process crashed between target.mkdir() and the write_text() finishing, a half-written config.toml could be left behind. On NFS / Windows / some FUSE mounts this is a real concern; on ext4/APFS small writes are usually atomic in practice but not guaranteed. Fix: write to a tempfile.mkstemp() temp file in the same directory, then Path.replace() (atomic same-dir rename on POSIX, ReplaceFile on Windows). On rename failure, clean up the temp file so repeated failed migrations don't pile up .config.toml.* files. Tests: - test_string_with_newline_escaped — \n in value → \n in output - test_string_with_tab_escaped — \t in value → \t in output - test_string_with_other_controls_escaped — \r, \f, \b - test_windows_path_escaped_correctly — backslash doubling - test_atomic_write_no_temp_leak_on_success — no .config.toml.* left over after a successful write - test_atomic_write_cleanup_on_rename_failure — temp file removed when Path.replace raises (simulated disk full) 180 codex-runtime tests, all green (+6 from this commit). Footguns audited but NOT fixed (with rationale): - Concurrent migrations race. Two Hermes processes hitting /codex-runtime codex_app_server within seconds of each other could cause one writer to lose entries. Low probability (you'd have to enable from two surfaces simultaneously) and low impact (just re-run migration). Adding fcntl/msvcrt locking is more code than it's worth here. The atomic rename above means each individual write is consistent — only the merge step is racy. - Codex protocol version drift. We pin MIN_CODEX_VERSION=0.125 and check at runtime but don't reject too-new versions. Right call — the protocol has been stable through 0.125 → 0.130. If OpenAI breaks it later we'd see the error in test_codex_app_server_runtime on CI before users hit it.	2026-05-13 17:18:15 -07:00
Teknium	9d42c2c286	feat(video_gen): unified video_generate tool with pluggable provider backends (#25126 ) * feat(video_gen): unified video_generate tool with pluggable provider backends One core video_generate tool, every backend a plugin. Mirrors the image_gen + memory_provider + context_engine architecture: ABC, registry, plugin-context registration hook, and per-plugin model catalogs surfaced through hermes tools. Surface (one schema, every backend): - operation: generate / edit / extend - modalities: text-to-video (prompt only), image-to-video (prompt + image_url), video edit (prompt + video_url), video extend (video_url) - reference_image_urls, duration, aspect_ratio, resolution, negative_prompt, audio, seed, model override - Providers ignore unknown kwargs and declare what they support via VideoGenProvider.capabilities() — backend-specific quirks stay in the backend, the agent learns one tool Backends shipped: - plugins/video_gen/xai/ — Grok-Imagine, full generate/edit/extend + image-to-video + reference images (salvaged from PR #10600 by @Jaaneek, reshaped into the plugin interface) - plugins/video_gen/fal/ — Veo 3.1 (t2v + i2v), Kling O3 i2v, Pixverse v6 i2v with model-aware payload building that drops keys a model doesn't declare Wiring: - agent/video_gen_provider.py — VideoGenProvider ABC, normalize_operation, success_response / error_response, save_b64_video / save_bytes_video, $HERMES_HOME/cache/videos/ - agent/video_gen_registry.py — thread-safe register/get/list + get_active_provider() reading video_gen.provider from config.yaml - hermes_cli/plugins.py — PluginContext.register_video_gen_provider() - hermes_cli/tools_config.py — Video Generation category in hermes tools, plugin-only providers list, model picker per plugin, config write to video_gen.{provider,model} - toolsets.py — new video_gen toolset - tests: 31 new tests covering ABC, registry, tool dispatch, both plugins - docs: developer-guide/video-gen-provider-plugin.md (parallel to the image-gen guide), sidebar + toolsets-reference + plugin guides updated Supersedes: #25035 (FAL), #17972 (FAL), #14543 (xAI), #13847 (HappyHorse), #10458 (provider categories), #10786 (xAI media+search bundle), #2984 (FAL duplicate), #19086 (Google Veo standalone — easy port to plugin interface). Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com> * feat(video_gen): dynamic schema reflects active backend's capabilities Address the 'capability variance' question — instead of one tool with a static schema that lies about what every backend supports, the video_generate tool now rebuilds its description at get_definitions() time based on the configured video_gen.provider and video_gen.model. The agent sees backend-specific guidance up-front: - 'fal-ai/veo3.1/image-to-video': 'image-to-video only — image_url is REQUIRED; text-only prompts will be rejected' - 'fal-ai/veo3.1' (t2v): no image_url restriction shown - xAI grok-imagine-video: 'operations: generate, edit, extend; up to 7 reference_image_urls' - Backends without edit/extend: 'not supported on this backend — surface that they need to switch backends via hermes tools' This is the same pattern PR #22694 used for delegate_task self-capping — documented in the dynamic-tool-schemas skill. Cache invalidation is free: get_tool_definitions() already memoizes on config.yaml mtime, so a mid-session backend swap rebuilds the schema automatically. Tested: - Empirical FAL OpenAPI schema check confirms image-to-video models require image_url (FAL returns HTTP 422 otherwise) — client-side rejection in FALVideoGenProvider.generate() now prevents the wasted round-trip - Live E2E: fal-ai/veo3.1/image-to-video + prompt-only → clean missing_image_url error; fal-ai/veo3.1 + prompt-only → dispatches - 6 new tests cover the builder (no config / image-only / full-surface / text-only / unknown provider / registry wiring), all passing - 37/37 in the slice, 134/134 in the broader regression set * test(video_gen/xai): full surface integration tests + cleaner schema Verified end-to-end that the xAI plugin handles every documented mode from PR #10600's surface: text-to-video, image-to-video, reference-images-to-video, video edit, video extend (with and without prompt). All five modes route to the correct xAI endpoint (/videos/generations, /videos/edits, /videos/extensions) with the right payload shape (image / reference_images / video keys), and all five client-side rejections fire before the network: edit-without-prompt, extend-without-video_url, image+refs conflict, >7 references, and duration/aspect_ratio clamping. 15 new integration tests grouped into four classes (endpoint routing, modalities, validation, clamping). httpx is stubbed via a small fake AsyncClient that records POSTs so the tests assert the actual payload the plugin would send to xAI — not just the success/error envelope. Also cleaned up a description redundancy: when a model's operations match the backend's overall set, we no longer print the duplicate 'operations supported by this model' line. xAI's description now reads: Active backend: xAI . model: grok-imagine-video - operations supported by this backend: edit, extend, generate - modalities supported by this backend: image, reference_images, text - aspect_ratio choices: 16:9, 1:1, 2:3, 3:2, 3:4, 4:3, 9:16 - resolution choices: 480p, 720p - duration range: 1-15s - reference_image_urls: up to 7 images Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com> * feat(video_gen): collapse surface to t2v + i2v, family-based auto-routing Two design changes per Teknium: 1) Drop edit/extend from the tool surface entirely. Only text-to-video and image-to-video remain. The agent sees a clean tool with two modalities; backend-specific quirks like xAI's edit/extend endpoints stay out of the unified schema. 2) FAL: pick a model FAMILY once, the plugin routes between the family's text-to-video and image-to-video endpoints based on whether image_url was passed. Users no longer pick 'fal-ai/veo3.1' AND 'fal-ai/veo3.1/image-to-video' as separate options — they pick 'veo3.1', and the plugin handles the rest. Catalog rewritten as families: veo3.1 fal-ai/veo3.1 / fal-ai/veo3.1/image-to-video pixverse-v6 fal-ai/pixverse/v6/text-to-video / fal-ai/pixverse/v6/image-to-video kling-o3-standard fal-ai/kling-video/o3/standard/text-to-video / fal-ai/kling-video/o3/standard/image-to-video xAI uses a single endpoint (/videos/generations) for both modes, routed by the presence of the 'image' field in the payload — no edit/extend exposure. Schema changes: - VIDEO_GENERATE_SCHEMA: drop operation, drop video_url. Final params: prompt (required), image_url, reference_image_urls, duration, aspect_ratio, resolution, negative_prompt, audio, seed, model. - VideoGenProvider ABC: drop normalize_operation, VALID_OPERATIONS, DEFAULT_OPERATION. capabilities() drops 'operations' key. - success_response: add 'modality' field ('text' \| 'image') so the agent and logs can see which endpoint was actually hit. Dynamic schema builder simplified — no operations bullet, no 'switch backends if you need edit/extend' guidance. When the active backend supports both modalities (the common case), description reads: Active backend: FAL . model: pixverse-v6 - supports both text-to-video (omit image_url) and image-to-video (pass image_url) - routes automatically - aspect_ratio choices: 16:9, 9:16, 1:1 - resolution choices: 360p, 540p, 720p, 1080p - duration range: 1-15s - audio: pass audio=true to enable native audio (pricing tier) - negative_prompt: supported Tests: 51 in the video_gen slice, 216 across the broader image+video sweep, all passing. New FAL routing tests prove pixverse-v6 + no image hits text-to-video endpoint, pixverse-v6 + image_url hits image-to-video endpoint, same for veo3.1 and kling-o3-standard. Docs updated: developer-guide page rewrites the 'model families' pattern as a first-class section so external plugin authors know the convention. toolsets-reference and toolsets.py descriptions match the new surface. Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com> * feat(video_gen/fal): expand catalog to 6 families, cheap + premium tiers Catalog now covers everything Teknium specced from FAL: Cheap tier: ltx-2.3 fal-ai/ltx-2.3-22b/text-to-video / image-to-video pixverse-v6 fal-ai/pixverse/v6/text-to-video / image-to-video Premium tier: veo3.1 fal-ai/veo3.1 / fal-ai/veo3.1/image-to-video seedance-2.0 bytedance/seedance-2.0/text-to-video / image-to-video kling-v3-4k fal-ai/kling-video/v3/4k/text-to-video / image-to-video happy-horse fal-ai/happy-horse/text-to-video / image-to-video DEFAULT_MODEL moved from veo3.1 (premium) to pixverse-v6 (cheap, sane defaults, both modalities) — better first-run UX for users who haven't explicitly picked a model. New family-entry knob: image_param_key. Kling v3 4K's image-to-video endpoint expects start_image_url instead of image_url; declaring image_param_key='start_image_url' on the family lets _build_payload remap correctly. Other families default to plain image_url. Per-family capability flags reflect each model's docs: - LTX 2.3 + Happy Horse: minimal payloads (no duration/aspect/resolution enum exposed by FAL — let endpoint apply defaults) - Seedance: 6 aspect ratios incl 21:9, durations 4-15, audio supported, negative prompts NOT supported per docs - Kling v3 4K: 16:9/9:16/1:1, 3-15s, audio + negative - Veo 3.1: unchanged, 16:9/9:16, 4/6/8s Tests: +5 covering the new families (full catalog, Kling 4K start_image_url remap, Seedance routing, LTX payload minimality, Happy Horse minimality). 56/56 in the slice green. Note: I did NOT add the FAL-hosted xAI Grok-Imagine variant. Hermes already has a direct xAI plugin that talks to xAI's own API; routing the same model through FAL's wrapper would duplicate the surface without adding capabilities. Users on FAL who want Grok-Imagine should use the xAI plugin directly; flag if you want both routes available. * test(video_gen): tool-surface routing matrix — every model x modality End-to-end matrix test driven through _handle_video_generate() — the actual function the agent's video_generate tool call lands in. Writes config.yaml, invokes the registered handler with a raw args dict, then asserts the outbound HTTP/SDK call hit the right endpoint with the right payload shape. Parametrized over FAL_FAMILIES.keys() so the matrix auto-discovers new families as they're added (add a family to FAL_FAMILIES and you get both modalities tested for free). Coverage: - All 6 FAL families x {text-only, text+image} = 12 cases - xAI x {text-only, text+image} = 2 cases - tool-level model= arg overrides config = 2 cases For each case, verifies: - result['success'] is True - result['modality'] matches input shape ('text' if no image_url, 'image' otherwise) - outbound endpoint URL matches the family's text_endpoint or image_endpoint - text-only payloads carry no image-shaped keys - text+image payloads carry the family's image key (image_url for most, start_image_url for kling-v3-4k, wrapped 'image' object for xAI) All 16 cases passing. Confirms the tool surface routes every (provider, model, modality) combination correctly with zero leakage. * feat(video_gen): keep video_gen out of first-run setup, surface in status Two changes: 1. video_gen joins _DEFAULT_OFF_TOOLSETS, so it is NOT pre-selected in the first-run toolset checklist. Video gen is niche, paid, and slow — most users don't want it nagging them during initial setup. Anyone who wants it opts in via 'hermes tools' -> Video Generation, which already routes to the provider+model picker. 2. The 'hermes setup' status panel learns about video_gen — but only shows the row when a plugin reports available. Users without FAL_KEY/XAI_API_KEY see nothing about video gen; users with one of those keys see 'Video Generation (FAL) ✓' as confirmation it's wired. Verified live: - Fresh install (no creds): zero video_gen mentions in wizard. - With FAL_KEY: status row appears with active backend name. - 160/160 in the setup + tools_config + video_gen test slice. Rationale: image_gen is on by default because it's a featured creative tool used in casual chat (telegrams, etc). Video gen is heavier — long wait, paid per-second pricing. Default-off matches user intent better. --------- Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>	2026-05-13 16:39:41 -07:00
GodsBoy	da0ddbf88a	fix: classify landed file mutations with diagnostics	2026-05-13 06:46:23 -07:00
Teknium	486b692ddd	feat(nous): unified client=hermes-client-v<version> tag on every Portal request (#24779 ) * feat(nous): unified client=hermes-client-v<version> tag on every Portal request Every Hermes request to Nous Portal now carries the same client=hermes-client-v<__version__> tag (e.g. client=hermes-client-v0.13.0 on this release), sourced live from hermes_cli.__version__. The release script's regex bump auto-aligns it on every release. Centralized in agent/portal_tags.py and wired into all four call sites: - NousProfile.build_extra_body (main agent loop, every chat completion) - auxiliary_client.NOUS_EXTRA_BODY + _build_call_kwargs (aux client) - run_agent.py compression-summary fallback path - tools/web_tools.py web_extract fallback Replaces the client=aux marker added in #24194 with the unified version tag. Tests assert against the helper output (invariant) rather than the literal string, so they don't need updating on every release. * feat(nous): cover /goal judge and kanban specify aux paths Two aux-using surfaces bypassed call_llm by invoking client.chat.completions.create() directly without extra_body, so they were missing the unified Portal client tag: - hermes_cli/goals.py — /goal standing-goal judge - hermes_cli/kanban_specify.py — kanban triage specifier Both now pass extra_body=get_auxiliary_extra_body() or None so they inherit the version tag when the aux client points at Nous Portal, and emit nothing otherwise (no tag leak to OpenRouter/Anthropic auxes).	2026-05-12 20:49:20 -07:00
Teknium	b06e999302	fix(cache): kill long-lived prefix layout — system prompt is now byte-static within a session (#24778 ) The long-lived prefix-cache layout split the system prompt into stable/ context/volatile blocks and re-derived them on every API call. The volatile tier (timestamp + memory snapshot + USER profile) ticks per turn, so the system message bytes mutated mid-conversation and broke upstream prompt caches (OpenRouter, Nous Portal, Anthropic). Diagnosed via live wire-format diffing: an 8-turn conversation showed OLD layout flipping system block[1] sha mid-session at the minute boundary, dropping cached_tokens to 0 on that turn (cumulative 66.6% vs 83.3% for the single-block layout). Hermes invariant: history (system + all but the last 1-2 messages) must be static. Fix: drop the long-lived layout entirely. Single layout everywhere — system_and_3 with one cached system string built once on first turn, replayed verbatim on every subsequent turn. Loses cross-session 1h prefix caching for Claude (the feature that motivated the split), but within-session caching now actually works on every provider. Removed: - run_agent.py: _use_long_lived_prefix_cache flag, _long_lived_cache_ttl, _supports_long_lived_anthropic_cache method, the long-lived branch in run_conversation, mark_tools_for_long_lived_cache call site - agent/prompt_caching.py: apply_anthropic_cache_control_long_lived, mark_tools_for_long_lived_cache, _mark_system_stable_block helper - hermes_cli/config.py: prompt_caching.long_lived_prefix and prompt_caching.long_lived_ttl config keys - tests/agent/test_prompt_caching_live.py (entire file) - tests/agent/test_prompt_caching.py: TestMarkToolsForLongLivedCache, TestApplyAnthropicCacheControlLongLived - tests/run_agent/test_anthropic_prompt_cache_policy.py: TestSupportsLongLivedAnthropicCache Targeted tests: 62/62 pass.	2026-05-12 20:46:04 -07:00
ALIYILD	afa5b81918	fix(prompt_builder): inject tool-use enforcement for GLM models GLM-family models (z-ai/glm-4.5-air, z-ai/glm-4.5-flash, etc.) exhibit the same "describe-instead-of-call" failure mode that gpt/codex/gemini/ gemma/grok already trigger enforcement for. Without the injection, free-tier GLM workers spawned by the kanban dispatcher routinely exit cleanly (rc=0) without invoking kanban_complete or kanban_block, producing the "protocol violation" error and triggering the dispatcher's gave_up path. Observed in real workloads: seven consecutive kanban tasks across three GLM-tier profiles (shipbackend, frontend-engineer, backend-engineer) all failed with the identical message: worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation Re-running the same tasks on Claude Haiku immediately resolved them. Adding "glm" to TOOL_USE_ENFORCEMENT_MODELS closes the gap so future GLM-routed work receives the explicit "every response must contain a tool call or final result" steering that already protects the other enforcement-gated model families. One-line change; no behavior change for non-GLM models.	2026-05-12 18:46:28 -07:00
Teknium	29c9ff9ba5	fix(lsp): typescript SDK install + tsc-missing skip + shellcheck warning (#24630 ) Three follow-ups to PR #24168 found during live E2E testing on TS/bash files: 1. typescript-language-server now installs the typescript SDK (tsserver) alongside it. Without that sibling install, initialize() failed with "Could not find a valid TypeScript installation" and the server was marked broken — no diagnostics ever reached the agent. New extra_pkgs field on INSTALL_RECIPES makes that explicit and reusable for future peer-dep cases. 2. _check_lint now treats "linter command exists on PATH but cannot actually run" as skipped instead of error. The motivating case is npx tsc when typescript is not in node_modules — npx prints its "This is not the tsc command you are looking for" banner and exits non-zero, which previously blocked the LSP semantic tier (gated on success or skipped). Pattern-matched per base command (npx, rustfmt, go) so genuine lint errors still flow through normally. 3. hermes lsp status now surfaces a Backend warnings section when bash-language-server is installed but shellcheck is missing. The server itself spawns fine but bash-language-server delegates diagnostics to shellcheck — without it on PATH the integration looks alive but never reports any problems. Same warning is logged once at server spawn time. Validation: - 12 new tests in tests/agent/lsp/test_install_and_lint_fixes.py: * recipe carries typescript SDK * _install_npm passes both pkg + extras to npm CLI * backwards compat: recipes without extras still work * _backend_warnings quiet when bash absent / both present * _backend_warnings fires when bash installed without shellcheck * status output includes the Backend warnings section * _looks_like_linter_unusable catches the npx tsc banner * real TS type errors not misclassified as unusable * unfamiliar linters fall through normally * _check_lint returns skipped on npx tsc unusable * _check_lint returns error on real tsc type errors - Full lsp + file_operations test suite: 245/245 pass - Live E2E: * try_install("typescript-language-server") installs both packages into node_modules * write_file(bad.ts, ...) returns lint=skipped + lsp_diagnostics with two real TS errors (was lint=error, no lsp_diagnostics) * hermes lsp status renders the shellcheck warning when bash is installed but shellcheck is not on PATH	2026-05-12 17:02:35 -07:00
hookinglau	d68a0ec383	fix(auxiliary): pass cfg_base_url and cfg_api_key when resolving task provider _resolve_task_provider_model drops cfg_base_url and cfg_api_key when returning a named provider, causing configured API keys and base URLs to be lost. Pass them through so named providers can use custom endpoints while still resolving credentials from provider-specific env vars. Closes #20139	2026-05-12 16:36:20 -07:00
zccyman	88ede807c4	fix(pricing): add deepseek-v4-pro to official docs pricing table deepseek-v4-pro has been routable since v0.12 but was missing from the _OFFICIAL_DOCS_PRICING table. Sessions using this model showed as "unknown cost" in hermes insights instead of a dollar estimate. Add pricing entry using published list prices: - input: \$1.74/M tokens - output: \$3.48/M tokens - cache_read: \$0.0145/M tokens Uses standard list rates (not the 75% promo) so estimates remain accurate after promo expires 2026-05-31. Closes #24218	2026-05-12 16:32:57 -07:00
Teknium	83b93898c2	feat(lsp): semantic diagnostics from real language servers in write_file/patch (#24168 ) * feat(lsp): semantic diagnostics from real language servers in write_file/patch Wire ~26 language servers (pyright, gopls, rust-analyzer, typescript-language-server, clangd, bash-language-server, ...) into the post-write lint check used by write_file and patch. The model now sees type errors, undefined names, missing imports, and project-wide semantic issues introduced by its edits, not just syntax errors. LSP is gated on git workspace detection: when the agent's cwd or the file being edited is inside a git worktree, LSP runs against that workspace; otherwise the existing in-process syntax checks are the only tier. This keeps users on user-home cwds (Telegram/Discord gateway chats) from spawning daemons. The post-write check is layered: in-process syntax check first (microseconds), then LSP semantic diagnostics second when syntax is clean. Diagnostics are delta-filtered against a baseline captured at write start, so the agent only sees errors its edit introduced. A flaky/missing language server can never break a write -- every LSP failure path falls back silently to the syntax-only result. New module agent/lsp/ split into: - protocol.py: Content-Length JSON-RPC framer + envelope helpers - client.py: async LSPClient (spawn, initialize, didOpen/didChange, ContentModified retry, push/pull diagnostic stores) - workspace.py: git worktree walk-up + per-server NearestRoot resolver - servers.py: registry of 26 language servers (extension match, root resolver, spawn builder per language) - install.py: auto-install dispatch (npm install --prefix, go install with GOBIN, pip install --target) into HERMES_HOME/lsp/bin/ - manager.py: LSPService (per-(server_id, root) client registry, lazy spawn, broken-set, in-flight dedupe, sync facade for tools layer) - reporter.py: <diagnostics> block formatter (severity-1-only, 20-per-file) - cli.py: hermes lsp {status,list,install,install-all,restart,which} Wired into tools/file_operations.py: - write_file/patch_replace now call _snapshot_lsp_baseline before write - _check_lint_delta gains a third tier: LSP semantic diagnostics when syntax is clean - All LSP code paths swallow exceptions; write_file's contract unchanged Config: 'lsp' section in DEFAULT_CONFIG with enabled (default true), wait_mode, wait_timeout, install_strategy (default 'auto'), and per-server overrides (disabled, command, env, initialization_options). Tests: tests/agent/lsp/ -- 49 tests covering protocol framing (encode and read_message round-trip, EOF/truncation/missing Content-Length), workspace gate (git walk-up, exclude markers, fallback to file location), reporter (severity filter, max-per-file cap, truncation), service-level delta filter, and an in-process mock LSP server that exercises the full client lifecycle including didChange version bumps, dedup, crash recovery, and idempotent teardown. Live E2E verified end-to-end through ShellFileOperations: pyright auto-installed via npm into HERMES_HOME, baseline captured, type error introduced, single delta diagnostic surfaced with correct line/column/code/ source, then patch fix removes the diagnostic from the output. Docs: new website/docs/user-guide/features/lsp.md page covering supported languages, configuration knobs, performance characteristics, and troubleshooting; cli-commands.md updated with the 'hermes lsp' reference; sidebar updated. * feat(lsp): structured logging, backend gate, defensive walk caps Cherry-picks the substantive ideas from #24155 (different scope, same problem space) onto our PR. agent/lsp/eventlog.py (new): dedicated structured logger ``hermes.lint.lsp`` with steady-state silence. Module-level dedup sets keep a 1000-write session at exactly ONE INFO line ("active for <root>") at the default INFO threshold; clean writes log at DEBUG so they never reach agent.log under normal config. State transitions (server starts, no project root for a file, server unavailable) fire at INFO/WARNING once per (server_id, key); novel events (timeouts, unexpected errors) fire WARNING per call. Grep recipe: ``rg 'lsp\\['``. agent/lsp/manager.py: wire the eventlog into _get_or_spawn and get_diagnostics_sync so users can answer "did LSP fire on this edit?" with a single grep, plus surface "binary not on PATH" warnings once instead of silently retrying every write. tools/file_operations.py: backend-type gate. ``_lsp_local_only()`` returns False for non-local backends (Docker / Modal / SSH / Daytona); ``_snapshot_lsp_baseline`` and ``_maybe_lsp_diagnostics`` now skip entirely on remote envs. The host-side language server can't see files inside a sandbox, so this prevents pretending to lint a file the host process can't open. agent/lsp/protocol.py: 8 KiB cap on the header block in ``read_message``. A pathological server that streams headers without ever emitting CRLF-CRLF would have looped forever consuming bytes; now raises ``LSPProtocolError`` instead. agent/lsp/workspace.py: 64-step cap on ``find_git_worktree`` and ``nearest_root`` upward walks, plus try/except containment around ``Path(...).resolve()`` and child ``.exists()`` calls. Defensive against pathological inputs (symlink loops, encoding errors, permission failures mid-walk) — the lint hook is hot-path code and must never raise. Tests: - tests/agent/lsp/test_eventlog.py: 18 tests covering steady-state silence (clean writes stay DEBUG), state-transition INFO-once semantics (active for, no project root), action-required WARNING-once (server unavailable), per-call WARNING (timeouts, spawn failures), and the "1000 clean writes => 1 INFO" contract. - tests/agent/lsp/test_backend_gate.py: 5 tests verifying _lsp_local_only / snapshot_baseline / maybe_lsp_diagnostics skip the LSP layer for non-local backends and route correctly for LocalEnvironment. - tests/agent/lsp/test_protocol.py: new test_read_message_rejects_runaway_header exercising the 8 KiB cap. Validation: - 73/73 LSP tests pass (49 original + 18 eventlog + 5 backend-gate + 1 framer cap) - 198/198 pass when run alongside existing file_operations tests - Live E2E re-run with pyright still surfaces "ERROR [2:12] Type ... reportReturnType (Pyright)" through the full path, then patch fix removes it on the next call. * feat(lsp): atexit cleanup + separate lsp_diagnostics JSON field Two improvements salvaged from #24414's plugin-form alternative, keeping our core-integrated design: 1. atexit cleanup of spawned language servers ---------------------------------------------------------------- ``agent/lsp/__init__.get_service`` now registers an ``atexit`` handler on first creation that tears down the LSPService on Python exit. Without this, every ``hermes chat`` exit was leaking pyright/gopls/etc. processes for a few seconds while their stdout buffers drained -- they got reaped by the kernel eventually but a watchful ``ps aux`` would catch them. The handler runs once per process (gated by ``_atexit_registered``); idempotent ``shutdown_service`` ensures double-fire is a no-op. Errors during shutdown are swallowed at debug level since by the time atexit fires the user has already seen the agent's final response. 2. Separate ``lsp_diagnostics`` field on WriteResult / PatchResult ---------------------------------------------------------------- Previously the LSP layer folded its diagnostic block into the ``lint.output`` string, conflating the syntax-check tier with the semantic tier. The agent (and any downstream parsers) now read syntax errors and semantic errors as independent signals: { "bytes_written": 42, "lint": {"status": "ok", "output": ""}, "lsp_diagnostics": "<diagnostics file=...>\nERROR [2:12] ..." } ``_check_lint_delta`` returns to its original two-tier shape (syntax check + delta filter); ``write_file`` and ``patch_replace`` independently fetch LSP diagnostics via ``_maybe_lsp_diagnostics`` and pass them into the new field. ``patch_replace`` propagates the inner write_file's ``lsp_diagnostics`` so the outer PatchResult carries the patch's delta correctly. Tests: 19 new - tests/agent/lsp/test_lifecycle.py (8 tests): atexit registration fires once and only once across N get_service calls; the registered callable is our internal shutdown wrapper; shutdown_service is idempotent and safe when never started; exceptions during shutdown are swallowed; inactive service is cached so we don't rebuild on every check. - tests/agent/lsp/test_diagnostics_field.py (11 tests): WriteResult / PatchResult dataclass shape, to_dict include/omit semantics, channel separation (lint and lsp_diagnostics carry independent signals), write_file populates the field via _maybe_lsp_diagnostics only when the syntax tier is clean, patch_replace propagates the field forward from its internal write_file. Validation: - 92/92 LSP tests pass (73 prior + 8 lifecycle + 11 diagnostics field) - 217/217 pass with file_operations + LSP combined - Live E2E reverified: clean writes -> both fields empty/none; type error introduced -> lint clean (parses), lsp_diagnostics carries the pyright reportReturnType block; patch fix -> both fields clean again. * fix(lsp): broken-set short-circuit so a wedged server isn't paid every write Discovered while auditing failure paths: a language server binary that hangs (sleep forever, no LSP traffic on stdin/stdout) caused EVERY subsequent write to re-pay the 8s snapshot_baseline timeout. Five writes = ~64s of dead time. The bug: ``_get_or_spawn`` adds the (server_id, root) pair to ``_broken`` inside its inner exception handler, but when the OUTER ``_loop.run`` timeout fires, it cancels the inner task before that handler runs. The pair never makes it to broken-set, so the next write re-enters the spawn path and re-pays the timeout. Fix: - New ``_mark_broken_for_file`` helper at the service layer marks the (server_id, workspace_root) pair broken from the OUTSIDE when the outer timeout fires. Called from the except branches in ``snapshot_baseline``, ``get_diagnostics_sync`` (asyncio.TimeoutError + generic Exception). Also kills any orphan client process that survived the cancelled future, fire-and-forget with a 1s ceiling. - ``enabled_for`` now consults the broken-set BEFORE returning True. Files in already-broken (server_id, root) pairs short-circuit to False, so the file_operations layer skips the LSP path entirely with no spawn cost. Until the service is restarted (``hermes lsp restart``) or the process exits. - A single eventlog WARNING is emitted on first mark-broken so the user knows which server gave up. Subsequent edits in the same project stay silent. Tests: 7 new in tests/agent/lsp/test_broken_set.py — covers the key shape (server_id, per_server_root), enabled_for short-circuit, sibling-file skip in same project, project isolation (broken in A doesn't affect B), graceful no-op for missing-server / no-workspace, and an end-to-end test that snapshots after a failure and verifies the next ``enabled_for`` returns False. Validation: - Live retest of the wedged-binary scenario: 5 sequential writes, first 8.88s (the one snapshot timeout), subsequent four ~0.84s (no LSP cost). Down from 5x12.85s = 64s before this fix. - 99/99 LSP tests pass (92 prior + 7 broken-set) - 224/224 pass with file_operations + LSP combined - Happy path E2E reverified — clean write, type error introduced, patch fix all behave correctly with the new broken-set logic. Note: the FIRST write to a wedged binary still pays 8s (the snapshot_baseline timeout). We could shorten that, but pyright/ tsserver normally take 2-3s and slow CI rust-analyzer can need 5+ seconds, so 8s is the conservative ceiling. Subsequent writes are instant.	2026-05-12 16:31:54 -07:00
rob-maron	2863e9484a	Use nous portal as model metadata authority (#24502 ) * nous portal metadata resolver * minor fixes	2026-05-12 11:59:31 -07:00
Teknium	c1eb2dcda7	feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback (#24220 ) * feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback Three coordinated mitigations for the Mini Shai-Hulud worm hitting mistralai 2.4.6 on PyPI (2026-05-12) and for the next single-package compromise that follows. # What this PR makes true 1. Users with the poisoned mistralai 2.4.6 in their venv get a loud detection banner with copy-pasteable remediation steps the moment they run hermes (and on every gateway startup). 2. One quarantined / yanked PyPI package can no longer silently demote a fresh install to 'core only' — the installer keeps every other extra and tells the user which tier landed. 3. Future opt-in backends (Mistral, ElevenLabs, Honcho, etc.) can lazy-install on first use under a strict allowlist, instead of eagerly pulling everything at install time. # Detection: hermes_cli/security_advisories.py - ADVISORIES catalog (one entry currently: shai-hulud-2026-05 for mistralai==2.4.6). Adding the next one is a single dataclass. - detect_compromised() uses importlib.metadata.version() — no pip dependency, works in uv venvs that lack pip. - Banner cache (~/.hermes/cache/advisory_banner_seen) rate-limits the startup banner to once per 24h per advisory. - Acks persisted to security.acked_advisories in config.yaml; never re-banner after ack. - Wired into: * hermes doctor — runs first, prints full remediation block * hermes doctor --ack <id> — dismisses an advisory * cli.py interactive run() and single-query branches — short stderr banner pointing at hermes doctor * gateway/run.py startup — operator-visible warning in gateway.log # Lazy-install framework: tools/lazy_deps.py - LAZY_DEPS allowlist maps namespaced feature keys (tts.elevenlabs, memory.honcho, provider.bedrock, etc.) to pip specs. - ensure(feature) installs missing deps in the active venv via the uv → pip → ensurepip ladder (matches tools_config._pip_install). - Strict spec safety regex rejects URLs, file paths, shell metas, pip flag injection, control chars — only PyPI-by-name accepted. - Gated on security.allow_lazy_installs (default true) plus the HERMES_DISABLE_LAZY_INSTALLS env var for restricted/audited envs. - Migrated three backends as proof of pattern: * tools/tts_tool.py — _import_elevenlabs() calls ensure first * plugins/memory/honcho/client.py — get_honcho_client lazy-installs * tts.mistral / stt.mistral entries pre-registered for when PyPI restores mistralai # Installer fallback tiers scripts/install.sh, scripts/install.ps1, setup-hermes.sh: - Centralised _BROKEN_EXTRAS list (currently: mistral). Edit one array when a transitive breaks; users keep every other extra. - New 'all minus known-broken' tier between [all] and the existing PyPI-only-extras tier. Only kicks in when [all] fails resolve. - All three tiers explicit: every fallback announces which tier landed and prints a re-run hint when not on Tier 1. - install.ps1 and install.sh both regenerate their tier specs from the same _BROKEN_EXTRAS array so updates stay in sync. Side effect: install.ps1 Tier 2 spec previously hardcoded 'mistral' in its extra list — bug fixed by the refactor (mistral is filtered out). # Config hermes_cli/config.py — DEFAULT_CONFIG.security gains: - acked_advisories: [] (advisory IDs the user has dismissed) - allow_lazy_installs: True (security gate for ensure()) No config version bump needed — both keys nest under existing security: block, and load_config's deep-merge picks up DEFAULT_CONFIG defaults for users with older configs. # Tests tests/hermes_cli/test_security_advisories.py — 23 tests covering: - detect_compromised matches/non-matches, wildcard frozenset - ack persistence, idempotence, blank rejection, config-failure path - banner cache rate limiting + 24h re-banner + ack-stops-banner - short_banner_lines / full_remediation_text / render_doctor_section / gateway_log_message - shipped catalog well-formedness invariant tests/tools/test_lazy_deps.py — 40 tests covering: - spec safety: 11 safe parametrized + 18 unsafe parametrized - allowlist: unknown-feature rejection, namespace.name shape, every shipped spec passes the safety regex - security gating: config flag, env var, default, fail-open - ensure() happy/sad paths: already-satisfied, install success, pip stderr surfaced on failure, install-succeeds-but-still-missing - is_available, feature_install_command Combined: 63 new tests, all passing under scripts/run_tests.sh. # Validation - scripts/run_tests.sh tests/hermes_cli/test_security_advisories.py tests/tools/test_lazy_deps.py → 63/63 passing - scripts/run_tests.sh tests/hermes_cli/test_doctor.py tests/hermes_cli/test_doctor_command_install.py tests/tools/test_tts_mistral.py tests/tools/test_transcription_tools.py tests/tools/test_transcription_dotenv_fallback.py → 165/165 passing - scripts/run_tests.sh tests/hermes_cli/ tests/tools/ → 9191 passed, 8 pre-existing failures (verified on origin/main before this change) - bash -n on install.sh and setup-hermes.sh → OK - py_compile on all modified .py files → OK - End-to-end smoke test of detect_compromised + render_doctor_section + gateway_log_message with mocked installed version → produces copy-pasteable remediation output # Community Full advisory + remediation steps: website/docs/community/security-advisories/shai-hulud-mistralai-2026-05.md Short-form post drafts (Discord, GitHub pinned issue, README banner): scripts/community-announcement-shai-hulud.md Refs: PR #24205 (mistral disabled), Socket Security advisory <https://socket.dev/blog/mini-shai-hulud-worm-pypi> * build(deps): pin every direct dep to ==X.Y.Z (no ranges) Companion to the supply-chain advisory work: replace every >=/</~= range in pyproject.toml's [project.dependencies] and [project.optional-dependencies] with an exact ==X.Y.Z pin sourced from uv.lock. Why: ranges allow PyPI to ship a fresh version of any direct dep at any time without a code review on our side. With ranges, the malicious mistralai 2.4.6 release would have been pulled by every fresh 'pip install -e .[all]' for the hours between upload and PyPI's quarantine — exactly the install window we got hit on. Exact pins close that window: the only way a new package version reaches a user is via an intentional update on our end. What the user-facing change is: nothing, behavior-wise. Every package resolves to the same version it was already resolving to via uv.lock — the pins just remove the resolver's freedom to pick a different one. Cost: any user installing Hermes alongside another package that requires a newer pin gets a resolver conflict. Acceptable for our isolated-venv install path; documented in the new comment block. Build-system requires line (setuptools>=61.0) is intentionally left as a range — pinning the build backend would block fresh pip from bootstrapping the build on architectures where that exact wheel isn't available. mistral extra (mistralai==2.3.0) is pinned but stays out of [all] (per PR #24205). 'uv lock' regeneration will fail until PyPI restores mistralai; lockfile regeneration is gated behind that, NOT on every PR. LAZY_DEPS in tools/lazy_deps.py also moved to exact pins so the lazy- install pathway can never resolve a different version than the one declared in pyproject.toml. Validation: - Cross-checked all 77 pinned direct deps in pyproject.toml against uv.lock — every pin matches the resolved version exactly. - Cross-checked all LAZY_DEPS specs against uv.lock — same. - 'uv pip install -e .[all] --dry-run' resolves 205 packages cleanly. - tests/tools/test_lazy_deps.py + tests/hermes_cli/test_security_advisories.py → 63/63 passing (every shipped spec passes the safety regex). - Doctor + TTS + transcription targeted suite → 146/146 passing. * build(deps): hash-verify transitives via uv.lock; remove unresolvable [mistral] extra You asked: 'what about the dependencies the dependencies rely on?' — correctly noting that exact-pinning direct deps in pyproject.toml does NOT cover the transitive graph. `pip install` and `uv pip install` both re-resolve transitives fresh from PyPI at install time, so a compromised transitive (e.g. `httpcore` if it got worm-poisoned tomorrow) would still hit our users even with every direct dep exact-pinned. # What this commit fixes 1. Both real installer scripts now prefer `uv sync --locked` as Tier 0. uv.lock records SHA256 hashes for every transitive — a compromised package with a different hash gets REJECTED. Falls through to the existing `uv pip install` cascade if the lockfile is missing or stale, with a loud warning that the fallback path does NOT hash-verify transitives. Previously only `setup-hermes.sh` (the dev path) used the lockfile; `scripts/install.sh` and `scripts/install.ps1` (the paths fresh users actually run) skipped it. 2. Removed the `[mistral]` extra entirely. The `mistralai` PyPI project is fully quarantined right now — every version returns 404, so any pin we wrote was unresolvable, which broke `uv lock --check` in CI. Restoration is documented in pyproject.toml as a 5-step checklist (verify, re-add extra, re-enable in 4 modules, regenerate lock, optionally re-add to [all]). 3. Regenerated uv.lock. 262 packages, mistralai/eval-type-backport/ jsonpath-python pruned. `uv lock --check` now passes. # Defense-in-depth view \| Layer \| Where \| Protects against \| \|----------------------------\|-------------------\|-------------------------------------------\| \| Exact pins in pyproject \| direct deps \| new mistralai 2.4.6-style direct compromise \| \| uv.lock + `--locked` install \| transitive graph \| transitive worm injection \| \| Tier-0 hash-verified path \| install.sh / .ps1 \| actually USE the lockfile in fresh installs \| \| `uv lock --check` CI gate \| every PR \| drift between pyproject and lockfile \| \| `hermes_cli/security_advisories.py` \| runtime \| cleanup for users who already got hit \| The exact pinning + hash verification together close the supply-chain gap. Without the lockfile path, exact pins alone are theater. # Validation - `uv lock --check` → passes (262 packages resolved, no drift). - `bash -n` on install.sh + setup-hermes.sh → OK. - 209/209 tests passing across new + adjacent test files (test_lazy_deps.py, test_security_advisories.py, test_doctor.py, test_tts_mistral.py, test_transcription_tools.py). - TOML parse OK. * chore: remove community announcement drafts (PR body covers it) * build(deps): lazy-install every opt-in backend (anthropic, search, terminal, platforms, dashboard) Extends the lazy-install framework to cover everything that's not used by every hermes session. Base install drops from ~60 packages to 45. Moved out of core dependencies = []: - anthropic (only when provider=anthropic native, not via aggregators) - exa-py, firecrawl-py, parallel-web (search backends; only when picked) - fal-client (image gen; only when picked) - edge-tts (default TTS but still optional) New extras in pyproject.toml: [anthropic] [exa] [firecrawl] [parallel-web] [fal] [edge-tts]. All added to [all]. New LAZY_DEPS entries: provider.anthropic, search.{exa,firecrawl,parallel}, tts.edge, image.fal, memory.hindsight, platform.{telegram,discord,matrix}, terminal.{modal,daytona,vercel}, tool.dashboard. Each import site now calls ensure() before importing the SDK. Where the module had a top-level try/except (telegram, discord, fastapi), the graceful-fallback pattern was extended to lazy-install on first check_*_requirements() call and re-bind module globals. Updated test_windows_native_support.py tzdata check from snapshot (>=2023.3 literal) to invariant (any version + win32 marker). Validation: - Base install: 45 packages (was ~60); 6 newly-extracted packages absent - uv lock --check: passes (262 packages, no drift) - 209/209 lazy_deps + advisory + doctor + tts/transcription tests passing - py_compile clean on all 12 modified modules	2026-05-12 01:02:25 -07:00
Robin Fernandes	94d9db72ba	add client marker tag on aux inference requests	2026-05-11 22:30:42 -07:00
rob-maron	32abe742fa	fix comment	2026-05-11 21:30:29 -07:00
rob-maron	f0c2964f0b	remove comments	2026-05-11 21:30:29 -07:00
rob-maron	057fc7b073	fix guard	2026-05-11 21:30:29 -07:00
rob-maron	528bba6734	fix kimi	2026-05-11 21:30:29 -07:00
Teknium	ea1d0462cf	fix(cli): vertical fallback for markdown tables wider than terminal (#23948 ) Follow-up to #23863 (CJK table alignment). The realigner was correctly padding pipes to identical column offsets, but when a table's natural width exceeds terminal cells it produced lines that the terminal soft-wrapped mid-cell, destroying column alignment visually even though the bytes were perfectly padded. Reported as 'columns are not aligned' on tables containing one long row alongside several short rows. Approach mirrors Claude Code's MarkdownTable.tsx narrow-terminal fallback: when realign_markdown_tables is given an available_width budget and the rebuilt horizontal table exceeds it, render each body row as 'Header: value' lines separated by a thin ─ rule. Word-wraps oversize values at the budget with a 2-space continuation indent. - agent/markdown_tables.py: realign_markdown_tables(text, available_width=None); threshold check at the top of _render_block flips into a new _render_vertical fallback. Includes _wrap_to_width with hard-break for tokens longer than the budget. - cli.py: helper _terminal_width_for_streaming() returns shutil.get_terminal_size().columns minus _STREAM_PAD and a 2-cell safety margin; passed to all three realign call sites (_render_final_assistant_content for strip+render Panel paths, and the streaming flushers in _emit_stream_text / _flush_stream). - tests/agent/test_markdown_tables.py: 4 new tests covering the overflow-vertical fallback for ASCII + CJK content, the 'fits → keep horizontal' case, and the long-cell wrap with indent. Live-verified: with COLUMNS=100, the user's reported 'long row in ASCII table' case now renders as vertical key-value rows that all fit the panel; the 6-column CJK comparison table still renders as an aligned horizontal table because it fits inside 100 cols.	2026-05-11 16:49:13 -07:00
nicoechaniz	e2b713cced	fix(model-metadata): skip OpenRouter for known providers, add kimi/moonshot to PROVIDER_TO_MODELS_DEV Based on PR #23950 by @nicoechaniz. - Add "kimi" and "moonshot" to PROVIDER_TO_MODELS_DEV → kimi-for-coding - Gate OpenRouter metadata step behind "if not effective_provider": known providers should not be overridden by community-maintained OR data - Keep the targeted Kimi-family 32k guard as a secondary safety net inside the OR gate (for unknown providers with Kimi models) Co-authored-by: nicoechaniz <nicoechaniz@altermundi.net>	2026-05-11 13:16:07 -07:00
kshitijk4poor	91eef6255e	fix: correct context-length resolution for kimi-k2.6 on Ollama Cloud and Kimi Coding Kimi-k2.6 (which supports 262K context) was incorrectly resolved as 32K, tripping the 64K minimum-context guard and preventing use of the model on Ollama Cloud and Kimi Coding / Moonshot providers. Three fixes in the context-length resolution chain: 1. Ollama Cloud native /api/show query: new _query_ollama_api_show() queries the Ollama native API for authoritative GGUF model_info context_length. For hosted Ollama, prefers model_info over num_ctx since users can't set their own num_ctx on Cloud. Added at step 5e in get_model_context_length(), before the models.dev fallback. 2. models.dev :cloud/-cloud suffix fallback: lookup_models_dev_context() now also tries appending :cloud and -cloud suffixes when the bare model name doesn't match. models.dev stores 'kimi-k2.6:cloud' but users and the live API use bare 'kimi-k2.6'. 3. Kimi-family 32K guard: after the OpenRouter metadata step, reject exactly 32768 for Kimi-named models (kimi-, moonshot) and fall through to hardcoded defaults ('kimi': 262144). OpenRouter reports 32768 for moonshotai/kimi-k2.6 but the model actually supports 262K. Narrow filter — only 32768, only Kimi-family — becomes dead code when OpenRouter updates its metadata. ---	2026-05-11 13:16:07 -07:00
Teknium	7b76366552	feat(prompt-cache): cross-session 1h prefix cache for Claude on Anthropic / OpenRouter / Nous Portal (#23828 ) Cuts input cost for first-turn Claude requests by ~85-90% on subsequent sessions within an hour. Tools array (~13k tokens for default toolset) + stable system prefix (~5-8k tokens) get a 1h cache_control marker; the volatile suffix (memory, USER profile, timestamp, session id) sits in a separate non-cached block at the end so it doesn't poison the cross-session prefix when it changes. Provider gate: Claude on native Anthropic (incl. OAuth subscription), OpenRouter, and Nous Portal (which proxies to OpenRouter). All other providers keep today's system_and_3 layout unchanged. Layout (4 cache_control breakpoints, Anthropic max): 1. tools[-1] -> 1h (cross-session) 2. system content[0] -> 1h (cross-session, stable prefix) 3. messages[-2] -> 5m (within-session rolling) 4. messages[-1] -> 5m (within-session rolling) Within-session rolling shrinks from 3 messages to 2 to free the breakpoint budget. On Claude with realistic tool loadouts the long-lived tier carries the bulk of cross-session value anyway. System prompt is now always assembled cache-friendly: stable identity / guidance / skills / platform hints first, then session-stable context files (AGENTS.md, .cursorrules), then per-call volatile content. Old single-string callers see the same logical content (same join order), just reordered so volatile lives at the end. Config knobs (defaults shown): prompt_caching: cache_ttl: "5m" # rolling-window TTL (unchanged) long_lived_prefix: true # opt-out switch long_lived_ttl: "1h" # cross-session prefix TTL Live E2E (tests/agent/test_prompt_caching_live.py, gated on OPENROUTER_API_KEY) on anthropic/claude-haiku-4.5 with default toolset: Call 1 (cold): cache_write=13,415 cache_read=0 Call 2 (NEW agent + msg): cache_write=391 cache_read=13,025 Cross-session reuse: 97.09% Implementation: * agent/prompt_caching.py: new apply_anthropic_cache_control_long_lived() + mark_tools_for_long_lived_cache(); existing apply_anthropic_cache_control() preserved verbatim for the fallback path. * agent/anthropic_adapter.py: convert_tools_to_anthropic() now forwards cache_control onto each Anthropic-format tool dict. * run_agent.py: _build_system_prompt_parts() returns the 3-tier dict; _build_system_prompt() joins them (backward compatible). _supports_long_lived_anthropic_cache() policy added next to the existing _anthropic_prompt_cache_policy() (which now also recognises Nous Portal Claude — pre-existing gap fixed in passing). _build_api_kwargs() resolves tools_for_api once and propagates the marker through all four build paths (anthropic_messages, bedrock, codex_responses, profile/legacy chat completions). Long-lived flag plumbed into the runtime snapshot/restore + model-switch + fallback-promotion paths. Tests: * tests/agent/test_prompt_caching.py: +8 tests (TestMarkToolsForLongLivedCache, TestApplyAnthropicCacheControlLongLived). * tests/run_agent/test_anthropic_prompt_cache_policy.py: +9 tests (TestSupportsLongLivedAnthropicCache matrix across 8 endpoint classes + a fallback-target case). * tests/agent/test_prompt_caching_live.py: new live E2E (skipif when OPENROUTER_API_KEY is unset; runs outside the hermetic suite). * Targeted suites: 327/327 pass (caching/adapter/policy/builder). * tests/agent/ + tests/run_agent/: 3992 pass, 17 skip, 1 pre-existing flake (test_async_httpx_del_neuter::test_same_key_replaces_stale_loop_entry, verified failing on pristine origin/main).	2026-05-11 11:14:56 -07:00
kshitij	2ec8d2b42f	chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937 ) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining. 133 files, +626/-626 (net zero).	2026-05-11 11:13:25 -07:00
wuli666	111b859e49	fix(auxiliary): evict async wrappers on poisoned client (follow-up to #23482 ) #23482 fixed cache poisoning in the sync path: when a Codex auxiliary timeout closes the underlying OpenAI client, _evict_cached_client_instance walks CodexAuxiliaryClient wrappers via their _real_client attribute and drops the cache entry so the next aux call rebuilds. The cache key includes async_mode (see _client_cache_key), so the sync and async clients for the same provider live in two distinct entries pointing at the same underlying transport. The fix walked the sync wrapper's _real_client correctly but the async wrappers (AsyncCodexAuxiliaryClient, AsyncAnthropicAuxiliaryClient, AsyncGeminiNativeClient) never exposed _real_client at all, so the async entry survived eviction and kept handing out the poisoned client. Effect on async aux callers: one timeout now poisons every subsequent async aux call (compression, vision, session_search, title_generation) with 'Connection error' until gateway restart -- even while the sync route recovered as designed in #23482. Mirror the sync wrapper's _real_client onto each async wrapper so the existing eviction helper finds them. Three changes, one per wrapper: - AsyncCodexAuxiliaryClient: self._real_client = sync_wrapper._real_client (the underlying OpenAI client) - AsyncAnthropicAuxiliaryClient: same shape - AsyncGeminiNativeClient: self._real_client = sync_client (Gemini's native facade is itself the leaf; no OpenAI client beneath it) Update _evict_cached_client_instance docstring to reflect that it now covers both sync and async wrappers via the same attribute walk. Test: TestAuxiliaryClientPoisonedCacheEviction.test_evict_cached_client_instance_walks_async_wrapper seeds both sync and async cache entries pointing at the same leaf and asserts both are dropped on a single eviction call. Verified the test fails without the wrapper changes ("async cache entry survived eviction -- wrapper is missing _real_client") and passes with them. Refs #23482, #23432	2026-05-11 11:13:20 -07:00
Teknium	1d00716754	fix(cli,tui): align CJK / wide-char markdown tables (#23863 ) CJK and emoji glyphs render as two terminal cells but JS String#length and the model's own padding count them as one, so any markdown table with Chinese / Japanese / Korean cells drifts right per row when a real terminal renders it. Both surfaces fix this with a display-cell width measurement (wcswidth on the Python side, stringWidth on the TUI side). Changes: - agent/markdown_tables.py: new helper. realign_markdown_tables(text) detects markdown table blocks (header + \|---\| divider) and rewrites the row padding using wcwidth.wcswidth so every pipe and dash lines up across rows. No-op on text without tables. - cli.py: hook the helper into _render_final_assistant_content for strip / render modes (raw passes through untouched), and into the streaming line emitter so live token-by-token rendering also produces aligned tables. A small two-buffer state machine in _emit_stream_text holds table rows until the block ends, then flushes them through the realigner so all rows pad to a single per-column width. - ui-tui/src/components/markdown.tsx: renderTable now uses stringWidth (Bun.stringWidth fast path + East-Asian-width-aware fallback, already memoised in @hermes/ink) instead of UTF-16 String#length for both column-width measurement and per-cell padding. Drops the comment that documented the bug as a deliberate limitation. Validation: - New tests/agent/test_markdown_tables.py (11): every rebuilt block shares pipe column offsets across rows for pure CJK, mixed CJK+emoji, ragged-row, and multi-table inputs. - Updated tests/cli/test_cli_markdown_rendering.py: the existing strip-mode test asserted exact whitespace; rewritten to assert the alignment contract (cell content survives + every rendered row shares pipe offsets). - New ui-tui markdown.test.ts case (1): rendered column-2 start offset is identical for the header + every body row, including the CJK row that drifted before the fix. - Live: hermes chat -q with the user-reported screenshot prompt now produces a perfectly aligned table on the wire (header, divider, 4 body rows including '通义千问', all pipes at identical columns).	2026-05-11 11:13:06 -07:00
kshitij	657874460f	chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926 ) PLR5501 (collapsible-else-if): 28 instances — else: if: → elif: PLR1730 (if-stmt-min-max): 15 instances — if x<y: x=y → x=max(x,y) C420 (dict.fromkeys): 2 instances — dictcomp → dict.fromkeys PLR1704 (redefined-argument): 1 instance — reason → err_msg (shadow fix) C414 (unnecessary-list): 1 instance — sorted(list(x)) → sorted(x) 28 files, -44 net lines. All mechanical, zero logic changes. 17,211 tests pass, zero regressions.	2026-05-11 11:03:29 -07:00
Teknium	228b7d27bd	fix(auxiliary): cache 402'd providers as unhealthy with TTL to stop per-call retry storms (#23597 ) When an auxiliary provider returns HTTP 402 (credit / payment), every subsequent compression / title-gen / session-search / vision call still re-tried it as the FIRST entry in the chain — burning ~1 RTT to hit 402 again, then falling back. On a long Discord/LCM session that meant dozens of doomed 402s per minute (issue #23570). Add a per-process unhealthy-provider cache with a 10 min TTL. When any caller observes a payment error against a provider, the label is marked unhealthy and skipped by: * _resolve_auto Step-1 (main provider use-as-aux path) * _resolve_auto Step-2 (aggregator/fallback chain) * _try_payment_fallback (used by call_llm/acall_llm on first 402) Skip-logs are throttled to once per minute per label so a bursty session doesn't spam agent.log. Entries auto-expire so a topped-up account recovers without manual intervention. The cache is in-process only by design — multi-profile users with different keys per profile must each hit the 402 once. Refs #23570	2026-05-10 22:43:14 -07:00
Teknium	e5bce320db	fix(auxiliary): evict cached client on timeout/connection error (#23482 ) A Codex auxiliary timeout closes the underlying OpenAI client (so the streaming hang doesn't sit until the user kills the session), but the cached wrapper kept pointing at the now-dead transport. Subsequent auxiliary calls (compression retry, memory flush, background review, title generation routed via provider: main) reused that closed client and failed fast with 'Connection error' until the gateway restarted — even though the main agent route was healthy the whole time. Sync `_get_cached_client` had no liveness check (async did, via loop identity), and the connection-error fallback in `call_llm` only fired on the auto provider path, so an explicit provider — including the common `auxiliary.compression.provider: main` shape — never evicted. Three fixes: * New `_evict_cached_client_instance(target)` helper that drops the cache entry whose stored client is target (or wraps it via `_real_client`, for `CodexAuxiliaryClient`). * `_CodexCompletionsAdapter._close_client_on_timeout` evicts the wrapper after closing the inner OpenAI client. * `call_llm` and `async_call_llm` evict on `_is_connection_error` before re-raising, regardless of whether the provider is auto. Net effect: one timeout costs one summary attempt + the existing 30s compressor cooldown; the next compaction rebuilds the client and works. Non-connection errors (4xx/5xx) do not evict, so cache hits stay stable. Closes #23432	2026-05-10 18:55:05 -07:00
Teknium1	ae83a54be4	docs(kanban): worker lane contract page + review-required convention Closes the architectural-pin part of #19931. Most of what that issue asked for is already implemented (logs under kanban root, env-pinned workspace, dispatcher routing of unknown assignees, lifecycle ownership, structured handoff conventions). What was missing: 1. A written contract integrators can point at when adding a new worker lane shape, and 2. The "code-changing workers should not auto-promote success to done" convention. This commit ships both as docs+convention layered on existing primitives. No kernel changes — the kanban_complete / kanban_block / kanban_comment surfaces already support the review-required pattern; we just hadn't written it down or made it visible to workers. Changes: - `agent/prompt_builder.py::KANBAN_GUIDANCE`: append the review-required exception to step 5 of the lifecycle. Workers get the cue auto-injected into their system prompt — drop structured metadata into a kanban_comment first, then end with kanban_block(reason="review-required: <summary>") instead of kanban_complete when the work needs review. Total prompt size went from ~3000 to ~3275 chars; well under the 4096 budget enforced by test_kanban_guidance_size. - `skills/devops/kanban-worker/SKILL.md`: add a worked example to the existing "Good summary + metadata shapes" section between the Coding-task and Research-task examples. Same shape as the others (kanban_comment with structured handoff JSON, then kanban_block with the human-readable reason). Plus a one-line guide on when to use kanban_complete vs the review-required pattern. - `website/docs/user-guide/features/kanban-worker-lanes.md` (new): the integrator-facing contract. Covers the hierarchy, the three things every lane must provide (assignee, spawn mechanism, lifecycle terminator), the env vars the dispatcher injects, the review-required convention, the failure modes the kernel handles for free, and an explicit "external CLI worker lane" deferred- pending-concrete-asker section that links to #19931 and #19924. - `website/sidebars.ts`: link the new page under user-guide/features. The "specialist worker lanes for external CLI tools (Codex / Claude Code / OpenCode)" runner is NOT shipped here. The dispatcher's spawn_fn parameter already supports plugin-shaped extension; the per-CLI integration work (auth, sandbox policy, exit-code mapping) needs a concrete asker. The new docs page tells would-be integrators the contract any such lane must satisfy. Refs #19931	2026-05-10 18:15:52 -07:00
Teknium	d6e1fadbf5	fix(xai): omit reasoning.effort for grok models that reject it (#23435 ) xAI's Responses API returns HTTP 400 ("Model X does not support parameter reasoningEffort") for grok-4, grok-4-0709, grok-4-fast-, grok-4-1-fast-, grok-3, grok-4.20-0309-, and grok-code-fast-1 — even though those models reason natively. Hermes was unconditionally sending `reasoning: {effort: 'medium'}` to xAI for every Grok model, breaking direct `--provider xai` for the entire grok-4 line. Add a substring allowlist predicate (verified live against api.x.ai 2026-05-10) covering the only Grok families that accept the effort dial: grok-3-mini, grok-4.20-multi-agent, grok-4.3. The Responses transport omits the `reasoning` key entirely for everything else while still including `reasoning.encrypted_content` so we capture native reasoning tokens. Verified end-to-end: `hermes chat -q hi --provider xai --model grok-4-0709` went from HTTP 400 to a successful reply.	2026-05-10 15:21:30 -07:00
Teknium	c39168453d	feat(i18n): localize all gateway commands + web dashboard, add 8 new locales (16 total) (#22914 ) * feat(i18n): localize /model command output Reported by @tianma8888: when Chinese users run /model, the labels ("Provider:", "Context:", "_session only_", etc.) are still English. This routes the static prose through the existing i18n catalog so it follows display.language / HERMES_LANGUAGE. Changes: - locales/{en,zh,ja,de,es,fr,tr,uk}.yaml: add 17 keys under gateway.model.* covering switched/provider/context/max_output/cost/ capabilities/prompt_caching/warning/saved_global/session_only_hint/ current_label/current_tag/more_models_suffix/usage_. - gateway/run.py _handle_model_command: replace hardcoded f-strings in the picker callback, the text-list fallback, and the direct-switch confirmation block with t("gateway.model.<key>", ...). What stays English: - model IDs, provider slugs, capability strings, cost figures, and the "[Note: model was just switched...]" prepended to the model's next prompt (LLM-facing, not user-facing). - The two slightly-different session-only hints unify on a single key with the em-dash phrasing. Validation: tests/agent/test_i18n.py 27/27 passing (parity contract holds), tests/gateway/ -k 'model or i18n' 74/74 passing. feat(i18n): localize all gateway slash command outputs Expands the i18n catalog from 7 strings to 234 keys across 35 gateway slash command handlers, so non-English users see localized output for \`/profile\`, \`/status\`, \`/help\`, \`/personality\`, \`/voice\`, \`/reset\`, \`/agents\`, \`/restart\`, \`/commands\`, \`/goal\`, \`/retry\`, \`/undo\`, \`/sethome\`, \`/title\`, \`/yolo\`, \`/background\`, \`/approve\`, \`/deny\`, \`/insights\`, \`/debug\`, \`/rollback\`, \`/reasoning\`, \`/fast\`, \`/verbose\`, \`/footer\`, \`/compress\`, \`/topic\`, \`/kanban\`, \`/resume\`, \`/branch\`, \`/usage\`, \`/reload-mcp\`, \`/reload-skills\`, \`/update\`, \`/stop\` (plus the \`/model\` block already added in the previous commit). Reported by @tianma8888 — Chinese users want command output prose in their language, not just the labels we already had. Translations are hand-written for all 8 supported locales (en, zh, ja, de, es, fr, tr, uk), matching each catalog's existing style: full-width punctuation in zh, em-dashes in zh/ja/uk, French spaced colons, German noun capitalization, etc. What stays English (unchanged): - Identifiers/values: model IDs, file paths, profile names, session IDs, command flag names like --global, URLs, config keys. - Backtick code spans: \`/foo\`, \`config.yaml\`. - Log messages (logger.info/warning/error). - LLM-facing system notes prepended to next prompt (e.g. [Note: model was just switched...]). - Strings produced by external modules (gateway_help_lines, format_gateway, manual_compression_feedback) — those have their own surfaces. New shared keys for cross-handler boilerplate: - gateway.shared.session_db_unavailable (5 call sites: branch, title, resume, topic, _disable_telegram_topic_mode_for_chat) - gateway.shared.session_not_found (1 site) - gateway.shared.warn_passthrough (2 sites in /title's f"⚠️ {e}" pattern) YAML gotcha fixed: \`yolo.on\` and \`yolo.off\` were originally written unquoted, which YAML 1.1 parses as boolean True/False keys. Renamed to \`yolo.enabled\` / \`yolo.disabled\` for both safety and clarity. Test fix: tests/agent/test_i18n.py::test_t_missing_key_in_non_english_falls_back_to_english now resets the catalog cache on teardown, so the fake "foo: English Foo" locale doesn't poison the module-level cache for subsequent tests in the same xdist worker. (Without this, every gateway slash command test that shares a worker with the i18n suite would see the fake catalog.) Validation: - tests/agent/test_i18n.py: 27/27 (parity contract — every key in every locale, matching placeholder tokens). - tests/gateway/: 5077 passed, 0 failed (full gateway suite). - 180 t() call sites added across 35 handlers; 1872 catalog entries total (234 keys × 8 locales). * feat(i18n): add 8 new locales — af, ko, it, ga, zh-hant, pt, ru, hu Expands the static-message catalog from 8 → 16 languages, each with full 270-key parity against the English source-of-truth. Every locale now covers the same surface PR #22914 added: approval prompts plus all 35 gateway slash command outputs. New locales: - af Afrikaans (community ask in #21961 by @GodsBoy; PRs #21962, #21970) - ko Korean (PRs #20297 by @tmdgusya, #22285 by @project820) - it Italian (PR #20371 by @leprincep35700) - ga Irish/Gaeilge (PR #20962 by @ryanmcc09-dot) - zh-hant Traditional Chinese (PRs #20523 by @jackey8616, #13140 by @anomixer) - pt Portuguese (PRs #20443 by @pedroborges, #15737 by @carloshenriquecarniatto, #22063 by @Magaav) - ru Russian (PR #22770 by @DrMaks22) - hu Hungarian (PR #22336 by @lunasec007) Each locale uses native-quality translations matching the existing tone and conventions of the older 8 locales: - zh-hant uses 繁體 characters with TW/HK technical vocabulary (軟體 not 软件, 連線 not 连接, 設定 not 设置, 訊息 not 消息, 工作階段 not 会话, 程式 not 程序, 預設 not 默认, 伺服器 not 服务器), full-width punctuation 「：（）」. - ko uses formal 합니다체 (습니다/합니다) register throughout. - pt uses European Portuguese as baseline with neutral PT/BR vocabulary where possible. - ga uses standard An Caighdeán Oifigiúil; English loanwords retained for tech terms without good Irish equivalents (gateway, API, JSON). - All preserve {placeholder} tokens, backtick code spans, slash commands, brand names (Hermes, MCP, TTS, YOLO, OpenAI, Telegram, etc.), and emoji. Aliases added in agent/i18n.py: - af-za, Afrikaans → af - ko-kr, Korean, 한국어 → ko - it-it, italiano → it - ga-ie, Irish, Gaeilge → ga - zh-tw, zh-hk, zh-mo, traditional-chinese → zh-hant (note: zh-tw used to alias to zh; now aliases to its own zh-hant catalog) - zh-cn, zh-hans, zh-sg → zh (unchanged from before) - pt-pt, pt-br, brazilian, portuguese → pt - ru-ru, Russian, русский → ru - hu-hu, Magyar → hu The zh-tw alias re-routing is intentional: previously typing 'zh-TW' got the Simplified Chinese catalog (wrong vocabulary for Taiwan/HK users). Now those users get the proper Traditional Chinese catalog. Validation: - tests/agent/test_i18n.py: 43/43 (parity contract holds for all 16 languages × 270 keys = 4320 catalog entries, with matching placeholder tokens). - E2E alias resolution verified for all 19 alias inputs (Afrikaans, ko-KR, 한국어, italiano, Gaeilge, zh-TW, zh-HK, traditional-chinese, pt-BR, brazilian, Magyar, etc.). - tests/gateway/: 5198 passed (3 pre-existing TTS routing failures unrelated to i18n). Credit to all contributors whose PRs surfaced these language requests. Their original PRs may now be closed as superseded with credit. * feat(dashboard-i18n): add 14 web dashboard locales matching the static catalog Brings the React dashboard (web/src/) up to the same 16-language coverage the static catalog already has after the previous commits in this PR. The Translations interface is TypeScript-typed, so every new locale must provide every key — tsc -b is the parity guard. Languages added (each is a complete 429-line locale file): - af Afrikaans - ja Japanese (PR #22513 by @snuffxxx surfaced this) - de German (PR #21749 by @mag1art) - es Spanish (PR #21749) - fr French (PRs #21749, #10310 by @foXaCe) - tr Turkish - uk Ukrainian - ko Korean (PRs #21749, #18894 by @ovstng, #22285 by @project820) - it Italian - ga Irish (Gaeilge) - zh-hant Traditional Chinese (PR #13140 by @anomixer) - pt Portuguese (PRs #22063 by @Magaav, #22182 by @wesleysimplicio, #15737 by @carloshenriquecarniatto) - ru Russian (PRs #21749, #22770 by @DrMaks22) - hu Hungarian (PR #22336 by @lunasec007) Each translation covers all 15 namespaces with full key parity vs en.ts, preserves every {placeholder} token verbatim, keeps identifiers untranslated (brand names, file paths, cron expressions, code spans), translates the language.switchTo tooltip into the target language, and matches existing tone conventions (zh-hant uses TW/HK vocab; ja uses formal desu/masu; ko uses formal seumnida register; ga uses An Caighdean Oifigiuil with English loanwords for tech vocab without good Irish equivalents). Plumbing: - web/src/i18n/types.ts: Locale union expanded to all 16 codes. - web/src/i18n/context.tsx: imports all 16 catalogs; exports LOCALE_META (endonym + flag per locale); isLocale() type guard. - web/src/i18n/index.ts: re-export LOCALE_META. - web/src/components/LanguageSwitcher.tsx: replaced two-state EN-ZH toggle with a click-to-open dropdown listing all 16 languages. Note: zh-hant.ts exports zhHant (camelCase) since hyphen is invalid in a JS identifier; the canonical 'zh-hant' string keys it in TRANSLATIONS. Validation: - npx tsc -b: 0 errors. Every locale satisfies Translations. - npm run build (tsc + vite production): green, 2062 modules. - Each locale file is exactly 429 lines. Out of scope: plugin dashboards (kanban/achievements ship as prebuilt bundles with no source in repo); Docusaurus docs (separate surface); TUI (no i18n yet). * feat(plugin-i18n): localize achievements + kanban plugin dashboards across all 16 locales Brings the two shipped plugin dashboards (hermes-achievements, kanban) under the same i18n umbrella as the core dashboard PR #22914 just established. Both bundles now read user-facing strings from the host's i18n catalog via SDK.useI18n() instead of hardcoded English. ## Approach Plugin dashboards ship as prebuilt IIFE bundles in plugins/<name>/dashboard/dist/index.js — no build step, no source in repo (upstream-authored, vendored as compiled JS). Earlier contributor PRs (#22594, #22595, #18747) tried direct edits but didn't actually wire the bundles to read translations. This change does the wiring properly: 1. Each bundle gets a useI18n shim at IIFE scope: const useI18n = SDK.useI18n \|\| function () { return { t: { kanban: null }, locale: "en" }; }; Older host SDKs without useI18n still load the bundle and render English fallbacks. 2. A small tx(t, path, fallback, vars) helper resolves dotted keys under the plugin's namespace (t.kanban.* or t.achievements.) and interpolates {placeholder} tokens. 3. Every React component starts with const { t } = useI18n() and each user-visible string is wrapped in tx(t, "key", "English fallback"). Helpers called outside React components (window.prompt callers, constants used during init) take t as a parameter. 4. Top-level constants that were English dictionaries (COLUMN_LABEL, COLUMN_HELP, DESTRUCTIVE_TRANSITIONS, DIAGNOSTIC_EVENT_LABELS in kanban) become getColumnLabel(t, status)-style functions backed by FALLBACK_ dictionaries. ## Translations added Two new top-level namespaces added to the dashboard's TypeScript-typed Translations interface: - achievements: ~70 keys covering the hero, scan banner, achievement card, share dialog, stats, filters, and empty states. - kanban: ~145 keys covering the board, columns (with nested columnLabels and columnHelp sub-dicts), card detail panel, bulk-actions toolbar, dependency editor, board switcher, and diagnostic callouts. Each key is provided across all 16 supported locales: en, zh, zh-hant, ja, de, es, fr, tr, uk, af, ko, it, ga, pt, ru, hu. Total new translation entries: ~3,440 (215 keys × 16 locales). ## What stays English (deliberate) - API paths, CSS class names, data-* attributes, JSON keys, regex strings, URLs, file paths (~/.hermes/kanban.db, boards/_archived/). - State identifier strings used as lookup keys (triage / todo / ready / running / blocked / done / archived) — labels translate, key strings don't. - The PNG share-card text rendered to canvas in the achievements ShareDialog (HERMES AGENT watermark, UNLOCKED stamp, tier names) — these become part of a globally-shared image and stay English. - localStorage keys (hermes.kanban.selectedBoard). - Brand names (Kanban, Hermes, WebSocket, Nous Research). ## Contributor credit PR #22594 by @02356abc and PR #22595 by @02356abc supplied the en + zh kanban namespace skeleton (145 keys); used as the en source- of-truth in this commit and translated to the other 14 locales. PR #18747 by @laolaoshiren first surfaced the achievements localization request. ## Validation - npx tsc -b: 0 errors. All 16 locale .ts files satisfy the Translations type with full key parity. - npm run build (tsc + vite production build): green, 2062 modules, 1.56MB JS / 95KB CSS, ~2.5s build. - node --check on both plugin bundles: parse cleanly. - 126 tx() call sites in kanban, 46 in achievements. ## Out of scope - TUI (ui-tui/) has no i18n infrastructure yet. - Docusaurus docs (website/i18n/) — already had zh-Hans; expanding is a separate translation workstream (Thai / Korean / Hindi PRs).	2026-05-10 07:14:14 -07:00
Teknium	5aa755e4e6	feat(plugins): run any LLM call from inside a plugin via ctx.llm (#23194 ) * feat(plugins): host-owned LLM access via ctx.llm Plugins can now ask the host to run a one-shot chat or structured completion against the user's active model and auth, without ever seeing an OAuth token or API key. Closes the gap where plugins that needed bounded structured inference (receipts, CRM extraction, support classification) had to either bring their own provider keys or register a tool the agent had to call. New surface on PluginContext: - ctx.llm.complete(messages, ...) - ctx.llm.complete_structured(instructions, input, json_schema, ...) - async siblings ctx.llm.acomplete / acomplete_structured Backed by the existing auxiliary_client.call_llm pipeline — every provider, fallback chain, vision routing, and timeout policy Hermes already supports applies automatically. Trust gate (fail-closed by default): - plugins.entries.<id>.llm.allow_model_override - plugins.entries.<id>.llm.allowed_models (allowlist; '' = any) - plugins.entries.<id>.llm.allow_agent_id_override - plugins.entries.<id>.llm.allow_profile_override Embedded model@profile shorthand goes through the same gate as explicit profile=, so it can't bypass the auth-profile policy. Conflicting explicit and embedded profiles fail closed. Also lands: - plugins/plugin-llm-example/ — reference plugin that registers /receipt-extract, demonstrating image+text structured input, jsonschema validation, and the trust-gate config. - website/docs/developer-guide/plugin-llm-access.md — full API docs. - 45 unit tests covering trust gates, JSON parsing, schema validation, image encoding, async surface, and config loading. Validation: - 2628 tests pass in tests/agent/ - E2E: bundled plugin loaded with isolated HERMES_HOME, slash command produced parsed JSON via stubbed call_llm - response_format extra_body wired correctly for both json_object and json_schema modes docs(plugin-llm): rewrite quickstart and framing The quickstart now uses a meeting-notes-to-tasks example instead of a receipt extractor, and the page leads with hook-time / gateway pre-filter / scheduled-job framing rather than the OpenClaw KB/support/CRM/finance/migration enumeration that the original upstream PR used. Receipt example moved to a separate worked example link so the docs page itself doesn't echo any of the upstream framing. Also clarifies where ctx.llm fits in the broader plugin surface (table comparing register_tool / register_platform / register_hook / etc.) and what makes this lane different from auxiliary_client internals. No code change. * docs(plugin-llm): reframe as any LLM call, not just structured output The original draft leaned heavily on complete_structured() and made the chat lane (complete() / acomplete()) feel like a footnote. Restructure so: - The page title and description say 'any LLM call.' - The lead shows BOTH a plain chat call (error rewriter) AND a structured call (triage scorer) up top. - Quick start has two complete plugin examples — /tldr (chat) and /paste-to-tasks (structured). - New 'When to use which' table for choosing complete() vs complete_structured() vs the async siblings. - Trust-gate sections explicitly note 'all four methods,' and the request-shaping list calls out chat-only fields (messages) and structured-only fields (instructions, input, json_schema) alongside each other. - The 'Where this fits' section now says 'for any reason, structured or not.' The receipt-extractor reference plugin still exists under plugins/plugin-llm-example/ — but the docs page no longer treats it as the canonical surface example. It's now described as 'a third worked example, this time with image input.' No code change. * feat(plugin-llm): split provider/model into independent explicit kwargs The first cut accepted a single 'provider/model' slug on every method and split it internally. That looked clean but broke under live test: the model-override path tried to use the slug's vendor prefix as a literal Hermes provider id, which silently switched the user off their aggregator (e.g. plugin asks for 'openai/gpt-4o-mini' on a user who routes through OpenRouter — host attempted to call the 'openai' provider directly, failed because OPENAI_API_KEY wasn't set). New shape mirrors the host's main config: ctx.llm.complete( messages=[...], provider='openrouter', # gated, optional model='openai/gpt-4o-mini', # gated, optional profile='work', # gated, optional ... ) Each is independently gated by its own allow__override flag. Granting model-override does NOT auto-grant provider-override. Allowlists are now per-axis (allowed_providers, allowed_models) matched literally against whatever string the plugin sends. Dropped 'model@profile' embedded-suffix shorthand entirely. Hermes doesn't use that pattern anywhere else; profile= is its own kwarg. Live E2E (against real OpenRouter via Teknium's config) confirms: - zero-config call works - default-deny blocks each override with a helpful error - model-only override stays on user's active provider (the bug) - provider+model override switches cleanly - allowlist refuses non-listed entries - structured output round-trip parses + schema-validates Tests: 49 cases (up from 45); all green. Docs updated to match the new shape, including a 'most plugins never need this section' callout on the trust-gate config block. fix+cleanup(plugin-llm): real attribution, hook-mode coverage, move example out of core Three integration fixes for the ctx.llm surface: 1. Attribution bug — result.provider and result.model now reflect what call_llm actually used, not placeholder fallbacks ('auto', 'default'). New _resolve_attribution() helper: - explicit overrides win (what the call targeted) - response.model wins for the recorded model (provider canonicalisation: 'gpt-4o' → 'gpt-4o-2024-08-06' etc.) - falls back to _read_main_provider() / _read_main_model() when no override is set, so audit logs reflect the user's active main provider/model - 'auto' / 'default' only when EVERYTHING is empty Live verified: zero-config call now records provider='openrouter', model='anthropic/claude-4.7-opus-20260416' instead of provider='auto', model='default'. 2. Hook-mode coverage — TestHookMode confirms ctx.llm.complete works from inside a registered post_tool_call callback. The docs page promised hook integration; now there's a test that exercises the lazy-import path through the real invoke_hook machinery. Two cases: traceback-rewrite hook with conditional ctx.llm.complete, and minimal hook regression for the sync-hook + sync-llm path. 3. Reference plugin moved out of core. plugins/plugin-llm-example/ is gone from hermes-agent — it now lives in the new NousResearch/hermes-example-plugins companion repo. The docs page links there. Hermes' bundled plugins should be plugins users actually run; reference / docs-companion plugins live externally. Test count: 56 (up from 49). Wider sweep on tests/hermes_cli/ + tests/gateway/ + tests/tools/ + tests/agent/ shows 16770 passing; the 12 failures are all pre-existing on origin/main (verified by stashing this branch's changes and re-running) — kanban-boards, delegate-task, gateway-restart, tts-routing — none touch the plugin_llm surface. * chore(plugins): move all example plugins to companion repo Reference / docs-companion plugins now live exclusively in NousResearch/hermes-example-plugins, not bundled with the core repo: - example-dashboard - strike-freedom-cockpit A new fourth example, plugin-llm-async-example, was added to that repo demonstrating ctx.llm's async surface (acomplete()) with asyncio.gather() — registers /translate <lang>: <text> which fires forward translation + sentiment classifier in parallel, then a back-translation for QA. Live-tested at 2.5s for three real provider round-trips (would be ~5-6s sequential). Docs updated: - developer-guide/plugin-llm-access.md links both sync and async examples in the Reference section - user-guide/features/extending-the-dashboard.md repoints both demo sections to the companion repo with corrected install paths - user-guide/features/built-in-plugins.md drops the two demo rows - AGENTS.md notes that example plugins live in the companion repo Net: hermes-agent's plugins/ directory now contains only plugins users actually run (memory providers, dashboard tabs that ship real features, the disk-cleanup hook, platform adapters). All four demo / reference plugins live externally where they can be cloned on demand instead of inflating the core install.	2026-05-10 07:09:28 -07:00
Teknium	7312f7f849	feat(curator): hint at `hermes curator pin` in the rename block (#23212 ) Surfaces the pin command at the moment users care about it: when a consolidation just landed against their skill library and they're looking at the umbrella name in the curator output. Previously `hermes curator pin` existed but had no discovery surface — users only learned it existed by reading docs or stumbling onto `hermes curator --help`. The hint: archived 3 skill(s): • docx-extraction → document-tools • pdf-extraction → document-tools • old-stale — pruned (stale) full report: hermes curator status keep an umbrella stable: hermes curator pin document-tools Gated on having at least one consolidation that produced an umbrella. Pruned-only runs (nothing surviving to pin) skip the hint. When multiple umbrellas were produced, picks alphabetically first as a concrete example rather than listing them all. 3 new tests in tests/agent/test_curator_classification.py covering: consolidation produces hint with real umbrella name, pruned-only run omits it, multi-umbrella picks one example.	2026-05-10 06:44:53 -07:00
kshitijk4poor	44cdf555a8	fix(codex-spark): defensive 128k entry in DEFAULT_CONTEXT_LENGTHS + clarify validation test docstring Two follow-ups from self-review: 1. Add gpt-5.3-codex-spark to DEFAULT_CONTEXT_LENGTHS at 128k. The primary resolution path for Spark goes through provider='openai-codex' → _CODEX_OAUTH_CONTEXT_FALLBACK (already correct). But if any future code path resolves Spark's context with a different provider (custom proxy, generic fallthrough), the longest-substring-first lookup in step 8 would match 'gpt-5' and report 400k, which is wrong by ~3x. Adding the explicit override is a cheap defensive correctness fix matching how gpt-5.4-mini and gpt-5.4-nano already shadow the generic gpt-5 entry. 2. Update test_openai_codex_model_validation_fallback.py docstring. The bug it was originally written for (gpt-5.3-codex-spark missing from listing) is now resolved by this PR's catalog restoration. The test still validly exercises the soft-accept code path for any future entitlement-gated Codex slug that ships before Hermes catalogs it, but the framing was stale — clarified.	2026-05-09 23:17:25 -07:00
kshitij	9ee9a4297d	docs(codex-spark): document ChatGPT Pro entitlement gating PR #12994 stripped gpt-5.3-codex-spark on the assumption that it was unsupported. It's actually research-preview, ChatGPT-Pro-only, exposed via the Codex OAuth backend at chatgpt.com/backend-api/codex/models — not via the public OpenAI API. Add explanatory comments in: - DEFAULT_CODEX_MODELS / _FORWARD_COMPAT_TEMPLATE_MODELS (codex_models.py) - _CODEX_OAUTH_CONTEXT_FALLBACK (model_metadata.py) - list_authenticated_providers' live-discovery branch (model_switch.py) so future maintainers don't strip the entry again. Also documents the intentional asymmetry that Spark stays out of the "openai" provider catalog (it isn't on the public API) and why the supported_in_api filter is not applied for the openai-codex route.	2026-05-09 23:17:25 -07:00
olegdater	c6dc295a35	fix(model-metadata): set codex-spark fallback context to 128k	2026-05-09 23:17:25 -07:00
olegdater	2a6f3deb50	fix(model-metadata): restore gpt-5.3-codex-spark fallback context	2026-05-09 23:17:25 -07:00

1 2 3 4 5 ...

1033 Commits