Commit Graph

775 Commits

Author SHA1 Message Date
Teknium
68528068ec
fix(streaming): update stale-stream timer during Anthropic native streaming (#7117)
The _call_anthropic() streaming path never updated last_chunk_time during
the event loop — only once at stream start. The stale stream detector in
the outer poll loop uses this timer, so any Anthropic stream longer than
180s was killed even when events were actively arriving. This self-inflicted
a RemoteProtocolError that users saw as:

  '⚠️ Connection to provider dropped (RemoteProtocolError). Reconnecting…'

The _call_chat_completions() path already updates last_chunk_time on every
chunk (line 4475). This brings _call_anthropic() to parity.

Also adds deltas_were_sent tracking to the Anthropic text_delta path so
the retry loop knows not to retry after partial delivery (prevents
duplicated output on connection drops mid-stream).

Reported-by: Discord users (Castellani, Codename_11)
2026-04-10 03:34:56 -07:00
helix4u
5a8b5f149d fix(run-agent): rotate credential pool on billing-classified 400s 2026-04-10 03:27:19 -07:00
helix4u
9aedab00f4 fix(run_agent): recover primary client on openai transport errors 2026-04-10 03:21:24 -07:00
kshitijk4poor
9431f82aff fix: update Kimi Coding User-Agent to KimiCLI/1.30.0
The hardcoded User-Agent 'KimiCLI/1.3' is outdated — Kimi CLI is now at
v1.30.0. The stale version string causes intermittent 403 errors from
Kimi's coding endpoint ('only available for Coding Agents').

Update all 8 occurrences across run_agent.py, auxiliary_client.py, and
doctor.py to 'KimiCLI/1.30.0' to match the current official Kimi CLI.
2026-04-10 02:37:28 -07:00
Teknium
8779a268a7
feat: add Anthropic Fast Mode support to /fast command (#7037)
Extends the /fast command to support Anthropic's Fast Mode beta in addition
to OpenAI Priority Processing. When enabled on Claude Opus 4.6, adds
speed:"fast" and the fast-mode-2026-02-01 beta header to API requests for
~2.5x faster output token throughput.

Changes:
- hermes_cli/models.py: Add _ANTHROPIC_FAST_MODE_MODELS registry,
  model_supports_fast_mode() now recognizes Claude Opus 4.6,
  resolve_fast_mode_overrides() returns {speed: fast} for Anthropic
  vs {service_tier: priority} for OpenAI
- agent/anthropic_adapter.py: Add _FAST_MODE_BETA constant,
  build_anthropic_kwargs() accepts fast_mode=True which injects
  speed:fast + beta header via extra_headers (skipped for third-party
  Anthropic-compatible endpoints like MiniMax)
- run_agent.py: Pass fast_mode to build_anthropic_kwargs in the
  anthropic_messages path of _build_api_kwargs()
- cli.py: Update _handle_fast_command with provider-aware messaging
  (shows 'Anthropic Fast Mode' vs 'Priority Processing')
- hermes_cli/commands.py: Update /fast description to mention both
  providers
- tests: 13 new tests covering Anthropic model detection, override
  resolution, CLI availability, routing, adapter kwargs, and
  third-party endpoint safety
2026-04-10 02:32:15 -07:00
Teknium
871313ae2d
fix: clear conversation_history after mid-loop compression to prevent empty sessions (#7001)
After mid-loop compression (triggered by 413, context_overflow, or Anthropic
long-context tier errors), _compress_context() creates a new session in SQLite
and resets _last_flushed_db_idx=0. However, conversation_history was not cleared,
so _flush_messages_to_session_db() computed:

    flush_from = max(len(conversation_history=200), _last_flushed_db_idx=0) = 200
    messages[200:] → empty (compressed messages < 200)

This resulted in zero messages being written to the new session's SQLite store.
On resume, the user would see 'Session found but has no messages.'

The preflight compression path (line 7311) already had the fix:
    conversation_history = None

This commit adds the same clearing to the three mid-loop compression sites:
- Anthropic long-context tier overflow
- HTTP 413 payload too large
- Generic context_overflow error

Reported by Aaryan (Nous community).
2026-04-10 00:14:59 -07:00
Teknium
f783986f5a
fix: increase stream read timeout default to 120s, auto-raise for local LLMs (#6967)
Raise the default httpx stream read timeout from 60s to 120s for all
providers. Additionally, auto-detect local LLM endpoints (Ollama,
llama.cpp, vLLM) and raise the read timeout to HERMES_API_TIMEOUT
(1800s) since local models can take minutes for prefill on large
contexts before producing the first token.

The stale stream timeout already had this local auto-detection pattern;
the httpx read timeout was missing it — causing a hard 60s wall that
users couldn't find (HERMES_STREAM_READ_TIMEOUT was undocumented).

Changes:
- Default HERMES_STREAM_READ_TIMEOUT: 60s -> 120s
- Auto-detect local endpoints -> raise to 1800s (user override respected)
- Document HERMES_STREAM_READ_TIMEOUT and HERMES_STREAM_STALE_TIMEOUT
- Add 10 parametrized tests

Reported-by: Pavan Srinivas (@pavanandums)
2026-04-09 22:35:30 -07:00
emozilla
bda9aa17cb fix(streaming): prevent <think> in prose from suppressing response output
When the model mentions <think> as literal text in its response (e.g.
"(/think not producing <think> tags)"), the streaming display treated it
as a reasoning block opener and suppressed everything after it. The
response box would close with truncated content and no error — the API
response was complete but the display ate it.

Root cause: _stream_delta() matched <think> anywhere in the text stream
regardless of position. Real reasoning blocks always start at the
beginning of a line; mentions in prose appear mid-sentence.

Fix: track line position across streaming deltas with a
_stream_last_was_newline flag. Only enter reasoning suppression when
the tag appears at a block boundary (start of stream, after a newline,
or after only whitespace on the current line). Add a _flush_stream()
safety net that recovers buffered content if no closing tag is found
by end-of-stream.

Also fixes three related issues discovered during investigation:

- anthropic_adapter: _get_anthropic_max_output() now normalizes dots to
  hyphens so 'claude-opus-4.6' matches the 'claude-opus-4-6' table key
  (was returning 32K instead of 128K)

- run_agent: send explicit max_tokens for Claude models on Nous Portal,
  same as OpenRouter — both proxy to Anthropic's API which requires it.
  Without it the backend defaults to a low limit that truncates responses.

- run_agent: reset truncated_tool_call_retries after successful tool
  execution so a single truncation doesn't poison the entire conversation.
2026-04-09 22:16:36 -07:00
Teknium
8394b5ddd2
feat: expand /fast to all OpenAI Priority Processing models (#6960)
Previously /fast only supported gpt-5.4 and forced a provider switch to
openai-codex. Now supports all 13 models from OpenAI's Priority Processing
pricing table (gpt-5.4, gpt-5.4-mini, gpt-5.2, gpt-5.1, gpt-5, gpt-5-mini,
gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o3, o4-mini).

Key changes:
- Replaced _FAST_MODE_BACKEND_CONFIG with _PRIORITY_PROCESSING_MODELS frozenset
- Removed provider-forcing logic — service_tier is now injected into whatever
  API path the user is already on (Codex Responses, Chat Completions, or
  OpenRouter passthrough)
- Added request_overrides support to chat_completions path in run_agent.py
- Updated messaging from 'Codex inference tier' to 'Priority Processing'
- Expanded test coverage for all supported models
2026-04-09 22:06:30 -07:00
g-guthrie
d416a69288 feat: add Codex fast mode toggle (/fast command)
Add /fast slash command to toggle OpenAI Codex service_tier between
normal and priority ('fast') inference. Only exposed for models
registered in _FAST_MODE_BACKEND_CONFIG (currently gpt-5.4).

- Registry-based backend config for extensibility
- Dynamic command visibility (hidden from help/autocomplete for
  non-supported models) via command_filter on SlashCommandCompleter
- service_tier flows through request_overrides from route resolution
- Omit max_output_tokens for Codex backend (rejects it)
- Persists to config.yaml under agent.service_tier

Salvage cleanup: removed simple_term_menu/input() menu (banned),
bare /fast now shows status like /reasoning. Removed redundant
override resolution in _build_api_kwargs — single source of truth
via request_overrides from route.

Co-authored-by: Hermes Agent <hermes@nousresearch.com>
2026-04-09 21:54:32 -07:00
Teknium
b87d00288d
fix: add actionable hint for OpenRouter 'no tool endpoints' error
When OpenRouter returns 'No endpoints found that support tool use'
(HTTP 404), display a hint explaining that provider routing restrictions
may be filtering out tool-capable providers. Links the user directly
to the model's OpenRouter page to check which providers support tools.

The hint fires in the error display block that runs regardless of whether
fallback succeeds — so the user always understands WHY the model failed,
not just that it fell back.

Reported via Discord: GLM-5.1 on OpenRouter with US-based provider
restrictions eliminated all 4 tool-supporting endpoints (DeepInfra,
Z.AI, Friendli, Venice), leaving only 7 non-tool providers.
2026-04-09 18:03:09 -07:00
Brooklyn Nicholson
aa5b697a9d Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-09 19:12:31 -05:00
Brooklyn Nicholson
aca479c1ae Merge branch 'feat/ink-refactor' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-09 19:08:52 -05:00
Brooklyn Nicholson
b85ff282bc feat(ui-tui): slash command history/display, CoT fade, live skin switch, fix double reasoning 2026-04-09 19:08:47 -05:00
AIandI0x1
2d0d05a337 fix(agent): detect truncated streaming tool calls before execution
When a streaming response is cut mid-tool-call (connection drop, timeout),
the accumulated function.arguments is invalid JSON. The mock response
builder defaulted finish_reason to 'stop', so the agent loop treated it
as a valid completed turn and tried to execute tools with broken args.

Fix: validate tool call arguments with json.loads() during mock response
reconstruction. If any are invalid JSON, override finish_reason to
'length'. In the main loop's length handler, if tool calls are present,
refuse to execute and return partial=True with a clear error instead of
silently failing or wasting retries.

Also fixes _thinking_exhausted to not short-circuit when tool calls are
present — truncated tool calls are not thinking exhaustion.

Original cherry-picked from PR #6776 by AIandI0x1.
Closes #6638.
2026-04-09 17:03:54 -07:00
Austin Pickett
f805323517 chore: merge main 2026-04-09 20:00:34 -04:00
adybag14-cyber
a3aed1bd26 fix(termux): keep quiet chat output parseable 2026-04-09 16:24:53 -07:00
adybag14-cyber
4970705ed3 fix(termux): silence quiet chat tool previews 2026-04-09 16:24:53 -07:00
Brooklyn Nicholson
99fd3b518d feat: add /copy and /agents 2026-04-09 17:19:36 -05:00
KUSH42
34d06a9802 fix(compaction): don't halve context_length on output-cap-too-large errors
When the API returns "max_tokens too large given prompt" (input tokens
are within the context window, but input + requested output > window),
the old code incorrectly routed through the same handler as "prompt too
long" errors, calling get_next_probe_tier() and permanently halving
context_length. This made things worse: the window was fine, only the
requested output size needed trimming for that one call.

Two distinct error classes now handled separately:

  Prompt too long  — input itself exceeds context window.
    Fix: compress history + halve context_length (existing behaviour,
    unchanged).

  Output cap too large — input OK, but input + max_tokens > window.
    Fix: parse available_tokens from the error message, set a one-shot
    _ephemeral_max_output_tokens override for the retry, and leave
    context_length completely untouched.

Changes:
- agent/model_metadata.py: add parse_available_output_tokens_from_error()
  that detects Anthropic's "available_tokens: N" error format and returns
  the available output budget, or None for all other error types.
- run_agent.py: call the new parser first in the is_context_length_error
  block; if it fires, set _ephemeral_max_output_tokens (with a 64-token
  safety margin) and break to retry without touching context_length.
  _build_api_kwargs consumes the ephemeral value exactly once then clears
  it so subsequent calls use self.max_tokens normally.
- agent/anthropic_adapter.py: expand build_anthropic_kwargs docstring to
  clearly document the max_tokens (output cap) vs context_length (total
  window) distinction, which is a persistent source of confusion due to
  the OpenAI-inherited "max_tokens" name.
- cli-config.yaml.example: add inline comments explaining both keys side
  by side where users are most likely to look.
- website/docs/integrations/providers.md: add a callout box at the top
  of "Context Length Detection" and clarify the troubleshooting entry.
- tests/test_ctx_halving_fix.py: 24 tests across four classes covering
  the parser, build_anthropic_kwargs clamping, ephemeral one-shot
  consumption, and the invariant that context_length is never mutated
  on output-cap errors.
2026-04-09 11:27:41 -07:00
Yang Zhi
019c11d07e fix(fallback): preserve provider-specific headers when activating fallback
When _try_activate_fallback() swaps to a new provider (e.g.
kimi-coding), resolve_provider_client() correctly injects
provider-specific default_headers (like KimiCLI User-Agent) into the
returned OpenAI client. However, _client_kwargs was saved with only
api_key and base_url, dropping those headers.

Every subsequent API call rebuilds the client from _client_kwargs via
_create_request_openai_client(), producing a bare OpenAI client without
the required headers. Kimi Coding rejects this with 403; Copilot would
lose its auth headers similarly.

This patch reads _custom_headers from the fallback client (where the
OpenAI SDK stores the default_headers kwarg) and includes them in
_client_kwargs so any client rebuild preserves provider-specific headers.

Fixes #6075
2026-04-09 11:11:25 -07:00
Brooklyn Nicholson
b2ea9b4176 Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-09 12:31:20 -05:00
Teknium
268ee6bdce
fix: add turn-exit diagnostic logging to agent loop (#6549)
Every turn now logs WHY the agent loop ended to agent.log with a
structured INFO line capturing: exit reason, model, api_calls/max,
budget usage, tool turn count, last message role, response length,
and session ID.

When the last message is a tool result and the turn was NOT
interrupted, emits WARNING level (visible in errors.log) — this is
the 'just stops' scenario users report where a tool call completes
but no continuation or final response follows.

10 tracked exit reasons: text_response, interrupted_by_user,
interrupted_during_api_call, budget_exhausted, max_iterations_reached,
all_retries_exhausted_no_response, fallback_prior_turn_content,
empty_response_exhausted, error_near_max_iterations, unknown.
2026-04-09 04:15:22 -07:00
Teknium
1a3ae6ac6e
feat: structured API error classification for smart failover (#6514)
Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
2026-04-09 04:10:11 -07:00
Teknium
8dfc96dbbb
feat: capture provider rate limit headers and show in /usage (#6541)
Parse x-ratelimit-* headers from inference API responses (Nous Portal,
OpenRouter, OpenAI-compatible) and display them in the /usage command.

- New agent/rate_limit_tracker.py: parse 12 rate limit headers (RPM/RPH/
  TPM/TPH limits, remaining, reset timers), format as progress bars (CLI)
  or compact one-liner (gateway)
- Hook into streaming path in run_agent.py: stream.response.headers is
  available on the OpenAI SDK Stream object before chunks are consumed
- CLI /usage: appends rate limit section with progress bars + warnings
  when any bucket exceeds 80%
- Gateway /usage: appends compact rate limit summary
- 24 unit tests covering parsing, formatting, edge cases

Headers captured per response:
  x-ratelimit-{limit,remaining,reset}-{requests,tokens}{,-1h}

Example CLI display:
  Nous Rate Limits (captured just now):
    Requests/min [░░░░░░░░░░░░░░░░░░░░]  0.1%  1/800 used  (799 left, resets in 59s)
    Tokens/hr    [░░░░░░░░░░░░░░░░░░░░]  0.0%  49/336.0M   (336.0M left, resets in 52m)
2026-04-09 03:43:14 -07:00
Teknium
1eabbe905e
fix: retry 3 times when model returns truly empty response (#6488)
When a model returns no content, no structured reasoning, and no tool
calls (common with open models), the agent now silently retries up to
3 times before falling through to (empty).

Silent retry (no synthetic messages) keeps the conversation history
clean, preserves prompt caching, and respects the no-synthetic-user-
injection invariant.  Most empty responses from open models are
transient (provider hiccups, rate limits, sampling flukes) so a
simple retry is sufficient.

This fills the last gap in the empty-response recovery chain:
1. _last_content_with_tools fallback (prior tool turn had content)
2. Thinking-only prefill continuation (#5931 — structured reasoning)
3. Empty response silent retry (NEW — truly empty, no reasoning)
4. (empty) terminal (last resort after all retries exhausted)

Inline <think> blocks are excluded — the model chose to reason, it
just produced no visible text.  That differs from truly empty.

Tests:
- Updated test_truly_empty to expect 4 API calls (1 + 3 retries)
- Added test_truly_empty_response_succeeds_on_nudge
2026-04-09 02:06:12 -07:00
Brooklyn Nicholson
54bd25ff4a fix(tui): -c resume, ctrl z, pasting updates, exit summary, session fix 2026-04-09 00:36:53 -05:00
angelos
e7d3e9d767 fix(terminal): persistent sandbox envs survive between turns
`_cleanup_task_resources` was unconditionally calling `cleanup_vm()` at
the end of every `run_conversation` (i.e. every user turn), tearing down
the docker/daytona/modal sandbox container regardless of its
`persistent_filesystem` setting. This contradicted the documented intent
of `terminal.lifetime_seconds` (idle reaper) and `container_persistent`,
and caused per-turn loss of `/workspace`, `~/.config`, agent CLI auth
state, and any other content living inside the sandbox.

The unconditional teardown was introduced in fbd3a2fd ("prevent leakage
of morph instances between tasks", 2025-11-04) to plug a Morph backend
leak, two days after `lifetime_seconds` shipped in faecbddd. It was
later refactored into `_cleanup_task_resources` in 70dd3a16 without
changing semantics. Code and docs have disagreed since.

Fix: introduce `terminal_tool.is_persistent_env(task_id)` and skip the
per-turn `cleanup_vm` when the active env is persistent. The idle reaper
(`_cleanup_inactive_envs`) still tears persistent envs down once
`terminal.lifetime_seconds` is exceeded. Non-persistent backends (Morph)
are unchanged — still torn down per turn, preserving the original
leak-prevention intent.
2026-04-08 21:31:57 -07:00
Teknium
54db7cbbe1
fix(agent): tiered context pressure warnings + gateway dedup (#6411)
Combines the approaches from PR #6309 (duan78) and PR #5963 (KUSH42):

Tiered warnings (from #5963):
- Replaces boolean _context_pressure_warned with float _context_pressure_warned_at
- Fires at 85% (orange) and re-fires at 95% (red/critical)
- Adds 'compacting context...' status message before compression

Gateway dedup (from #6309):
- Class-level dict _context_pressure_last_warned survives across AIAgent
  instances (gateway creates a new instance per message)
- 5-minute cooldown per session prevents warning spam
- Higher-tier warnings bypass the cooldown (85% → 95% always fires)
- Compression reset clears the dedup entry for the session
- Stale entries evicted (older than 2x cooldown) to prevent memory leak

Does NOT inject into messages — purely user-facing via _safe_print (CLI)
and status_callback (gateway). Zero prompt cache impact.

Fixes #6309. Fixes #5963.
2026-04-08 21:31:44 -07:00
SHL0MS
8567031433 fix: improve context compression quality — named constants, tool tracking, degradation warning
Three targeted improvements to the compression system:

1. Replace hardcoded truncation limits with named class constants
   (_CONTENT_MAX=6000, _CONTENT_HEAD=4000, _CONTENT_TAIL=1500,
   _TOOL_ARGS_MAX=1500, _TOOL_ARGS_HEAD=1200). Previous limits
   (3000/500) heavily truncated the summarizer's input — a 200-line
   edit got cut to 3000 chars before the summarizer ever saw it.

2. Add '## Tools & Patterns' section to both compression prompt
   templates (first-pass and iterative). Preserves working tool
   invocations, preferred flags, and tool-specific discoveries
   across compaction boundaries.

3. Warn users on 2nd+ compression: 'Session compressed N times —
   accuracy may degrade. Consider /new to start fresh.'

Ref #499
2026-04-08 20:54:23 -07:00
Teknium
ae4a884e8d
fix(agent): disable stale stream timeout for local providers (#6368)
Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes #5889
2026-04-08 19:53:39 -07:00
konsisumer
42e366f27b fix(agent): respect config timeout for flush_memories instead of hardcoded 30s
The _call_llm() and direct OpenAI fallback paths in flush_memories() both
hardcoded timeout=30.0, ignoring the user-configurable value at
auxiliary.flush_memories.timeout in config.yaml.

Remove the explicit timeout from the auxiliary _call_llm() call so that
_get_task_timeout('flush_memories') reads from config. For the direct
OpenAI fallback, import and use _get_task_timeout() instead of the
hardcoded value.

Add two regression tests verifying both code paths respect the config.

Fixes #6154
2026-04-08 18:55:33 -07:00
kshitijk4poor
875a72e4c8 fix: normalize httpx.URL base_url + strip thinking signatures for third-party endpoints
Two linked fixes for MiniMax Anthropic-compatible fallback:

1. Normalize httpx.URL to str before calling .rstrip() in auth/provider
   detection helpers. Some client objects expose base_url as httpx.URL,
   not str — crashed with AttributeError in _requires_bearer_auth() and
   _is_third_party_anthropic_endpoint(). Also fixes _try_activate_fallback()
   to use the already-stringified fb_base_url instead of raw httpx.URL.

2. Strip Anthropic-proprietary thinking block signatures when targeting
   third-party Anthropic-compatible endpoints (MiniMax, Azure AI Foundry,
   self-hosted proxies). These endpoints cannot validate Anthropic's
   signatures and reject them with HTTP 400 'Invalid signature in
   thinking block'. Now threads base_url through convert_messages_to_anthropic()
   → build_anthropic_kwargs() so signature management is endpoint-aware.

Based on PR #4945 by kshitijk4poor (rstrip fix).
Fixes #4944.
2026-04-08 16:39:29 -07:00
kshitijk4poor
3377017eb4 feat(qwen): add Qwen OAuth provider with portal request support
Based on #6079 by @tunamitom with critical fixes and comprehensive tests.

Changes from #6079:
- Fix: sanitization overwrite bug — Qwen message prep now runs AFTER codex
  field sanitization, not before (was silently discarding Qwen transforms)
- Fix: missing try/except AuthError in runtime_provider.py — stale Qwen
  credentials now fall through to next provider on auto-detect
- Fix: 'qwen' alias conflict — bare 'qwen' stays mapped to 'alibaba'
  (DashScope); use 'qwen-portal' or 'qwen-cli' for the OAuth provider
- Fix: hardcoded ['coder-model'] replaced with live API fetch + curated
  fallback list (qwen3-coder-plus, qwen3-coder)
- Fix: extract _is_qwen_portal() helper + _qwen_portal_headers() to replace
  5 inline 'portal.qwen.ai' string checks and share headers between init
  and credential swap
- Fix: add Qwen branch to _apply_client_headers_for_base_url for mid-session
  credential swaps
- Fix: remove suspicious TypeError catch blocks around _prompt_provider_choice
- Fix: handle bare string items in content lists (were silently dropped)
- Fix: remove redundant dict() copies after deepcopy in message prep
- Revert: unrelated ai-gateway test mock removal and model_switch.py comment deletion

New tests (30 test functions):
- _qwen_cli_auth_path, _read_qwen_cli_tokens (success + 3 error paths)
- _save_qwen_cli_tokens (roundtrip, parent creation, permissions)
- _qwen_access_token_is_expiring (5 edge cases: fresh, expired, within skew,
  None, non-numeric)
- _refresh_qwen_cli_tokens (success, preserve old refresh, 4 error paths,
  default expires_in, disk persistence)
- resolve_qwen_runtime_credentials (fresh, auto-refresh, force-refresh,
  missing token, env override)
- get_qwen_auth_status (logged in, not logged in)
- Runtime provider resolution (direct, pool entry, alias)
- _build_api_kwargs (metadata, vl_high_resolution_images, message formatting,
  max_tokens suppression)
2026-04-08 13:46:30 -07:00
Teknium
1368caf66f
fix(anthropic): smart thinking block signature management (#6112)
Anthropic signs thinking blocks against the full turn content. Any
upstream mutation (context compression, session truncation, orphan
stripping, message merging) invalidates the signature, causing HTTP 400
'Invalid signature in thinking block' — especially in long-lived
gateway sessions.

Strategy (following clawdbot/OpenClaw pattern):

1. Strip thinking/redacted_thinking from all assistant messages EXCEPT
   the last one — preserves reasoning continuity on the current
   tool-use chain while avoiding stale signature errors on older turns.

2. Downgrade unsigned thinking blocks to plain text — Anthropic can't
   validate them, but the reasoning content is preserved.

3. Strip cache_control from thinking/redacted_thinking blocks to
   prevent cache markers from interfering with signature validation.

4. Drop thinking blocks from the second message when merging
   consecutive assistant messages (role alternation enforcement).

5. Error recovery: on HTTP 400 mentioning 'signature' and 'thinking',
   strip all reasoning_details from the conversation and retry once.
   This is the safety net for edge cases the proactive stripping
   misses.

Addresses the issue reported in PR #6086 by @mingginwan while
preserving reasoning continuity (their PR stripped ALL thinking
blocks unconditionally).

Files changed:
- agent/anthropic_adapter.py: thinking block management in
  convert_messages_to_anthropic (strip old turns, downgrade unsigned,
  strip cache_control, merge-time strip)
- run_agent.py: one-shot signature error recovery in retry loop
- tests/test_anthropic_adapter.py: 10 new tests covering all cases
2026-04-08 03:38:08 -07:00
alt-glitch
65e24c942e wip: tool result fixes -- persistence 2026-04-08 02:24:32 -07:00
zocomputer
e1befe5077 feat(agent): add jittered retry backoff
Adds agent/retry_utils.py with jittered_backoff() — exponential backoff
with additive jitter to prevent thundering-herd retry spikes when
multiple gateway sessions hit the same rate-limited provider.

Replaces fixed exponential backoff at 4 call sites:
- run_agent.py: None-choices retry path (5s base, 120s cap)
- run_agent.py: API error retry path (2s base, 60s cap)
- trajectory_compressor.py: sync + async summarization retries

Thread-safe jitter counter with overflow guards ensures unique seeds
across concurrent retries.

Trimmed from original PR to keep only wired-in functionality.

Co-authored-by: martinp09 <martinp09@users.noreply.github.com>
2026-04-08 00:41:36 -07:00
Teknium
5c03f2e7cc
fix: provider/model resolution — salvage 4 PRs + MiniMax aux URL fix (#5983)
Salvaged fixes from community PRs:

- fix(model_switch): _read_auth_store → _load_auth_store + fix auth store
  key lookup (was checking top-level dict instead of store['providers']).
  OAuth providers now correctly detected in /model picker.
  Cherry-picked from PR #5911 by Xule Lin (linxule).

- fix(ollama): pass num_ctx to override 2048 default context window.
  Ollama defaults to 2048 context regardless of model capabilities. Now
  auto-detects from /api/show metadata and injects num_ctx into every
  request. Config override via model.ollama_num_ctx. Fixes #2708.
  Cherry-picked from PR #5929 by kshitij (kshitijk4poor).

- fix(aux): normalize provider aliases for vision/auxiliary routing.
  Adds _normalize_aux_provider() with 17 aliases (google→gemini,
  claude→anthropic, glm→zai, etc). Fixes vision routing failure when
  provider is set to 'google' instead of 'gemini'.
  Cherry-picked from PR #5793 by e11i (Elizabeth1979).

- fix(aux): rewrite MiniMax /anthropic base URLs to /v1 for OpenAI SDK.
  MiniMax's inference_base_url ends in /anthropic (Anthropic Messages API),
  but auxiliary client uses OpenAI SDK which appends /chat/completions →
  404 at /anthropic/chat/completions. Generic _to_openai_base_url() helper
  rewrites terminal /anthropic to /v1 for OpenAI-compatible endpoint.
  Inspired by PR #5786 by Lempkey.

Added debug logging to silent exception blocks across all fixes.

Co-authored-by: Hermes Agent <hermes@nousresearch.com>
2026-04-07 22:23:28 -07:00
lesterli
37bf19a29d fix(codex): align validation with normalization for empty stream output
The response validation stage unconditionally marked Codex Responses API
replies as invalid when response.output was empty, triggering unnecessary
retries and fallback chains. However, _normalize_codex_response can
recover from this state by synthesizing output from response.output_text.

Now the validation stage checks for output_text before marking the
response invalid, matching the normalization logic. Also fixes
logging.warning → logger.warning for consistency with the rest of the
file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 17:29:41 -07:00
Teknium
ab8f9c089e
feat: thinking-only prefill continuation for structured reasoning responses (#5931)
When the model produces structured reasoning (via API fields like .reasoning,
.reasoning_content, .reasoning_details) but no visible text content, append
the assistant message as prefill and continue the loop. The model sees its own
reasoning context on the next turn and produces the text portion.

Inspired by clawdbot's 'incomplete-text' recovery pattern. Up to 2 prefill
attempts before falling through to the existing '(empty)' terminal.

Key design decisions:
- Only triggers for structured reasoning (API fields), NOT inline <think> tags
- Prefill messages are popped on success to maintain strict role alternation
- _thinking_prefill marker stripped from all API message building paths
- Works across all providers: OpenAI (continuation), Anthropic (native prefill)

Verified with E2E tests: simulated thinking-only → real OpenRouter continuation
produces correct content. Also confirmed Qwen models consistently produce
structured-reasoning-only responses under token pressure.
2026-04-07 13:19:06 -07:00
Teknium
ca0459d109
refactor: remove 24 confirmed dead functions — 432 lines of unused code
Each function was verified to have exactly 1 reference in the entire
codebase (its own definition). Zero calls, zero imports, zero string
references anywhere including tests.

Removed by category:

Superseded wrappers (replaced by newer implementations):
- agent/anthropic_adapter.py: run_hermes_oauth_login, refresh_hermes_oauth_token
- hermes_cli/callbacks.py: sudo_password_callback (superseded by CLI method)
- hermes_cli/setup.py: _set_model_provider, _sync_model_from_disk
- tools/file_tools.py: get_file_tools (superseded by registry.register)
- tools/cronjob_tools.py: get_cronjob_tool_definitions (same)
- tools/terminal_tool.py: _check_dangerous_command (_check_all_guards used)

Dead private helpers (lost their callers during refactors):
- agent/anthropic_adapter.py: _convert_user_content_part_to_anthropic
- agent/display.py: honcho_session_line, write_tty
- hermes_cli/providers.py: _build_labels (+ dead _labels_cache var)
- hermes_cli/tools_config.py: _prompt_yes_no
- hermes_cli/models.py: _extract_model_ids
- hermes_cli/uninstall.py: log_error
- gateway/platforms/feishu.py: _is_loop_ready
- tools/file_operations.py: _read_image (64-line method)
- tools/process_registry.py: cleanup_expired
- tools/skill_manager_tool.py: check_skill_manage_requirements

Dead class methods (zero callers):
- run_agent.py: _is_anthropic_url (logic duplicated inline at L618)
- run_agent.py: _classify_empty_content_response (68-line method, never wired)
- cli.py: reset_conversation (callers all use new_session directly)
- cli.py: _clear_current_input (added but never wired in)

Other:
- gateway/delivery.py: build_delivery_context_for_tool
- tools/browser_tool.py: get_active_browser_sessions
2026-04-07 11:41:26 -07:00
Teknium
69c753c19b
fix: thread gateway user_id to memory plugins for per-user scoping (#5895)
Memory plugins (Mem0, Honcho) used static identifiers ('hermes-user',
config peerName) meaning all gateway users shared the same memory bucket.

Changes:
- AIAgent.__init__: add user_id parameter, store as self._user_id
- run_agent.py: include user_id in _init_kwargs passed to memory providers
- gateway/run.py: pass source.user_id to AIAgent in primary + background paths
- Mem0 plugin: prefer kwargs user_id over config default
- Honcho plugin: override cfg.peer_name with gateway user_id when present

CLI sessions (user_id=None) preserve existing defaults. Only gateway
sessions with a real platform user_id get per-user memory scoping.

Reported by plev333.
2026-04-07 11:14:12 -07:00
Teknium
d0ffb111c2
refactor: codebase-wide lint cleanup — unused imports, dead code, and inefficient patterns (#5821)
Comprehensive cleanup across 80 files based on automated (ruff, pyflakes, vulture)
and manual analysis of the entire codebase.

Changes by category:

Unused imports removed (~95 across 55 files):
- Removed genuinely unused imports from all major subsystems
- agent/, hermes_cli/, tools/, gateway/, plugins/, cron/
- Includes imports in try/except blocks that were truly unused
  (vs availability checks which were left alone)

Unused variables removed (~25):
- Removed dead variables: connected, inner, channels, last_exc,
  source, new_server_names, verify, pconfig, default_terminal,
  result, pending_handled, temperature, loop
- Dropped unused argparse subparser assignments in hermes_cli/main.py
  (12 instances of add_parser() where result was never used)

Dead code removed:
- run_agent.py: Removed dead ternary (None if False else None) and
  surrounding unreachable branch in identity fallback
- run_agent.py: Removed write-only attribute _last_reported_tool
- hermes_cli/providers.py: Removed dead @property decorator on
  module-level function (decorator has no effect outside a class)
- gateway/run.py: Removed unused MCP config load before reconnect
- gateway/platforms/slack.py: Removed dead SessionSource construction

Undefined name bugs fixed (would cause NameError at runtime):
- batch_runner.py: Added missing logger = logging.getLogger(__name__)
- tools/environments/daytona.py: Added missing Dict and Path imports

Unnecessary global statements removed (14):
- tools/terminal_tool.py: 5 functions declared global for dicts
  they only mutated via .pop()/[key]=value (no rebinding)
- tools/browser_tool.py: cleanup thread loop only reads flag
- tools/rl_training_tool.py: 4 functions only do dict mutations
- tools/mcp_oauth.py: only reads the global
- hermes_time.py: only reads cached values

Inefficient patterns fixed:
- startswith/endswith tuple form: 15 instances of
  x.startswith('a') or x.startswith('b') consolidated to
  x.startswith(('a', 'b'))
- len(x)==0 / len(x)>0: 13 instances replaced with pythonic
  truthiness checks (not x / bool(x))
- in dict.keys(): 5 instances simplified to in dict
- Redefined unused name: removed duplicate _strip_mdv2 import in
  send_message_tool.py

Other fixes:
- hermes_cli/doctor.py: Replaced undefined logger.debug() with pass
- hermes_cli/config.py: Consolidated chained .endswith() calls

Test results: 3934 passed, 17 failed (all pre-existing on main),
19 skipped. Zero regressions.
2026-04-07 10:25:31 -07:00
Teknium
2021442c8a
fix: cover remaining codex empty-output gaps in fallback + normalizer (#5724)
Two gaps in the codex empty-output handling:

1. _run_codex_create_stream_fallback() skipped all non-terminal events,
   so when the fallback path was used and the terminal response had
   empty output, there was no recovery. Now collects output_item.done
   and text deltas during the fallback stream, backfills on empty output.

2. _normalize_codex_response() hard-crashed with RuntimeError when
   output was empty, even when the response had output_text set. The
   function already had fallback logic at line 3562 to use output_text,
   but the guard at line 3446 killed it first. Now checks output_text
   before raising and synthesizes a minimal output item.
2026-04-06 20:58:47 -07:00
Teknium
0e336b0e71
fix: backfill codex stream output from output_item.done events (#5689)
Salvages the core fix from PR #5673 (egerev) onto current main.

The chatgpt.com/backend-api/codex endpoint streams valid output items
via response.output_item.done events, but the OpenAI SDK's
get_final_response() returns an empty output list. This caused every
Codex response to be rejected as invalid.

Fix: collect output_item.done events during streaming and backfill
response.output when get_final_response() returns empty. Falls back
to synthesizing from text deltas when no done events were received.

Also moves the synthesis logic from the validation loop (too late, from
#5681) into _run_codex_stream() (before the response leaves the
streaming function), and simplifies the validation to just log
diagnostics since recovery now happens upstream.

Co-authored-by: Egor <egerev@users.noreply.github.com>
2026-04-06 18:19:30 -07:00
Teknium
dc4c07ed9d
fix: codex OAuth credential pool disconnect + expired token import (#5681)
Three bugs causing OpenAI Codex sessions to fail silently:

1. Credential pool vs legacy store disconnect: hermes auth and hermes
   model store device_code tokens in the credential pool, but
   get_codex_auth_status(), resolve_codex_runtime_credentials(), and
   _model_flow_openai_codex() only read from the legacy provider state.
   Fresh pool tokens were invisible to the auth status checks and model
   selection flow.

2. _import_codex_cli_tokens() imported expired tokens from ~/.codex/
   without checking JWT expiry. Combined with _login_openai_codex()
   saying 'Login successful!' for expired credentials, users got stuck
   in a loop of dead tokens being recycled.

3. _login_openai_codex() accepted expired tokens from
   resolve_codex_runtime_credentials() without validating expiry before
   telling the user login succeeded.

Fixes:
- get_codex_auth_status() now checks credential pool first, falls back
  to legacy provider state
- _model_flow_openai_codex() uses pool-aware auth status for token
  retrieval when fetching model lists
- _import_codex_cli_tokens() validates JWT exp claim, rejects expired
- _login_openai_codex() verifies resolved token isn't expiring before
  accepting existing credentials
- _run_codex_stream() logs response.incomplete/failed terminal events
  with status and incomplete_details for diagnostics
- Codex empty output recovery: captures streamed text during streaming
  and synthesizes a response when get_final_response() returns empty
  output (handles chatgpt.com backend-api edge cases)
2026-04-06 18:10:33 -07:00
Teknium
8cf013ecd9
fix: replace stale 'hermes login' refs with 'hermes auth' + fix credential removal re-seeding (#5670)
Two fixes:

1. Replace all stale 'hermes login' references with 'hermes auth' across
   auth.py, auxiliary_client.py, delegate_tool.py, config.py, run_agent.py,
   and documentation. The 'hermes login' command was deprecated; 'hermes auth'
   now handles OAuth credential management.

2. Fix credential removal not persisting for singleton-sourced credentials
   (device_code for openai-codex/nous, hermes_pkce for anthropic).
   auth_remove_command already cleared env vars for env-sourced credentials,
   but singleton credentials stored in the auth store were re-seeded by
   _seed_from_singletons() on the next load_pool() call. Now clears the
   underlying auth store entry when removing singleton-sourced credentials.
2026-04-06 17:17:57 -07:00
tymrtn
40527ff5e3 fix(auth): actionable error message when Codex refresh token is reused
When the Codex CLI (or VS Code extension) consumes a refresh token before
Hermes can use it, Hermes previously surfaced a generic 401 error with no
actionable guidance.

- In `refresh_codex_oauth_pure`: detect `refresh_token_reused` from the
  OAuth endpoint and raise an AuthError explaining the cause and the exact
  steps to recover (run `codex` to refresh, then `hermes login`).
- In `run_agent.py`: when provider is `openai-codex` and HTTP 401 is
  received, show Codex-specific recovery steps instead of the generic
  "check your API key" message.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 16:50:10 -07:00
Julien Talbot
92c19924a9 feat: add xAI prompt caching via x-grok-conv-id header
When using xAI's API directly (base_url contains x.ai), send the
x-grok-conv-id header set to the Hermes session_id. This routes
consecutive requests to the same server, maximizing automatic
prompt cache hits.

Ref: https://docs.x.ai/developers/advanced-api-usage/prompt-caching
2026-04-06 12:06:33 -07:00
ClintonEmok
f77be22c65 Fix #5211: Preserve dots in OpenCode Go model names
OpenCode Go model names with dots (minimax-m2.7, glm-4.5, kimi-k2.5)
were being mangled to hyphens (minimax-m2-7), causing HTTP 401 errors.

Two code paths were affected:
1. model_normalize.py: opencode-go was incorrectly in DOT_TO_HYPHEN_PROVIDERS
2. run_agent.py: _anthropic_preserve_dots() did not check for opencode-go

Fix:
- Remove opencode-go from _DOT_TO_HYPHEN_PROVIDERS (dots are correct for Go)
- Add opencode-go to _anthropic_preserve_dots() provider check
- Add opencode.ai/zen/go to base_url fallback check
- Add regression tests in tests/test_model_normalize.py

Co-authored-by: jacob3712 <jacob3712@users.noreply.github.com>
2026-04-06 11:25:06 -07:00
Teknium
9c96f669a1
feat: centralized logging, instrumentation, hermes logs CLI, gateway noise fix (#5430)
Adds comprehensive logging infrastructure to Hermes Agent across 4 phases:

**Phase 1 — Centralized logging**
- New hermes_logging.py with idempotent setup_logging() used by CLI, gateway, and cron
- agent.log (INFO+) and errors.log (WARNING+) with RotatingFileHandler + RedactingFormatter
- config.yaml logging: section (level, max_size_mb, backup_count)
- All entry points wired (cli.py, main.py, gateway/run.py, run_agent.py)
- Fixed debug_helpers.py writing to ./logs/ instead of ~/.hermes/logs/

**Phase 2 — Event instrumentation**
- API calls: model, provider, tokens, latency, cache hit %
- Tool execution: name, duration, result size (both sequential + concurrent)
- Session lifecycle: turn start (session/model/provider/platform), compression (before/after)
- Credential pool: rotation events, exhaustion tracking

**Phase 3 — hermes logs CLI command**
- hermes logs / hermes logs -f / hermes logs errors / hermes logs gateway
- --level, --session, --since filters
- hermes logs list (file sizes + ages)

**Phase 4 — Gateway bug fix + noise reduction**
- fix: _async_flush_memories() called with wrong arg count — sessions never flushed
- Batched session expiry logs: 6 lines/cycle → 2 summary lines
- Added inbound message + response time logging

75 new tests, zero regressions on the full suite.
2026-04-06 00:08:20 -07:00
Teknium
dc9c3cac87 chore: remove redundant local import of normalize_usage
Already imported at module level (line 94). The local import inside
_usage_summary_for_api_request_hook was unnecessary.
2026-04-05 23:31:29 -07:00
kshitijk4poor
f530ef1835 feat(plugins): pre_api_request/post_api_request with narrow payloads
- Rename per-LLM-call hooks from pre_llm_request/post_llm_request for clarity vs pre_llm_call
- Emit summary kwargs only (counts, usage dict from normalize_usage); keep env_var_enabled for HERMES_DUMP_REQUESTS
- Add is_truthy_value/env_var_enabled to utils; wire hermes_cli.plugins._env_enabled through it
- Update Langfuse local setup doc; add scripts/langfuse_smoketest.py and optional ~/.hermes plugin tests

Made-with: Cursor
2026-04-05 23:31:29 -07:00
kshitijk4poor
9e820dda37 Add request-scoped plugin lifecycle hooks 2026-04-05 23:31:29 -07:00
Teknium
9ca954a274
fix: mem0 API v2 compat, prefetch context fencing, secret redaction (#5423)
Consolidated salvage from PRs #5301 (qaqcvc), #5339 (lance0),
#5058 and #5098 (maymuneth).

Mem0 API v2 compatibility (#5301):
- All reads use filters={user_id: ...} instead of bare user_id= kwarg
- All writes use filters with user_id + agent_id for attribution
- Response unwrapping for v2 dict format {results: [...]}
- Split _read_filters() vs _write_filters() — reads are user-scoped
  only for cross-session recall, writes include agent_id
- Preserved 'hermes-user' default (no breaking change for existing users)
- Omitted run_id scoping from #5301 — cross-session memory is Mem0's
  core value, session-scoping reads would defeat that purpose

Memory prefetch context fencing (#5339):
- Wraps prefetched memory in <memory-context> fenced blocks with system
  note marking content as recalled context, NOT user input
- Sanitizes provider output to strip fence-escape sequences, preventing
  injection where memory content breaks out of the fence
- API-call-time only — never persisted to session history

Secret redaction (#5058, #5098):
- Added prefix patterns for Groq (gsk_), Matrix (syt_), RetainDB
  (retaindb_), Hindsight (hsk-), Mem0 (mem0_), ByteRover (brv_)
2026-04-05 22:43:33 -07:00
Teknium
0efe7dace7
feat: add GPT/Codex execution discipline guidance for tool persistence (#5414)
Adds OPENAI_MODEL_EXECUTION_GUIDANCE — XML-tagged behavioral guidance
injected for GPT and Codex models alongside the existing tool-use
enforcement. Targets four specific failure modes:

- <tool_persistence>: retry on empty/partial results instead of giving up
- <prerequisite_checks>: do discovery/lookup before jumping to final action
- <verification>: check correctness/grounding/formatting before finalizing
- <missing_context>: use lookup tools instead of hallucinating

Follows the same injection pattern as GOOGLE_MODEL_OPERATIONAL_GUIDANCE
for Gemini/Gemma models. Inspired by OpenClaw PR #38953 and OpenAI's
GPT-5.4 prompting guide patterns.
2026-04-05 21:51:07 -07:00
Teknium
2563493466
fix: improve timeout debug logging and user-facing diagnostics (#5370)
Agent activity tracking:
- Add _last_activity_ts, _last_activity_desc, _current_tool to AIAgent
- Touch activity on: API call start/complete, tool start/complete,
  first stream chunk, streaming request start
- Public get_activity_summary() method for external consumers

Gateway timeout diagnostics:
- Timeout message now includes what the agent was doing when killed:
  actively working vs stuck on a tool vs waiting on API response
- Includes iteration count, last activity description, seconds since
  last activity — users can distinguish legitimate long tasks from
  genuine hangs
- 'Still working' notifications now show iteration count and current
  tool instead of just elapsed time
- Stale lock eviction logs include agent activity state for debugging

Stream stale timeout:
- _emit_status when stale stream is detected (was log-only) — gateway
  users now see 'No response from provider for Ns' with model and
  context size
- Improved logger.warning with model name and estimated context size

Error path notifications (gateway-visible via _emit_status):
- Context compression attempts now use _emit_status (was _vprint only)
- Non-retryable client errors emit summary before aborting
- Max retry exhaustion emits error summary (was _vprint only)
- Rate limit exhaustion emits specific rate-limit message

These were all CLI-visible but silent to gateway users, which is why
people on Telegram/Discord saw generic 'request failed' messages
without explanation.
2026-04-05 18:33:33 -07:00
donrhmexe
7409715947 fix: link subagent sessions to parent and hide from session list
Subagent sessions spawned by delegate_task were created with
parent_session_id=NULL and source=cli, making them indistinguishable
from user sessions in hermes sessions list and /resume.

Changes:
- delegate_tool.py: pass parent_agent.session_id to child agent
- run_agent.py: accept parent_session_id param, pass to create_session
- hermes_state.py list_sessions_rich: filter parent_session_id IS NULL
  by default (opt-in include_children=True for callers that need them)
- hermes_state.py delete_session: delete child sessions first (FK)
- hermes_state.py prune_sessions: delete children before parents (FK)

session_search already handles parent_session_id correctly — child
sessions are filtered from recent list and resolved to parent root
in full-text search results.

Fixes #5122
2026-04-05 12:48:50 -07:00
Teknium
12724e6295
feat: progressive subdirectory hint discovery (#5291)
As the agent navigates into subdirectories via tool calls (read_file,
terminal, search_files, etc.), automatically discover and load project
context files (AGENTS.md, CLAUDE.md, .cursorrules) from those directories.

Previously, context files were only loaded from the CWD at session start.
If the agent moved into backend/, frontend/, or any subdirectory with its
own AGENTS.md, those instructions were never seen.

Now, SubdirectoryHintTracker watches tool call arguments for file paths
and shell commands, resolves directories, and loads hint files on first
access. Discovered hints are appended to the tool result so the model
gets relevant context at the moment it starts working in a new area —
without modifying the system prompt (preserving prompt caching).

Features:
- Extracts paths from tool args (path, workdir) and shell commands
- Loads AGENTS.md, CLAUDE.md, .cursorrules (first match per directory)
- Deduplicates — each directory loaded at most once per session
- Ignores paths outside the working directory
- Truncates large hint files at 8K chars
- Works on both sequential and concurrent tool execution paths

Inspired by Block/goose SubdirectoryHintTracker.
2026-04-05 12:33:47 -07:00
Mibayy
cc2b56b26a feat(api): structured run events via /v1/runs SSE endpoint
Add POST /v1/runs to start async agent runs and GET /v1/runs/{run_id}/events
for SSE streaming of typed lifecycle events (tool.started, tool.completed,
message.delta, reasoning.available, run.completed, run.failed).

Changes the internal tool_progress_callback signature from positional
(tool_name, preview, args) to event-type-first
(event_type, tool_name, preview, args, **kwargs). Existing consumers
filter on event_type and remain backward-compatible.

Adds concurrency limit (_MAX_CONCURRENT_RUNS=10) and orphaned run sweep.

Fixes logic inversion in cli.py _on_tool_progress where the original PR
would have displayed internal tools instead of non-internal ones.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
2026-04-05 12:05:13 -07:00
Git-on-my-level
fcdd5447e2 fix: keep ACP stdout protocol-clean
Route AIAgent print output to stderr via _print_fn for ACP stdio sessions.
Gate quiet-mode spinner startup on _should_start_quiet_spinner() so JSON-RPC
on stdout isn't corrupted. Child agents inherit the redirect.

Co-authored-by: Git-on-my-level <Git-on-my-level@users.noreply.github.com>
2026-04-05 12:05:13 -07:00
Teknium
a0a1b86c2e
fix: accept reasoning-only responses without retries — set content to "(empty)" (#5278)
* feat: coerce tool call arguments to match JSON Schema types

LLMs frequently return numbers as strings ("42" instead of 42) and
booleans as strings ("true" instead of true). This causes silent
failures with MCP tools and any tool with strictly-typed parameters.

Added coerce_tool_args() in model_tools.py that runs before every tool
dispatch. For each argument, it checks the tool registry schema and
attempts safe coercion:
  - "42" → 42 when schema says "type": "integer"
  - "3.14" → 3.14 when schema says "type": "number"
  - "true"/"false" → True/False when schema says "type": "boolean"
  - Union types tried in order
  - Original values preserved when coercion fails or is not applicable

Inspired by Block/goose tool argument coercion system.

* fix: accept reasoning-only responses without retries — set content to "(empty)"

Previously, when a model returned reasoning/thinking but no visible
content, we entered a 120-line retry/classify/compress/salvage cascade
that wasted 3+ API calls trying to "fix" the response. The model was
done thinking — retrying with the same input just burned money.

Now reasoning-only responses are accepted immediately:
- Reasoning stays in the `reasoning` field (semantically correct)
- Content set to "(empty)" — valid non-empty string every provider accepts
- No retries, no compression triggers, no salvage logic
- Session history contains "(empty)" not "" — prevents #2128 session
  poisoning where empty assistant content caused prefill rejections

Removes ~120 lines, adds ~15. Saves 2-3 API calls per reasoning-only
response. Fixes #2128.
2026-04-05 11:30:52 -07:00
LucidPaths
70f798043b fix: Ollama Cloud auth, /model switch persistence, and alias tab completion
- Add OLLAMA_API_KEY to credential resolution chain for ollama.com endpoints
- Update requested_provider/_explicit_api_key/_explicit_base_url after /model
  switch so _ensure_runtime_credentials() doesn't revert the switch
- Pass base_url/api_key from fallback config to resolve_provider_client()
- Add DirectAlias system: user-configurable model_aliases in config.yaml
  checked before catalog resolution, with reverse lookup by model ID
- Add /model tab completion showing aliases with provider metadata

Co-authored-by: LucidPaths <LucidPaths@users.noreply.github.com>
2026-04-05 11:06:06 -07:00
Teknium
51ed7dc2f3
feat: save oversized tool results to file instead of destructive truncation (#5210)
Previously, tool results exceeding 100K characters were silently chopped
with only a '[Truncated]' notice — the rest of the content was lost
permanently. The model had no way to access the truncated portion.

Now, oversized results are written to HERMES_HOME/cache/tool_responses/
and the model receives:
  - A 1,500-char head preview for immediate context
  - The file path so it can use read_file/search_files on the full output

This preserves the context window protection (inline content stays small)
while making the full data recoverable. Falls back to the old destructive
truncation if the file write fails.

Inspired by Block/goose's large response handler pattern.
2026-04-05 10:29:57 -07:00
Teknium
4976a8b066
feat: /model command — models.dev primary database + --provider flag (#5181)
Full overhaul of the model/provider system.

## What changed
- models.dev (109 providers, 4000+ models) as primary database for provider identity AND model metadata
- --provider flag replaces colon syntax for explicit provider switching
- Full ModelInfo/ProviderInfo dataclasses with context, cost, capabilities, modalities
- HermesOverlay system merges models.dev + Hermes-specific transport/auth/aggregator flags
- User-defined endpoints via config.yaml providers: section
- /model (no args) lists authenticated providers with curated model catalog
- Rich metadata display: context window, max output, cost/M tokens, capabilities
- Config migration: custom_providers list → providers dict (v11→v12)
- AIAgent.switch_model() for in-place model swap preserving conversation

## Files
agent/models_dev.py, hermes_cli/providers.py, hermes_cli/model_switch.py,
hermes_cli/model_normalize.py, cli.py, gateway/run.py, run_agent.py,
hermes_cli/config.py, hermes_cli/commands.py
2026-04-05 01:04:44 -07:00
Teknium
441ec48802 style: use module-level re import instead of local import re as _re 2026-04-05 00:20:53 -07:00
kshitijk4poor
65952ac00c Honor provider reset windows in pooled credential failover
Persist structured exhaustion metadata from provider errors, use explicit reset timestamps when available, and expose label-based credential targeting in the auth CLI. This keeps long-lived Codex cooldowns from being misreported as one-hour waits and avoids forcing operators to manage entries by list position alone.

Constraint: Existing credential pool JSON needs to remain backward compatible with stored entries that only record status code and timestamp
Constraint: Runtime recovery must keep the existing retry-then-rotate semantics for 429s while enriching pool state with provider metadata
Rejected: Add a separate credential scheduler subsystem | too large for the Hermes pool architecture and unnecessary for this fix
Rejected: Only change CLI formatting | would leave runtime rotation blind to resets_at and preserve the serial-failure behavior
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Preserve structured rate-limit metadata when new providers expose reset hints; do not collapse back to status-code-only exhaustion tracking
Tested: Focused pytest slice for auth commands, credential pool recovery, and routing (272 passed); py_compile on changed Python files; hermes -w auth list/remove smoke test with temporary HERMES_HOME
Not-tested: Full repository pytest suite, broader gateway/integration flows outside the touched auth and pool paths
2026-04-05 00:20:53 -07:00
Lume
ed4a605696 docs: update docstring to mention Fireworks strict validation
Updates _sanitize_tool_calls_for_strict_api docstring to explicitly
mention Fireworks alongside Mistral as strict APIs requiring sanitization.
Also documents the specific fields that are stripped (call_id, response_item_id).
2026-04-05 00:13:25 -07:00
Lume
d90035835b refactor: use _should_sanitize_tool_calls in run_conversation()
Replaces hardcoded Mistral check with the new _should_sanitize_tool_calls()
method. Updates comment to mention Fireworks alongside Mistral as strict
APIs requiring tool_call field sanitization.
2026-04-05 00:13:25 -07:00
Lume
234c01f690 refactor: use _should_sanitize_tool_calls in _handle_max_iterations()
Replaces hardcoded Mistral check with the new _should_sanitize_tool_calls()
method. Ensures summary generation works correctly with Fireworks and
other strict APIs that reject unknown tool_call fields.
2026-04-05 00:13:25 -07:00
Lume
7f6e509199 refactor: use _should_sanitize_tool_calls in flush_memories()
Replaces hardcoded Mistral check with the new _should_sanitize_tool_calls()
method. This ensures tool_calls are sanitized for all strict APIs, not
just Mistral. Prevents 400 errors from Fireworks and other providers.
2026-04-05 00:13:25 -07:00
Lume
560c6ae143 feat: add _should_sanitize_tool_calls() method
Adds a centralized method to determine when tool_calls need sanitization
for strict APIs. Returns True for all APIs except codex_responses mode.
This prevents 400 errors from providers like Fireworks that reject unknown
fields (call_id, response_item_id) in tool_calls.
2026-04-05 00:13:25 -07:00
Teknium
5879b3ef82
fix: move pre_llm_call plugin context to user message, preserve prompt cache (#5146)
Plugin context from pre_llm_call hooks was injected into the system
prompt, breaking the prompt cache prefix every turn when content
changed (typical for memory plugins). Now all plugin context goes
into the current turn's user message — the system prompt stays
identical across turns, preserving cached tokens.

The system prompt is reserved for Hermes internals. Plugins
contribute context alongside the user's input.

Also adds comprehensive documentation for all 6 plugin hooks:
pre_tool_call, post_tool_call, pre_llm_call, post_llm_call,
on_session_start, on_session_end — each with full callback
signatures, parameter tables, firing conditions, and examples.

Supersedes #5138 which identified the same cache-busting bug
and proposed an uncached system suffix approach. This fix goes
further by removing system prompt injection entirely.

Co-identified-by: OutThisLife (PR #5138)
2026-04-04 16:55:44 -07:00
Teknium
5a98ce5973
fix: use clean user message for all memory provider operations (#4940)
When a skill is active, user_message contains the full SKILL.md content
injected by the skill system. This bloated string was being passed to
memory provider sync_all(), queue_prefetch_all(), and prefetch_all(),
causing providers with query size limits (e.g. Honcho's 10K char limit)
to fail.

Both call sites now use original_user_message (the clean user input,
already defined at line 6516) instead of the skill-inflated user_message:

- Pre-turn prefetch (line ~6695): prefetch_all() query
- Post-turn sync (line ~8672): sync_all() + queue_prefetch_all()

Fixes #4889
2026-04-03 20:43:01 -07:00
Teknium
cee761ee4a
fix: prevent duplicate messages — gateway dedup + partial stream guard (#4878)
* fix(gateway): add message deduplication to Discord and Slack adapters (#4777)

Discord RESUME replays events after reconnects (~7/day observed),
and Slack Socket Mode can redeliver events if the ack was lost.
Neither adapter tracked which messages were already processed,
causing duplicate bot responses.

Add _seen_messages dedup cache (message ID → timestamp) with 5-min
TTL and 2000-entry cap to both adapters, matching the pattern already
used by Mattermost, Matrix, WeCom, Feishu, DingTalk, and Email.

The check goes at the very top of the message handler, before any
other logic, so replayed events are silently dropped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prevent duplicate messages on partial stream delivery

When streaming fails after tokens are already delivered to the platform,
_interruptible_streaming_api_call re-raised the error into the outer
retry loop, which would make a new API call — creating a duplicate
message.

Now checks deltas_were_sent before re-raising: if partial content was
already streamed, returns a stub response instead. The outer loop treats
the turn as complete (no retry, no fallback, no duplicate).

Inspired by PR #4871 (@trevorgordon981) which identified the bug.
This implementation avoids monkey-patching exception objects and keeps
the fix within the streaming call boundary.

---------

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:53:52 -07:00
kagura-agent
6bf5946bbe fix: filter transcript-only roles from chat-completions payload (#4715)
Add a provider-agnostic role allowlist guard to _sanitize_api_messages()
that drops messages with roles not accepted by the chat-completions API
(e.g. session_meta). This prevents CLI resume/session restore from
leaking transcript-only metadata into the outgoing messages payload.

Two layers of defense:

1. API-boundary guard: _sanitize_api_messages() now filters messages by
   role allowlist (system/user/assistant/tool/function/developer) before
   the existing orphaned tool-call repair logic. This protects all
   current and future call paths.

2. CLI restore defense-in-depth: Both session restore paths in cli.py
   now strip session_meta entries before loading history into
   conversation_history, matching the existing gateway behavior.

Closes #4715
2026-04-03 14:57:33 -07:00
Teknium
e492420df4
fix: route memory provider tools in sequential execution path (#4803)
Memory provider tools (hindsight_retain, honcho_search, etc.) were
advertised to the model via tool schemas but failed with 'Unknown tool'
at execution time. The concurrent path (_invoke_tool) correctly checks
self._memory_manager.has_tool() before falling through to the registry,
but the sequential path (_execute_tool_calls_sequential) was never
updated with this check. Since sequential is the default for single
tool calls, memory provider tools always hit the registry dispatcher
which returns 'Unknown tool' because they're not registered there.

Add the memory_manager dispatch check between the delegate_task handler
and the quiet_mode fallthrough in the sequential path, with proper
spinner/display handling to match the existing pattern.

Reported by KiBenderOP — all memory providers affected (Honcho,
Hindsight, Holographic, etc.).
2026-04-03 10:31:53 -07:00
Teknium
f374ae4c61
fix: prevent compression death spiral from API disconnects (#2153) (#4750)
Three fixes for long-running gateway sessions that enter a death spiral
when API disconnects prevent token data collection, which prevents
compression, which causes more disconnects:

Layer 1 — Stale token counter fallback (run_agent.py in-loop):
When last_prompt_tokens is 0 (stale after API disconnect or provider
returned no usage data), fall back to estimate_messages_tokens_rough()
instead of passing 0 to should_compress(), which would never fire.

Layer 2 — Server disconnect heuristic (run_agent.py error handler):
When ReadError/RemoteProtocolError hits a large session (>60% context
or >200 messages), treat it as a context-length error and trigger
compression rather than burning through retries that all fail the
same way.

Layer 3 — Hard message count limit (gateway/run.py hygiene):
Force compression when a session exceeds 400 messages, regardless of
token estimates. This catches runaway growth even when all token-based
checks fail due to missing API data.

Based on the analysis from PR #2157 by ygd58 — the gateway threshold
direction fix (1.4x multiplier) was already resolved on main.
2026-04-03 02:16:46 -07:00
Teknium
8fd9fafc84
fix: handle Anthropic Sonnet long-context tier 429 by reducing to 200k (#4747)
Anthropic returns HTTP 429 'Extra usage is required for long context
requests' when a Claude Max subscription doesn't include the 1M context
tier. This is NOT a transient rate limit — retrying won't help.

Only applies to Sonnet models (Opus 1M is general access). Detects
this specific error before the generic rate-limit handler and:
1. Reduces context_length from 1M to 200k (the standard tier)
2. Triggers context compression to fit
3. Retries with the reduced context

The reduction is session-scoped (not persisted) so it auto-recovers
if the user later enables extra usage on their subscription.

Fixes: Sonnet 4.6 instant rate limits on Claude Max without extra usage
2026-04-03 02:05:02 -07:00
Teknium
470c3ea51a
fix: handle Anthropic long-context tier 429 by reducing to 200k
Anthropic returns HTTP 429 'Extra usage is required for long context
requests' when a Claude Max subscription doesn't include the 1M context
tier. This is NOT a transient rate limit — retrying won't help.

Detect this specific error before the generic rate-limit handler and:
1. Reduce context_length from 1M to 200k (the standard tier)
2. Trigger context compression to fit
3. Retry with the reduced context

The reduction is session-scoped (not persisted) so it auto-recovers
if the user later enables extra usage on their subscription.

Fixes: Sonnet 4.6 instant rate limits on Claude Max without extra usage
2026-04-03 01:56:43 -07:00
Jack
f66b3fe76b fix(acp): include tool results in step_callback for ACP tool_call_update events
The step_callback previously only forwarded tool names as strings,
so build_tool_complete received result=None and ACP tool_call_update
events had empty content/rawOutput. Now prev_tools carries dicts with
both name and result by pairing each tool_call with its matching
tool-role message via tool_call_id.
2026-04-02 20:54:27 -07:00
Teknium
924bc67eee
feat(memory): pluggable memory provider interface with profile isolation, review fixes, and honcho CLI restoration (#4623)
* feat(memory): add pluggable memory provider interface with profile isolation

Introduces a pluggable MemoryProvider ABC so external memory backends can
integrate with Hermes without modifying core files. Each backend becomes a
plugin implementing a standard interface, orchestrated by MemoryManager.

Key architecture:
- agent/memory_provider.py — ABC with core + optional lifecycle hooks
- agent/memory_manager.py — single integration point in the agent loop
- agent/builtin_memory_provider.py — wraps existing MEMORY.md/USER.md

Profile isolation fixes applied to all 6 shipped plugins:
- Cognitive Memory: use get_hermes_home() instead of raw env var
- Hindsight Memory: check $HERMES_HOME/hindsight/config.json first,
  fall back to legacy ~/.hindsight/ for backward compat
- Hermes Memory Store: replace hardcoded ~/.hermes paths with
  get_hermes_home() for config loading and DB path defaults
- Mem0 Memory: use get_hermes_home() instead of raw env var
- RetainDB Memory: auto-derive profile-scoped project name from
  hermes_home path (hermes-<profile>), explicit env var overrides
- OpenViking Memory: read-only, no local state, isolation via .env

MemoryManager.initialize_all() now injects hermes_home into kwargs so
every provider can resolve profile-scoped storage without importing
get_hermes_home() themselves.

Plugin system: adds register_memory_provider() to PluginContext and
get_plugin_memory_providers() accessor.

Based on PR #3825. 46 tests (37 unit + 5 E2E + 4 plugin registration).

* refactor(memory): drop cognitive plugin, rewrite OpenViking as full provider

Remove cognitive-memory plugin (#727) — core mechanics are broken:
decay runs 24x too fast (hourly not daily), prefetch uses row ID as
timestamp, search limited by importance not similarity.

Rewrite openviking-memory plugin from a read-only search wrapper into
a full bidirectional memory provider using the complete OpenViking
session lifecycle API:

- sync_turn: records user/assistant messages to OpenViking session
  (threaded, non-blocking)
- on_session_end: commits session to trigger automatic memory extraction
  into 6 categories (profile, preferences, entities, events, cases,
  patterns)
- prefetch: background semantic search via find() endpoint
- on_memory_write: mirrors built-in memory writes to the session
- is_available: checks env var only, no network calls (ABC compliance)

Tools expanded from 3 to 5:
- viking_search: semantic search with mode/scope/limit
- viking_read: tiered content (abstract ~100tok / overview ~2k / full)
- viking_browse: filesystem-style navigation (list/tree/stat)
- viking_remember: explicit memory storage via session
- viking_add_resource: ingest URLs/docs into knowledge base

Uses direct HTTP via httpx (no openviking SDK dependency needed).
Response truncation on viking_read to prevent context flooding.

* fix(memory): harden Mem0 plugin — thread safety, non-blocking sync, circuit breaker

- Remove redundant mem0_context tool (identical to mem0_search with
  rerank=true, top_k=5 — wastes a tool slot and confuses the model)
- Thread sync_turn so it's non-blocking — Mem0's server-side LLM
  extraction can take 5-10s, was stalling the agent after every turn
- Add threading.Lock around _get_client() for thread-safe lazy init
  (prefetch and sync threads could race on first client creation)
- Add circuit breaker: after 5 consecutive API failures, pause calls
  for 120s instead of hammering a down server every turn. Auto-resets
  after cooldown. Logs a warning when tripped.
- Track success/failure in prefetch, sync_turn, and all tool calls
- Wait for previous sync to finish before starting a new one (prevents
  unbounded thread accumulation on rapid turns)
- Clean up shutdown to join both prefetch and sync threads

* fix(memory): enforce single external memory provider limit

MemoryManager now rejects a second non-builtin provider with a warning.
Built-in memory (MEMORY.md/USER.md) is always accepted. Only ONE
external plugin provider is allowed at a time. This prevents tool
schema bloat (some providers add 3-5 tools each) and conflicting
memory backends.

The warning message directs users to configure memory.provider in
config.yaml to select which provider to activate.

Updated all 47 tests to use builtin + one external pattern instead
of multiple externals. Added test_second_external_rejected to verify
the enforcement.

* feat(memory): add ByteRover memory provider plugin

Implements the ByteRover integration (from PR #3499 by hieuntg81) as a
MemoryProvider plugin instead of direct run_agent.py modifications.

ByteRover provides persistent memory via the brv CLI — a hierarchical
knowledge tree with tiered retrieval (fuzzy text then LLM-driven search).
Local-first with optional cloud sync.

Plugin capabilities:
- prefetch: background brv query for relevant context
- sync_turn: curate conversation turns (threaded, non-blocking)
- on_memory_write: mirror built-in memory writes to brv
- on_pre_compress: extract insights before context compression

Tools (3):
- brv_query: search the knowledge tree
- brv_curate: store facts/decisions/patterns
- brv_status: check CLI version and context tree state

Profile isolation: working directory at $HERMES_HOME/byterover/ (scoped
per profile). Binary resolution cached with thread-safe double-checked
locking. All write operations threaded to avoid blocking the agent
(curate can take 120s with LLM processing).

* fix(memory): thread remaining sync_turns, fix holographic, add config key

Plugin fixes:
- Hindsight: thread sync_turn (was blocking up to 30s via _run_in_thread)
- RetainDB: thread sync_turn (was blocking on HTTP POST)
- Both: shutdown now joins sync threads alongside prefetch threads

Holographic retrieval fixes:
- reason(): removed dead intersection_key computation (bundled but never
  used in scoring). Now reuses pre-computed entity_residuals directly,
  moved role_content encoding outside the inner loop.
- contradict(): added _MAX_CONTRADICT_FACTS=500 scaling guard. Above
  500 facts, only checks the most recently updated ones to avoid O(n^2)
  explosion (~125K comparisons at 500 is acceptable).

Config:
- Added memory.provider key to DEFAULT_CONFIG ("" = builtin only).
  No version bump needed (deep_merge handles new keys automatically).

* feat(memory): extract Honcho as a MemoryProvider plugin

Creates plugins/honcho-memory/ as a thin adapter over the existing
honcho_integration/ package. All 4 Honcho tools (profile, search,
context, conclude) move from the normal tool registry to the
MemoryProvider interface.

The plugin delegates all work to HonchoSessionManager — no Honcho
logic is reimplemented. It uses the existing config chain:
$HERMES_HOME/honcho.json -> ~/.honcho/config.json -> env vars.

Lifecycle hooks:
- initialize: creates HonchoSessionManager via existing client factory
- prefetch: background dialectic query
- sync_turn: records messages + flushes to API (threaded)
- on_memory_write: mirrors user profile writes as conclusions
- on_session_end: flushes all pending messages

This is a prerequisite for the MemoryManager wiring in run_agent.py.
Once wired, Honcho goes through the same provider interface as all
other memory plugins, and the scattered Honcho code in run_agent.py
can be consolidated into the single MemoryManager integration point.

* feat(memory): wire MemoryManager into run_agent.py

Adds 8 integration points for the external memory provider plugin,
all purely additive (zero existing code modified):

1. Init (~L1130): Create MemoryManager, find matching plugin provider
   from memory.provider config, initialize with session context
2. Tool injection (~L1160): Append provider tool schemas to self.tools
   and self.valid_tool_names after memory_manager init
3. System prompt (~L2705): Add external provider's system_prompt_block
   alongside existing MEMORY.md/USER.md blocks
4. Tool routing (~L5362): Route provider tool calls through
   memory_manager.handle_tool_call() before the catchall handler
5. Memory write bridge (~L5353): Notify external provider via
   on_memory_write() when the built-in memory tool writes
6. Pre-compress (~L5233): Call on_pre_compress() before context
   compression discards messages
7. Prefetch (~L6421): Inject provider prefetch results into the
   current-turn user message (same pattern as Honcho turn context)
8. Turn sync + session end (~L8161, ~L8172): sync_all() after each
   completed turn, queue_prefetch_all() for next turn, on_session_end()
   + shutdown_all() at conversation end

All hooks are wrapped in try/except — a failing provider never breaks
the agent. The existing memory system, Honcho integration, and all
other code paths are completely untouched.

Full suite: 7222 passed, 4 pre-existing failures.

* refactor(memory): remove legacy Honcho integration from core

Extracts all Honcho-specific code from run_agent.py, model_tools.py,
toolsets.py, and gateway/run.py. Honcho is now exclusively available
as a memory provider plugin (plugins/honcho-memory/).

Removed from run_agent.py (-457 lines):
- Honcho init block (session manager creation, activation, config)
- 8 Honcho methods: _honcho_should_activate, _strip_honcho_tools,
  _activate_honcho, _register_honcho_exit_hook, _queue_honcho_prefetch,
  _honcho_prefetch, _honcho_save_user_observation, _honcho_sync
- _inject_honcho_turn_context module-level function
- Honcho system prompt block (tool descriptions, CLI commands)
- Honcho context injection in api_messages building
- Honcho params from __init__ (honcho_session_key, honcho_manager,
  honcho_config)
- HONCHO_TOOL_NAMES constant
- All honcho-specific tool dispatch forwarding

Removed from other files:
- model_tools.py: honcho_tools import, honcho params from handle_function_call
- toolsets.py: honcho toolset definition, honcho tools from core tools list
- gateway/run.py: honcho params from AIAgent constructor calls

Removed tests (-339 lines):
- 9 Honcho-specific test methods from test_run_agent.py
- TestHonchoAtexitFlush class from test_exit_cleanup_interrupt.py

Restored two regex constants (_SURROGATE_RE, _BUDGET_WARNING_RE) that
were accidentally removed during the honcho function extraction.

The honcho_integration/ package is kept intact — the plugin delegates
to it. tools/honcho_tools.py registry entries are now dead code (import
commented out in model_tools.py) but the file is preserved for reference.

Full suite: 7207 passed, 4 pre-existing failures. Zero regressions.

* refactor(memory): restructure plugins, add CLI, clean gateway, migration notice

Plugin restructure:
- Move all memory plugins from plugins/<name>-memory/ to plugins/memory/<name>/
  (byterover, hindsight, holographic, honcho, mem0, openviking, retaindb)
- New plugins/memory/__init__.py discovery module that scans the directory
  directly, loading providers by name without the general plugin system
- run_agent.py uses load_memory_provider() instead of get_plugin_memory_providers()

CLI wiring:
- hermes memory setup — interactive curses picker + config wizard
- hermes memory status — show active provider, config, availability
- hermes memory off — disable external provider (built-in only)
- hermes honcho — now shows migration notice pointing to hermes memory setup

Gateway cleanup:
- Remove _get_or_create_gateway_honcho (already removed in prev commit)
- Remove _shutdown_gateway_honcho and _shutdown_all_gateway_honcho methods
- Remove all calls to shutdown methods (4 call sites)
- Remove _honcho_managers/_honcho_configs dict references

Dead code removal:
- Delete tools/honcho_tools.py (279 lines, import was already commented out)
- Delete tests/gateway/test_honcho_lifecycle.py (131 lines, tested removed methods)
- Remove if False placeholder from run_agent.py

Migration:
- Honcho migration notice on startup: detects existing honcho.json or
  ~/.honcho/config.json, prints guidance to run hermes memory setup.
  Only fires when memory.provider is not set and not in quiet mode.

Full suite: 7203 passed, 4 pre-existing failures. Zero regressions.

* feat(memory): standardize plugin config + add per-plugin documentation

Config architecture:
- Add save_config(values, hermes_home) to MemoryProvider ABC
- Honcho: writes to $HERMES_HOME/honcho.json (SDK native)
- Mem0: writes to $HERMES_HOME/mem0.json
- Hindsight: writes to $HERMES_HOME/hindsight/config.json
- Holographic: writes to config.yaml under plugins.hermes-memory-store
- OpenViking/RetainDB/ByteRover: env-var only (default no-op)

Setup wizard (hermes memory setup):
- Now calls provider.save_config() for non-secret config
- Secrets still go to .env via env vars
- Only memory.provider activation key goes to config.yaml

Documentation:
- README.md for each of the 7 providers in plugins/memory/<name>/
- Requirements, setup (wizard + manual), config reference, tools table
- Consistent format across all providers

The contract for new memory plugins:
- get_config_schema() declares all fields (REQUIRED)
- save_config() writes native config (REQUIRED if not env-var-only)
- Secrets use env_var field in schema, written to .env by wizard
- README.md in the plugin directory

* docs: add memory providers user guide + developer guide

New pages:
- user-guide/features/memory-providers.md — comprehensive guide covering
  all 7 shipped providers (Honcho, OpenViking, Mem0, Hindsight,
  Holographic, RetainDB, ByteRover). Each with setup, config, tools,
  cost, and unique features. Includes comparison table and profile
  isolation notes.
- developer-guide/memory-provider-plugin.md — how to build a new memory
  provider plugin. Covers ABC, required methods, config schema,
  save_config, threading contract, profile isolation, testing.

Updated pages:
- user-guide/features/memory.md — replaced Honcho section with link to
  new Memory Providers page
- user-guide/features/honcho.md — replaced with migration redirect to
  the new Memory Providers page
- sidebars.ts — added both new pages to navigation

* fix(memory): auto-migrate Honcho users to memory provider plugin

When honcho.json or ~/.honcho/config.json exists but memory.provider
is not set, automatically set memory.provider: honcho in config.yaml
and activate the plugin. The plugin reads the same config files, so
all data and credentials are preserved. Zero user action needed.

Persists the migration to config.yaml so it only fires once. Prints
a one-line confirmation in non-quiet mode.

* fix(memory): only auto-migrate Honcho when enabled + credentialed

Check HonchoClientConfig.enabled AND (api_key OR base_url) before
auto-migrating — not just file existence. Prevents false activation
for users who disabled Honcho, stopped using it (config lingers),
or have ~/.honcho/ from a different tool.

* feat(memory): auto-install pip dependencies during hermes memory setup

Reads pip_dependencies from plugin.yaml, checks which are missing,
installs them via pip before config walkthrough. Also shows install
guidance for external_dependencies (e.g. brv CLI for ByteRover).

Updated all 7 plugin.yaml files with pip_dependencies:
- honcho: honcho-ai
- mem0: mem0ai
- openviking: httpx
- hindsight: hindsight-client
- holographic: (none)
- retaindb: requests
- byterover: (external_dependencies for brv CLI)

* fix: remove remaining Honcho crash risks from cli.py and gateway

cli.py: removed Honcho session re-mapping block (would crash importing
deleted tools/honcho_tools.py), Honcho flush on compress, Honcho
session display on startup, Honcho shutdown on exit, honcho_session_key
AIAgent param.

gateway/run.py: removed honcho_session_key params from helper methods,
sync_honcho param, _honcho.shutdown() block.

tests: fixed test_cron_session_with_honcho_key_skipped (was passing
removed honcho_key param to _flush_memories_for_session).

* fix: include plugins/ in pyproject.toml package list

Without this, plugins/memory/ wouldn't be included in non-editable
installs. Hermes always runs from the repo checkout so this is belt-
and-suspenders, but prevents breakage if the install method changes.

* fix(memory): correct pip-to-import name mapping for dep checks

The heuristic dep.replace('-', '_') fails for packages where the pip
name differs from the import name: honcho-ai→honcho, mem0ai→mem0,
hindsight-client→hindsight_client. Added explicit mapping table so
hermes memory setup doesn't try to reinstall already-installed packages.

* chore: remove dead code from old plugin memory registration path

- hermes_cli/plugins.py: removed register_memory_provider(),
  _memory_providers list, get_plugin_memory_providers() — memory
  providers now use plugins/memory/ discovery, not the general plugin system
- hermes_cli/main.py: stripped 74 lines of dead honcho argparse
  subparsers (setup, status, sessions, map, peer, mode, tokens,
  identity, migrate) — kept only the migration redirect
- agent/memory_provider.py: updated docstring to reflect new
  registration path
- tests: replaced TestPluginMemoryProviderRegistration with
  TestPluginMemoryDiscovery that tests the actual plugins/memory/
  discovery system. Added 3 new tests (discover, load, nonexistent).

* chore: delete dead honcho_integration/cli.py and its tests

cli.py (794 lines) was the old 'hermes honcho' command handler — nobody
calls it since cmd_honcho was replaced with a migration redirect.

Deleted tests that imported from removed code:
- tests/honcho_integration/test_cli.py (tested _resolve_api_key)
- tests/honcho_integration/test_config_isolation.py (tested CLI config paths)
- tests/tools/test_honcho_tools.py (tested the deleted tools/honcho_tools.py)

Remaining honcho_integration/ files (actively used by the plugin):
- client.py (445 lines) — config loading, SDK client creation
- session.py (991 lines) — session management, queries, flush

* refactor: move honcho_integration/ into the honcho plugin

Moves client.py (445 lines) and session.py (991 lines) from the
top-level honcho_integration/ package into plugins/memory/honcho/.
No Honcho code remains in the main codebase.

- plugins/memory/honcho/client.py — config loading, SDK client creation
- plugins/memory/honcho/session.py — session management, queries, flush
- Updated all imports: run_agent.py (auto-migration), hermes_cli/doctor.py,
  plugin __init__.py, session.py cross-import, all tests
- Removed honcho_integration/ package and pyproject.toml entry
- Renamed tests/honcho_integration/ → tests/honcho_plugin/

* docs: update architecture + gateway-internals for memory provider system

- architecture.md: replaced honcho_integration/ with plugins/memory/
- gateway-internals.md: replaced Honcho-specific session routing and
  flush lifecycle docs with generic memory provider interface docs

* fix: update stale mock path for resolve_active_host after honcho plugin migration

* fix(memory): address review feedback — P0 lifecycle, ABC contract, honcho CLI restore

Review feedback from Honcho devs (erosika):

P0 — Provider lifecycle:
- Remove on_session_end() + shutdown_all() from run_conversation() tail
  (was killing providers after every turn in multi-turn sessions)
- Add shutdown_memory_provider() method on AIAgent for callers
- Wire shutdown into CLI atexit, reset_conversation, gateway stop/expiry

Bug fixes:
- Remove sync_honcho=False kwarg from /btw callsites (TypeError crash)
- Fix doctor.py references to dead 'hermes honcho setup' command
- Cache prefetch_all() before tool loop (was re-calling every iteration)

ABC contract hardening (all backwards-compatible):
- Add session_id kwarg to prefetch/sync_turn/queue_prefetch
- Make on_pre_compress() return str (provider insights in compression)
- Add **kwargs to on_turn_start() for runtime context
- Add on_delegation() hook for parent-side subagent observation
- Document agent_context/agent_identity/agent_workspace kwargs on
  initialize() (prevents cron corruption, enables profile scoping)
- Fix docstring: single external provider, not multiple

Honcho CLI restoration:
- Add plugins/memory/honcho/cli.py (from main's honcho_integration/cli.py
  with imports adapted to plugin path)
- Restore full hermes honcho command with all subcommands (status, peer,
  mode, tokens, identity, enable/disable, sync, peers, --target-profile)
- Restore auto-clone on profile creation + sync on hermes update
- hermes honcho setup now redirects to hermes memory setup

* fix(memory): wire on_delegation, skip_memory for cron/flush, fix ByteRover return type

- Wire on_delegation() in delegate_tool.py — parent's memory provider
  is notified with task+result after each subagent completes
- Add skip_memory=True to cron scheduler (prevents cron system prompts
  from corrupting user representations — closes #4052)
- Add skip_memory=True to gateway flush agent (throwaway agent shouldn't
  activate memory provider)
- Fix ByteRover on_pre_compress() return type: None -> str

* fix(honcho): port profile isolation fixes from PR #4632

Ports 5 bug fixes found during profile testing (erosika's PR #4632):

1. 3-tier config resolution — resolve_config_path() now checks
   $HERMES_HOME/honcho.json → ~/.hermes/honcho.json → ~/.honcho/config.json
   (non-default profiles couldn't find shared host blocks)

2. Thread host=_host_key() through from_global_config() in cmd_setup,
   cmd_status, cmd_identity (--target-profile was being ignored)

3. Use bare profile name as aiPeer (not host key with dots) — Honcho's
   peer ID pattern is ^[a-zA-Z0-9_-]+$, dots are invalid

4. Wrap add_peers() in try/except — was fatal on new AI peers, killed
   all message uploads for the session

5. Gate Honcho clone behind --clone/--clone-all on profile create
   (bare create should be blank-slate)

Also: sanitize assistant_peer_id via _sanitize_id()

* fix(tests): add module cleanup fixture to test_cli_provider_resolution

test_cli_provider_resolution._import_cli() wipes tools.*, cli, and
run_agent from sys.modules to force fresh imports, but had no cleanup.
This poisoned all subsequent tests on the same xdist worker — mocks
targeting tools.file_tools, tools.send_message_tool, etc. patched the
NEW module object while already-imported functions still referenced
the OLD one. Caused ~25 cascade failures: send_message KeyError,
process_registry FileNotFoundError, file_read_guards timeouts,
read_loop_detection file-not-found, mcp_oauth None port, and
provider_parity/codex_execution stale tool lists.

Fix: autouse fixture saves all affected modules before each test and
restores them after, matching the pattern in
test_managed_browserbase_and_modal.py.
2026-04-02 15:33:51 -07:00
kshitijk4poor
d2b08406a4 fix(agent): classify think-only empty responses before retrying 2026-04-02 12:29:18 -07:00
Teknium
d89cc7fec1
feat(prompt): add Google model operational guidance for Gemini and Gemma (#4641)
Adapted from OpenCode's gemini.txt. Gemini and Gemma models now get
structured operational directives alongside tool-use enforcement:
absolute paths, verify-before-edit, dependency checks, conciseness,
parallel tool calls, non-interactive flags, autonomous execution.

Based on PR #4026, extended to cover Gemma models.
2026-04-02 11:52:34 -07:00
Teknium
3186668799
feat: per-turn primary runtime restoration and transport recovery (#4624)
Makes provider fallback turn-scoped in long-lived CLI sessions. Previously, a single transient failure pinned the session to the fallback provider for every subsequent turn.

- _primary_runtime dict snapshot at __init__ (model, provider, base_url, api_mode, client_kwargs, compressor state)
- _restore_primary_runtime() at top of run_conversation() — restores all state, resets fallback chain index
- _try_recover_primary_transport() — one extra recovery cycle (client rebuild + cooldown) for transient transport errors on direct endpoints before fallback
- Skipped for aggregator providers (OpenRouter, Nous)
- 25 tests

Inspired by #4612 (@betamod). Closes #4612.
2026-04-02 10:52:01 -07:00
kshitijk4poor
20441cf2c8 fix(insights): persist token usage for non-CLI sessions 2026-04-02 10:47:13 -07:00
Teknium
585855d2ca fix: preserve Anthropic thinking block signatures across tool-use turns
Anthropic extended thinking blocks include an opaque 'signature' field
required for thinking chain continuity across multi-turn tool-use
conversations. Previously, normalize_anthropic_response() extracted
only the thinking text and set reasoning_details=None, discarding the
signature. On subsequent turns the API could not verify the chain.

Changes:
- _to_plain_data(): new recursive SDK-to-dict converter with depth cap
  (20 levels) and path-based cycle detection for safety
- _extract_preserved_thinking_blocks(): rehydrates preserved thinking
  blocks (including signature) from reasoning_details on assistant
  messages, placing them before tool_use blocks as Anthropic requires
- normalize_anthropic_response(): stores full thinking blocks in
  reasoning_details via _to_plain_data()
- _extract_reasoning(): adds 'thinking' key to the detail lookup chain
  so Anthropic-format details are found alongside OpenRouter format

Salvaged from PR #4503 by @priveperfumes — focused on the thinking
block continuity fix only (cache strategy and other changes excluded).
2026-04-02 10:30:32 -07:00
Ben Barclay
a2e56d044b Merge branch 'main' into rewbs/tool-use-charge-to-subscription 2026-04-02 11:00:35 +11:00
Teknium
de9bba8d7c
fix: remove hardcoded OpenRouter/opus defaults
No model, base_url, or provider is assumed when the user hasn't
configured one.  Previously the defaults dict in cli.py, AIAgent
constructor args, and several fallback paths all hardcoded
anthropic/claude-opus-4.6 + openrouter.ai/api/v1 — silently routing
unconfigured users to OpenRouter, which 404s for anyone using a
different provider.

Now empty defaults force the setup wizard to run, and existing users
who already completed setup are unaffected (their config.yaml has
the model they chose).

Files changed:
- cli.py: defaults dict, _DEFAULT_CONFIG_MODEL
- run_agent.py: AIAgent.__init__ defaults, main() defaults
- hermes_cli/config.py: DEFAULT_CONFIG
- hermes_cli/runtime_provider.py: is_fallback sentinel
- acp_adapter/session.py: default_model
- tests: updated to reflect empty defaults
2026-04-01 15:22:26 -07:00
Teknium
3628ccc8c4
feat: use 'developer' role for GPT-5 and Codex models (#4498)
OpenAI's newer models (GPT-5, Codex) give stronger instruction-following
weight to the 'developer' role vs 'system'. Swap the role at the API
boundary in _build_api_kwargs() for the chat_completions path so internal
message representation stays consistent ('system' everywhere).

Applies regardless of provider — OpenRouter, Nous portal, direct, etc.
The codex_responses path (direct OpenAI) uses 'instructions' instead of
message roles, so it's unaffected.

DEVELOPER_ROLE_MODELS constant in prompt_builder.py defines the matching
model name substrings: ('gpt-5', 'codex').
2026-04-01 14:49:32 -07:00
Teknium
16d9f58445
fix(gateway): persist memory flush state to prevent redundant re-flushes on restart (#4481)
* fix: force-close TCP sockets on client cleanup, detect and recover dead connections

When a provider drops connections mid-stream (e.g. OpenRouter outage),
httpx's graceful close leaves sockets in CLOSE-WAIT indefinitely. These
zombie connections accumulate and can prevent recovery without restarting.

Changes:
- _force_close_tcp_sockets: walks the httpx connection pool and issues
  socket.shutdown(SHUT_RDWR) + close() to force TCP RST on every socket
  when a client is closed, preventing CLOSE-WAIT accumulation
- _cleanup_dead_connections: probes the primary client's pool for dead
  sockets (recv MSG_PEEK), rebuilds the client if any are found
- Pre-turn health check at the start of each run_conversation call that
  auto-recovers with a user-facing status message
- Primary client rebuild after stale stream detection to purge pool
- User-facing messages on streaming connection failures:
  "Connection to provider dropped — Reconnecting (attempt 2/3)"
  "Connection failed after 3 attempts — try again in a moment"

Made-with: Cursor

* fix: pool entry missing base_url for openrouter, clean error messages

- _resolve_runtime_from_pool_entry: add OPENROUTER_BASE_URL fallback
  when pool entry has no runtime_base_url (pool entries from auth.json
  credential_pool often omit base_url)
- Replace Rich console.print for auth errors with plain print() to
  prevent ANSI escape code mangling through prompt_toolkit's stdout patch
- Force-close TCP sockets on client cleanup to prevent CLOSE-WAIT
  accumulation after provider outages
- Pre-turn dead connection detection with auto-recovery and user message
- Primary client rebuild after stale stream detection
- User-facing status messages on streaming connection failures/retries

Made-with: Cursor

* fix(gateway): persist memory flush state to prevent redundant re-flushes on restart

The _session_expiry_watcher tracked flushed sessions in an in-memory set
(_pre_flushed_sessions) that was lost on gateway restart. Expired sessions
remained in sessions.json and were re-discovered every restart, causing
redundant AIAgent runs that burned API credits and blocked the event loop.

Fix: Add a memory_flushed boolean field to SessionEntry, persisted in
sessions.json. The watcher sets it after a successful flush. On restart,
the flag survives and the watcher skips already-flushed sessions.

- Add memory_flushed field to SessionEntry with to_dict/from_dict support
- Old sessions.json entries without the field default to False (backward compat)
- Remove the ephemeral _pre_flushed_sessions set from SessionStore
- Update tests: save/load roundtrip, legacy entry compat, auto-reset behavior
2026-04-01 12:05:02 -07:00
kshitijk4poor
935137f0d9 feat: add inline diff previews for write actions
Show inline diffs in the CLI transcript when write_file, patch, or
skill_manage modifies files. Captures a filesystem snapshot before the
tool runs, computes a unified diff after, and renders it with ANSI
coloring in the activity feed.

Adds tool_start_callback and tool_complete_callback hooks to AIAgent
for pre/post tool execution notifications.

Also fixes _extract_parallel_scope_path to normalize relative paths
to absolute, preventing the parallel overlap detection from missing
conflicts when the same file is referenced with different path styles.

Gated by display.inline_diffs config option (default: true).

Based on PR #3774 by @kshitijk4poor.
2026-04-01 02:13:57 -07:00
Bartok9
afa75a6185 fix(client): handle is_closed as method in OpenAI SDK
The openai SDK's SyncAPIClient.is_closed is a method, not a property.
getattr(client, 'is_closed', False) returned the bound method object,
which is always truthy — causing _is_openai_client_closed() to report
all clients as closed and triggering unnecessary client recreation
(~100-200ms TCP+TLS overhead per API call).

Fix: check if is_closed is callable and call it, otherwise treat as bool.

Fixes #4377
Co-authored-by: Bartok9 <Bartok9@users.noreply.github.com>
2026-04-01 01:40:43 -07:00
Teknium
a7f7e87070
fix: preserve credential_pool through smart routing and defer eager fallback on 429 (#4361)
Three bugs prevented credential pool rotation from working when multiple
Codex OAuth tokens were configured:

1. credential_pool was dropped during smart model turn routing.
   resolve_turn_route() constructed runtime dicts without it, so the
   AIAgent was created without pool access. Fixed in smart_model_routing.py
   (no-route and fallback paths), cli.py, and gateway/run.py.

2. Eager fallback fired before pool rotation on 429. The rate-limit
   handler at line ~7180 switched to a fallback provider immediately,
   before _recover_with_credential_pool got a chance to rotate to the
   next credential. Now deferred when the pool still has credentials.

3. (Non-issue) Retry budget was reported as too small, but successful
   pool rotations already skip retry_count increment — no change needed.

Reported by community member Schinsly who identified all three root
causes and verified the fix locally with multiple Codex accounts.
2026-04-01 01:02:34 -07:00
Teknium
e3f8347be3
feat(file_tools): harden read_file with size guard, dedup, and device blocking (#4315)
* feat(file_tools): harden read_file with size guard, dedup, and device blocking

Three improvements to read_file_tool to reduce wasted context tokens and
prevent process hangs:

1. Character-count guard: reads that produce more than 100K characters
   (≈25-35K tokens across tokenisers) are rejected with an error that
   tells the model to use offset+limit for a smaller range.  The
   effective cap is min(file_size, 100K) so small files that happen to
   have long lines aren't over-penalised.  Large truncated files also
   get a hint nudging toward targeted reads.

2. File-read deduplication: when the same (path, offset, limit) is read
   a second time and the file hasn't been modified (mtime unchanged),
   return a lightweight stub instead of re-sending the full content.
   Writes and patches naturally change mtime, so post-edit reads always
   return fresh content.  The dedup cache is cleared on context
   compression — after compression the original read content is
   summarised away, so the model needs the full content again.

3. Device path blocking: paths like /dev/zero, /dev/random, /dev/stdin
   etc. are rejected before any I/O to prevent process hangs from
   infinite-output or blocking-input devices.

Tests: 17 new tests covering all three features plus the dedup-reset-
on-compression integration.  All 52 file-read tests pass (35 existing +
17 new).  Full tool suite (2124 tests) passes with 0 failures.

* feat: make file_read_max_chars configurable, add docs

Add file_read_max_chars to DEFAULT_CONFIG (default 100K).  read_file_tool
reads this on first call and caches for the process lifetime.  Users on
large-context models can raise it; users on small local models can lower it.

Also adds a 'File Read Safety' section to the configuration docs
explaining the char limit, dedup behavior, and example values.
2026-03-31 12:53:19 -07:00
curtitoo
e75964d46d fix: harden codex responses transport handling 2026-03-31 12:05:06 -07:00
Teknium
161acb0086
fix: credential pool 401 recovery rotates to next credential after failed refresh (#4300)
When an OAuth token refresh fails on a 401 error, the pool recovery
would return 'not recovered' without trying the next credential in the
pool. This meant users who added a second valid credential via
'hermes auth add' would never see it used when the primary credential
was dead.

Now: try refresh first (handles expired tokens quickly), and if that
fails, rotate to the next available credential — same as 429/402
already did.

Adds three tests covering 401 refresh success, refresh-fail-then-rotate,
and refresh-fail-with-no-remaining-credentials.
2026-03-31 12:02:29 -07:00
arasovic
0240baa357 fix: strip orphaned think/reasoning tags from user-facing responses
Some models (e.g. Kimi K2.5 on Alibaba OpenAI-compatible endpoint)
emit reasoning text followed by a closing </think> without a matching
opening <think> tag.  The existing paired-tag regexes in
_strip_think_blocks() cannot match these orphaned tags, so </think>
leaks into user-facing responses on all platforms.

Add a catch-all regex that strips any remaining opening or closing
think/thinking/reasoning/REASONING_SCRATCHPAD tags after the existing
paired-block removal pass.

Closes #4285
2026-03-31 11:42:44 -07:00
Teknium
8d59881a62
feat(auth): same-provider credential pools with rotation, custom endpoint support, and interactive CLI (#2647)
* feat(auth): add same-provider credential pools and rotation UX

Add same-provider credential pooling so Hermes can rotate across
multiple credentials for a single provider, recover from exhausted
credentials without jumping providers immediately, and configure
that behavior directly in hermes setup.

- agent/credential_pool.py: persisted per-provider credential pools
- hermes auth add/list/remove/reset CLI commands
- 429/402/401 recovery with pool rotation in run_agent.py
- Setup wizard integration for pool strategy configuration
- Auto-seeding from env vars and existing OAuth state

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Salvaged from PR #2647

* fix(tests): prevent pool auto-seeding from host env in credential pool tests

Tests for non-pool Anthropic paths and auth remove were failing when
host env vars (ANTHROPIC_API_KEY) or file-backed OAuth credentials
were present. The pool auto-seeding picked these up, causing unexpected
pool entries in tests.

- Mock _select_pool_entry in auxiliary_client OAuth flag tests
- Clear Anthropic env vars and mock _seed_from_singletons in auth remove test

* feat(auth): add thread safety, least_used strategy, and request counting

- Add threading.Lock to CredentialPool for gateway thread safety
  (concurrent requests from multiple gateway sessions could race on
  pool state mutations without this)
- Add 'least_used' rotation strategy that selects the credential
  with the lowest request_count, distributing load more evenly
- Add request_count field to PooledCredential for usage tracking
- Add mark_used() method to increment per-credential request counts
- Wrap select(), mark_exhausted_and_rotate(), and try_refresh_current()
  with lock acquisition
- Add tests: least_used selection, mark_used counting, concurrent
  thread safety (4 threads × 20 selects with no corruption)

* feat(auth): add interactive mode for bare 'hermes auth' command

When 'hermes auth' is called without a subcommand, it now launches an
interactive wizard that:

1. Shows full credential pool status across all providers
2. Offers a menu: add, remove, reset cooldowns, set strategy
3. For OAuth-capable providers (anthropic, nous, openai-codex), the
   add flow explicitly asks 'API key or OAuth login?' — making it
   clear that both auth types are supported for the same provider
4. Strategy picker shows all 4 options (fill_first, round_robin,
   least_used, random) with the current selection marked
5. Remove flow shows entries with indices for easy selection

The subcommand paths (hermes auth add/list/remove/reset) still work
exactly as before for scripted/non-interactive use.

* fix(tests): update runtime_provider tests for config.yaml source of truth (#4165)

Tests were using OPENAI_BASE_URL env var which is no longer consulted
after #4165. Updated to use model config (provider, base_url, api_key)
which is the new single source of truth for custom endpoint URLs.

* feat(auth): support custom endpoint credential pools keyed by provider name

Custom OpenAI-compatible endpoints all share provider='custom', making
the provider-keyed pool useless. Now pools for custom endpoints are
keyed by 'custom:<normalized_name>' where the name comes from the
custom_providers config list (auto-generated from URL hostname).

- Pool key format: 'custom:together.ai', 'custom:local-(localhost:8080)'
- load_pool('custom:name') seeds from custom_providers api_key AND
  model.api_key when base_url matches
- hermes auth add/list now shows custom endpoints alongside registry
  providers
- _resolve_openrouter_runtime and _resolve_named_custom_runtime check
  pool before falling back to single config key
- 6 new tests covering custom pool keying, seeding, and listing

* docs: add Excalidraw diagram of full credential pool flow

Comprehensive architecture diagram showing:
- Credential sources (env vars, auth.json OAuth, config.yaml, CLI)
- Pool storage and auto-seeding
- Runtime resolution paths (registry, custom, OpenRouter)
- Error recovery (429 retry-then-rotate, 402 immediate, 401 refresh)
- CLI management commands and strategy configuration

Open at: https://excalidraw.com/#json=2Ycqhqpi6f12E_3ITyiwh,c7u9jSt5BwrmiVzHGbm87g

* fix(tests): update setup wizard pool tests for unified select_provider_and_model flow

The setup wizard now delegates to select_provider_and_model() instead
of using its own prompt_choice-based provider picker. Tests needed:
- Mock select_provider_and_model as no-op (provider pre-written to config)
- Call _stub_tts BEFORE custom prompt_choice mock (it overwrites it)
- Pre-write model.provider to config so the pool step is reached

* docs: add comprehensive credential pool documentation

- New page: website/docs/user-guide/features/credential-pools.md
  Full guide covering quick start, CLI commands, rotation strategies,
  error recovery, custom endpoint pools, auto-discovery, thread safety,
  architecture, and storage format.
- Updated fallback-providers.md to reference credential pools as the
  first layer of resilience (same-provider rotation before cross-provider)
- Added hermes auth to CLI commands reference with usage examples
- Added credential_pool_strategies to configuration guide

* chore: remove excalidraw diagram from repo (external link only)

* refactor: simplify credential pool code — extract helpers, collapse extras, dedup patterns

- _load_config_safe(): replace 4 identical try/except/import blocks
- _iter_custom_providers(): shared generator for custom provider iteration
- PooledCredential.extra dict: collapse 11 round-trip-only fields
  (token_type, scope, client_id, portal_base_url, obtained_at,
  expires_in, agent_key_id, agent_key_expires_in, agent_key_reused,
  agent_key_obtained_at, tls) into a single extra dict with
  __getattr__ for backward-compatible access
- _available_entries(): shared exhaustion-check between select and peek
- Dedup anthropic OAuth seeding (hermes_pkce + claude_code identical)
- SimpleNamespace replaces class _Args boilerplate in auth_commands
- _try_resolve_from_custom_pool(): shared pool-check in runtime_provider

Net -17 lines. All 383 targeted tests pass.

---------

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-03-31 03:10:01 -07:00
Teknium
1bd206ea5d
feat: add /btw command for ephemeral side questions (#4161)
Adds /btw <question> — ask a quick follow-up using the current
session context without interrupting the main conversation.

- Snapshots conversation history, answers with a no-tools agent
- Response is not persisted to session history or DB
- Runs in a background thread (CLI) / async task (gateway)
- Per-session guard prevents concurrent /btw in gateway

Implementation:
- model_tools.py: enabled_toolsets=[] now correctly means "no tools"
  (was falsy, fell through to default "all tools")
- run_agent.py: persist_session=False gates _persist_session()
- cli.py: _handle_btw_command (background thread, Rich panel output)
- gateway/run.py: _handle_btw_command + _run_btw_task (async task)
- hermes_cli/commands.py: CommandDef for "btw"

Inspired by PR #3504 by areu01or00, reimplemented cleanly on current
main with the enabled_toolsets=[] fix and without the __btw_no_tools__
hack.
2026-03-30 21:10:05 -07:00
Teknium
54b876a5c9
fix: add actionable guidance to context-exceeded error messages (#4155)
When context compression fails, users now see hints suggesting /new
or /compress instead of a dead-end error. Covers all 4 error paths:
payload-too-large, max compression attempts (2 paths), and context
length exceeded.

Closes #4061
Salvaged from PR #4076 by SHL0MS.

Co-authored-by: SHL0MS <SHL0MS@users.noreply.github.com>
2026-03-30 20:23:28 -07:00
Teknium
f776191650
fix: persist compressed context to gateway session after mid-run compression
When context compression fires during run_conversation() in the gateway,
the compressed messages were silently lost on the next turn. Two bugs:

1. Agent-side: _flush_messages_to_session_db() calculated
   flush_from = max(len(conversation_history), _last_flushed_db_idx).
   After compression, _last_flushed_db_idx was correctly reset to 0,
   but conversation_history still had its original pre-compression
   length (e.g. 200). Since compressed messages are shorter (~30),
   messages[200:] was empty — nothing written to the new session's
   SQLite.

   Fix: Set conversation_history = None after each _compress_context()
   call so start_idx = 0 and all compressed messages are flushed.

2. Gateway-side: history_offset was always len(agent_history) — the
   original pre-compression length. After compression shortened the
   message list, agent_messages[200:] was empty, causing the gateway
   to fall back to writing only a user/assistant pair, losing the
   compressed summary and tail context.

   Fix: Detect session splits (agent.session_id != original) and set
   history_offset = 0 so all compressed messages are written to JSONL.
2026-03-30 18:49:14 -07:00
Robin Fernandes
1126284c97 Merge branch 'main' into rewbs/tool-use-charge-to-subscription 2026-03-31 09:29:43 +09:00
Robin Fernandes
6e4598ce1e Merge branch 'main' into rewbs/tool-use-charge-to-subscription 2026-03-31 08:48:54 +09:00
Teknium
7dac75f2ae
fix: prevent context pressure warning spam after compression (#4012)
* feat: add /yolo slash command to toggle dangerous command approvals

Adds a /yolo command that toggles HERMES_YOLO_MODE at runtime, skipping
all dangerous command approval prompts for the current session. Works in
both CLI and gateway (Telegram, Discord, etc.).

- /yolo -> ON: all commands auto-approved, no confirmation prompts
- /yolo -> OFF: normal approval flow restored

The --yolo CLI flag already existed for launch-time opt-in. This adds
the ability to toggle mid-session without restarting.

Session-scoped — resets when the process ends. Uses the existing
HERMES_YOLO_MODE env var that check_all_command_guards() already
respects.

* fix: prevent context pressure warning spam (agent loop + gateway rate-limit)

Two complementary fixes for repeated context pressure warnings spamming
gateway users (Telegram, Discord, etc.):

1. Agent-level loop fix (run_agent.py):
   After compression, only reset _context_pressure_warned if the
   post-compression estimate is actually below the 85% warning level.
   Previously the flag was unconditionally reset, causing the warning
   to re-fire every loop iteration when compression couldn't reduce
   below 85% of the threshold (e.g. very low threshold like 15%,
   or system prompt alone exceeds the warning level).

2. Gateway-level rate-limit (gateway/run.py, salvaged from PR #3786):
   Per-chat_id cooldown of 1 hour on compression warning messages.
   Both warning paths ('still large after compression' and 'compression
   failed') are gated. Defense-in-depth — even if the agent-level fix
   has edge cases, users won't see more than one warning per hour.

Co-authored-by: dlkakbs <dlkakbs@users.noreply.github.com>

---------

Co-authored-by: dlkakbs <dlkakbs@users.noreply.github.com>
2026-03-30 13:18:21 -07:00
Teknium
97d6813f51
fix(cache): use deterministic call_id fallbacks instead of random UUIDs (#3991)
When the API doesn't provide a call_id for tool calls, the fallback
generated a random uuid4 hex. This made every API call's input unique
when replayed, preventing OpenAI's prompt cache from matching the
prefix across turns.

Replaced all four uuid4 fallback sites with a deterministic hash of
(function_name, arguments, position_index). The same tool call now
always produces the same fallback call_id, preserving cache-friendly
input stability.

Affected code paths:
- _chat_messages_to_responses_input() — Codex input reconstruction
- _normalize_codex_response() — function_call and custom_tool_call
- _build_assistant_message() — assistant message construction
2026-03-30 09:43:56 -07:00
Robin Fernandes
1cbb1b99cc Gate tool-gateway behind an env var, so it's not in users' faces until we're ready. Even if users enable it, it'll be blocked server-side for now, until we unlock for non-admin users on tool-gateway. 2026-03-30 13:28:10 +09:00
Teknium
2a0e8b001f
fix(cli): handle closed stdout ValueError in safe print paths (#3843)
When stdout is closed (piped to a dead process, broken terminal),
Python raises ValueError('I/O operation on closed file'), not OSError.
_safe_print and the API error printer only caught OSError, letting the
ValueError propagate and crash the agent.

Salvaged from PR #3760 by @apexscaleai. Fixes #3534.

Co-authored-by: apexscaleai <apexscaleai@users.noreply.github.com>
2026-03-29 18:21:27 -07:00
Teknium
e314833c9d
feat(display): configurable tool preview length -- show full paths by default (#3841)
Tool call previews (paths, commands, queries) were hardcoded to truncate
at 35-40 chars across CLI spinners, completion lines, and gateway progress
messages. Users could not see full file paths in tool output.

New config option: display.tool_preview_length (default 0 = no limit).
Set a positive number to truncate at that length.

Changes:
- display.py: module-level _tool_preview_max_len with getter/setter;
  build_tool_preview() and get_cute_tool_message() _trunc/_path respect it
- cli.py: reads config at startup, spinner widget respects config
- gateway/run.py: reads config per-message, progress callback respects config
- run_agent.py: removed redundant 30-char quiet-mode spinner truncation
- config.py: added display.tool_preview_length to DEFAULT_CONFIG

Reported by kriskaminski
2026-03-29 18:02:42 -07:00
Teknium
d6b7836210
fix: update session_log_file during context compression (#3835)
When compression creates a child session with a new session_id,
session_log_file was still pointing to the old session's JSON file.
This caused _save_session_log() to write new data to the wrong file.

Closes #3731.

Co-authored-by: kelsia14 <kelsia14@users.noreply.github.com>
2026-03-29 17:49:58 -07:00
Teknium
68d5472810
fix: omit tools param entirely when empty instead of sending None (#3820)
Some providers (Fireworks AI) reject tools=null, and others (Anthropic)
reject tools=[]. The safest approach is to not include the key at all
when there are no tools — the OpenAI SDK treats a missing parameter as
NOT_GIVEN and omits it from the request entirely.

Inspired by PR #3736 (@kelsia14).
2026-03-29 16:12:47 -07:00
Teknium
252fbea005
feat(providers): add ordered fallback provider chain (salvage #1761) (#3813)
Extends the single fallback_model mechanism into an ordered chain.
When the primary model fails, Hermes tries each fallback provider in
sequence until one succeeds or the chain is exhausted.

Config format (new):
  fallback_providers:
    - provider: openrouter
      model: anthropic/claude-sonnet-4
    - provider: openai
      model: gpt-4o

Legacy single-dict fallback_model format still works unchanged.

Key fix vs original PR: the call sites in the retry loop now use
_fallback_index < len(_fallback_chain) instead of the old one-shot
_fallback_activated guard, so the chain actually advances through
all configured providers.

Changes:
- run_agent.py: _fallback_chain list + _fallback_index replaces
  one-shot _fallback_model; _try_activate_fallback() advances
  through chain; failed provider resolution skips to next entry;
  call sites updated to allow chain advancement
- cli.py: reads fallback_providers with legacy fallback_model compat
- gateway/run.py: same
- hermes_cli/config.py: fallback_providers: [] in DEFAULT_CONFIG
- tests: 12 new chain tests + 6 existing test fixtures updated

Co-authored-by: uzaylisak <uzaylisak@users.noreply.github.com>
2026-03-29 16:04:53 -07:00
Teknium
ed6427e0a7
fix(agent): user-friendly 429 rate limit messages with Retry-After support (#3809)
When hitting rate limits (429), the agent now:
- Extracts the Retry-After header from the provider response and uses it
  as the wait time instead of blind exponential backoff (capped at 120s)
- Shows rate-limit-specific messaging: 'Rate limit reached. Waiting Xs
  before retry (attempt N/M)...'
- Shows a distinct exhaustion message: 'Rate limit persisted after N
  retries. Please try again later.'

Non-429 errors keep the existing exponential backoff and generic messaging.

Co-authored-by: ygd58 <ygd58@users.noreply.github.com>
2026-03-29 15:48:06 -07:00
Teknium
c62cadb73a
fix: make display_hermes_home imports lazy to prevent ImportError during hermes update (#3776)
When a user runs 'hermes update', the Python process caches old modules
in sys.modules.  After git pull updates files on disk, lazy imports of
newly-updated modules fail because they try to import display_hermes_home
from the cached (old) hermes_constants which doesn't have the function.

This specifically broke the gateway auto-restart in cmd_update — importing
hermes_cli/gateway.py triggered the top-level 'from hermes_constants
import display_hermes_home' against the cached old module.  The ImportError
was silently caught, so the gateway was never restarted after update.

Users with a running gateway then hit the ImportError on their next
Telegram/Discord message when the stale gateway process lazily loaded
run_agent.py (new version) which also had the top-level import.

Fixes:
- hermes_cli/gateway.py: lazy import at call site (line 940)
- run_agent.py: lazy import at call site (line 6927)
- tools/terminal_tool.py: lazy imports at 3 call sites
- tools/tts_tool.py: static schema string (no module-level call)
- hermes_cli/auth.py: lazy import at call site (line 2024)
- hermes_cli/main.py: reload hermes_constants after git pull in cmd_update

Also fixes 4 pre-existing test failures in test_parse_env_var caused by
NameError on display_hermes_home in terminal_tool.py.
2026-03-29 15:15:17 -07:00
Teknium
9f01244137
fix: replace user-facing hardcoded ~/.hermes paths with display_hermes_home()
Prep for profiles: user-facing messages now use display_hermes_home() so
diagnostic output shows the correct path for each profile.

New helper: display_hermes_home() in hermes_constants.py
12 files swept, ~30 user-facing string replacements.
Includes dynamic TTS schema description.
2026-03-28 23:47:21 -07:00
Teknium
0bd7e95dfc
fix(honcho): allow self-hosted local instances without API key (#3644)
Self-hosted Honcho on localhost doesn't require authentication, but
both the activation gates and the SDK client required an API key.

Combined fix from three contributor PRs:
- Relax all 8 activation gates to accept (api_key OR base_url) as
  valid credentials (#3482 by @cameronbergh)
- Use 'local' placeholder for the SDK client when base_url points to
  localhost/127.0.0.1/::1 (#3570 by @ygd58)

Files changed: run_agent.py (2 gates), cli.py (1 gate),
gateway/run.py (1 gate), honcho_integration/cli.py (2 gates),
hermes_cli/doctor.py (2 gates), honcho_integration/client.py (SDK).

Co-authored-by: cameronbergh <cameronbergh@users.noreply.github.com>
Co-authored-by: ygd58 <ygd58@users.noreply.github.com>
Co-authored-by: devorun <devorun@users.noreply.github.com>
2026-03-28 17:49:56 -07:00
Teknium
857a5d7b47
fix: sanitize surrogate characters from clipboard paste to prevent UnicodeEncodeError (#3624)
Pasting text from rich-text editors (Google Docs, Word, etc.) can inject
lone surrogate characters (U+D800..U+DFFF) that are invalid UTF-8.
The OpenAI SDK serializes messages with ensure_ascii=False, then encodes
to UTF-8 for the HTTP body — surrogates crash this with:
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2'

Three-layer fix:
1. Primary: sanitize user_message at the top of run_conversation()
2. CLI: sanitize in chat() before appending to conversation_history
3. Safety net: catch UnicodeEncodeError in the API error handler,
   sanitize the entire messages list in-place, and retry once.
   Also exclude UnicodeEncodeError from is_local_validation_error
   so it doesn't get classified as non-retryable.

Includes 14 new tests covering the sanitization helpers and the
integration with run_conversation().
2026-03-28 16:53:14 -07:00
Teknium
924857c3e3
fix: prevent tool name/arg concatenation for Ollama-compatible endpoints (#3582)
Ollama reuses index 0 for every tool call in a parallel batch,
distinguishing them only by id.  The streaming accumulator now
detects a new non-empty id at an already-active index and redirects
it to a fresh slot, preventing names and arguments from being
concatenated into a single tool call.

No-op for normal providers that use incrementing indices.

Co-authored-by: dmater01 <dmater01@users.noreply.github.com>
2026-03-28 14:08:26 -07:00
Teknium
1d0a119368
fix(display): show reasoning before response when tool calls suppress content (#3566)
* fix(provider): remove MiniMax /v1→/anthropic auto-correction to allow user override

The minimax-specific auto-correction in runtime_provider.py was
preventing users from overriding to the OpenAI-compatible endpoint
via MINIMAX_BASE_URL. Users in certain regions get nginx 404 on
api.minimax.io/anthropic and need to switch to api.minimax.chat/v1.

The generic URL-suffix detection already handles /anthropic →
anthropic_messages, so the minimax-specific code was redundant for
the default path and harmful for the override path.

Now: default /anthropic URL works via generic detection, user
override to /v1 gets chat_completions mode naturally.

Closes #3546 (different approach — respects user overrides instead
of changing the default endpoint).

* fix(display): show reasoning during streaming even when tool calls suppress content

When a model generates content (containing <REASONING_SCRATCHPAD> tags)
alongside tool calls in the same API response, content deltas were
suppressed from streaming once any tool call chunk arrived. This
prevented the CLI's tag extraction from running, so reasoning was
never shown during streaming. The post-response fallback then
displayed reasoning AFTER the already-visible streamed response,
creating a confusing reversed order.

Fix: route suppressed content to stream_delta_callback even when tool
calls are present. The CLI's _stream_delta handles tag extraction —
reasoning tags are routed to the reasoning display box, while
non-reasoning text is handled by the existing stream display logic.
This ensures reasoning appears before tool execution and the final
response, matching the expected visual order.
2026-03-28 12:34:32 -07:00
Teknium
901494d728
feat: make tool-use enforcement configurable via agent.tool_use_enforcement (#3551)
The TOOL_USE_ENFORCEMENT_GUIDANCE injection (added in #3528) was
hardcoded to only match gpt/codex model names. This makes it a
config option so users can turn it on for any model family.

New config key: agent.tool_use_enforcement
  - "auto" (default): matches gpt/codex (existing behavior)
  - true: inject for all models
  - false: never inject
  - list of strings: custom model-name substrings to match
    e.g. ["gpt", "codex", "deepseek", "qwen"]

No version bump needed — deep merge provides the default
automatically for existing installs.

12 new tests covering all config modes.
2026-03-28 12:31:22 -07:00
Teknium
455bf2e853
feat: activate plugin lifecycle hooks (pre/post_llm_call, session start/end) (#3542)
The plugin system defined six lifecycle hooks but only pre_tool_call and
post_tool_call were invoked.  This activates the remaining four so that
external plugins (e.g. memory systems) can hook into the conversation
loop without touching core code.

Hook semantics:
- on_session_start: fires once when a new session is created
- pre_llm_call: fires once per turn before the tool-calling loop;
  plugins can return {"context": "..."} to inject into the ephemeral
  system prompt (not cached, not persisted)
- post_llm_call: fires once per turn after the loop completes, with
  user_message and assistant_response for sync/storage
- on_session_end: fires at the end of every run_conversation call

invoke_hook() now returns a list of non-None callback return values,
enabling pre_llm_call context injection while remaining backward
compatible (existing hooks that return None are unaffected).

Salvaged from PR #2823.

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
2026-03-28 11:14:54 -07:00
Teknium
80a899a8e2
fix: enable fine-grained tool streaming for Claude/OpenRouter + retry SSE errors (#3497)
Root cause: Anthropic buffers entire tool call arguments and goes silent
for minutes while thinking (verified: 167s gap with zero SSE events on
direct API).  OpenRouter's upstream proxy times out after ~125s of
inactivity and drops the connection with 'Network connection lost'.

Fix: Send the x-anthropic-beta: fine-grained-tool-streaming-2025-05-14
header for Claude models on OpenRouter.  This makes Anthropic stream
tool call arguments token-by-token instead of buffering them, keeping
the connection alive through OpenRouter's proxy.

Live-tested: the exact prompt that consistently failed at ~128s now
completes successfully — 2,972 lines written, 49K tokens, 8 minutes.

Additional improvements:

1. Send explicit max_tokens for Claude through OpenRouter.  Without it,
   OpenRouter defaults to 65,536 (confirmed via echo_upstream_body) —
   only half of Opus 4.6's 128K limit.

2. Classify SSE 'Network connection lost' as retryable in the streaming
   inner retry loop.  The OpenAI SDK raises APIError from SSE error
   events, which was bypassing our transient error retry logic.

3. Actionable diagnostic guidance when stream-drop retries exhaust.
2026-03-28 08:01:37 -07:00
Teknium
831e8ba0e5
feat: tool-use enforcement + strip budget warnings from history (#3528)
Cherry-pick of feat/gpt-tool-steering with modifications:

1. Tool-use enforcement prompt (refactored from GPT-specific):
   - Renamed GPT_TOOL_USE_GUIDANCE -> TOOL_USE_ENFORCEMENT_GUIDANCE
   - Added TOOL_USE_ENFORCEMENT_MODELS tuple: ('gpt', 'codex')
   - Injection logic now checks against the tuple instead of hardcoding
     'gpt' — adding new model families is a one-line change
   - Addresses models describing actions instead of making tool calls

2. Budget warning history stripping:
   - _strip_budget_warnings_from_history() strips _budget_warning JSON
     keys and [BUDGET WARNING: ...] text from tool results at the start
     of run_conversation()
   - Prevents old budget warnings from poisoning subsequent turns

Based on PR #3479 by teknium1.
2026-03-28 07:38:36 -07:00
Teknium
15cfd20820
fix: cap context pressure percentage at 100% in display (#3480)
* fix: cap context pressure percentage at 100% in display

The forward-looking token estimate can overshoot the compaction threshold
(e.g. a large tool result pushes it from 70% to 109% in one step). The
progress bar was already capped via min(), but pct_int was not — causing
the user to see '109% to compaction' which is confusing.

Cap pct_int at 100 in both CLI and gateway display functions.

Reported by @JoshExile82.

* refactor: use real API token counts for compression decisions

Replace the rough chars/3 estimation with actual prompt_tokens +
completion_tokens from the API response. The estimation was needed to
predict whether tool results would push context past the threshold, but
the default 50% threshold leaves ample headroom — if tool results push
past it, the next API call reports real usage and triggers compression
then.

This removes all estimation from the compression and context pressure
paths, making both 100% data-driven from provider-reported token counts.

Also removes the dead _msg_count_before_tools variable.
2026-03-27 21:42:09 -07:00
Teknium
8fdfc4b00c
fix(agent): detect thinking-budget exhaustion on truncation, skip useless retries (#3444)
When finish_reason='length' and the response contains only reasoning
(think blocks or empty content), the model exhausted its output token
budget on thinking with nothing left for the actual response.

Previously, this fell into either:
- chat_completions: 3 useless continuation retries (model hits same limit)
- anthropic/codex: generic 'Response truncated' error with rollback

Now: detect the think-only + length condition early and return immediately
with a targeted error message: 'Model used all output tokens on reasoning
with none left for the response. Try lowering reasoning effort or
increasing max_tokens.'

This saves 2 wasted API calls on the chat_completions path and gives
users actionable guidance instead of a cryptic error.

The existing think-only retry logic (finish_reason='stop') is unchanged —
that's a genuine model glitch where retrying can help.
2026-03-27 15:29:30 -07:00
Teknium
6f11ff53ad
fix(anthropic): use model-native output limits instead of hardcoded 16K (#3426)
The Anthropic adapter defaulted to max_tokens=16384 when no explicit value
was configured.  This severely limits thinking-enabled models where thinking
tokens count toward max_tokens:

- Claude Opus 4.6 supports 128K output but was capped at 16K
- Claude Sonnet 4.6 supports 64K output but was capped at 16K

With extended thinking (adaptive or budget-based), the model could exhaust
the entire 16K on reasoning, leaving zero tokens for the actual response.
This caused two user-visible errors:
- 'Response truncated (finish_reason=length)' — thinking consumed most tokens
- 'Response only contains think block with no content' — thinking consumed all

Fix: add _ANTHROPIC_OUTPUT_LIMITS lookup table (sourced from Anthropic docs
and Cline's model catalog) and use the model's actual output limit as the
default.  Unknown future models default to 128K (the current maximum).

Also adds context_length clamping: if the user configured a smaller context
window (e.g. custom endpoint), max_tokens is clamped to context_length - 1
to avoid exceeding the window.

Closes #2706
2026-03-27 13:02:52 -07:00
Teknium
fb46a90098
fix: increase API timeout default from 900s to 1800s for slow-thinking models (#3431)
Models like GLM-5/5.1 can think for 15+ minutes. The previous 900s
(15 min) default for HERMES_API_TIMEOUT killed legitimate requests.

Raised to 1800s (30 min) in both places that read the env var:
- _build_api_kwargs() timeout (non-streaming total timeout)
- _call_chat_completions() write timeout (streaming connection)

The streaming per-chunk read timeout (60s) and stale stream detector
(180-300s) are unchanged — those are appropriate for inter-chunk timing.
2026-03-27 13:02:23 -07:00
Teknium
8ecd7aed2c
fix: prevent reasoning box from rendering 3x during tool-calling loops (#3405)
Two independent bugs caused the reasoning box to appear three times when
the model produced reasoning + tool_calls:

Bug A: _build_assistant_message() re-fired reasoning_callback with the full
reasoning text even when streaming had already displayed it. The original
guard only checked structured reasoning_content deltas, but reasoning also
arrives via content tag extraction (<REASONING_SCRATCHPAD>/<think> tags
in delta.content), which went through _fire_stream_delta not
_fire_reasoning_delta. Fix: skip the callback entirely when streaming is
active — both paths display reasoning during the stream. Any reasoning not
shown during streaming is caught by the CLI post-response fallback.

Bug B: The post-response reasoning display checked _reasoning_stream_started,
but that flag was reset by _reset_stream_state() during intermediate turn
boundaries (when stream_delta_callback(None) fires between tool calls).
Introduced _reasoning_shown_this_turn flag that persists across the tool
loop and is only reset at the start of each user turn.

Live-tested in PTY: reasoning now shows exactly once per API call, no
duplicates across tool-calling loops.
2026-03-27 09:57:50 -07:00
Teknium
5a1e2a307a
perf(ttft): salvage easy-win startup optimizations from #3346 (#3395)
* perf(ttft): dedupe shared tool availability checks

* perf(ttft): short-circuit vision auto-resolution

* perf(ttft): make Claude Code version detection lazy

* perf(ttft): reuse loaded toolsets for skills prompt

---------

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-03-27 07:49:44 -07:00
Teknium
915df02bbf
fix(streaming): stale stream detector race causing spurious RemoteProtocolError
The stale stream detector (90s timeout) was killing healthy connections
during the model's thinking phase, producing self-inflicted
RemoteProtocolError ("peer closed connection without sending complete
message body"). Three issues:

1. last_chunk_time was never reset between inner stream retries, so
   subsequent attempts inherited the previous attempt's stale budget
2. The non-streaming fallback path didn't reset the timer either
3. 90s base timeout was too aggressive for large-context Opus sessions
   where thinking time before first token routinely exceeds 90s

Fix: reset last_chunk_time at the start of each streaming attempt and
before the non-streaming fallback. Increase base timeout to 180s and
scale to 300s for >100K token contexts.

Made-with: Cursor
2026-03-27 04:05:51 -07:00
Teknium
1519c4d477
fix(session): add /resume CLI handler, session log truncation guard, reopen_session API (#3315)
Three improvements salvaged from PR #3225 by Mibayy:

1. Add /resume slash command handler in CLI process_command(). The
   command was registered in the commands registry but had no handler,
   so typing /resume produced 'Unknown command'. The handler resolves
   by title or session ID, ends the current session cleanly, loads
   conversation history from SQLite, re-opens the target session, and
   syncs the AIAgent instance. Follows the same pattern as new_session().

2. Add truncation guard in _save_session_log(). When resuming a session
   whose messages weren't fully written to SQLite, the agent starts with
   partial history and the first save would overwrite the full JSON log
   on disk. The guard reads the existing file and skips the write if it
   already has more messages than the current batch.

3. Add reopen_session() method to SessionDB. Proper API for clearing
   ended_at/end_reason instead of reaching into _conn directly.

Note: Bug 1 from the original PR (INSERT OR IGNORE + _session_db = None)
is already fixed on main — skipped as redundant.

Closes #3123.
2026-03-26 19:04:28 -07:00
Teknium
60fdb58ce4
fix(agent): update context compressor limits after fallback activation (#3305)
When _try_activate_fallback() switches to the fallback model, it
updates the agent's model/provider/client but never touches
self.context_compressor. The compressor keeps the primary model's
context_length and threshold_tokens, so compression decisions use
wrong limits — a 200K primary → 32K fallback still uses 200K-based
thresholds, causing oversized sessions to overflow the fallback.

Update the compressor's model, credentials, context_length, and
threshold_tokens after fallback activation using get_model_context_length()
for the new model.

Cherry-picked from PR #3202 by binhnt92.

Co-authored-by: binhnt92 <binhnt.ht.92@gmail.com>
2026-03-26 18:10:50 -07:00
Teknium
0375b2a0d7
fix(gateway): silence background agent terminal output (#3297)
* fix(gateway): silence flush agent terminal output

quiet_mode=True only suppresses AIAgent init messages.
Tool call output still leaks to the terminal through
_safe_print → _print_fn during session reset/expiry.

Since #2670 injected live memory state into the flush prompt,
the flush agent now reliably calls memory tools — making the
output leak noticeable for the first time.

Set _print_fn to a no-op so the background flush is fully silent.

* test(gateway): add test for flush agent terminal silence + fix dotenv mock

- Add TestFlushAgentSilenced: verifies _print_fn is set to a no-op on
  the flush agent so tool output never leaks to the terminal
- Fix pre-existing test failures: replace patch('run_agent.AIAgent')
  with sys.modules mock to avoid importing run_agent (requires openai)
- Add autouse _mock_dotenv fixture so all tests in this file run
  without the dotenv package installed

* fix(display): route KawaiiSpinner output through print_fn to fully silence flush agent

The previous fix set tmp_agent._print_fn = no-op on the flush agent but
spinner output and quiet-mode cute messages bypassed _print_fn entirely:
- KawaiiSpinner captured sys.stdout at __init__ and wrote directly to it
- quiet-mode tool results used builtin print() instead of _safe_print()

Add optional print_fn parameter to KawaiiSpinner.__init__; _write routes
through it when set. Pass self._print_fn to all spinner construction sites
in run_agent.py and change the quiet-mode cute message print to _safe_print.
The existing gateway fix (tmp_agent._print_fn = lambda) now propagates
correctly through both paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(gateway): silence hygiene and compression background agents

Two more background AIAgent instances in the gateway were created with
quiet_mode=True but without _print_fn = no-op, causing tool output to
leak to the terminal:
- _hyg_agent (in-turn hygiene memory agent)
- tmp_agent (_compress_context path)

Apply the same _print_fn no-op pattern used for the flush agent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(display): remove unused _last_flush_time from KawaiiSpinner

Attribute was set but never read; upstream already removed it.
Leftover from conflict resolution during rebase onto upstream/main.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Dilee <uzmpsk.dilekakbas@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:40:31 -07:00
Teknium
08fa326bb0
feat(gateway): deliver background review notifications to user chat (#3293)
The background memory/skill review (_spawn_background_review) runs
after the agent response when turn/iteration counters exceed their
thresholds. It saves memories and skills, then prints a summary like
'💾 Memory updated · User profile updated'. In CLI mode this goes to
the terminal via _safe_print. In gateway mode, _safe_print routes to
print() which goes to stdout — invisible to the user.

Add a background_review_callback attribute to AIAgent. When set, the
background review thread calls it with the summary string after saves
complete. The gateway wires this to adapter.send() via the same
run_coroutine_threadsafe bridge used by status_callback, delivering
the notification to the user's chat.
2026-03-26 17:38:24 -07:00
Robin Fernandes
e95965d76a Merge branch 'main' into rewbs/tool-use-charge-to-subscription 2026-03-26 16:18:28 -07:00
Robin Fernandes
95dc9aaa75 feat: add managed tool gateway and Nous subscription support
- add managed modal and gateway-backed tool integrations\n- improve CLI setup, auth, and configuration for subscriber flows\n- expand tests and docs for managed tool support
2026-03-26 16:17:58 -07:00
Teknium
db241ae6ce
feat(sessions): add --source flag for third-party session isolation (#3255)
When third-party tools (Paperclip orchestrator, etc.) spawn hermes chat
as a subprocess, their sessions pollute user session history and search.

- hermes chat --source <tag> (also HERMES_SESSION_SOURCE env var)
- exclude_sources parameter on list_sessions_rich() and search_messages()
- Sessions with source=tool hidden from sessions list/browse/search
- Third-party adapters pass --source tool to isolate agent sessions

Cherry-picked from PR #3208 by HenkDz.

Co-authored-by: Henkey <noonou7@gmail.com>
2026-03-26 14:35:31 -07:00
Teknium
41ee207a5e
fix: catch KeyboardInterrupt in exit cleanup handlers (#3257)
except Exception does not catch KeyboardInterrupt (inherits from
BaseException). A second Ctrl+C during exit cleanup aborts pending
writes — Honcho observations dropped, SQLite sessions left unclosed,
cron job sessions never marked ended.

Changed to except (Exception, KeyboardInterrupt) at all five sites:
- cli.py: honcho.shutdown() and end_session() in finally exit block
- run_agent.py: _flush_honcho_on_exit atexit handler
- cron/scheduler.py: end_session() and close() in job finally block

Tests exercise the actual production code paths and confirm
KeyboardInterrupt propagates without the fix.

Co-authored-by: dieutx <dangtc94@gmail.com>
2026-03-26 14:34:31 -07:00
Teknium
b81d49dc45
fix(state): SQLite concurrency hardening + session transcript integrity (#3249)
* fix(session-db): survive CLI/gateway concurrent write contention

Closes #3139

Three layered fixes for the scenario where CLI and gateway write to
state.db concurrently, causing create_session() to fail with
'database is locked' and permanently disabling session_search on the
gateway side.

1. Increase SQLite connection timeout: 10s -> 30s
   hermes_state.py: longer window for the WAL writer to finish a batch
   flush before the other process gives up entirely.

2. INSERT OR IGNORE in create_session
   hermes_state.py: prevents IntegrityError on duplicate session IDs
   (e.g. gateway restarts while CLI session is still alive).

3. Don't null out _session_db on create_session failure  (main fix)
   run_agent.py: a transient lock at agent startup must not permanently
   disable session_search for the lifetime of that agent instance.
   _session_db now stays alive so subsequent flushes and searches work
   once the lock clears.

4. New ensure_session() helper + call it during flush
   hermes_state.py: INSERT OR IGNORE for a minimal session row.
   run_agent.py _flush_messages_to_session_db: calls ensure_session()
   before appending messages, so the FK constraint is satisfied even
   when create_session() failed at startup. No-op when the row exists.

* fix(state): release lock between context queries in search_messages

The context-window queries (one per FTS5 match) were running inside
the same lock acquisition as the primary FTS5 query, holding the lock
for O(N) sequential SQLite round-trips. Move per-match context fetches
outside the outer lock block so each acquires the lock independently,
keeping critical sections short and allowing other threads to interleave.

* fix(session): prefer longer source in load_transcript to prevent legacy truncation

When a long-lived session pre-dates SQLite storage (e.g. sessions
created before the DB layer was introduced, or after a clean
deployment that reset the DB), _flush_messages_to_session_db only
writes the *new* messages from the current turn to SQLite — it skips
messages already present in conversation_history, assuming they are
already persisted.

That assumption fails for legacy JSONL-only sessions:

  Turn N (first after DB migration):
    load_transcript(id)       → SQLite: 0  → falls back to JSONL: 994 ✓
    _flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2

  Turn N+1:
    load_transcript(id)       → SQLite: 2  → returns immediately ✗
    Agent sees 2 messages of history instead of 996

The same pattern causes the reported symptom: session JSON truncated
to 4 messages (_save_session_log writes agent.messages which only has
2 history + 2 new = 4).

Fix: always load both sources and return whichever is longer.  For a
fully-migrated session SQLite will always be ≥ JSONL, so there is no
regression.  For a legacy session that hasn't been bootstrapped yet,
JSONL wins and the full history is restored.

Closes #3212

* test: add load_transcript source preference tests for #3212

Covers: JSONL longer returns JSONL, SQLite longer returns SQLite,
SQLite empty falls back to JSONL, both empty returns empty, equal
length prefers SQLite (richer reasoning fields).

---------

Co-authored-by: Mibayy <mibayy@hermes.ai>
Co-authored-by: kewe63 <kewe.3217@gmail.com>
Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
2026-03-26 13:47:14 -07:00
Teknium
43af094ae3
fix(agent): include tool tokens in preflight estimate, guard context probe persistence (#3164)
Two improvements salvaged from PR #2600 (paraddox):

1. Preflight compression now counts tool schema tokens alongside system
   prompt and messages.  With 50+ tools enabled, schemas can add 20-30K
   tokens that were previously invisible to the estimator, delaying
   compression until the API rejected the request.

2. Context probe persistence guard: when the agent steps down context
   tiers after a context-length error, only provider-confirmed numeric
   limits (parsed from the error message) are cached to disk.  Guessed
   fallback tiers from get_next_probe_tier() stay in-memory only,
   preventing wrong values from polluting the persistent cache.

Co-authored-by: paraddox <paraddox@users.noreply.github.com>
2026-03-26 02:00:50 -07:00
Teknium
4a56e2cd88
fix(display): show tool progress for substantive tools, not just "preparing"
_mute_post_response was set True whenever a turn had both content
and tool_calls, suppressing ALL subsequent _vprint output including
tool completion messages. This meant users only saw "preparing
search_files..." but never the result.

Now only mutes output when every tool in the batch is housekeeping
(memory, todo, skill_manage, session_search). Substantive tools
like search_files, read_file, write_file, terminal etc. keep their
completion messages visible.

Also fixes: run_conversation no longer raises on max retries
(returns graceful error dict instead), and cli.py wraps the agent
thread in try/except as a safety net.

Made-with: Cursor
2026-03-26 01:52:52 -07:00
Teknium
c511e087e0
fix(agent): always prefer streaming for API calls to prevent hung subagents (#3120)
The non-streaming API call path (_interruptible_api_call) had no
wall-clock timeout. When providers keep connections alive with SSE
keep-alive pings but never deliver a response, httpx's inactivity
timeout never fires and the call hangs indefinitely.

Subagents always used the non-streaming path because they have no
stream consumers (quiet_mode=True). This caused delegate_task to
hang for 40+ minutes in production.

The streaming path has two layers of protection:
- httpx read timeout (60s, HERMES_STREAM_READ_TIMEOUT)
- Stale stream detection (90s, HERMES_STREAM_STALE_TIMEOUT)

Both work because streaming sends chunks continuously — a 90-second
gap between chunks genuinely means the connection is broken, even for
reasoning models that take minutes to complete.

Now run_conversation() always prefers the streaming path. The streaming
method falls back to non-streaming automatically if the provider
doesn't support it. Stream delta callbacks are no-ops when no
consumers are registered, so there's no overhead for subagents.
2026-03-26 01:22:31 -07:00
Teknium
c07c17f5f2
feat(agent): surface all retry/fallback/compression lifecycle events (#3153)
Add _emit_status() helper that sends lifecycle notifications to both
CLI (via _vprint force=True) and gateway (via status_callback). No
retry, fallback, or compression path is silent anymore.

Pathways surfaced:
- General retry backoff: was logger-only, now shows countdown
- Provider fallback: changed raw print() to _emit_status for gateway
- Rate limit eager fallback: new notification before switching
- Empty/malformed response fallback: new notification
- Client error fallback: new notification with HTTP status
- Max retries fallback: new notification before attempting
- Max retries giving up: upgraded from _vprint to _emit_status
- Compression retry (413 + context overflow): upgraded to _emit_status
- Compression success + retry: upgraded to _emit_status (2 instances)
2026-03-26 01:08:47 -07:00
Teknium
cbf195e806
chore: fix 154 f-strings, simplify getattr/URL patterns, remove dead code (#3119)
Three categories of cleanup, all zero-behavioral-change:

1. F-strings without placeholders (154 fixes across 29 files)
   - Converted f'...' to '...' where no {expression} was present
   - Heaviest files: run_agent.py (24), cli.py (20), honcho_integration/cli.py (34)

2. Simplify defensive patterns in run_agent.py
   - Added explicit self._is_anthropic_oauth = False in __init__ (before
     the api_mode branch that conditionally sets it)
   - Replaced 7x getattr(self, '_is_anthropic_oauth', False) with direct
     self._is_anthropic_oauth (attribute always initialized now)
   - Added _is_openrouter_url() and _is_anthropic_url() helper methods
   - Replaced 3 inline 'openrouter' in self._base_url_lower checks

3. Remove dead code in small files
   - hermes_cli/claw.py: removed unused 'total' computation
   - tools/fuzzy_match.py: removed unused strip_indent() function and
     pattern_stripped variable

Full test suite: 6184 passed, 0 failures
E2E PTY: banner clean, tool calls work, zero garbled ANSI
2026-03-25 19:47:58 -07:00
Teknium
08d3be0412
fix: graceful return on max retries instead of crashing thread
run_conversation raised the raw exception after exhausting retries,
which crashed the background thread in cli.py (unhandled exception
in Thread). Now returns a proper error result dict with failed=True
and persists the session, matching the pattern used by other error
paths (invalid responses, empty content, etc.).

Also wraps cli.py's run_agent thread function in try/except as a
safety net against any future unhandled exceptions from
run_conversation.

Made-with: Cursor
2026-03-25 19:00:39 -07:00
Teknium
156b50358b
fix(reasoning): skip duplicate callback for <think>-extracted reasoning during streaming (#3116)
Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.

Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.

Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already
covered by PR #3013's _current_reasoning_callback centralization).
Closes #2069.
2026-03-25 18:57:18 -07:00
Teknium
5b29ff50f8
fix(logging): extract useful info from HTML error pages, dump debug on max retries
Three problems with API error debugging:

1. Terminal showed str(error)[:200] — raw HTML gibberish for Cloudflare
   502/503 pages instead of "502 Bad Gateway"
2. errors.log dumped the entire HTML page as unstructured text
3. _dump_api_request_debug was never called when retries exhausted,
   only for non-retryable 4xx errors

Adds _summarize_api_error() that extracts <title> and Cloudflare Ray ID
from HTML error pages, and falls back to SDK error body messages. Now
the terminal shows clean one-liners like:

  📝 Error: HTTP 502 — openrouter.ai | 502: Bad gateway — Ray 9e226...

Also calls _dump_api_request_debug on max_retries_exhausted so the full
request context is written to ~/.hermes/sessions/ for post-mortem.

Made-with: Cursor
2026-03-25 18:36:04 -07:00
Teknium
4b45f65858
fix: update api_key in _try_activate_fallback for subagent auth (#3103)
When fallback activates (e.g. minimax → OpenRouter), self.provider,
self.base_url, self.api_mode, and self._client_kwargs were all updated
but self.api_key was not. delegate_tool.py reads parent_agent.api_key
to pass credentials to child agents, so subagents inherited the stale
pre-fallback key (e.g. a minimax key sent to OpenRouter), causing 401
Missing Authentication errors.

Add self.api_key = ... in both the anthropic_messages and
chat_completions branches of _try_activate_fallback().
2026-03-25 18:23:03 -07:00
Teknium
b374f52063
fix(session): clear compressor summary and turn counter on /clear and /new (#3102)
reset_session_state() was missing two fields added after it was written:
- _user_turn_count: kept accumulating across sessions, affecting
  flush_min_turns guard behavior
- context_compressor._previous_summary: old session's compression
  summary leaked into new session's iterative compression

Cherry-picked from PR #2640 by dusterbloom. Closes #2635.
2026-03-25 18:22:21 -07:00
Teknium
712cebc40f
fix(logging): show HTTP status code and 400 body in API error output (#3096)
When an API call fails, the terminal output now includes the HTTP status
code in the header line and, for 400 errors, the response body from the
provider (truncated to 300 chars). Makes it much easier to diagnose
issues like invalid model names or malformed requests that were
previously hidden behind generic error messages.

Salvaged from PR #2646 by Mibayy. Fixes #2644.
2026-03-25 17:47:55 -07:00
Teknium
bd6b138e85
fix: clean up HTML error messages in CLI display (#3069)
When API calls fail with HTML error pages (e.g., CloudFlare errors), the CLI
was dumping raw HTML content to users like:
  📝 Error: <!DOCTYPE html><!--[if lt IE 7]> <html class="no-js ie6...

This commit adds a _clean_error_message() utility method that:
- Detects HTML content and replaces with user-friendly message
- Collapses multiline errors to single line
- Truncates overly long errors (>150 chars)
- Preserves meaningful error text for regular errors

Applied to all user-facing error displays:
- API call failure messages (line 6314)
- Interrupt error responses (line 6324)
- Invalid response error messages (line 6000)

Before: 📝 Error: <!DOCTYPE html><!--[if lt IE 7]>...
After:  📝 Error: Service temporarily unavailable (HTML error page returned)
2026-03-25 16:39:22 -07:00
Teknium
9792bde31a
fix(agent): count compression restarts toward retry limit (#3070)
When context overflow triggers compression, the outer retry loop
restarts via continue without incrementing retry_count. If compression
reduces messages but not enough to fit the context window, this creates
an infinite loop burning API credits: API call → overflow → compress →
retry → overflow → compress → ...

Increment retry_count on compression restarts so the loop exits after
max_retries total attempts.

Cherry-picked from PR #2766 by dieutx.
2026-03-25 16:35:17 -07:00
Teknium
f7f30aaab9
fix(streaming): detect and kill stale SSE connections
Adds a wall-clock stale stream detector (HERMES_STREAM_STALE_TIMEOUT,
default 90s) that force-closes the httpx client when no real chunks
arrive, even if SSE keep-alive pings keep the socket alive. Works
with the existing streaming retry loop to recover via fresh connection.

Made-with: Cursor
2026-03-25 16:07:05 -07:00
Teknium
77bcaba2d7
refactor: consolidate get_hermes_home() and parse_reasoning_effort() (#3062)
Centralizes two widely-duplicated patterns into hermes_constants.py:

1. get_hermes_home() — Path resolution for ~/.hermes (HERMES_HOME env var)
   - Was copy-pasted inline across 30+ files as:
     Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
   - Now defined once in hermes_constants.py (zero-dependency module)
   - hermes_cli/config.py re-exports it for backward compatibility
   - Removed local wrapper functions in honcho_integration/client.py,
     tools/website_policy.py, tools/tirith_security.py, hermes_cli/uninstall.py

2. parse_reasoning_effort() — Reasoning effort string validation
   - Was copy-pasted in cli.py, gateway/run.py, cron/scheduler.py
   - Same validation logic: check against (xhigh, high, medium, low, minimal, none)
   - Now defined once in hermes_constants.py, called from all 3 locations
   - Warning log for unknown values kept at call sites (context-specific)

31 files changed, net +31 lines (125 insertions, 94 deletions)
Full test suite: 6179 passed, 0 failed
2026-03-25 15:54:28 -07:00
Teknium
8bb1d15da4
chore: remove ~100 unused imports across 55 files (#3016)
Automated cleanup via pyflakes + autoflake with manual review.

Changes:
- Removed unused stdlib imports (os, sys, json, pathlib.Path, etc.)
- Removed unused typing imports (List, Dict, Any, Optional, Tuple, Set, etc.)
- Removed unused internal imports (hermes_cli.auth, hermes_cli.config, etc.)
- Fixed cli.py: removed 8 shadowed banner imports (imported from hermes_cli.banner
  then immediately redefined locally — only build_welcome_banner is actually used)
- Added noqa comments to imports that appear unused but serve a purpose:
  - Re-exports (gateway/session.py SessionResetPolicy, tools/terminal_tool.py
    is_interrupted/_interrupt_event)
  - SDK presence checks in try/except (daytona, fal_client, discord)
  - Test mock targets (auxiliary_client.py Path, mcp_config.py get_hermes_home)

Zero behavioral changes. Full test suite passes (6162/6162, 2 pre-existing
streaming test failures unrelated to this change).
2026-03-25 15:02:03 -07:00
Teknium
94e3d9adbf
fix(agent): restore safe non-streaming fallback after stream failures (#3020)
After streaming retries are exhausted on transient errors, fall back to
non-streaming instead of propagating the error. Also fall back for any
other pre-delivery stream error (not just 'streaming not supported').

Added user-facing message when streaming is not supported by a model/
provider, directing users to set display.streaming: false in config.yaml
to avoid the fallback delay.

Cherry-picked from PR #3008 by kshitijk4poor. Added UX message for
streaming-not-supported detection.

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
2026-03-25 12:46:04 -07:00
Teknium
099dfca6db
fix: GLM reasoning-only and max-length handling (#3010)
- Add 'prompt exceeds max length' to context overflow detection for
  Z.AI/GLM 400 errors
- Extract inline reasoning blocks from assistant content as fallback
  when no structured reasoning fields are present
- Guard inline extraction so structured API reasoning takes priority
- Update test for reasoning-only response salvage behavior

Cherry-picked from PR #2993 by kshitijk4poor. Added priority guard
to fix test_structured_reasoning_takes_priority failure.

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
2026-03-25 12:05:37 -07:00
Teknium
68ab37e891
fix(delegate): give subagents independent iteration budgets (#3004)
Each subagent now gets its own IterationBudget instead of sharing the
parent's.  The per-subagent cap is controlled by delegation.max_iterations
in config.yaml (default 50).  Total iterations across parent + subagents
can exceed the parent's max_iterations, but the user retains control via
the config setting.

Previously, subagents shared the parent's budget, so three parallel
subagents configured for max_iterations=50 racing against a parent that
already used 60 of 90 would each only get ~10 iterations.

Inspired by PR #2928 (Bartok9) which identified the issue (#2873).
2026-03-25 11:29:49 -07:00
Teknium
61949f0af7
Fix (#2997)
Co-authored-by: Jack <jvand@DESKTOP-JACK.localdomain>
2026-03-25 11:12:11 -07:00
Teknium
52c5e491f5
fix(session): surface silent SessionDB failures that cause session data loss (#2999)
* fix(session): surface silent SessionDB failures that cause session data loss

SessionDB initialization and operation failures were logged at debug level
or silently swallowed, causing sessions to never be indexed in the FTS5
database. This made session_search unable to find affected conversations.

In practice, ~48% of sessions can be lost without any visible indication.
The JSON session files are still written (separate code path), but the
SQLite/FTS5 index gets nothing — making session_search return empty results
for affected sessions.

Changes:
- cli.py: Log warnings (not debug) when SessionDB init fails at both
  __init__ and _start_session entry points
- run_agent.py: Log warnings on create_session, append_message, and
  compression split failures
- run_agent.py: Set _session_db = None after create_session failure to
  fail fast instead of silently dropping every message for the session

Root cause: When gateway restarts or DB lock contention occurs during
SessionDB() init, the exception is caught and swallowed. The agent
continues running normally — JSON session logs are written to disk —
but no messages reach the FTS5 index.

* fix: use module logger instead of root logging for SessionDB warnings

Follow-up to cherry-picked PR #2939 — the original used logging.warning()
(root logger) instead of logger.warning() (module logger) in the 5 new
warning calls. Module logger preserves the logger hierarchy and shows the
correct module name in log output.

---------

Co-authored-by: LucidPaths <lc77@outlook.de>
2026-03-25 11:10:19 -07:00
Teknium
42fec19151
feat: persist reasoning across gateway session turns (schema v6) (#2974)
feat: persist reasoning across gateway session turns (schema v6)

Tested against OpenAI Codex (direct), Anthropic (direct + OAI-compat), and OpenRouter → 6 backends. All reasoning field types (reasoning, reasoning_details, codex_reasoning_items) round-trip through the DB correctly.
2026-03-25 09:47:28 -07:00
Teknium
5dbe2d9d73
fix: skills-sh install fails for deeply nested repo structures (#2980)
* fix(run_agent): ensure _fire_first_delta() is called for tool generation events

Added calls to _fire_first_delta() in the AIAgent class to improve the handling of tool generation events, ensuring timely notifications during the processing of function calls and tool usage.

* fix(run_agent): improve timeout handling for chat completions

Enhanced the timeout configuration for chat completions in the AIAgent class by introducing customizable connection, read, and write timeouts using environment variables. This ensures more robust handling of API requests during streaming operations.

* fix(run_agent): reduce default stream read timeout for chat completions

Updated the default stream read timeout from 120 seconds to 60 seconds in the AIAgent class, enhancing the timeout configuration for chat completions. This change aims to improve responsiveness during streaming operations.

* fix(run_agent): enhance streaming error handling and retry logic

Improved the error handling and retry mechanism for streaming requests in the AIAgent class. Introduced a configurable maximum number of stream retries and refined the handling of transient network errors, allowing for retries with fresh connections. Non-transient errors now trigger a fallback to non-streaming only when appropriate, ensuring better resilience during API interactions.

* fix: skills-sh install fails for deeply nested repo structures

Skills in repos with deep directory nesting (e.g.
cli-tool/components/skills/development/senior-backend/) could not be
installed because the candidate path generation and shallow root-dir
scan never reached them.

Added GitHubSource._find_skill_in_repo_tree() which uses the GitHub
Trees API to recursively search the entire repo tree in a single API
call. This is used as a final fallback in
SkillsShSource._discover_identifier() when the standard candidate
paths and shallow scan both fail.

Fixes installation of skills from repos like davila7/claude-code-templates
where skills are nested 4+ levels deep.

Reported by user Samuraixheart.
2026-03-25 09:31:05 -07:00
Teknium
fd292e676b
fix: skip KawaiiSpinner when TUI handles tool progress (#2973)
* docs: unify hooks documentation — add plugin hooks to hooks page, add session:end event

The hooks page only documented gateway event hooks (HOOK.yaml system).
The plugins page listed plugin hooks (pre_tool_call, etc.) that weren't
referenced from the hooks page, which was confusing.

Changes:
- hooks.md: Add overview table showing both hook systems
- hooks.md: Add Plugin Hooks section with available hooks, callback
  signatures, and example
- hooks.md: Add missing session:end gateway event (emitted but undocumented)
- hooks.md: Mark pre_llm_call, post_llm_call, on_session_start,
  on_session_end as planned (defined in VALID_HOOKS but not yet invoked)
- hooks.md: Update info box to cross-reference plugin hooks
- hooks.md: Fix heading hierarchy (gateway content as subsections)
- plugins.md: Add cross-reference to hooks page for full details
- plugins.md: Mark planned hooks as (planned)

* feat(session_search): add recent sessions mode when query is omitted

When session_search is called without a query (or with an empty query),
it now returns metadata for the most recent sessions instead of erroring.
This lets the agent quickly see what was worked on recently without
needing specific keywords.

Returns for each session: session_id, title, source, started_at,
last_active, message_count, preview (first user message).
Zero LLM cost — pure DB query. Current session lineage and child
delegation sessions are excluded.

The agent can then keyword-search specific sessions if it needs
deeper context from any of them.

* docs: clarify two-mode behavior in session_search schema description

* fix(compression): restore sane defaults and cap summary at 12K tokens

- threshold: 0.80 → 0.50 (compress at 50%, not 80%)
- target_ratio: 0.40 → 0.20, now relative to threshold not total context
  (20% of 50% = 10% of context as tail budget)
- summary ceiling: 32K → 12K (Gemini can't output more than ~12K)
- Updated DEFAULT_CONFIG, config display, example config, and tests

* fix: browser_vision ignores auxiliary.vision.timeout config (#2901)

* docs: unify hooks documentation — add plugin hooks to hooks page, add session:end event

The hooks page only documented gateway event hooks (HOOK.yaml system).
The plugins page listed plugin hooks (pre_tool_call, etc.) that weren't
referenced from the hooks page, which was confusing.

Changes:
- hooks.md: Add overview table showing both hook systems
- hooks.md: Add Plugin Hooks section with available hooks, callback
  signatures, and example
- hooks.md: Add missing session:end gateway event (emitted but undocumented)
- hooks.md: Mark pre_llm_call, post_llm_call, on_session_start,
  on_session_end as planned (defined in VALID_HOOKS but not yet invoked)
- hooks.md: Update info box to cross-reference plugin hooks
- hooks.md: Fix heading hierarchy (gateway content as subsections)
- plugins.md: Add cross-reference to hooks page for full details
- plugins.md: Mark planned hooks as (planned)

* fix: browser_vision ignores auxiliary.vision.timeout config

browser_vision called call_llm() without passing a timeout parameter,
so it always used the 30-second default in auxiliary_client.py. This
made vision analysis with local models (llama.cpp, ollama) impossible
since they typically need more than 30s for screenshot analysis.

Now browser_vision reads auxiliary.vision.timeout from config.yaml
(same config key that vision_analyze already uses) and passes it
through to call_llm().

Also bumped the default vision timeout from 30s to 120s in both
browser_vision and vision_analyze — 30s is too aggressive for local
models and the previous default silently failed for anyone running
vision locally.

Fixes user report from GamerGB1988.

* fix(skills): agent-created skills were incorrectly treated as untrusted community content

_resolve_trust_level() didn't handle 'agent-created' source, so it
fell through to 'community' trust level. Community policy blocks on
any caution or dangerous findings, which meant common patterns like
curl with env vars, systemctl, crontab, cloudflared references etc.
would block skill creation/patching.

The agent-created policy row already existed in INSTALL_POLICY with
permissive settings (allow caution, ask on dangerous) but was never
reached. Now it is.

Fixes reports of skill_manage being blocked by security scanner.

* fix(cli): enhance real-time reasoning output by forcing flush of long partial lines

Updated the reasoning output mechanism to emit complete lines and force-flush long partial lines, ensuring reasoning is visible in real-time even without newlines. This improves user experience during reasoning sessions.

* fix: skip KawaiiSpinner when TUI handles tool progress

In the interactive CLI, the agent runs with quiet_mode=True and
tool_progress_callback set. The quiet_mode condition triggered
KawaiiSpinner for every tool call, but the TUI was already handling
progress display via the spinner widget.

The KawaiiSpinner writes carriage-return animation through StdoutProxy,
triggering run_in_terminal() erase/redraw cycles on every flush. These
redundant cycles cause the status bar to ghost into terminal scrollback.

The thinking spinner already had this guard (checks thinking_callback).
This extends the same pattern to the three tool spinner creation sites:
concurrent tools, delegate_task, and single tool execution.
2026-03-25 08:33:44 -07:00
Teknium
7ca22ea11b
fix(compression): restore sane defaults and cap summary at 12K tokens
- threshold: 0.80 → 0.50 (compress at 50%, not 80%)
- target_ratio: 0.40 → 0.20, now relative to threshold not total context
  (20% of 50% = 10% of context as tail budget)
- summary ceiling: 32K → 12K (Gemini can't output more than ~12K)
- Updated DEFAULT_CONFIG, config display, example config, and tests
2026-03-24 18:48:47 -07:00
Teknium
9231a335d4
fix(compression): replace dead summary_target_tokens with ratio-based scaling (#2554)
The summary_target_tokens parameter was accepted in the constructor,
stored on the instance, and never used — the summary budget was always
computed from hardcoded module constants (_SUMMARY_RATIO=0.20,
_MAX_SUMMARY_TOKENS=8000). This caused two compounding problems:

1. The config value was silently ignored, giving users no control
   over post-compression size.
2. Fixed budgets (20K tail, 8K summary cap) didn't scale with
   context window size. Switching from a 1M-context model to a
   200K model would trigger compression that nuked 350K tokens
   of conversation history down to ~30K.

Changes:
- Replace summary_target_tokens with summary_target_ratio (default 0.40)
  which sets the post-compression target as a fraction of context_length.
  Tail token budget and summary cap now scale proportionally:
    MiniMax 200K → ~80K post-compression
    GPT-5   1M  → ~400K post-compression
- Change threshold_percent default: 0.50 → 0.80 (don't fire until
  80% of context is consumed)
- Change protect_last_n default: 4 → 20 (preserve ~10 full turns)
- Summary token cap scales to 5% of context (was fixed 8K), capped
  at 32K ceiling
- Read target_ratio and protect_last_n from config.yaml compression
  section (both are now configurable)
- Remove hardcoded summary_target_tokens=500 from run_agent.py
- Add 5 new tests for ratio scaling, clamping, and new defaults
2026-03-24 17:45:49 -07:00
Teknium
8ee4f32819
fix(gateway): use TERMINAL_CWD for context file discovery, not process cwd
The gateway process runs from the hermes-agent install directory, so
os.getcwd() picks up the repo's AGENTS.md (16k chars) and other dev
context files — inflating input tokens by ~10k on every gateway message.

Fix: use TERMINAL_CWD (which the gateway sets to MESSAGING_CWD or
$HOME) as the cwd for build_context_files_prompt(). In CLI mode,
TERMINAL_CWD is the user's actual project directory, so behavior
is unchanged.

Before: gateway 15-20k input tokens, CLI 6-8k
After:  gateway ~6-8k input tokens (same as CLI)

Reported by keri on Discord.
2026-03-24 17:30:33 -07:00
Teknium
618f15dda9
fix: reorder setup wizard providers — OpenRouter first
Move OpenRouter to position 1 in the setup wizard's provider list
to match hermes model ordering. Update default selection index and
fix test expectations for the new ordering.

Setup order: OpenRouter → Nous Portal → Codex → Custom → ...
2026-03-24 12:50:24 -07:00
Teknium
481915587e
fix: update context pressure warnings and token estimates after compaction
Reset context pressure warnings and update last_prompt_tokens and last_completion_tokens in the context compressor to prevent stale values from causing excessive warnings and re-triggering compression. This change ensures accurate pressure calculations following the compaction process.
2026-03-24 09:25:10 -07:00
Teknium
ad1bf16f28
chore: remove all remaining mini-swe-agent references
Complete cleanup after dropping the mini-swe-agent submodule (PR #2804):

- Remove MSWEA_SILENT_STARTUP and MSWEA_GLOBAL_CONFIG_DIR env var
  settings from cli.py, run_agent.py, hermes_cli/main.py, doctor.py
- Remove mini-swe-agent health check from hermes doctor
- Remove 'minisweagent' from logger suppression lists
- Remove litellm/typer/platformdirs from requirements.txt
- Remove mini-swe-agent install steps from install.ps1 (Windows)
- Remove mini-swe-agent install steps from website docs
- Update all stale comments/docstrings referencing mini-swe-agent
  in terminal_tool.py, tools/__init__.py, code_execution_tool.py,
  environments/README.md, environments/agent_loop.py
- Remove mini_swe_runner from pyproject.toml py-modules
  (still exists as standalone script for RL training use)
- Shrink test_minisweagent_path.py to empty stub

The orphaned mini-swe-agent/ directory on disk needs manual removal:
  rm -rf mini-swe-agent/
2026-03-24 08:19:23 -07:00
Teknium
02b38b93cb
refactor: remove mini-swe-agent dependency — inline Docker/Modal backends (#2804)
Drop the mini-swe-agent git submodule. All terminal backends now use
hermes-agent's own environment implementations directly.

Docker backend:
- Inline the `docker run -d` container startup (was 15 lines in
  minisweagent's DockerEnvironment). Our wrapper already handled
  execute(), cleanup(), security hardening, volumes, and resource limits.

Modal backend:
- Import swe-rex's ModalDeployment directly instead of going through
  minisweagent's 90-line passthrough wrapper.
- Bake the _AsyncWorker pattern (from environments/patches.py) directly
  into ModalEnvironment for Atropos compatibility without monkey-patching.

Cleanup:
- Remove minisweagent_path.py (submodule path resolution helper)
- Remove submodule init/install from install.sh and setup-hermes.sh
- Remove mini-swe-agent from .gitmodules
- environments/patches.py is now a no-op (kept for backward compat)
- terminal_tool.py no longer does sys.path hacking for minisweagent
- mini_swe_runner.py guards imports (optional, for RL training only)
- Update all affected tests to mock the new direct subprocess calls
- Update README.md, CONTRIBUTING.md

No functionality change — all Docker, Modal, local, SSH, Singularity,
and Daytona backends behave identically. 6093 tests pass.
2026-03-24 07:30:25 -07:00
Teknium
a312ee7b4c
fix(agent): ensure first delta is fired during reasoning updates
- Added calls to `_fire_first_delta()` in the `AIAgent` class to ensure that the first delta is triggered for both reasoning and thinking updates. This change improves the handling of delta events during streaming, enhancing the responsiveness of the agent's reasoning capabilities.
2026-03-24 07:16:20 -07:00
Teknium
87e2626cf6
feat(cli, agent): add tool generation callback for streaming updates
- Introduced `_on_tool_gen_start` in `HermesCLI` to indicate when tool-call arguments are being generated, enhancing user feedback during streaming.
- Updated `AIAgent` to support a new `tool_gen_callback`, notifying the display layer when tool generation starts, allowing for better user experience during large payloads.
- Ensured that the callback is triggered appropriately during streaming events to prevent user interface freezing.
2026-03-23 23:10:58 -07:00
Teknium
942f6eac94
fix(run_agent): ensure proper cleanup of OpenAI client in background review
Added explicit closing of the OpenAI/httpx client in the background review process to prevent "Event loop is closed" errors. This change ensures that the client is properly cleaned up when the review agent is no longer needed, enhancing stability and resource management.
2026-03-22 16:03:16 -07:00
Teknium
bfe4baa6ed
chore: remove unused imports, dead code, and stale comments
Mechanical cleanup — no behavior changes.

Unused imports removed:
- model_tools.py: import os
- run_agent.py: OPENROUTER_MODELS_URL, get_model_context_length
- cli.py: Table, VERSION, RELEASE_DATE, resolve_toolset, get_skill_commands
- terminal_tool.py: signal, uuid, tempfile, set_interrupt_event,
  DANGEROUS_PATTERNS, _load_permanent_allowlist, _detect_dangerous_command

Dead code removed:
- toolsets.py: print_toolset_tree() (zero callers)
- browser_tool.py: _get_session_name() (never called)

Stale comments removed:
- toolsets.py: duplicated/garbled comment line
- web_tools.py: 3 aspirational TODO comments from early development
2026-03-22 08:33:34 -07:00
MacroAnarchy
f9c2ad48c2
fix: defer streaming iteration linebreak to prevent blank line stacking
Follow-up to 669c60a6 (cherry-pick of PR #2187, fixes #2177).

The original fix emits a "\n\n" delta immediately after every
_execute_tool_calls() invocation. When the model runs multiple
consecutive tool iterations before producing text (common with
search → read → analyze flows), each iteration appends its own
paragraph break, resulting in 4-6+ blank lines before the actual
response.

Replace the immediate delta with a deferred flag
(_stream_needs_break). _fire_stream_delta() checks the flag and
prepends a single "\n\n" only when the first real text delta
arrives, so multiple back-to-back tool iterations still produce
exactly one paragraph break.
2026-03-22 04:59:12 -07:00
Teknium
34be3f8be6
revert: remove trailing empty assistant message stripping
Reverts the sanitizer addition from PR #2466 (originally #2129).
We already have _empty_content_retries handling for reasoning-only
responses. The trailing strip risks silently eating valid messages
and is redundant with existing empty-content handling.
2026-03-22 04:55:34 -07:00
ygd58
5407d12bc6
fix(agent): strip trailing empty assistant messages before API calls to prevent prefill rejection 2026-03-22 04:38:17 -07:00
Bartok Moltbot
e6a708aa04
fix(io): catch ValueError in _SafeWriter for closed file handles (#2428)
When subagents run in ThreadPoolExecutor threads, the shared stdout handle
can close between thread teardown and KawaiiSpinner cleanup. Python raises
ValueError (not OSError) for I/O operations on closed files:
  ValueError: I/O operation on closed file

The _SafeWriter class was only catching OSError, missing this case.

Changes:
- Add ValueError to exception handling in write(), flush(), and isatty()
- Update docstring to document the ThreadPoolExecutor teardown scenario

Fixes #2428
2026-03-22 04:38:17 -07:00
Teknium
8cb7864110
fix: resolve garbled ANSI escape codes in status printouts (#2262) (#2448)
Two related root causes for the '?[33mTool progress: NEW?[0m' garbling
reported on kitty, alacritty, ghostty and gnome-console:

1. /verbose label printing used self.console.print() with Rich markup
   ([yellow]...[/]).  self.console is a plain Rich Console() whose output
   goes directly to sys.stdout, which patch_stdout's StdoutProxy
   intercepts and mangles raw ANSI sequences.

2. Context pressure status lines (e.g. 'approaching compaction') from
   AIAgent._safe_print() had the same problem -- _safe_print() was a
   @staticmethod that always called builtin print(), bypassing the
   prompt_toolkit renderer entirely.

Fix:
- Convert AIAgent._safe_print() from @staticmethod to an instance method
  that delegates to self._print_fn (defaults to builtin print, preserving
  all non-CLI behaviour).
- After the CLI creates its AIAgent instance, wire self.agent._print_fn to
  the existing _cprint() helper which routes through
  prompt_toolkit.print_formatted_text(ANSI(text)).
- Rewrite the /verbose feedback labels to use hermes_cli.colors.Colors
  ANSI constants in f-strings and emit them via _cprint() directly,
  removing the Rich-markup-inside-patch_stdout anti-pattern.

Fixes #2262

Co-authored-by: Animesh Mishra <animesh.m.7523@gmail.com>
2026-03-22 04:07:06 -07:00
Teknium
306e67f32d
fix: fail fast when explicit provider has no API key instead of silent OpenRouter fallback (#2445)
When a non-OpenRouter provider (e.g. minimax, anthropic) is set in
config.yaml but its API key is missing, Hermes silently fell back to
OpenRouter, causing confusing 404 errors.

Now checks if the user explicitly configured a provider before falling
back. Explicit providers raise RuntimeError with a clear message naming
the missing env var. Auto/openrouter/custom providers still fall through
to OpenRouter as before.

Three code paths fixed:
- run_agent.py AIAgent.__init__ — main client initialization
- auxiliary_client.py call_llm — sync auxiliary calls
- auxiliary_client.py call_llm_streaming — async auxiliary calls

Based on PR #2272 by @StefanIsMe. Applied manually to fix a
pconfig NameError in the original and extend to call_llm_streaming.

Co-authored-by: StefanIsMe <StefanIsMe@users.noreply.github.com>
2026-03-22 03:59:29 -07:00
Teknium
669c60a6bb
fix: add iteration boundary linebreak to prevent stream concatenation
Cherry-picked from PR #2187 by @devorun. Fixes #2177.

When streaming is enabled, text before and after tool calls gets
concatenated without separation. Adds a paragraph break delta after
_execute_tool_calls() so stream consumers insert proper whitespace
between iteration boundaries.
2026-03-21 19:19:26 -07:00
teyrebaz33
bd49bce278
fix(prompt-caching): skip top-level cache_control on role:tool for OpenRouter
On the native Anthropic Messages API path, convert_messages_to_anthropic()
moves top-level cache_control on role:tool messages inside the tool_result
block. On OpenRouter (chat_completions), no such conversion happens — the
unexpected top-level field causes a silent hang on the second tool call.

Add native_anthropic parameter to _apply_cache_marker() and
apply_anthropic_cache_control(). When False (OpenRouter), role:tool messages
are skipped entirely. When True (native Anthropic), existing behaviour is
preserved.

Fixes #2362
2026-03-21 16:54:43 -07:00
Teknium
525caadd8c
fix: prevent Anthropic token leaking to third-party anthropic_messages providers (salvage #2383) (#2389)
* fix: prevent Anthropic token fallback leaking to third-party anthropic_messages providers

When provider is minimax/alibaba/etc and MINIMAX_API_KEY is not set,
the code fell back to resolve_anthropic_token() sending Anthropic OAuth
credentials to third-party endpoints, causing 401 errors.

Now only provider=="anthropic" triggers the fallback. Generalizes the
Alibaba-specific guard from #1739 to all non-Anthropic providers.

* fix: set provider='anthropic' in credential refresh tests

Follow-up for cherry-picked PR #2383 — existing tests didn't set
agent.provider, which the new guard requires to allow Anthropic
token refresh.

---------

Co-authored-by: 0xbyt4 <35742124+0xbyt4@users.noreply.github.com>
2026-03-21 16:42:46 -07:00
Teknium
2a5f86ed6d
Merge pull request #2343 from NousResearch/hermes/hermes-31d7db3b
feat: @ context references + Honcho config fixes
2026-03-21 16:10:19 -07:00
Teknium
2c06ec5f51
fix: correct provider check for Alibaba model identity injection
PR #2314 checked for provider names 'alibaba-coding-plan' and
'alibaba-coding-plan-anthropic' which don't exist in the provider
registry. The provider is always 'alibaba' — the condition was dead
code. Fixed to check self.provider == 'alibaba'.
2026-03-21 09:46:26 -07:00
crazywriter1
523d8c38f9
fix: Alibaba/DashScope: preserve model dots (qwen3.5-plus) and fix 401 auth
When using Alibaba (DashScope) with an anthropic-compatible endpoint,
model names like qwen3.5-plus were being normalized to qwen3-5-plus.
Alibaba's API expects the dot. Added preserve_dots parameter to
normalize_model_name() and build_anthropic_kwargs().

Also fixed 401 auth: when provider is alibaba or base_url contains
dashscope/aliyuncs, use only the resolved API key (DASHSCOPE_API_KEY).
Never fall back to resolve_anthropic_token(), and skip Anthropic
credential refresh for DashScope endpoints.

Cherry-picked from PR #1748 by crazywriter1. Fixes #1739.
2026-03-21 09:38:04 -07:00
Teknium
e183744cb5
feat(honcho): instance-local config via HERMES_HOME, default session strategy to per-directory
- Add resolve_config_path(): checks $HERMES_HOME/honcho.json first,
  falls back to ~/.honcho/config.json.  Enables isolated Hermes instances
  with independent Honcho credentials and settings.
- Update CLI and doctor to use resolved path instead of hardcoded global.
- Change default session_strategy from per-session to per-directory.

Part 1 of #1962 by @erosika.
2026-03-21 09:34:00 -07:00
Teknium
9305164bf3
fix: add None-entry guard to tool_calls loops in run_agent, batch_runner, and mini_swe_runner (#2316)
Co-authored-by: Dilee <uzmpsk.dilekakbas@gmail.com>
2026-03-21 07:20:41 -07:00
ygd58
2ea8054304
fix(agent): inject model identity for Alibaba Coding Plan to work around API returning wrong model name 2026-03-21 07:11:08 -07:00
Teknium
58b52dfb2f
Merge pull request #2303 from NousResearch/hermes/hermes-31d7db3b
fix: remove synthetic error message injection, fix session resume after repeated failures
2026-03-21 07:03:54 -07:00
Teknium
779619f742
fix: remove synthetic error message injection, fix session resume after repeated failures
Two changes to the error handler in the agent loop:

1. Remove the 'if not pending_handled' block that injected fake
   [System error during processing: ...] messages into conversation
   history.  These polluted history, burned tokens on retries, and
   could violate role alternation by injecting as role=user.
   The tool_calls error-result path (role=tool) is preserved.

2. Append the error final_response as an assistant message when
   hitting the iteration limit, so session resume doesn't produce
   consecutive user messages.
2026-03-21 06:33:05 -07:00
Teknium
96a5e9fc11
feat(agent): add summary of successful tool actions in review agent
Enhanced the review agent to scan and summarize successful tool actions, providing users with a compact overview of updates made during the review process. This includes actions related to memory and user profiles, improving user feedback and interaction clarity.
2026-03-21 06:31:59 -07:00
Teknium
885f88fb60
feat(agent): suppress non-forced output during post-response housekeeping
- Introduced a mechanism to mute output after the main response is delivered, ensuring that subsequent tool calls run without cluttering the CLI.
- Redirected stdout to devnull during the review agent's execution to prevent any print statements from interfering with the main CLI display.
- Added a new attribute `_mute_post_response` to manage output suppression effectively.
2026-03-20 23:54:42 -07:00
Teknium
761a8ad39a
fix(display): show provider and endpoint in API error messages (#2266)
fix(display): show provider and endpoint in API error messages
2026-03-20 21:57:53 -07:00
Test
d560f2d1f2 fix(display): show provider and endpoint in API error messages
When an API call fails, the error output now shows the provider name,
model, and endpoint URL so users can immediately identify which service
rejected their request. Auth errors (401/403) get actionable guidance:
check key validity, model access, and OpenRouter credits link.

Before: 'API call failed (attempt 1/3): PermissionDeniedError'
After:  'API call failed (attempt 1/3): PermissionDeniedError
         Provider: openrouter  Model: anthropic/claude-sonnet-4
         Endpoint: https://openrouter.ai/api/v1
         Your API key was rejected by the provider. Check:
           • Is the key valid? Run: hermes setup
           • Does your account have access to anthropic/claude-sonnet-4?
           • Check credits: https://openrouter.ai/settings/credits'
2026-03-20 21:06:55 -07:00
Teknium
45058b4105
feat: replace inline nudges with background memory/skill review (#2235)
Remove the memory and skill nudges that were appended directly to user
messages, causing backward-looking system instructions to compete with
forward-looking user tasks. Found in 43% of user messages across 15
sessions, with confirmed cases of the agent spending tool calls on
nudge responses before starting the user's actual request.

Replace with a background review agent that runs AFTER the main agent
finishes responding:
- Spawns a background thread with a snapshot of the conversation
- Uses the main model (not auxiliary) for high-precision memory/skill work
- Only has memory + skill_manage tools (5 iteration budget)
- Shares the memory store for direct writes
- Never modifies the main conversation history
- Never competes with the user's task for model attention
- Zero latency impact (runs after response is delivered)
- Same token cost (processes the same context, just on a separate track)

The trigger conditions are unchanged (every 10 user turns for memory,
after 10+ tool iterations for skills). Only the execution path changes:
from inline injection to background fork.

Closes #2227.

Co-authored-by: Test <test@test.com>
2026-03-20 18:51:31 -07:00
Teknium
4263350c5b
fix: remove post-compression file-read history injection (#2226)
Remove the [Files already read — do NOT re-read these] user message
that was injected into the conversation after context compression.

This message used role='user' for system-generated content, creating
a fake user turn that confused models about conversation state and
could contribute to task-redo behavior.

The file_tools.py read tracker (warn on 3rd consecutive read, block
on 4th+) already handles re-read prevention inline without injecting
synthetic messages.

Closes #2224.

Co-authored-by: Test <test@test.com>
2026-03-20 14:54:25 -07:00
Test
76bc27199f fix(cli, agent): improve streaming handling and state management
- Updated _stream_delta method in HermesCLI to handle None values, flushing the stream and resetting state for clean tool execution.
- Enhanced quiet mode handling in AIAgent to ensure proper display closure before tool execution, preventing display issues with intermediate streamed content.

These changes improve the robustness of the streaming functionality and ensure a smoother user experience during tool interactions.
2026-03-20 10:02:42 -07:00
Test
55ce601502 fix: 6 bugs in model metadata, reasoning detection, and delegate tool
Cherry-picked from PR #2169 by @0xbyt4.

1. _strip_provider_prefix: skip Ollama model:tag names (qwen:0.5b)
2. Fuzzy match: remove reverse direction that made claude-sonnet-4
   resolve to 1M instead of 200K
3. _has_content_after_think_block: reuse _strip_think_blocks() to
   handle all tag variants (thinking, reasoning, REASONING_SCRATCHPAD)
4. models.dev lookup: elif→if so nous provider also queries models.dev
5. Disk cache fallback: use 5-min TTL instead of full hour so network
   is retried soon
6. Delegate build: wrap child construction in try/finally so
   _last_resolved_tool_names is always restored on exception
2026-03-20 08:52:37 -07:00
Teknium
c52353cf8a
feat: context pressure warnings for CLI and gateway (#2159)
* feat: context pressure warnings for CLI and gateway

User-facing notifications as context approaches the compaction threshold.
Warnings fire at 60% and 85% of the way to compaction — relative to
the configured compression threshold, not the raw context window.

CLI: Formatted line with a progress bar showing distance to compaction.
Cyan at 60% (approaching), bold yellow at 85% (imminent).

  ◐ context ▰▰▰▰▰▰▰▰▰▰▰▰▱▱▱▱▱▱▱▱ 60% to compaction  100k threshold (50%) · approaching compaction
  ⚠ context ▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▱▱▱ 85% to compaction  100k threshold (50%) · compaction imminent

Gateway: Plain-text notification sent to the user's chat via the new
status_callback mechanism (asyncio.run_coroutine_threadsafe bridge,
same pattern as step_callback).

Does NOT inject into the message stream. The LLM never sees these
warnings. Flags reset after each compaction cycle.

Files changed:
- agent/display.py — format_context_pressure(), format_context_pressure_gateway()
- run_agent.py — status_callback param, _context_50/70_warned flags,
  _emit_context_pressure(), flag reset in _compress_context()
- gateway/run.py — _status_callback_sync bridge, wired to AIAgent
- tests/test_context_pressure.py — 23 tests

* Merge remote-tracking branch 'origin/main' into hermes/hermes-7ea545bf

---------

Co-authored-by: Test <test@test.com>
2026-03-20 08:37:36 -07:00
Teknium
88643a1ba9
feat: overhaul context length detection with models.dev and provider-aware resolution (#2158)
Replace the fragile hardcoded context length system with a multi-source
resolution chain that correctly identifies context windows per provider.

Key changes:

- New agent/models_dev.py: Fetches and caches the models.dev registry
  (3800+ models across 100+ providers with per-provider context windows).
  In-memory cache (1hr TTL) + disk cache for cold starts.

- Rewritten get_model_context_length() resolution chain:
  0. Config override (model.context_length)
  1. Custom providers per-model context_length
  2. Persistent disk cache
  3. Endpoint /models (local servers)
  4. Anthropic /v1/models API (max_input_tokens, API-key only)
  5. OpenRouter live API (existing, unchanged)
  6. Nous suffix-match via OpenRouter (dot/dash normalization)
  7. models.dev registry lookup (provider-aware)
  8. Thin hardcoded defaults (broad family patterns)
  9. 128K fallback (was 2M)

- Provider-aware context: same model now correctly resolves to different
  context windows per provider (e.g. claude-opus-4.6: 1M on Anthropic,
  128K on GitHub Copilot). Provider name flows through ContextCompressor.

- DEFAULT_CONTEXT_LENGTHS shrunk from 80+ entries to ~16 broad patterns.
  models.dev replaces the per-model hardcoding.

- CONTEXT_PROBE_TIERS changed from [2M, 1M, 512K, 200K, 128K, 64K, 32K]
  to [128K, 64K, 32K, 16K, 8K]. Unknown models no longer start at 2M.

- hermes model: prompts for context_length when configuring custom
  endpoints. Supports shorthand (32k, 128K). Saved to custom_providers
  per-model config.

- custom_providers schema extended with optional models dict for
  per-model context_length (backward compatible).

- Nous Portal: suffix-matches bare IDs (claude-opus-4-6) against
  OpenRouter's prefixed IDs (anthropic/claude-opus-4.6) with dot/dash
  normalization. Handles all 15 current Nous models.

- Anthropic direct: queries /v1/models for max_input_tokens. Only works
  with regular API keys (sk-ant-api*), not OAuth tokens. Falls through
  to models.dev for OAuth users.

Tests: 5574 passed (18 new tests for models_dev + updated probe tiers)
Docs: Updated configuration.md context length section, AGENTS.md

Co-authored-by: Test <test@test.com>
2026-03-20 06:04:33 -07:00
Teknium
b7b585656b
Merge pull request #2110 from NousResearch/hermes/hermes-5d6932ba
fix: session reset + custom provider model switch + honcho base_url
2026-03-20 06:01:44 -07:00
Teknium
aa6416399e
Merge pull request #2161 from NousResearch/hermes/hermes-6757a563
fix(display): show spinners and tool progress during streaming mode
2026-03-20 05:17:55 -07:00
Test
b313751acf fix(display): show spinners and tool progress during streaming mode
When streaming was enabled, two visual feedback mechanisms were
completely suppressed:

1. The thinking spinner (TUI toolbar) was skipped because the entire
   spinner block was gated on 'not self._has_stream_consumers()'.
   Now the thinking_callback fires in streaming mode too — the
   raw KawaiiSpinner is still skipped (would conflict with streamed
   tokens) but the TUI toolbar widget works fine alongside streaming.

2. Tool progress lines (the ┊ feed) were invisible because _vprint
   was blanket-suppressed when stream consumers existed. But during
   tool execution, no tokens are actively streaming, so printing is
   safe. Added an _executing_tools flag that _vprint respects to
   allow output during tool execution even with stream consumers
   registered.
2026-03-20 05:14:42 -07:00
Test
b1d05dfe8b fix(openai): route api.openai.com to Responses API for GPT-5.x
Based on PR #1859 by @magi-morph (too stale to cherry-pick, reimplemented).

GPT-5.x models reject tool calls + reasoning_effort on
/v1/chat/completions with a 400 error directing to /v1/responses.
This auto-detects api.openai.com in the base URL and switches to
codex_responses mode in three places:

- AIAgent.__init__: upgrades chat_completions → codex_responses
- _try_activate_fallback(): same routing for fallback model
- runtime_provider.py: _detect_api_mode_for_url() for both custom
  provider and openrouter runtime resolution paths

Also extracts _is_direct_openai_url() helper to replace the inline
check in _max_tokens_param().
2026-03-20 05:09:41 -07:00
Test
5822711ae6 fix: complete session reset — missing compressor counters + test
Follow-up to PR #2101 (InB4DevOps). Adds three missing context compressor
resets in reset_session_state():
- compression_count (displayed in status bar)
- last_total_tokens
- _context_probed (stale context-error flag)

Also fixes the test_cli_new_session.py prompt_toolkit mock (missing
auto_suggest stub) and adds a regression test for #2099 that verifies
all token counters and compressor state are zeroed on /new.
2026-03-20 04:35:17 -07:00
Teknium
b19f5133c3
Merge pull request #2118 from NousResearch/hermes/hermes-e83093f0
feat: show reasoning/thinking blocks when show_reasoning is enabled
2026-03-20 04:35:12 -07:00
Test
b1832faaae feat: show reasoning/thinking blocks when show_reasoning is enabled
- Add <thinking> tag to streaming filter's tag list
- When show_reasoning is on, route XML reasoning content to the
  reasoning display box instead of silently discarding it
- Expand _strip_think_blocks to handle all tag variants:
  <think>, <thinking>, <THINKING>, <reasoning>, <REASONING_SCRATCHPAD>
2026-03-19 19:44:31 -07:00
Teknium
3a9a1bbb84
Merge pull request #2091 from dusterbloom/fix/lmstudio-context-length-detection
feat: query local servers for actual context window size
2026-03-19 19:08:21 -07:00
InB4DevOps
fe331ed9bd fix: Reset token counters on new session for accurate usage display (#2099) 2026-03-20 01:21:25 +01:00
Peppi Littera
746abf5e28 fix: use reasoning content as response when model only produces think blocks
Local models (especially Qwen 3.5) sometimes wrap their entire response
inside <think> tags, leaving actual content empty. Previously this caused
3 retries and then an error, wasting tokens and failing the request.

Now when retries are exhausted and reasoning_text contains the response,
it is used as final_response instead of returning an error. The user
sees the actual answer instead of "Model generated only think blocks."
2026-03-20 00:26:36 +01:00
Teknium
e84d952dc0
fix(codex): handle reasoning-only responses and replay path (#2070)
* fix(codex): treat reasoning-only responses as incomplete, not stop

When a Codex Responses API response contains only reasoning items
(encrypted thinking state) with no message text or tool calls, the
_normalize_codex_response method was setting finish_reason='stop'.
This sent the response into the empty-content retry loop, which
burned 3 retries and then failed — exactly the pattern Nester
reported in Discord.

Two fixes:
1. _normalize_codex_response: reasoning-only responses (reasoning_items_raw
   non-empty but no final_text) now get finish_reason='incomplete', routing
   them to the Codex continuation path instead of the retry loop.
2. Incomplete handling: also checks for codex_reasoning_items when deciding
   whether to preserve an interim message, so encrypted reasoning state is
   not silently dropped when there is no visible reasoning text.

Adds 4 regression tests covering:
- Unit: reasoning-only → incomplete, reasoning+content → stop
- E2E: reasoning-only → continuation → final answer succeeds
- E2E: encrypted reasoning items preserved in interim messages

* fix(codex): ensure reasoning items have required following item in API input

Follow-up to the reasoning-only response fix. Three additional issues
found by tracing the full replay path:

1. _chat_messages_to_responses_input: when a reasoning-only interim
   message was converted to Responses API input, the reasoning items
   were emitted as the last items with no following item. The Responses
   API requires a following item after each reasoning item (otherwise:
   'missing_following_item' error, as seen in OpenHands #11406). Now
   emits an empty assistant message as the required following item when
   content is empty but reasoning items were added.

2. Duplicate detection: two consecutive reasoning-only incomplete
   messages with identical empty content/reasoning but different
   encrypted codex_reasoning_items were incorrectly treated as
   duplicates, silently dropping the second response's reasoning state.
   Now includes codex_reasoning_items in the duplicate comparison.

3. Added tests for both the API input conversion path and the duplicate
   detection edge case.

Research context: verified against OpenCode (uses Vercel AI SDK, no
retry loop so avoids the issue), Clawdbot (drops orphaned reasoning
blocks entirely), and OpenHands (hit the missing_following_item error).
Our approach preserves reasoning continuity while satisfying the API
constraint.

---------

Co-authored-by: Test <test@test.com>
2026-03-19 10:34:44 -07:00
Teknium
d76fa7fc37
fix: detect context length for custom model endpoints via fuzzy matching + config override (#2051)
* fix: detect context length for custom model endpoints via fuzzy matching + config override

Custom model endpoints (non-OpenRouter, non-known-provider) were silently
falling back to 2M tokens when the model name didn't exactly match what the
endpoint's /v1/models reported. This happened because:

1. Endpoint metadata lookup used exact match only — model name mismatches
   (e.g. 'qwen3.5:9b' vs 'Qwen3.5-9B-Q4_K_M.gguf') caused a miss
2. Single-model servers (common for local inference) required exact name
   match even though only one model was loaded
3. No user escape hatch to manually set context length

Changes:
- Add fuzzy matching for endpoint model metadata: single-model servers
  use the only available model regardless of name; multi-model servers
  try substring matching in both directions
- Add model.context_length config override (highest priority) so users
  can explicitly set their model's context length in config.yaml
- Log an informative message when falling back to 2M probe, telling
  users about the config override option
- Thread config_context_length through ContextCompressor and AIAgent init

Tests: 6 new tests covering fuzzy match, single-model fallback, config
override (including zero/None edge cases).

* fix: auto-detect local model name and context length for local servers

Cherry-picked from PR #2043 by sudoingX.

- Auto-detect model name from local server's /v1/models when only one
  model is loaded (no manual model name config needed)
- Add n_ctx_train and n_ctx to context length detection keys for llama.cpp
- Query llama.cpp /props endpoint for actual allocated context (not just
  training context from GGUF metadata)
- Strip .gguf suffix from display in banner and status bar
- _auto_detect_local_model() in runtime_provider.py for CLI init

Co-authored-by: sudo <sudoingx@users.noreply.github.com>

* fix: revert accidental summary_target_tokens change + add docs for context_length config

- Revert summary_target_tokens from 2500 back to 500 (accidental change
  during patching)
- Add 'Context Length Detection' section to Custom & Self-Hosted docs
  explaining model.context_length config override

---------

Co-authored-by: Test <test@test.com>
Co-authored-by: sudo <sudoingx@users.noreply.github.com>
2026-03-19 06:01:16 -07:00
Teknium
a7cc1cf309
fix: support Anthropic-compatible endpoints for third-party providers (#1997)
Three bugs prevented providers like MiniMax from using their
Anthropic-compatible endpoints (e.g. api.minimax.io/anthropic):

1. _VALID_API_MODES was missing 'anthropic_messages', so explicit
   api_mode config was silently rejected and defaulted to
   chat_completions.

2. API-key provider resolution hardcoded api_mode to 'chat_completions'
   without checking model config or detecting Anthropic-compatible URLs.

3. run_agent.py auto-detection only recognized api.anthropic.com, not
   third-party endpoints using the /anthropic URL convention.

Fixes:
- Add 'anthropic_messages' to _VALID_API_MODES
- API-key providers now check model config api_mode and auto-detect
  URLs ending in /anthropic
- run_agent.py and fallback logic detect /anthropic URL convention
- 5 new tests covering all scenarios

Users can now either:
- Set MINIMAX_BASE_URL=https://api.minimax.io/anthropic (auto-detected)
- Set api_mode: anthropic_messages in model config (explicit)
- Use custom_providers with api_mode: anthropic_messages

Co-authored-by: Test <test@test.com>
2026-03-18 16:26:06 -07:00
Teknium
7c7feaa033
Merge pull request #1929 from NousResearch/hermes/hermes-b29f73b2
feat: inject model and provider into system prompt
2026-03-18 04:18:41 -07:00
Test
e99aca98ab feat: inject model and provider into system prompt
Adds model name and provider to the system prompt metadata block,
alongside the existing session ID and timestamp. These are frozen
at session start and don't change mid-conversation, so they won't
break prompt caching.
2026-03-18 04:18:26 -07:00
Teknium
e4a3ffa9c1
feat: use SOUL.md as primary agent identity instead of hardcoded default (#1922)
SOUL.md now loads in slot #1 of the system prompt, replacing the
hardcoded DEFAULT_AGENT_IDENTITY. This lets users fully customize
the agent's identity and personality by editing ~/.hermes/SOUL.md
without it conflicting with the built-in identity text.

When SOUL.md is loaded as identity, it's excluded from the context
files section to avoid appearing twice. When SOUL.md is missing,
empty, unreadable, or skip_context_files is set, the hardcoded
DEFAULT_AGENT_IDENTITY is used as a fallback.

The default SOUL.md (seeded on first run) already contains the full
Hermes personality, so existing installs are unaffected.

Co-authored-by: Test <test@test.com>
2026-03-18 04:11:20 -07:00
Test
e7844e9c8d Merge origin/main, resolve conflicts (self._base_url_lower) 2026-03-18 04:09:00 -07:00
Teknium
c0c14e60b4
fix: make concurrent tool batching path-aware for file mutations (#1914)
* Improve tool batching independence checks

* fix: address review feedback on path-aware batching

- Log malformed/non-dict tool arguments at debug level before
  falling back to sequential, instead of silently swallowing
  the error into an empty dict
- Guard empty paths in _paths_overlap (unreachable in practice
  due to upstream filtering, but makes the invariant explicit)
- Add tests: malformed JSON args, non-dict args, _paths_overlap
  unit tests including empty path edge cases
- web_crawl is not a registered tool (only web_search/web_extract
  are); no addition needed to _PARALLEL_SAFE_TOOLS

---------

Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
2026-03-18 03:25:38 -07:00
Teknium
a2440f72f6
feat: use endpoint metadata for custom model context and pricing (#1906)
* perf: cache base_url.lower() via property, consolidate triple load_config(), hoist set constant

run_agent.py:
- Add base_url property that auto-caches _base_url_lower on every
  assignment, eliminating 12+ redundant .lower() calls per API cycle
  across __init__, _build_api_kwargs, _supports_reasoning_extra_body,
  and the main conversation loop
- Consolidate three separate load_config() disk reads in __init__
  (memory, skills, compression) into a single call, reusing the
  result dict for all three config sections

model_tools.py:
- Hoist _READ_SEARCH_TOOLS set to module level (was rebuilt inside
  handle_function_call on every tool invocation)

* Use endpoint metadata for custom model context and pricing

---------

Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
2026-03-18 03:04:07 -07:00
Test
5b74df2bfc fix: OAuth flag stale after refresh/fallback, memory nudge never fires, dead code
- Update _is_anthropic_oauth in _try_refresh_anthropic_client_credentials()
  when token type changes during credential refresh
- Set _is_anthropic_oauth in _try_activate_fallback() Anthropic path
- Move _turns_since_memory and _iters_since_skill init to __init__ so
  nudge counters accumulate across run_conversation() calls in CLI mode
- Remove unreachable retry_count >= max_retries block after raise

Adds 7 regression tests. Salvaged from PR #1797 by @0xbyt4.
2026-03-18 02:19:57 -07:00
max
0c392e7a87 feat: integrate GitHub Copilot providers across Hermes
Add first-class GitHub Copilot and Copilot ACP provider support across
model selection, runtime provider resolution, CLI sessions, delegated
subagents, cron jobs, and the Telegram gateway.

This also normalizes Copilot model catalogs and API modes, introduces a
Copilot ACP OpenAI-compatible shim, and fixes service-mode auth by
resolving Homebrew-installed gh binaries under launchd.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-17 23:40:22 -07:00
Teknium
d1d17f4f0a
feat(compression): add summary_base_url + move compression config to YAML-only
- Add summary_base_url config option to compression block for custom
  OpenAI-compatible endpoints (e.g. zai, DeepSeek, Ollama)
- Remove compression env var bridges from cli.py and gateway/run.py
  (CONTEXT_COMPRESSION_* env vars no longer set from config)
- Switch run_agent.py to read compression config directly from
  config.yaml instead of env vars
- Fix backwards-compat block in _resolve_task_provider_model to also
  fire when auxiliary.compression.provider is 'auto' (DEFAULT_CONFIG
  sets this, which was silently preventing the compression section's
  summary_* keys from being read)
- Add test for summary_base_url config-to-client flow
- Update docs to show compression as config.yaml-only

Closes #1591
Based on PR #1702 by @uzaylisak
2026-03-17 04:46:15 -07:00
Teknium
85993fbb5a
feat: pre-call sanitization and post-call tool guardrails (#1732)
Salvage of PR #1321 by @alireza78a (cherry-picked concept, reimplemented
against current main).

Phase 1 — Pre-call message sanitization:
  _sanitize_api_messages() now runs unconditionally before every LLM call.
  Previously gated on context_compressor being present, so sessions loaded
  from disk or running without compression could accumulate dangling
  tool_call/tool_result pairs causing API errors.

Phase 2a — Delegate task cap:
  _cap_delegate_task_calls() truncates excess delegate_task calls per turn
  to MAX_CONCURRENT_CHILDREN. The existing cap in delegate_tool.py only
  limits the task array within a single call; this catches multiple
  separate delegate_task tool_calls in one turn.

Phase 2b — Tool call deduplication:
  _deduplicate_tool_calls() drops duplicate (tool_name, arguments) pairs
  within a single turn when models stutter.

All three are static methods on AIAgent, independently testable.
29 tests covering happy paths and edge cases.
2026-03-17 04:24:27 -07:00
Teknium
9f81c11ba0
feat: eager fallback to backup model on rate-limit errors (#1730)
When a fallback model is configured, switch to it immediately upon
detecting rate-limit conditions (429, quota exhaustion, empty/malformed
responses) instead of exhausting all retries with exponential backoff.

Two eager-fallback checks:
1. Invalid/empty API responses — fallback attempted before retry loop
2. HTTP 429 / rate-limit keyword detection — fallback before backoff

Both guarded by _fallback_activated for one-shot semantics.

Cherry-picked from PR #1413 by usvimal.

Co-authored-by: usvimal <usvimal@users.noreply.github.com>
2026-03-17 04:21:16 -07:00
Teknium
72bcec0ce5
Merge pull request #1723 from NousResearch/fix/compression-attempts-persist
fix(core): compression_attempts resets each iteration — allows unlimited compressions
2026-03-17 04:13:54 -07:00
Teknium
d604b9622c
Merge pull request #1722 from NousResearch/fix/run-agent-role-violations
fix(core): message role alternation violations in JSON recovery and error handler
2026-03-17 04:13:51 -07:00
teknium1
1264275cc3 fix(core): compression_attempts counter resets each loop iteration
compression_attempts was initialized inside the outer while loop,
resetting to 0 on every iteration. Since compression triggers a
'continue' back to the top of the loop, the counter never accumulated
past 1 — effectively allowing unlimited compression attempts.

Move initialization before the outer while loop so the cap of 3
applies across the entire run_conversation() call.
2026-03-17 04:11:32 -07:00
teknium1
cd6dc4ef7e fix(core): message role violations in JSON recovery and error handler
Two edge cases could inject messages that violate role alternation:

1. Invalid JSON recovery (line ~5985): After 3 retries of invalid JSON
   tool args, a user-role recovery message was injected. But the
   assistant's tool_calls were never appended, so the sequence could
   become user → user. Fix: append the assistant message with its
   tool_calls, then respond with proper tool-role error results.

2. System error handler (line ~6238): Always injected a user-role
   error message, which creates consecutive user messages if the last
   message was already user. Fix: dynamically choose the role based on
   the last message to maintain alternation.
2026-03-17 04:10:41 -07:00
teknium1
24282dceb1 fix(core): reset length_continue_retries after successful continuation
length_continue_retries and truncated_response_prefix were initialized
once before the outer loop and never reset after a successful
continuation. If a conversation hit length truncation once (counter=1),
succeeded on continuation, did more tool calls, then hit length again,
the counter started at 1 instead of 0 — reducing available retries
from 3 to 2. The stale truncated_response_prefix would also leak
into the next response.

Reset both after the prefix is consumed on a successful final response.
2026-03-17 04:05:20 -07:00
Teknium
56e0c90445
Merge pull request #1700 from NousResearch/fix/redacting-formatter-import
fix(core): RedactingFormatter NameError when verbose_logging=True
2026-03-17 03:46:49 -07:00
Teknium
d417ba2a48
feat: add route-aware pricing estimates (#1695)
Salvaged from PR #1563 by @kshitijk4poor. Cherry-picked with authorship preserved.

- Route-aware pricing architecture replacing static MODEL_PRICING + heuristics
- Canonical usage normalization (Anthropic/OpenAI/Codex API shapes)
- Cache-aware billing (separate cache_read/cache_write rates)
- Cost status tracking (estimated/included/unknown/actual)
- OpenRouter live pricing via models API
- Schema migration v4→v5 with billing metadata columns
- Removed speculative forward-looking entries
- Removed cost display from CLI status bar
- Threaded OpenRouter metadata pre-warm

Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
2026-03-17 03:44:44 -07:00
teknium1
c713d01e72 fix(core): move RedactingFormatter import before conditional block
RedactingFormatter was imported inside 'if not has_errors_log_handler:'
(line 461) but also used unconditionally in the verbose_logging block
(line 479). When the error log handler already exists (e.g. second
AIAgent in the same process) AND verbose_logging=True, the import was
skipped and line 479 raised NameError.

Fix: Move the import one level up so it's always available regardless
of whether the error log handler already exists.
2026-03-17 03:43:21 -07:00
Teknium
1d5a39e002
fix: thread safety for concurrent subagent delegation (#1672)
* fix: thread safety for concurrent subagent delegation

Four thread-safety fixes that prevent crashes and data races when
running multiple subagents concurrently via delegate_task:

1. Remove redirect_stdout/stderr from delegate_tool — mutating global
   sys.stdout races with the spinner thread when multiple children start
   concurrently, causing segfaults. Children already run with
   quiet_mode=True so the redirect was redundant.

2. Split _run_single_child into _build_child_agent (main thread) +
   _run_single_child (worker thread). AIAgent construction creates
   httpx/SSL clients which are not thread-safe to initialize
   concurrently.

3. Add threading.Lock to SessionDB — subagents share the parent's
   SessionDB and call create_session/append_message from worker threads
   with no synchronization.

4. Add _active_children_lock to AIAgent — interrupt() iterates
   _active_children while worker threads append/remove children.

5. Add _client_cache_lock to auxiliary_client — multiple subagent
   threads may resolve clients concurrently via call_llm().

Based on PR #1471 by peteromallet.

* feat: Honcho base_url override via config.yaml + quick command alias type

Two features salvaged from PR #1576:

1. Honcho base_url override: allows pointing Hermes at a remote
   self-hosted Honcho deployment via config.yaml:

     honcho:
       base_url: "http://192.168.x.x:8000"

   When set, this overrides the Honcho SDK's environment mapping
   (production/local), enabling LAN/VPN Honcho deployments without
   requiring the server to live on localhost. Uses config.yaml instead
   of env var (HONCHO_URL) per project convention.

2. Quick command alias type: adds a new 'alias' quick command type
   that rewrites to another slash command before normal dispatch:

     quick_commands:
       sc:
         type: alias
         target: /context

   Supports both CLI and gateway. Arguments are forwarded to the
   target command.

Based on PR #1576 by redhelix.

---------

Co-authored-by: peteromallet <peteromallet@users.noreply.github.com>
Co-authored-by: redhelix <redhelix@users.noreply.github.com>
2026-03-17 02:53:33 -07:00
Teknium
a3ac142c83
fix(core): guard print() calls in run_conversation() against OSError
In headless environments (systemd, Docker, nohup) stdout can become
unavailable mid-session. Raw print() raises OSError which crashes
cron jobs — agent finishes work but delivery never happens because
the error handler's own print() also raises OSError.

Fix:
- Add _safe_print() static method that wraps print() with try/except
  OSError — silently drops output when stdout is broken
- Make _vprint() use _safe_print() — protects all calls through the
  verbose print path
- Convert raw print() calls in run_conversation() hot path to use
  _safe_print(): starting conversation, interrupt, budget exhausted,
  preflight compression, context cache, conversation completed
- Error handler print (the cascading crash point) gets explicit
  try/except with logger.error() fallback so diagnostics aren't lost

Fixes #845
Closes #1358 (superseded — PR was 323 commits stale with a bug)
2026-03-17 02:41:01 -07:00
Teknium
f2414bfd45
feat: allow custom endpoints to use responses API via api_mode override (#1651)
Add HERMES_API_MODE env var and model.api_mode config field to let
custom OpenAI-compatible endpoints opt into codex_responses mode
without requiring the OpenAI Codex OAuth provider path.

- _get_configured_api_mode() reads HERMES_API_MODE env (precedence)
  then model.api_mode from config.yaml; validates against whitelist
- Applied in both _resolve_openrouter_runtime() and
  _resolve_named_custom_runtime() (original PR only covered openrouter)
- Fix _dump_api_request_debug() to show /responses URL when in
  codex_responses mode instead of always showing /chat/completions
- Tests for config override, env override, invalid values, named
  custom providers, and debug dump URL for both API modes

Inspired by PR #1041 by @mxyhi.

Co-authored-by: mxyhi <mxyhi@users.noreply.github.com>
2026-03-17 02:04:36 -07:00
Teknium
96dac22194
fix: prevent infinite 400 loop on context overflow + block prompt injection via cache files (#1630, #1558)
* fix: prevent infinite 400 failure loop on context overflow (#1630)

When a gateway session exceeds the model's context window, Anthropic may
return a generic 400 invalid_request_error with just 'Error' as the
message.  This bypassed the phrase-based context-length detection,
causing the agent to treat it as a non-retryable client error.  Worse,
the failed user message was still persisted to the transcript, making
the session even larger on each attempt — creating an infinite loop.

Three-layer fix:

1. run_agent.py — Fallback heuristic: when a 400 error has a very short
   generic message AND the session is large (>40% of context or >80
   messages), treat it as a probable context overflow and trigger
   compression instead of aborting.

2. run_agent.py + gateway/run.py — Don't persist failed messages:
   when the agent returns failed=True before generating any response,
   skip writing the user's message to the transcript/DB. This prevents
   the session from growing on each failure.

3. gateway/run.py — Smarter error messages: detect context-overflow
   failures and suggest /compact or /reset specifically, instead of a
   generic 'try again' that will fail identically.

* fix(skills): detect prompt injection patterns and block cache file reads

Adds two security layers to prevent prompt injection via skills hub
cache files (#1558):

1. read_file: blocks direct reads of ~/.hermes/skills/.hub/ directory
   (index-cache, catalog files). The 3.5MB clawhub_catalog_v1.json
   was the original injection vector — untrusted skill descriptions
   in the catalog contained adversarial text that the model executed.

2. skill_view: warns when skills are loaded from outside the trusted
   ~/.hermes/skills/ directory, and detects common injection patterns
   in skill content ("ignore previous instructions", "<system>", etc.).

Cherry-picked from PR #1562 by ygd58.

---------

Co-authored-by: buray <ygd58@users.noreply.github.com>
2026-03-17 01:50:59 -07:00
Teknium
5ada0b95e9
Merge pull request #1609 from 0xbyt4/fix/context-counter-cache-tokens
fix: context counter shows cached token count in status bar
2026-03-17 01:45:12 -07:00
Teknium
eaa9ceeb43
Merge pull request #1621 from Death-Incarnate/main
fix: isolate test_anthropic_adapter from local credentials
2026-03-17 01:40:39 -07:00
Teknium
3576f44a57
feat: add Vercel AI Gateway provider (#1628)
* feat: add Vercel AI Gateway as a first-class provider

Adds AI Gateway (ai-gateway.vercel.sh) as a new inference provider
with AI_GATEWAY_API_KEY authentication, live model discovery, and
reasoning support via extra_body.reasoning.

Based on PR #1492 by jerilynzheng.

* feat: add AI Gateway to setup wizard, doctor, and fallback providers

* test: add AI Gateway to api_key_providers test suite

* feat: add AI Gateway to hermes model CLI and model metadata

Wire AI Gateway into the interactive model selection menu and add
context lengths for AI Gateway model IDs in model_metadata.py.

* feat: use claude-haiku-4.5 as AI Gateway auxiliary model

* revert: use gemini-3-flash as AI Gateway auxiliary model

* fix: move AI Gateway below established providers in selection order

---------

Co-authored-by: jerilynzheng <jerilynzheng@users.noreply.github.com>
Co-authored-by: jerilynzheng <zheng.jerilyn@gmail.com>
2026-03-17 00:12:16 -07:00
DeadMan
285300528b fix: isolate test_anthropic_adapter from local credentials
Two tests lacked filesystem isolation causing them to pick up real
~/.claude/.credentials.json tokens on machines with Claude Code installed.

- test_prefers_oauth_token_over_api_key: add tmp_path, mock Path.home,
  clear CLAUDE_CODE_OAUTH_TOKEN env
- test_falls_back_to_token: same isolation

Also commit run_agent.py generic-400 retry fix.
2026-03-16 22:53:32 -07:00
0xbyt4
8d0a96a8bf fix: context counter shows cached token count in status bar
Anthropic prompt caching splits input into cache_read_input_tokens,
cache_creation_input_tokens, and non-cached input_tokens. The context
counter only read input_tokens (non-cached portion), showing ~3 tokens
instead of the real ~18K total. Now includes cached portions for
Anthropic native provider only — other providers (OpenAI, OpenRouter,
Codex) already include cached tokens in their prompt_tokens field.

Before: 3/200K | 0%
After: 17.7K/200K | 9%
2026-03-17 05:06:11 +03:00
Teknium
2158c44efd
fix: Anthropic OAuth compatibility — Claude Code identity fingerprinting (#1597)
Anthropic routes OAuth/subscription requests based on Claude Code's
identity markers. Without them, requests get intermittent 500 errors
(~25% failure rate observed). This matches what pi-ai (clawdbot) and
OpenCode both implement for OAuth compatibility.

Changes (OAuth tokens only — API key users unaffected):

1. Headers: user-agent 'claude-cli/2.1.2 (external, cli)' + x-app 'cli'
2. System prompt: prepend 'You are Claude Code, Anthropic's official CLI'
3. System prompt sanitization: replace Hermes/Nous references
4. Tool names: prefix with 'mcp_' (Claude Code convention for non-native tools)
5. Tool name stripping: remove 'mcp_' prefix from response tool calls

Before: 9/12 OK, 1 hard fail, 4 needed retries (~25% error rate)
After: 16/16 OK, 0 failures, 0 retries (0% error rate)
2026-03-16 17:08:22 -07:00
Teknium
e6cf1c94a8
Merge pull request #1585 from 0xbyt4/fix/anthropic-error-handling
fix(anthropic): retry 429/529 errors and surface error details to users
2026-03-16 15:46:06 -07:00
0xbyt4
d998cac319 fix(anthropic): retry 429/529 errors and surface error details to users
- 429 rate limit and 529 overloaded were incorrectly treated as
  non-retryable client errors, causing immediate failure instead of
  exponential backoff retry. Users hitting Anthropic rate limits got
  silent failures or no response at all.
- Generic "Sorry, I encountered an unexpected error" now includes
  error type, details, and status-specific hints (auth, rate limit,
  overloaded).
- Failed agent with final_response=None now surfaces the actual
  error message instead of returning an empty response.
2026-03-17 01:07:11 +03:00
teknium1
f4d61c168b merge: resolve conflicts with main (show_cost, turn routing, docker docs) 2026-03-16 14:22:38 -07:00
Teknium
1ecfe68675
feat: improve memory prioritization + aggressive skill updates (inspired by OpenAI Codex)
* feat: improve memory prioritization — user preferences over procedural knowledge

Inspired by OpenAI Codex's memory prompt improvements (openai/codex#14493)
which focus memory writes on user preferences and recurring patterns
rather than procedural task details.

Key insight: 'Optimize for reducing future user steering — the most
valuable memory prevents the user from having to repeat themselves.'

Changes:
- MEMORY_GUIDANCE (prompt_builder.py): added prioritization hierarchy
  and the core principle about reducing user steering
- MEMORY_SCHEMA (memory_tool.py): reordered WHEN TO SAVE list to put
  corrections first, added explicit PRIORITY guidance
- Memory nudge (run_agent.py): now asks specifically about preferences,
  corrections, and workflow patterns instead of generic 'anything'
- Memory flush (run_agent.py): now instructs to prioritize user
  preferences and corrections over task-specific details

* feat: more aggressive skill creation and update prompting

Press harder on skill updates — the agent should proactively patch
skills when it encounters issues during use, not wait to be asked.

Changes:
- SKILLS_GUIDANCE: 'consider saving' → 'save'; added explicit instruction
  to patch skills immediately when found outdated/wrong
- Skills header: added instruction to update loaded skills before finishing
  if they had missing steps or wrong commands
- Skill nudge: more assertive ('save the approach' not 'consider saving'),
  now also prompts for updating existing skills used in the task
- Skill nudge interval: lowered default from 15 to 10 iterations
- skill_manage schema: added 'patch it immediately' to update triggers
2026-03-16 06:52:32 -07:00
teknium1
8e07f9ca56 fix: audit fixes — 5 bugs found and resolved
Thorough code review found 5 issues across run_agent.py, cli.py, and gateway/:

1. CRITICAL — Gateway stream consumer task never started: stream_consumer_holder
   was checked BEFORE run_sync populated it. Fixed with async polling pattern
   (same as track_agent).

2. MEDIUM-HIGH — Streaming fallback after partial delivery caused double-response:
   if streaming failed after some tokens were delivered, the fallback would
   re-deliver the full response. Now tracks deltas_were_sent and only falls
   back when no tokens reached consumers yet.

3. MEDIUM — Codex mode lost on_first_delta spinner callback: _run_codex_stream
   now accepts on_first_delta parameter, fires it on first text delta. Passed
   through from _interruptible_streaming_api_call via _codex_on_first_delta
   instance attribute.

4. MEDIUM — CLI close-tag after-text bypassed tag filtering: text after a
   reasoning close tag was sent directly to _emit_stream_text, skipping
   open-tag detection. Now routes through _stream_delta for full filtering.

5. LOW — Removed 140 lines of dead code: old _streaming_api_call method
   (superseded by _interruptible_streaming_api_call). Updated 13 tests in
   test_run_agent.py and test_openai_client_lifecycle.py to use the new
   method name and signature.

4573 tests passing.
2026-03-16 06:35:46 -07:00
teknium1
99369b926c fix: always fall back to non-streaming on ANY streaming error
Previously the fallback only triggered on specific error keywords like
'streaming is not supported'. Many third-party providers have partial
or broken streaming — rejecting stream=True, crashing on stream_options,
dropping connections mid-stream, returning malformed chunks, etc.

Now: any exception during the streaming API call triggers an automatic
fallback to the standard non-streaming request path. The error is logged
at INFO level for diagnostics but never surfaces to the user. If the
fallback also fails, THAT error propagates normally.

This ensures streaming is additive — it improves UX when it works but
never breaks providers that don't support it.

Tests: 2 new (any-error fallback, double-failure propagation), 15 total.
2026-03-16 06:15:09 -07:00
teknium1
ac739e485f fix(cli): reasoning tag suppression during streaming + fix fallback detection
Fixes two issues found during live testing:

1. Reasoning tag suppression: close tags like </REASONING_SCRATCHPAD>
   that arrive split across stream tokens (e.g. '</REASONING_SCRATCH' +
   'PAD>\n\nHello') were being lost because the buffer was discarded.
   Fix: keep a sliding window of the tail (max close tag length) so
   partial tags survive across tokens.

2. Streaming fallback detection was too broad — 'stream' matched any
   error containing that word (including 'stream_options' rejections).
   Narrowed to specific phrases: 'streaming is not', 'streaming not
   support', 'does not support stream', 'not available'.

Verified with real API calls: streaming works end-to-end with
reasoning block suppression, response box framing, and proper
fallback to Rich Panel when streaming isn't active.
2026-03-16 05:28:10 -07:00
teknium1
c1ac32737d feat: unified streaming infrastructure — core delta callbacks for all providers
Stage 1 of streaming support. Adds:

- stream_delta_callback parameter on AIAgent.__init__ for real-time token delivery
- _interruptible_streaming_api_call() handling chat_completions + anthropic_messages
- Enhanced _run_codex_stream() to fire delta callbacks during Codex streaming
- _fire_stream_delta() fires both display and TTS callbacks
- _fire_reasoning_delta() for reasoning content streaming
- Tool-call suppression: callbacks only fire on text-only responses
- on_first_delta callback for spinner control on first token
- Provider fallback: graceful degradation to non-streaming
- _has_stream_consumers() unifies stream_delta_callback and _stream_callback checks
- Anthropic streaming returns native Message for downstream compatibility

Drawing from PRs #922 (unified streaming), #1312 (gateway consumer),
#774 (Telegram streaming), #798 (CLI streaming), #1214 (reasoning modes).
Credit: jobless0x, OutThisLife, clicksingh, raulvidis.
2026-03-16 05:05:45 -07:00
Teknium
9e845a6e53
feat: major /rollback improvements — enabled by default, diff preview, file-level restore, conversation undo, terminal checkpoints
Checkpoint & rollback upgrades:

1. Enabled by default — checkpoints are now on for all new sessions.
   Zero cost when no file-mutating tools fire. Disable with
   checkpoints.enabled: false in config.yaml.

2. Diff preview — /rollback diff <N> shows a git diff between the
   checkpoint and current working tree before committing to a restore.

3. File-level restore — /rollback <N> <file> restores a single file
   from a checkpoint instead of the entire directory.

4. Conversation undo on rollback — when restoring files, the last
   chat turn is automatically undone so the agent's context matches
   the restored filesystem state.

5. Terminal command checkpoints — destructive terminal commands (rm,
   mv, sed -i, truncate, git reset/clean, output redirects) now
   trigger automatic checkpoints before execution. Previously only
   write_file and patch were covered.

6. Change summary in listing — /rollback now shows file count and
   +insertions/-deletions for each checkpoint.

7. Fixed dead code — removed duplicate _run_git call in
   list_checkpoints with nonsensical --all if False condition.

8. Updated help text — /rollback with no args now shows available
   subcommands (diff, file-level restore).
2026-03-16 04:43:37 -07:00
Teknium
dd7921d514
fix(honcho): isolate session routing for multi-user gateway (#1500)
Salvaged from PR #1470 by adavyas.

Core fix: Honcho tool calls in a multi-session gateway could route to
the wrong session because honcho_tools.py relied on process-global
state. Now threads session context through the call chain:
  AIAgent._invoke_tool() → handle_function_call() → registry.dispatch()
  → handler **kw → _resolve_session_context()

Changes:
- Add _resolve_session_context() to prefer per-call context over globals
- Plumb honcho_manager + honcho_session_key through handle_function_call
- Add sync_honcho=False to run_conversation() for synthetic flush turns
- Pass honcho_session_key through gateway memory flush lifecycle
- Harden gateway PID detection when /proc cmdline is unreadable
- Make interrupt test scripts import-safe for pytest-xdist
- Wrap BibTeX examples in Jekyll raw blocks for docs build
- Fix thread-order-dependent assertion in client lifecycle test
- Expand Honcho docs: session isolation, lifecycle, routing internals

Dropped from original PR:
- Indentation change in _create_request_openai_client that would move
  client creation inside the lock (causes unnecessary contention)

Co-authored-by: adavyas <adavyas@users.noreply.github.com>
2026-03-16 00:23:47 -07:00
Teknium
eb4f0348e1
fix: persist CLI token counts to session DB for /insights
Token usage was tracked in-memory during CLI sessions (session_prompt_tokens,
session_completion_tokens) but never written to the SQLite session DB. The
gateway persisted tokens via session_store.update_session(), but CLI sessions
always showed 0 tokens in /insights.

Now run_agent.py persists token deltas to the DB after each API call for CLI
sessions. Gateway sessions continue to use their existing persist path to
avoid double-counting.
2026-03-16 00:23:13 -07:00
Teknium
3f0f4a04a9
fix(agent): skip reasoning extra_body for unsupported OpenRouter models (#1485)
* fix(agent): skip reasoning extra_body for models that don't support it

Sending reasoning config to models like MiniMax or Nvidia via OpenRouter
causes a 400 BadRequestError. Previously, reasoning extra_body was sent
to all OpenRouter and Nous models unconditionally.

Fix: only send reasoning extra_body when the model slug starts with a
known reasoning-capable prefix (deepseek/, anthropic/, openai/, x-ai/,
google/gemini-2, qwen/qwen3) or when using Nous Portal directly.

Applies to both the main API call path (_build_api_kwargs) and the
conversation summary path.

Fixes #1083

* test(agent): cover reasoning extra_body gating

---------

Co-authored-by: ygd58 <buraysandro9@gmail.com>
2026-03-15 20:42:07 -07:00
Teknium
c564e1c3dc
feat(tools): centralize tool emoji metadata in registry + skin integration (#1484)
feat(tools): centralize tool emoji metadata in registry + skin integration
2026-03-15 20:35:24 -07:00
teknium1
210d5ade1e feat(tools): centralize tool emoji metadata in registry + skin integration
- Add 'emoji' field to ToolEntry and 'get_emoji()' to ToolRegistry
- Add emoji= to all 50+ registry.register() calls across tool files
- Add get_tool_emoji() helper in agent/display.py with 3-tier resolution:
  skin override → registry default → hardcoded fallback
- Replace hardcoded emoji maps in run_agent.py, delegate_tool.py, and
  gateway/run.py with centralized get_tool_emoji() calls
- Add 'tool_emojis' field to SkinConfig so skins can override per-tool
  emojis (e.g. ares skin could use swords instead of wrenches)
- Add 11 tests (5 registry emoji, 6 display/skin integration)
- Update AGENTS.md skin docs table

Based on the approach from PR #1061 by ForgingAlex (emoji centralization
in registry). This salvage fixes several issues from the original:
- Does NOT split the cronjob tool (which would crash on missing schemas)
- Does NOT change image_generate toolset/requires_env/is_async
- Does NOT delete existing tests
- Completes the centralization (gateway/run.py was missed)
- Hooks into the skin system for full customizability
2026-03-15 20:21:21 -07:00
Teknium
103f7b1ebc
fix: verbose mode shows full untruncated output
* fix(cli): silence tirith prefetch install warnings at startup

* fix: verbose mode now shows full untruncated tool args, results, content, and think blocks

When tool progress is set to 'verbose' (via /verbose or config), the display
was still truncating tool arguments to 100 chars, tool results to 100-200 chars,
assistant content to 100 chars, and think blocks to 5 lines. This defeated the
purpose of verbose mode.

Changes:
- Tool args: show full JSON args (not truncated to log_prefix_chars)
- Tool results: show full result content in both display and debug logs
- Assistant content: show full content during tool-call loops
- Think blocks: show full reasoning text (not truncated to 5 lines/100 chars)
- Auto-enable reasoning display when verbose mode is active
- Fix initial agent creation to respect verbose config (was always quiet_mode=True)
- Updated verbose label to mention think blocks
2026-03-15 20:03:37 -07:00
teknium1
93a0c0cddd fix: handle dict tool call arguments from local backends
Normalize tool call arguments when OpenAI-compatible backends return parsed dict/list payloads instead of JSON strings. This prevents the .strip() crash during tool-call validation for llama.cpp and similar servers, while preserving existing empty-string and invalid-JSON handling. Adds a focused regression test for dict arguments in the agent loop.
2026-03-15 08:00:19 -07:00
teknium1
f24c00a5bf fix(config): reload .env over stale shell overrides
Hermes startup entrypoints now load ~/.hermes/.env and project fallback env files with user config taking precedence over stale shell-exported values. This makes model/provider/base URL changes in .env actually take effect after restarting Hermes. Adds a shared env loader plus regression coverage, and reproduces the original bug case where OPENAI_BASE_URL and HERMES_INFERENCE_PROVIDER remained stuck on old shell values before import.
2026-03-15 06:46:28 -07:00
teknium1
62abb453d3 Merge origin/main into hermes/hermes-daa73839 2026-03-14 23:44:47 -07:00
teknium1
735a6e7651 fix: convert anthropic image content blocks 2026-03-14 23:41:20 -07:00
0xbyt4
6f85283553 fix: use json.dumps instead of str() for Codex Responses API arguments
When the Responses API returns tool call arguments as a dict,
str(dict) produces Python repr with single quotes (e.g. {'key': 'val'})
which is invalid JSON. Downstream json.loads() fails silently and the
tool gets called with empty arguments, losing all parameters.

Affects both function_call and custom_tool_call item types in
_normalize_codex_response().
2026-03-14 22:03:53 -07:00
yemi-lagosinternationalmarket
00c5e77724 fix: prevent closed OpenAI client reuse across retries
Use per-request OpenAI clients inside _interruptible_api_call so interrupts and transport failures do not poison later retries. Also add closed-client detection/recreation for the shared client and regression tests covering retry and concurrency behavior.
2026-03-14 21:56:00 -07:00
Teknium
fc5443d854
Merge pull request #1360 from NousResearch/hermes/hermes-aa701810
fix: refresh Anthropic OAuth before stale env tokens
2026-03-14 19:53:40 -07:00
teknium1
70ea13eb40 fix: preflight Anthropic auth and prefer Claude store 2026-03-14 19:38:55 -07:00
teknium1
e052c74727 fix: refresh Anthropic OAuth before stale env tokens 2026-03-14 19:22:31 -07:00
teknium1
df5c61b37c feat: compress cron management into one tool 2026-03-14 12:21:50 -07:00
teknium1
7b10881b9e fix: persist clean voice transcripts and /voice off state
- keep CLI voice prefixes API-local while storing the original user text
- persist explicit gateway off state and restore adapter auto-TTS suppression on restart
- add regression coverage for both behaviors
2026-03-14 06:14:22 -07:00
0xbyt4
eb34c0b09a fix: voice pipeline hardening — 7 bug fixes with tests
1. Anthropic + ElevenLabs TTS silence: forward full response to TTS
   callback for non-streaming providers (choices first, then native
   content blocks fallback).

2. Subprocess timeout kill: play_audio_file now kills the process on
   TimeoutExpired instead of leaving zombie processes.

3. Discord disconnect cleanup: leave all voice channels before closing
   the client to prevent leaked state.

4. Audio stream leak: close InputStream if stream.start() fails.

5. Race condition: read/write _on_silence_stop under lock in audio
   callback thread.

6. _vprint force=True: show API error, retry, and truncation messages
   even during streaming TTS.

7. _refresh_level lock: read _voice_recording under _voice_lock.
2026-03-14 14:27:21 +03:00
0xbyt4
cc0a453476 fix: address PR review round 5 — streaming guard, VC auth, history prefix, auto-TTS control
1. Gate _streaming_api_call to chat_completions mode only — Anthropic and
   Codex fall back to _interruptible_api_call. Preserve Anthropic base_url
   across all client rebuild paths (interrupt, fallback, 401 refresh).

2. Discord VC synthetic events now use chat_type="channel" instead of
   defaulting to "dm" — prevents session bleed into DM context.
   Authorization runs before echoing transcript. Sanitize @everyone/@here
   in voice transcripts.

3. CLI voice prefix ("[Voice input...]") is now API-call-local only —
   stripped from returned history so it never persists to session DB or
   resumed sessions.

4. /voice off now disables base adapter auto-TTS via _auto_tts_disabled_chats
   set — voice input no longer triggers TTS when voice mode is off.
2026-03-14 14:27:21 +03:00
0xbyt4
0ff1b4ade2 fix: harden web gateway security and fix error swallowing
- Use hmac.compare_digest for timing-safe token comparison (3 endpoints)
- Default bind to 127.0.0.1 instead of 0.0.0.0
- Sanitize upload filenames with Path.name to prevent path traversal
- Add DOMPurify to sanitize marked.parse() output against XSS
- Replace add_static with authenticated media handler
- Hide token in group chats for /remote-control command
- Use ctypes.util.find_library for Opus instead of hardcoded paths
- Add force=True to 5 interrupt _vprint calls for visibility
- Log Opus decode errors and voice restart failures instead of swallowing
2026-03-14 14:27:21 +03:00
0xbyt4
d646442692 fix: restore Anthropic interrupt handler in _interruptible_api_call
Rebase auto-merge silently overwrote main's Anthropic-aware interrupt
handler with the older OpenAI-only version. Without this fix, interrupting
an Anthropic API call closes the wrong client and leaves token generation
running on the Anthropic side.
2026-03-14 14:27:21 +03:00
0xbyt4
a78249230c fix: address voice mode PR review (streaming TTS, prompt cache, _vprint)
Bug A: Replace stale _HAS_ELEVENLABS/_HAS_AUDIO boolean imports with
lazy import function calls (_import_elevenlabs, _import_sounddevice).
The old constants no longer exist in tts_tool -- the try/except
silently swallowed the ImportError, leaving streaming TTS dead.

Bug B: Use user message prefix instead of modifying system prompt for
voice mode instruction. Changing ephemeral_system_prompt mid-session
invalidates the prompt cache. Now the concise-response hint is
prepended to the user_message passed to run_conversation while
conversation_history keeps the original text.

Minor: Add force parameter to _vprint so critical error messages
(max retries, non-retryable errors, API failures) are always shown
even during streaming TTS playback.

Tests: 15 new tests in test_voice_cli_integration.py covering all
three fixes -- lazy import activation, message prefix behavior,
history cleanliness, system prompt stability, and AST verification
that all critical _vprint calls use force=True.
2026-03-14 14:27:20 +03:00
0xbyt4
b859dfab16 fix: address voice mode review feedback
1. Fully lazy imports: sounddevice, numpy, elevenlabs, edge_tts, and
   openai are never imported at module level. Each is imported only when
   the feature is explicitly activated, preventing crashes in headless
   environments (SSH, Docker, WSL, no PortAudio).

2. No core agent loop changes: streaming TTS path extracted from
   _interruptible_api_call() into separate _streaming_api_call() method.
   The original method is restored to its upstream form.

3. Configurable key binding: push-to-talk key changed from Ctrl+R
   (conflicts with readline reverse-search) to Ctrl+B by default.
   Configurable via voice.push_to_talk_key in config.yaml.

4. Environment detection: new detect_audio_environment() function checks
   for SSH, Docker, WSL, and missing audio devices before enabling voice
   mode. Auto-disables with clear warnings in incompatible environments.

5. Graceful degradation: every audio touchpoint (sd.play, sd.InputStream,
   sd.OutputStream) wrapped in try/except with ImportError/OSError
   handling. Failures produce warnings, not crashes.
2026-03-14 14:27:20 +03:00
0xbyt4
46db7aeffd fix: streaming tool call parsing, error handling, and fake HA state mutation
- Fix Gemini streaming tool call merge bug: multiple tool calls with same
  index but different IDs are now parsed as separate calls instead of
  concatenating names (e.g. ha_call_serviceha_call_service)
- Handle partial results in voice mode: show error and stop continuous
  mode when agent returns partial/failed results with empty response
- Fix error display during streaming TTS: error messages are shown in
  full response box even when streaming box was already opened
- Add duplicate sentence filter in TTS: skip near-duplicate sentences
  from LLM repetition
- Fix fake HA server state mutation: turn_on/turn_off/set_temperature
  correctly update entity states; temperature sensor simulates change
  when thermostat is adjusted
2026-03-14 14:27:20 +03:00
0xbyt4
b00c5949fc fix: suppress verbose logs during streaming TTS, improve hallucination filter, stop continuous mode on errors
- Add _vprint() helper to suppress log output when stream_callback is active
- Expand Whisper hallucination filter with multi-language phrases and regex pattern for repetitive text
- Stop continuous voice mode when agent returns a failed result (e.g. 429 rate limit)
2026-03-14 14:26:55 +03:00
0xbyt4
179d9e1a22 feat: add streaming sentence-by-sentence TTS via ElevenLabs
Stream audio to speaker as the agent generates tokens instead of
waiting for the full response. First sentence plays within ~1-2s
of agent starting to respond.

- run_agent: add stream_callback to run_conversation/chat, streaming
  path in _interruptible_api_call accumulates chunks into mock
  ChatCompletion while forwarding content deltas to callback
- tts_tool: add stream_tts_to_speaker() with sentence buffering,
  think block filtering, markdown stripping, ElevenLabs pcm_24000
  streaming to sounddevice OutputStream
- cli: wire up streaming TTS pipeline in chat(), detect elevenlabs
  provider + sounddevice availability, skip batch TTS when streaming
  is active, signal stop on interrupt

Falls back to batch TTS for Edge/OpenAI providers or when
elevenlabs/sounddevice are not available. Zero impact on non-voice
mode (callback defaults to None).
2026-03-14 14:26:30 +03:00
teknium1
cbbba87099 fix: reuse shared atomic session log helper 2026-03-14 02:56:13 -07:00
alireza78a
f685741481 fix(agent): use atomic write in _save_session_log to prevent data loss 2026-03-14 02:53:01 -07:00
Teknium
1117a21065
Merge pull request #1271 from NousResearch/hermes/hermes-de3d4e49
fix: guard init-time stdio writes
2026-03-14 02:21:39 -07:00
teknium1
936040d8f7 fix: guard init-time stdio writes 2026-03-14 02:19:46 -07:00
Teknium
29176f302e
fix: sanitize chat payloads and provider precedence (#1253)
fix: sanitize chat payloads and provider precedence
2026-03-14 00:09:14 -07:00
Adavya Sharma
a628c607f0 fix: preserve chat kwargs identity when no sanitization is needed 2026-03-13 23:59:12 -07:00
Adavya Sharma
358dab52ce fix: sanitize chat payloads and provider precedence 2026-03-13 23:59:12 -07:00
Eris
c2a7921f3b fix: prevent logging handler accumulation in gateway mode
Use exact Path comparison instead of endswith to detect existing
errors.log handlers, avoiding false positives from similarly-named
log files.
2026-03-13 23:56:22 -07:00
0xIbra
437ec17125 fix(cli): respect HERMES_HOME in all remaining hardcoded ~/.hermes paths
Several files resolved paths via Path.home() / ".hermes" or
os.path.expanduser("~/.hermes/..."), bypassing the HERMES_HOME
environment variable. This broke isolation when running multiple
Hermes instances with distinct HERMES_HOME directories.

Replace all hardcoded paths with calls to get_hermes_home() from
hermes_cli.config, consistent with the rest of the codebase.

Files fixed:
- tools/process_registry.py (processes.json)
- gateway/pairing.py (pairing/)
- gateway/sticker_cache.py (sticker_cache.json)
- gateway/channel_directory.py (channel_directory.json, sessions.json)
- gateway/config.py (gateway.json, config.yaml, sessions_dir)
- gateway/mirror.py (sessions/)
- gateway/hooks.py (hooks/)
- gateway/platforms/base.py (image_cache/, audio_cache/, document_cache/)
- gateway/platforms/whatsapp.py (whatsapp/session)
- gateway/delivery.py (cron/output)
- agent/auxiliary_client.py (auth.json)
- agent/prompt_builder.py (SOUL.md)
- cli.py (config.yaml, images/, pastes/, history)
- run_agent.py (logs/)
- tools/environments/base.py (sandboxes/)
- tools/environments/modal.py (modal_snapshots.json)
- tools/environments/singularity.py (singularity_snapshots.json)
- tools/tts_tool.py (audio_cache)
- hermes_cli/status.py (cron/jobs.json, sessions.json)
- hermes_cli/gateway.py (logs/, whatsapp session)
- hermes_cli/main.py (whatsapp/session)

Tests updated to use HERMES_HOME env var instead of patching Path.home().

Closes #892

(cherry picked from commit 78ac1bba43b8b74a934c6172f2c29bb4d03164b9)
2026-03-13 21:32:53 -07:00
Teknium
938e887b4c
fix: keep honcho recall out of cached system prefix (#1201)
Attach later-turn Honcho recall to the current-turn user message at API
call time instead of appending it to the system prompt. This preserves the
stable system-prefix cache while keeping Honcho continuity context
available for the turn.

Also adds regression coverage for the injection helper and for continuing
sessions so Honcho recall stays out of the system prompt.
2026-03-13 21:07:00 -07:00
Teknium
b74facd119
fix: handle YAML null values in session reset policy + configurable API timeout (#1194)
* fix: Home Assistant event filtering now closed by default

Previously, when no watch_domains or watch_entities were configured,
ALL state_changed events passed through to the agent, causing users
to be flooded with notifications for every HA entity change.

Now events are dropped by default unless the user explicitly configures:
- watch_domains: list of domains to monitor (e.g. climate, light)
- watch_entities: list of specific entity IDs to monitor
- watch_all: true (new option — opt-in to receive all events)

A warning is logged at connect time if no filters are configured,
guiding users to set up their HA platform config.

All 49 gateway HA tests + 52 HA tool tests pass.

* docs: update Home Assistant integration documentation

- homeassistant.md: Fix event filtering docs to reflect closed-by-default
  behavior. Add watch_all option. Replace Python dict config example with
  YAML. Fix defaults table (was incorrectly showing 'all'). Add required
  configuration warning admonition.
- environment-variables.md: Add HASS_TOKEN and HASS_URL to Messaging section.
- messaging/index.md: Add Home Assistant to description, architecture
  diagram, platform toolsets table, and Next Steps links.

* fix(terminal): strip provider env vars from background and PTY subprocesses

Extends the env var blocklist from #1157 to also cover the two remaining
leaky paths in process_registry.py:

- spawn_local() PTY path (line 156)
- spawn_local() background Popen path (line 197)

Both were still using raw os.environ, leaking provider vars to background
processes and interactive PTY sessions. Now uses the same dynamic
_HERMES_PROVIDER_ENV_BLOCKLIST from local.py.

Explicit env_vars passed to spawn_local() still override the blocklist,
matching the existing behavior for callers that intentionally need these.

Gap identified by PR #1004 (@PeterFile).

* feat(delegate): add observability metadata to subagent results

Enrich delegate_task results with metadata from the child AIAgent:

- model: which model the child used
- exit_reason: completed | interrupted | max_iterations
- tokens.input / tokens.output: token counts
- tool_trace: per-tool-call trace with byte sizes and ok/error status

Tool trace uses tool_call_id matching to correctly pair parallel tool
calls with their results, with a fallback for messages without IDs.

Cherry-picked from PR #872 by @omerkaz, with fixes:
- Fixed parallel tool call trace pairing (was always updating last entry)
- Removed redundant 'iterations' field (identical to existing 'api_calls')
- Added test for parallel tool call trace correctness

Co-authored-by: omerkaz <omerkaz@users.noreply.github.com>

* feat(stt): add free local whisper transcription via faster-whisper

Replace OpenAI-only STT with a dual-provider system mirroring the TTS
architecture (Edge TTS free / ElevenLabs paid):

  STT: faster-whisper local (free, default) / OpenAI Whisper API (paid)

Changes:
- tools/transcription_tools.py: Full rewrite with provider dispatch,
  config loading, local faster-whisper backend, and OpenAI API backend.
  Auto-downloads model (~150MB for 'base') on first voice message.
  Singleton model instance reused across calls.
- pyproject.toml: Add faster-whisper>=1.0.0 as core dependency
- hermes_cli/config.py: Expand stt config to match TTS pattern with
  provider selection and per-provider model settings
- agent/context_compressor.py: Fix .strip() crash when LLM returns
  non-string content (dict from llama.cpp, None). Fixes #1100 partially.
- tests/: 23 new tests for STT providers + 2 for compressor fix
- docs/: Updated Voice & TTS page with STT provider table, model sizes,
  config examples, and fallback behavior

Fallback behavior:
- Local not installed → OpenAI API (if key set)
- OpenAI key not set → local whisper (if installed)
- Neither → graceful error message to user

Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>

* fix: handle YAML null values in session reset policy + configurable API timeout

Two fixes from PR #888 by @Jah-yee:

1. SessionResetPolicy.from_dict() — data.get('at_hour', 4) returns None
   when the YAML key exists with a null value. Now explicitly checks for
   None and falls back to defaults. Zero remains a valid value.

2. API timeout — hardcoded 900s is now configurable via HERMES_API_TIMEOUT
   env var. Useful for slow local models (llama.cpp) that need longer.

Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>

---------

Co-authored-by: omerkaz <omerkaz@users.noreply.github.com>
Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
2026-03-13 11:16:42 -07:00
Teknium
0157253145
Merge pull request #1152 from NousResearch/hermes/hermes-f47f71c0
feat: concurrent tool execution with ThreadPoolExecutor
2026-03-13 03:20:38 -07:00
kshitijk4poor
ccfbf42844 feat: secure skill env setup on load (core #688)
When a skill declares required_environment_variables in its YAML
frontmatter, missing env vars trigger a secure TUI prompt (identical
to the sudo password widget) when the skill is loaded. Secrets flow
directly to ~/.hermes/.env, never entering LLM context.

Key changes:
- New required_environment_variables frontmatter field for skills
- Secure TUI widget (masked input, 120s timeout)
- Gateway safety: messaging platforms show local setup guidance
- Legacy prerequisites.env_vars normalized into new format
- Remote backend handling: conservative setup_needed=True
- Env var name validation, file permissions hardened to 0o600
- Redact patterns extended for secret-related JSON fields
- 12 existing skills updated with prerequisites declarations
- ~48 new tests covering skip, timeout, gateway, remote backends
- Dynamic panel widget sizing (fixes hardcoded width from original PR)

Cherry-picked from PR #723 by kshitijk4poor, rebased onto current main
with conflict resolution.

Fixes #688

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
2026-03-13 03:14:04 -07:00
teknium1
5d0d5b191c feat: concurrent tool execution with ThreadPoolExecutor
When the model returns multiple tool calls in a single response, they are
now executed concurrently using a thread pool instead of sequentially.
This significantly reduces wall-clock time when multiple independent tools
are batched (e.g. parallel web_search, read_file, terminal calls).

Architecture:
- _execute_tool_calls() dispatches to sequential or concurrent path
- Single tool calls and batches containing 'clarify' use sequential path
- Multiple non-interactive tools use ThreadPoolExecutor (max 8 workers)
- Results are collected and appended to messages in original order
- _invoke_tool() extracted as shared tool invocation helper

Safety:
- Pre-flight interrupt check skips all tools if interrupted
- Per-tool exception handling: one failure doesn't crash the batch
- Result truncation (100k char limit) applied per tool
- Budget pressure injection after all tools complete
- Checkpoints taken before file-mutating tools
- CLI spinner shows batch progress, then per-tool completion messages

Tests: 10 new tests covering dispatch logic, ordering, error handling,
interrupt behavior, truncation, and _invoke_tool routing.
2026-03-13 02:51:51 -07:00
kshitijk4poor
bb3f5ed32a fix: separate Anthropic OAuth tokens from API keys
Persist OAuth/setup tokens in ANTHROPIC_TOKEN instead of ANTHROPIC_API_KEY.
Reserve ANTHROPIC_API_KEY for regular Console API keys.

Changes:
- anthropic_adapter: reorder resolve_anthropic_token() priority —
  ANTHROPIC_TOKEN first, ANTHROPIC_API_KEY as legacy fallback
- config: add save_anthropic_oauth_token() / save_anthropic_api_key() helpers
  that clear the opposing slot to prevent priority conflicts
- config: show_config() prefers ANTHROPIC_TOKEN for display
- setup: OAuth login and pasted setup-tokens write to ANTHROPIC_TOKEN
- setup: API key entry writes to ANTHROPIC_API_KEY and clears ANTHROPIC_TOKEN
- main: same fixes in _run_anthropic_oauth_flow() and _model_flow_anthropic()
- main: _has_any_provider_configured() checks ANTHROPIC_TOKEN
- doctor: use _is_oauth_token() for correct auth method validation
- runtime_provider: updated error message
- run_agent: simplified client init to use resolve_anthropic_token()
- run_agent: updated 401 troubleshooting messages
- status: prefer ANTHROPIC_TOKEN in status display
- tests: updated priority test, added persistence helper tests

Cherry-picked from PR #1141 by kshitijk4poor, rebased onto current main
with unrelated changes (web_policy config, blocklist CLI) removed.

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
2026-03-13 02:09:52 -07:00
Teknium
9dfa81ab4b
Merge pull request #1125 from NousResearch/hermes/hermes-c877bdeb
fix(anthropic): add diagnostic output on 401 auth failures
2026-03-12 19:15:21 -07:00
teknium1
e5b8e06037 fix(anthropic): add diagnostic output on 401 auth failures
When Anthropic returns 401 and credential refresh doesn't help,
now prints actionable troubleshooting info:
- Which auth method was used (Bearer vs x-api-key)
- Token prefix for debugging
- Common fixes (stale ANTHROPIC_API_KEY, verify key, refresh login)
- How to clear stale keys
2026-03-12 19:09:06 -07:00
Teknium
a282322845
Merge pull request #1121 from 0xbyt4/fix/anthropic-adapter-issues
fix: anthropic adapter — max_tokens, fallback crash, proxy base_url
2026-03-12 19:07:06 -07:00
Teknium
475dd58a8e
Merge PR #736: feat(honcho): async writes, memory modes, session title integration, setup CLI
Authored by erosika. Builds on #38 and #243.

Adds async write support, configurable memory modes, context prefetch pipeline,
4 new Honcho tools (honcho_context, honcho_profile, honcho_search, honcho_conclude),
full 'hermes honcho' CLI, session strategies, AI peer identity, recallMode A/B,
gateway lifecycle management, and comprehensive docs.

Cherry-picks fixes from PRs #831/#832 (adavyas).

Co-authored-by: erosika <erosika@users.noreply.github.com>
Co-authored-by: adavyas <adavyas@users.noreply.github.com>
2026-03-12 19:05:11 -07:00
0xbyt4
22479b053c fix: anthropic adapter — max_tokens ignored, fallback crash, proxy base_url filtered
- Pass self.max_tokens to build_anthropic_kwargs instead of hardcoded None
- Add anthropic case to _try_activate_fallback (was only handling openai-codex)
- Remove 'anthropic in base_url' filter that blocked custom proxy URLs
2026-03-13 04:22:16 +03:00
teknium1
e976879cf2 merge: resolve conflicts with main (URL update to hermes-agent.nousresearch.com) 2026-03-12 17:49:26 -07:00
teknium1
7f7282c78d fix(anthropic): guard memory flush tool_calls extraction for Anthropic response format
The memory flush path extracted tool_calls from the response assuming
OpenAI format (response.choices[0].message.tool_calls). When using
the Anthropic client directly (aux unavailable), the response is an
Anthropic Message object which has no .choices attribute. Now uses
normalize_anthropic_response() to extract tool_calls correctly.
2026-03-12 17:35:01 -07:00