* feat: add Vercel AI Gateway as a first-class provider
Adds AI Gateway (ai-gateway.vercel.sh) as a new inference provider
with AI_GATEWAY_API_KEY authentication, live model discovery, and
reasoning support via extra_body.reasoning.
Based on PR #1492 by jerilynzheng.
* feat: add AI Gateway to setup wizard, doctor, and fallback providers
* test: add AI Gateway to api_key_providers test suite
* feat: add AI Gateway to hermes model CLI and model metadata
Wire AI Gateway into the interactive model selection menu and add
context lengths for AI Gateway model IDs in model_metadata.py.
* feat: use claude-haiku-4.5 as AI Gateway auxiliary model
* revert: use gemini-3-flash as AI Gateway auxiliary model
* fix: move AI Gateway below established providers in selection order
---------
Co-authored-by: jerilynzheng <jerilynzheng@users.noreply.github.com>
Co-authored-by: jerilynzheng <zheng.jerilyn@gmail.com>
Two tests lacked filesystem isolation causing them to pick up real
~/.claude/.credentials.json tokens on machines with Claude Code installed.
- test_prefers_oauth_token_over_api_key: add tmp_path, mock Path.home,
clear CLAUDE_CODE_OAUTH_TOKEN env
- test_falls_back_to_token: same isolation
Also commit run_agent.py generic-400 retry fix.
Anthropic prompt caching splits input into cache_read_input_tokens,
cache_creation_input_tokens, and non-cached input_tokens. The context
counter only read input_tokens (non-cached portion), showing ~3 tokens
instead of the real ~18K total. Now includes cached portions for
Anthropic native provider only — other providers (OpenAI, OpenRouter,
Codex) already include cached tokens in their prompt_tokens field.
Before: 3/200K | 0%
After: 17.7K/200K | 9%
- 429 rate limit and 529 overloaded were incorrectly treated as
non-retryable client errors, causing immediate failure instead of
exponential backoff retry. Users hitting Anthropic rate limits got
silent failures or no response at all.
- Generic "Sorry, I encountered an unexpected error" now includes
error type, details, and status-specific hints (auth, rate limit,
overloaded).
- Failed agent with final_response=None now surfaces the actual
error message instead of returning an empty response.
* feat: improve memory prioritization — user preferences over procedural knowledge
Inspired by OpenAI Codex's memory prompt improvements (openai/codex#14493)
which focus memory writes on user preferences and recurring patterns
rather than procedural task details.
Key insight: 'Optimize for reducing future user steering — the most
valuable memory prevents the user from having to repeat themselves.'
Changes:
- MEMORY_GUIDANCE (prompt_builder.py): added prioritization hierarchy
and the core principle about reducing user steering
- MEMORY_SCHEMA (memory_tool.py): reordered WHEN TO SAVE list to put
corrections first, added explicit PRIORITY guidance
- Memory nudge (run_agent.py): now asks specifically about preferences,
corrections, and workflow patterns instead of generic 'anything'
- Memory flush (run_agent.py): now instructs to prioritize user
preferences and corrections over task-specific details
* feat: more aggressive skill creation and update prompting
Press harder on skill updates — the agent should proactively patch
skills when it encounters issues during use, not wait to be asked.
Changes:
- SKILLS_GUIDANCE: 'consider saving' → 'save'; added explicit instruction
to patch skills immediately when found outdated/wrong
- Skills header: added instruction to update loaded skills before finishing
if they had missing steps or wrong commands
- Skill nudge: more assertive ('save the approach' not 'consider saving'),
now also prompts for updating existing skills used in the task
- Skill nudge interval: lowered default from 15 to 10 iterations
- skill_manage schema: added 'patch it immediately' to update triggers
Thorough code review found 5 issues across run_agent.py, cli.py, and gateway/:
1. CRITICAL — Gateway stream consumer task never started: stream_consumer_holder
was checked BEFORE run_sync populated it. Fixed with async polling pattern
(same as track_agent).
2. MEDIUM-HIGH — Streaming fallback after partial delivery caused double-response:
if streaming failed after some tokens were delivered, the fallback would
re-deliver the full response. Now tracks deltas_were_sent and only falls
back when no tokens reached consumers yet.
3. MEDIUM — Codex mode lost on_first_delta spinner callback: _run_codex_stream
now accepts on_first_delta parameter, fires it on first text delta. Passed
through from _interruptible_streaming_api_call via _codex_on_first_delta
instance attribute.
4. MEDIUM — CLI close-tag after-text bypassed tag filtering: text after a
reasoning close tag was sent directly to _emit_stream_text, skipping
open-tag detection. Now routes through _stream_delta for full filtering.
5. LOW — Removed 140 lines of dead code: old _streaming_api_call method
(superseded by _interruptible_streaming_api_call). Updated 13 tests in
test_run_agent.py and test_openai_client_lifecycle.py to use the new
method name and signature.
4573 tests passing.
Previously the fallback only triggered on specific error keywords like
'streaming is not supported'. Many third-party providers have partial
or broken streaming — rejecting stream=True, crashing on stream_options,
dropping connections mid-stream, returning malformed chunks, etc.
Now: any exception during the streaming API call triggers an automatic
fallback to the standard non-streaming request path. The error is logged
at INFO level for diagnostics but never surfaces to the user. If the
fallback also fails, THAT error propagates normally.
This ensures streaming is additive — it improves UX when it works but
never breaks providers that don't support it.
Tests: 2 new (any-error fallback, double-failure propagation), 15 total.
Fixes two issues found during live testing:
1. Reasoning tag suppression: close tags like </REASONING_SCRATCHPAD>
that arrive split across stream tokens (e.g. '</REASONING_SCRATCH' +
'PAD>\n\nHello') were being lost because the buffer was discarded.
Fix: keep a sliding window of the tail (max close tag length) so
partial tags survive across tokens.
2. Streaming fallback detection was too broad — 'stream' matched any
error containing that word (including 'stream_options' rejections).
Narrowed to specific phrases: 'streaming is not', 'streaming not
support', 'does not support stream', 'not available'.
Verified with real API calls: streaming works end-to-end with
reasoning block suppression, response box framing, and proper
fallback to Rich Panel when streaming isn't active.
Checkpoint & rollback upgrades:
1. Enabled by default — checkpoints are now on for all new sessions.
Zero cost when no file-mutating tools fire. Disable with
checkpoints.enabled: false in config.yaml.
2. Diff preview — /rollback diff <N> shows a git diff between the
checkpoint and current working tree before committing to a restore.
3. File-level restore — /rollback <N> <file> restores a single file
from a checkpoint instead of the entire directory.
4. Conversation undo on rollback — when restoring files, the last
chat turn is automatically undone so the agent's context matches
the restored filesystem state.
5. Terminal command checkpoints — destructive terminal commands (rm,
mv, sed -i, truncate, git reset/clean, output redirects) now
trigger automatic checkpoints before execution. Previously only
write_file and patch were covered.
6. Change summary in listing — /rollback now shows file count and
+insertions/-deletions for each checkpoint.
7. Fixed dead code — removed duplicate _run_git call in
list_checkpoints with nonsensical --all if False condition.
8. Updated help text — /rollback with no args now shows available
subcommands (diff, file-level restore).
Salvaged from PR #1470 by adavyas.
Core fix: Honcho tool calls in a multi-session gateway could route to
the wrong session because honcho_tools.py relied on process-global
state. Now threads session context through the call chain:
AIAgent._invoke_tool() → handle_function_call() → registry.dispatch()
→ handler **kw → _resolve_session_context()
Changes:
- Add _resolve_session_context() to prefer per-call context over globals
- Plumb honcho_manager + honcho_session_key through handle_function_call
- Add sync_honcho=False to run_conversation() for synthetic flush turns
- Pass honcho_session_key through gateway memory flush lifecycle
- Harden gateway PID detection when /proc cmdline is unreadable
- Make interrupt test scripts import-safe for pytest-xdist
- Wrap BibTeX examples in Jekyll raw blocks for docs build
- Fix thread-order-dependent assertion in client lifecycle test
- Expand Honcho docs: session isolation, lifecycle, routing internals
Dropped from original PR:
- Indentation change in _create_request_openai_client that would move
client creation inside the lock (causes unnecessary contention)
Co-authored-by: adavyas <adavyas@users.noreply.github.com>
Token usage was tracked in-memory during CLI sessions (session_prompt_tokens,
session_completion_tokens) but never written to the SQLite session DB. The
gateway persisted tokens via session_store.update_session(), but CLI sessions
always showed 0 tokens in /insights.
Now run_agent.py persists token deltas to the DB after each API call for CLI
sessions. Gateway sessions continue to use their existing persist path to
avoid double-counting.
* fix(agent): skip reasoning extra_body for models that don't support it
Sending reasoning config to models like MiniMax or Nvidia via OpenRouter
causes a 400 BadRequestError. Previously, reasoning extra_body was sent
to all OpenRouter and Nous models unconditionally.
Fix: only send reasoning extra_body when the model slug starts with a
known reasoning-capable prefix (deepseek/, anthropic/, openai/, x-ai/,
google/gemini-2, qwen/qwen3) or when using Nous Portal directly.
Applies to both the main API call path (_build_api_kwargs) and the
conversation summary path.
Fixes#1083
* test(agent): cover reasoning extra_body gating
---------
Co-authored-by: ygd58 <buraysandro9@gmail.com>
- Add 'emoji' field to ToolEntry and 'get_emoji()' to ToolRegistry
- Add emoji= to all 50+ registry.register() calls across tool files
- Add get_tool_emoji() helper in agent/display.py with 3-tier resolution:
skin override → registry default → hardcoded fallback
- Replace hardcoded emoji maps in run_agent.py, delegate_tool.py, and
gateway/run.py with centralized get_tool_emoji() calls
- Add 'tool_emojis' field to SkinConfig so skins can override per-tool
emojis (e.g. ares skin could use swords instead of wrenches)
- Add 11 tests (5 registry emoji, 6 display/skin integration)
- Update AGENTS.md skin docs table
Based on the approach from PR #1061 by ForgingAlex (emoji centralization
in registry). This salvage fixes several issues from the original:
- Does NOT split the cronjob tool (which would crash on missing schemas)
- Does NOT change image_generate toolset/requires_env/is_async
- Does NOT delete existing tests
- Completes the centralization (gateway/run.py was missed)
- Hooks into the skin system for full customizability
* fix(cli): silence tirith prefetch install warnings at startup
* fix: verbose mode now shows full untruncated tool args, results, content, and think blocks
When tool progress is set to 'verbose' (via /verbose or config), the display
was still truncating tool arguments to 100 chars, tool results to 100-200 chars,
assistant content to 100 chars, and think blocks to 5 lines. This defeated the
purpose of verbose mode.
Changes:
- Tool args: show full JSON args (not truncated to log_prefix_chars)
- Tool results: show full result content in both display and debug logs
- Assistant content: show full content during tool-call loops
- Think blocks: show full reasoning text (not truncated to 5 lines/100 chars)
- Auto-enable reasoning display when verbose mode is active
- Fix initial agent creation to respect verbose config (was always quiet_mode=True)
- Updated verbose label to mention think blocks
Normalize tool call arguments when OpenAI-compatible backends return parsed dict/list payloads instead of JSON strings. This prevents the .strip() crash during tool-call validation for llama.cpp and similar servers, while preserving existing empty-string and invalid-JSON handling. Adds a focused regression test for dict arguments in the agent loop.
Hermes startup entrypoints now load ~/.hermes/.env and project fallback env files with user config taking precedence over stale shell-exported values. This makes model/provider/base URL changes in .env actually take effect after restarting Hermes. Adds a shared env loader plus regression coverage, and reproduces the original bug case where OPENAI_BASE_URL and HERMES_INFERENCE_PROVIDER remained stuck on old shell values before import.
When the Responses API returns tool call arguments as a dict,
str(dict) produces Python repr with single quotes (e.g. {'key': 'val'})
which is invalid JSON. Downstream json.loads() fails silently and the
tool gets called with empty arguments, losing all parameters.
Affects both function_call and custom_tool_call item types in
_normalize_codex_response().
Use per-request OpenAI clients inside _interruptible_api_call so interrupts and transport failures do not poison later retries. Also add closed-client detection/recreation for the shared client and regression tests covering retry and concurrency behavior.
- keep CLI voice prefixes API-local while storing the original user text
- persist explicit gateway off state and restore adapter auto-TTS suppression on restart
- add regression coverage for both behaviors
1. Anthropic + ElevenLabs TTS silence: forward full response to TTS
callback for non-streaming providers (choices first, then native
content blocks fallback).
2. Subprocess timeout kill: play_audio_file now kills the process on
TimeoutExpired instead of leaving zombie processes.
3. Discord disconnect cleanup: leave all voice channels before closing
the client to prevent leaked state.
4. Audio stream leak: close InputStream if stream.start() fails.
5. Race condition: read/write _on_silence_stop under lock in audio
callback thread.
6. _vprint force=True: show API error, retry, and truncation messages
even during streaming TTS.
7. _refresh_level lock: read _voice_recording under _voice_lock.
1. Gate _streaming_api_call to chat_completions mode only — Anthropic and
Codex fall back to _interruptible_api_call. Preserve Anthropic base_url
across all client rebuild paths (interrupt, fallback, 401 refresh).
2. Discord VC synthetic events now use chat_type="channel" instead of
defaulting to "dm" — prevents session bleed into DM context.
Authorization runs before echoing transcript. Sanitize @everyone/@here
in voice transcripts.
3. CLI voice prefix ("[Voice input...]") is now API-call-local only —
stripped from returned history so it never persists to session DB or
resumed sessions.
4. /voice off now disables base adapter auto-TTS via _auto_tts_disabled_chats
set — voice input no longer triggers TTS when voice mode is off.
- Use hmac.compare_digest for timing-safe token comparison (3 endpoints)
- Default bind to 127.0.0.1 instead of 0.0.0.0
- Sanitize upload filenames with Path.name to prevent path traversal
- Add DOMPurify to sanitize marked.parse() output against XSS
- Replace add_static with authenticated media handler
- Hide token in group chats for /remote-control command
- Use ctypes.util.find_library for Opus instead of hardcoded paths
- Add force=True to 5 interrupt _vprint calls for visibility
- Log Opus decode errors and voice restart failures instead of swallowing
Rebase auto-merge silently overwrote main's Anthropic-aware interrupt
handler with the older OpenAI-only version. Without this fix, interrupting
an Anthropic API call closes the wrong client and leaves token generation
running on the Anthropic side.
Bug A: Replace stale _HAS_ELEVENLABS/_HAS_AUDIO boolean imports with
lazy import function calls (_import_elevenlabs, _import_sounddevice).
The old constants no longer exist in tts_tool -- the try/except
silently swallowed the ImportError, leaving streaming TTS dead.
Bug B: Use user message prefix instead of modifying system prompt for
voice mode instruction. Changing ephemeral_system_prompt mid-session
invalidates the prompt cache. Now the concise-response hint is
prepended to the user_message passed to run_conversation while
conversation_history keeps the original text.
Minor: Add force parameter to _vprint so critical error messages
(max retries, non-retryable errors, API failures) are always shown
even during streaming TTS playback.
Tests: 15 new tests in test_voice_cli_integration.py covering all
three fixes -- lazy import activation, message prefix behavior,
history cleanliness, system prompt stability, and AST verification
that all critical _vprint calls use force=True.
1. Fully lazy imports: sounddevice, numpy, elevenlabs, edge_tts, and
openai are never imported at module level. Each is imported only when
the feature is explicitly activated, preventing crashes in headless
environments (SSH, Docker, WSL, no PortAudio).
2. No core agent loop changes: streaming TTS path extracted from
_interruptible_api_call() into separate _streaming_api_call() method.
The original method is restored to its upstream form.
3. Configurable key binding: push-to-talk key changed from Ctrl+R
(conflicts with readline reverse-search) to Ctrl+B by default.
Configurable via voice.push_to_talk_key in config.yaml.
4. Environment detection: new detect_audio_environment() function checks
for SSH, Docker, WSL, and missing audio devices before enabling voice
mode. Auto-disables with clear warnings in incompatible environments.
5. Graceful degradation: every audio touchpoint (sd.play, sd.InputStream,
sd.OutputStream) wrapped in try/except with ImportError/OSError
handling. Failures produce warnings, not crashes.
- Fix Gemini streaming tool call merge bug: multiple tool calls with same
index but different IDs are now parsed as separate calls instead of
concatenating names (e.g. ha_call_serviceha_call_service)
- Handle partial results in voice mode: show error and stop continuous
mode when agent returns partial/failed results with empty response
- Fix error display during streaming TTS: error messages are shown in
full response box even when streaming box was already opened
- Add duplicate sentence filter in TTS: skip near-duplicate sentences
from LLM repetition
- Fix fake HA server state mutation: turn_on/turn_off/set_temperature
correctly update entity states; temperature sensor simulates change
when thermostat is adjusted
- Add _vprint() helper to suppress log output when stream_callback is active
- Expand Whisper hallucination filter with multi-language phrases and regex pattern for repetitive text
- Stop continuous voice mode when agent returns a failed result (e.g. 429 rate limit)
Stream audio to speaker as the agent generates tokens instead of
waiting for the full response. First sentence plays within ~1-2s
of agent starting to respond.
- run_agent: add stream_callback to run_conversation/chat, streaming
path in _interruptible_api_call accumulates chunks into mock
ChatCompletion while forwarding content deltas to callback
- tts_tool: add stream_tts_to_speaker() with sentence buffering,
think block filtering, markdown stripping, ElevenLabs pcm_24000
streaming to sounddevice OutputStream
- cli: wire up streaming TTS pipeline in chat(), detect elevenlabs
provider + sounddevice availability, skip batch TTS when streaming
is active, signal stop on interrupt
Falls back to batch TTS for Edge/OpenAI providers or when
elevenlabs/sounddevice are not available. Zero impact on non-voice
mode (callback defaults to None).
Attach later-turn Honcho recall to the current-turn user message at API
call time instead of appending it to the system prompt. This preserves the
stable system-prefix cache while keeping Honcho continuity context
available for the turn.
Also adds regression coverage for the injection helper and for continuing
sessions so Honcho recall stays out of the system prompt.
* fix: Home Assistant event filtering now closed by default
Previously, when no watch_domains or watch_entities were configured,
ALL state_changed events passed through to the agent, causing users
to be flooded with notifications for every HA entity change.
Now events are dropped by default unless the user explicitly configures:
- watch_domains: list of domains to monitor (e.g. climate, light)
- watch_entities: list of specific entity IDs to monitor
- watch_all: true (new option — opt-in to receive all events)
A warning is logged at connect time if no filters are configured,
guiding users to set up their HA platform config.
All 49 gateway HA tests + 52 HA tool tests pass.
* docs: update Home Assistant integration documentation
- homeassistant.md: Fix event filtering docs to reflect closed-by-default
behavior. Add watch_all option. Replace Python dict config example with
YAML. Fix defaults table (was incorrectly showing 'all'). Add required
configuration warning admonition.
- environment-variables.md: Add HASS_TOKEN and HASS_URL to Messaging section.
- messaging/index.md: Add Home Assistant to description, architecture
diagram, platform toolsets table, and Next Steps links.
* fix(terminal): strip provider env vars from background and PTY subprocesses
Extends the env var blocklist from #1157 to also cover the two remaining
leaky paths in process_registry.py:
- spawn_local() PTY path (line 156)
- spawn_local() background Popen path (line 197)
Both were still using raw os.environ, leaking provider vars to background
processes and interactive PTY sessions. Now uses the same dynamic
_HERMES_PROVIDER_ENV_BLOCKLIST from local.py.
Explicit env_vars passed to spawn_local() still override the blocklist,
matching the existing behavior for callers that intentionally need these.
Gap identified by PR #1004 (@PeterFile).
* feat(delegate): add observability metadata to subagent results
Enrich delegate_task results with metadata from the child AIAgent:
- model: which model the child used
- exit_reason: completed | interrupted | max_iterations
- tokens.input / tokens.output: token counts
- tool_trace: per-tool-call trace with byte sizes and ok/error status
Tool trace uses tool_call_id matching to correctly pair parallel tool
calls with their results, with a fallback for messages without IDs.
Cherry-picked from PR #872 by @omerkaz, with fixes:
- Fixed parallel tool call trace pairing (was always updating last entry)
- Removed redundant 'iterations' field (identical to existing 'api_calls')
- Added test for parallel tool call trace correctness
Co-authored-by: omerkaz <omerkaz@users.noreply.github.com>
* feat(stt): add free local whisper transcription via faster-whisper
Replace OpenAI-only STT with a dual-provider system mirroring the TTS
architecture (Edge TTS free / ElevenLabs paid):
STT: faster-whisper local (free, default) / OpenAI Whisper API (paid)
Changes:
- tools/transcription_tools.py: Full rewrite with provider dispatch,
config loading, local faster-whisper backend, and OpenAI API backend.
Auto-downloads model (~150MB for 'base') on first voice message.
Singleton model instance reused across calls.
- pyproject.toml: Add faster-whisper>=1.0.0 as core dependency
- hermes_cli/config.py: Expand stt config to match TTS pattern with
provider selection and per-provider model settings
- agent/context_compressor.py: Fix .strip() crash when LLM returns
non-string content (dict from llama.cpp, None). Fixes#1100 partially.
- tests/: 23 new tests for STT providers + 2 for compressor fix
- docs/: Updated Voice & TTS page with STT provider table, model sizes,
config examples, and fallback behavior
Fallback behavior:
- Local not installed → OpenAI API (if key set)
- OpenAI key not set → local whisper (if installed)
- Neither → graceful error message to user
Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
* fix: handle YAML null values in session reset policy + configurable API timeout
Two fixes from PR #888 by @Jah-yee:
1. SessionResetPolicy.from_dict() — data.get('at_hour', 4) returns None
when the YAML key exists with a null value. Now explicitly checks for
None and falls back to defaults. Zero remains a valid value.
2. API timeout — hardcoded 900s is now configurable via HERMES_API_TIMEOUT
env var. Useful for slow local models (llama.cpp) that need longer.
Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
---------
Co-authored-by: omerkaz <omerkaz@users.noreply.github.com>
Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
When a skill declares required_environment_variables in its YAML
frontmatter, missing env vars trigger a secure TUI prompt (identical
to the sudo password widget) when the skill is loaded. Secrets flow
directly to ~/.hermes/.env, never entering LLM context.
Key changes:
- New required_environment_variables frontmatter field for skills
- Secure TUI widget (masked input, 120s timeout)
- Gateway safety: messaging platforms show local setup guidance
- Legacy prerequisites.env_vars normalized into new format
- Remote backend handling: conservative setup_needed=True
- Env var name validation, file permissions hardened to 0o600
- Redact patterns extended for secret-related JSON fields
- 12 existing skills updated with prerequisites declarations
- ~48 new tests covering skip, timeout, gateway, remote backends
- Dynamic panel widget sizing (fixes hardcoded width from original PR)
Cherry-picked from PR #723 by kshitijk4poor, rebased onto current main
with conflict resolution.
Fixes#688
Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
When the model returns multiple tool calls in a single response, they are
now executed concurrently using a thread pool instead of sequentially.
This significantly reduces wall-clock time when multiple independent tools
are batched (e.g. parallel web_search, read_file, terminal calls).
Architecture:
- _execute_tool_calls() dispatches to sequential or concurrent path
- Single tool calls and batches containing 'clarify' use sequential path
- Multiple non-interactive tools use ThreadPoolExecutor (max 8 workers)
- Results are collected and appended to messages in original order
- _invoke_tool() extracted as shared tool invocation helper
Safety:
- Pre-flight interrupt check skips all tools if interrupted
- Per-tool exception handling: one failure doesn't crash the batch
- Result truncation (100k char limit) applied per tool
- Budget pressure injection after all tools complete
- Checkpoints taken before file-mutating tools
- CLI spinner shows batch progress, then per-tool completion messages
Tests: 10 new tests covering dispatch logic, ordering, error handling,
interrupt behavior, truncation, and _invoke_tool routing.
When Anthropic returns 401 and credential refresh doesn't help,
now prints actionable troubleshooting info:
- Which auth method was used (Bearer vs x-api-key)
- Token prefix for debugging
- Common fixes (stale ANTHROPIC_API_KEY, verify key, refresh login)
- How to clear stale keys
- Pass self.max_tokens to build_anthropic_kwargs instead of hardcoded None
- Add anthropic case to _try_activate_fallback (was only handling openai-codex)
- Remove 'anthropic in base_url' filter that blocked custom proxy URLs
The memory flush path extracted tool_calls from the response assuming
OpenAI format (response.choices[0].message.tool_calls). When using
the Anthropic client directly (aux unavailable), the response is an
Anthropic Message object which has no .choices attribute. Now uses
normalize_anthropic_response() to extract tool_calls correctly.
Remaining issues from deep scan:
Adapter (agent/anthropic_adapter.py):
- Add _sanitize_tool_id() — Anthropic requires IDs matching [a-zA-Z0-9_-],
now strips invalid chars and ensures non-empty (both tool_use and tool_result)
- Empty tool result content → '(no output)' placeholder (Anthropic rejects empty)
- Set temperature=1 when thinking type='enabled' on older models (required)
- normalize_model_name now case-insensitive for 'Anthropic/' prefix
- Fix stale docstrings referencing only ~/.claude/.credentials.json
Agent loop (run_agent.py):
- Guard memory flush path (line ~2684) — was calling self.client.chat.completions
which is None in anthropic_messages mode. Now routes through Anthropic client.
- Guard summary generation path (line ~3171) — same crash when reaching
iteration limit. Now builds proper Anthropic kwargs and normalizes response.
- Guard retry summary path (line ~3200) — same fix for the summary retry loop.
All three self.client.chat.completions.create() calls outside the main
loop now have anthropic_messages branches to prevent NoneType crashes.
Fixes from comprehensive code review and cross-referencing with
clawdbot/OpenCode implementations:
CRITICAL:
- Add one-shot guard (anthropic_auth_retry_attempted) to prevent
infinite 401 retry loops when credentials keep changing
- Fix _is_oauth_token(): managed keys from ~/.claude.json are NOT
regular API keys (don't start with sk-ant-api). Inverted the logic:
only sk-ant-api* is treated as API key auth, everything else uses
Bearer auth + oauth beta headers
HIGH:
- Wrap json.loads(args) in try/except in message conversion — malformed
tool_call arguments no longer crash the entire conversation
- Raise AuthError in runtime_provider when no Anthropic token found
(was silently passing empty string, causing confusing API errors)
- Remove broken _try_anthropic() from auxiliary vision chain — the
centralized router creates an OpenAI client for api_key providers
which doesn't work with Anthropic's Messages API
MEDIUM:
- Handle empty assistant message content — Anthropic rejects empty
content blocks, now inserts '(empty)' placeholder
- Fix setup.py existing_key logic — set to 'KEEP' sentinel instead
of None to prevent falling through to the auth choice prompt
- Add debug logging to _fetch_anthropic_models on failure
Tests: 43 adapter tests (2 new for token detection), 3197 total passed
* fix: stop rejecting unlisted models + auto-detect from /models endpoint
validate_requested_model() now accepts models not in the provider's API
listing with a warning instead of blocking. Removes hardcoded catalog
fallback for validation — if API is unreachable, accepts with a warning.
Model selection flows (setup + /model command) now probe the provider's
/models endpoint to get the real available models. Falls back to
hardcoded defaults with a clear warning when auto-detection fails:
'Could not auto-detect models — use Custom model if yours isn't listed.'
Z.AI setup no longer excludes GLM-5 on coding plans.
* fix: use hermes-agent.nousresearch.com as HTTP-Referer for OpenRouter
OpenRouter scrapes the favicon/logo from the HTTP-Referer URL for app
rankings. We were sending the GitHub repo URL, which gives us a generic
GitHub logo. Changed to the proper website URL so our actual branding
shows up in rankings.
Changed in run_agent.py (main agent client) and auxiliary_client.py
(vision/summarization clients).
After studying clawdbot (OpenClaw) and OpenCode implementations:
## Beta headers
- Add interleaved-thinking-2025-05-14 and fine-grained-tool-streaming-2025-05-14
as common betas (sent with ALL auth types, not just OAuth)
- OAuth tokens additionally get oauth-2025-04-20
- API keys now also get the common betas (previously got none)
## Vision/image support
- Add _convert_vision_content() to convert OpenAI multimodal format
(image_url blocks) to Anthropic format (image blocks with base64/url source)
- Handles both data: URIs (base64) and regular URLs
## Role alternation enforcement
- Anthropic strictly rejects consecutive same-role messages (400 error)
- Add post-processing step that merges consecutive user/assistant messages
- Handles string, list, and mixed content types during merge
## Tool choice support
- Add tool_choice parameter to build_anthropic_kwargs()
- Maps OpenAI values: auto→auto, required→any, none→omit, name→tool
## Cache metrics tracking
- Anthropic uses cache_read_input_tokens / cache_creation_input_tokens
(different from OpenRouter's prompt_tokens_details.cached_tokens)
- Add api_mode-aware branch in run_agent.py cache stats logging
## Credential refresh on 401
- On 401 error during anthropic_messages mode, re-read credentials
via resolve_anthropic_token() (picks up refreshed Claude Code tokens)
- Rebuild client if new token differs from current one
- Follows same pattern as Codex/Nous 401 refresh handlers
## Tests
- 44 adapter tests (8 new: vision conversion, role alternation, tool choice)
- Updated beta header tests to verify new structure
- Full suite: 3198 passed, 0 regressions
* fix: ClawHub skill install — use /download ZIP endpoint
The ClawHub API v1 version endpoint only returns file metadata
(path, size, sha256, contentType) without inline content or download
URLs. Our code was looking for inline content in the metadata, which
never existed, causing all ClawHub installs to fail with:
'no inline/raw file content was available'
Fix: Use the /api/v1/download endpoint (same as the official clawhub
CLI) to download skills as ZIP bundles and extract files in-memory.
Changes:
- Add _download_zip() method that downloads and extracts ZIP bundles
- Retry on 429 rate limiting with Retry-After header support
- Path sanitization and binary file filtering for security
- Keep _extract_files() as a fallback for inline/raw content
- Also fix nested file lookup (version_data.version.files)
* chore: lower default compression threshold from 85% to 50%
Triggers context compression earlier — at 50% of the model's context
window instead of 85%. Updated in all four places where the default
is defined: context_compressor.py, cli.py, run_agent.py, config.py,
and gateway/run.py.
Mistral's API strictly validates the Chat Completions schema and rejects
unknown fields (call_id, response_item_id) with 422. These fields are
added by _build_assistant_message() for Codex Responses API support.
This fix:
- Only strips when targeting Mistral (api.mistral.ai in base_url)
- Creates new tool_call dicts instead of mutating originals (shallow
copy safety — msg.copy() shares the tool_calls list)
- Preserves call_id/response_item_id in the internal message history
so _chat_messages_to_responses_input() can still read them if the
session falls back to a Codex provider mid-conversation
Applied in all 3 API message building locations:
- Main conversation loop (run_conversation)
- _handle_max_iterations()
- flush_memories()
Inspired by PR #864 (unmodeled-tyler) which identified the issue but
applied the fix unconditionally and mutated originals via shallow copy.
Co-authored-by: unmodeled-tyler <unmodeled.tyler@proton.me>
* fix: /reasoning command output ordering, display, and inline think extraction
Three issues with the /reasoning command:
1. Output interleaving: The command echo used print() while feedback
used _cprint(), causing them to render out-of-order under
prompt_toolkit's patch_stdout. Changed echo to use _cprint() so
all output renders through the same path in correct order.
2. Reasoning display not working: /reasoning show toggled a flag
but reasoning never appeared for models that embed thinking in
inline <think> blocks rather than structured API fields. Added
fallback extraction in _build_assistant_message to capture
<think> block content as reasoning when no structured reasoning
fields (reasoning, reasoning_content, reasoning_details) are
present. This feeds into both the reasoning callback (during
tool loops) and the post-response reasoning box display.
3. Feedback clarity: Added checkmarks to confirm actions, persisted
show/hide to config (was session-only before), and aligned the
status display for readability.
Tests: 7 new tests for inline think block extraction (41 total).
* feat: add /reasoning command to gateway (Telegram/Discord/etc)
The /reasoning command only existed in the CLI — messaging platforms
had no way to view or change reasoning settings. This adds:
1. /reasoning command handler in the gateway:
- No args: shows current effort level and display state
- /reasoning <level>: sets reasoning effort (none/low/medium/high/xhigh)
- /reasoning show|hide: toggles reasoning display in responses
- All changes saved to config.yaml immediately
2. Reasoning display in gateway responses:
- When show_reasoning is enabled, prepends a 'Reasoning' block
with the model's last_reasoning content before the response
- Collapses long reasoning (>15 lines) to keep messages readable
- Uses last_reasoning from run_conversation result dict
3. Plumbing:
- Added _show_reasoning attribute loaded from config at startup
- Propagated last_reasoning through _run_agent return dict
- Added /reasoning to help text and known_commands set
- Uses getattr for _show_reasoning to handle test stubs
* fix: improve Kimi model selection — auto-detect endpoint, add missing models
Kimi Coding Plan setup:
- New dedicated _model_flow_kimi() replaces the generic API-key flow
for kimi-coding. Removes the confusing 'Base URL' prompt entirely —
the endpoint is auto-detected from the API key prefix:
sk-kimi-* → api.kimi.com/coding/v1 (Kimi Coding Plan)
other → api.moonshot.ai/v1 (legacy Moonshot)
- Shows appropriate models for each endpoint:
Coding Plan: kimi-for-coding, kimi-k2.5, kimi-k2-thinking, kimi-k2-thinking-turbo
Moonshot: full model catalog
- Clears any stale KIMI_BASE_URL override so runtime auto-detection
via _resolve_kimi_base_url() works correctly.
Model catalog updates:
- Added kimi-for-coding (primary Coding Plan model) and kimi-k2-thinking-turbo
to models.py, main.py _PROVIDER_MODELS, and model_metadata.py context windows.
- Updated User-Agent from KimiCLI/1.0 to KimiCLI/1.3 (Kimi's coding
endpoint whitelists known coding agents via User-Agent sniffing).
Adds --pass-session-id CLI flag. When set, the agent's system prompt
includes the session ID:
Conversation started: Sunday, March 08, 2026 06:32 PM
Session ID: 20260308_183200_abc123
Usage:
hermes --pass-session-id
hermes chat --pass-session-id
Implementation threads the flag as a proper parameter through the full
chain (main.py → cli.py → run_agent.py) rather than using an env var,
avoiding collisions in multi-agent/multitenant setups.
Based on PR #726 by dmahan93, reworked to use instance parameter
instead of HERMES_PASS_SESSION_ID environment variable.
Co-authored-by: dmahan93 <dmahan93@users.noreply.github.com>
- gateway/run.py: Take main's _resolve_gateway_model() helper
- hermes_cli/setup.py: Re-apply nous-api removal after merge brought
it back. Fix provider_idx offset (Custom is now index 3, not 4).
- tests/hermes_cli/test_setup.py: Fix custom setup test index (3→4)
Nous Portal backend will become a transparent proxy for OpenRouter-
specific parameters (provider preferences, etc.), so keep sending them
to all providers. The reasoning disabled fix is kept (that's a real
constraint of the Nous endpoint).
Two bugs in _build_api_kwargs that broke Nous Portal:
1. Provider preferences (only, ignore, order, sort) are OpenRouter-
specific routing features. They were being sent in extra_body to ALL
providers, including Nous Portal. When the config had
providers_only=['google-vertex'], Nous Portal returned 404 'Inference
host not found' because it doesn't have a google-vertex backend.
Fix: Only include provider preferences when _is_openrouter is True.
2. Reasoning config with enabled=false was being sent to Nous Portal,
which requires reasoning and returns 400 'Reasoning is mandatory for
this endpoint and cannot be disabled.'
Fix: Omit the reasoning parameter for Nous when enabled=false.
Root cause found via HERMES_DUMP_REQUESTS=1 which showed the exact
request payload being sent to Nous Portal's inference API.
Phase 2 of the provider router migration — route the main agent's
client construction and fallback activation through
resolve_provider_client() instead of duplicated ad-hoc logic.
run_agent.py:
- __init__: When no explicit api_key/base_url, use
resolve_provider_client(provider, raw_codex=True) for client
construction. Explicit creds (from CLI/gateway runtime provider)
still construct directly.
- _try_activate_fallback: Replace _resolve_fallback_credentials and
its duplicated _FALLBACK_API_KEY_PROVIDERS / _FALLBACK_OAUTH_PROVIDERS
dicts with a single resolve_provider_client() call. The router
handles all provider types (API-key, OAuth, Codex) centrally.
- Remove _resolve_fallback_credentials method and both fallback dicts.
agent/auxiliary_client.py:
- Add raw_codex parameter to resolve_provider_client(). When True,
returns the raw OpenAI client for Codex providers instead of wrapping
in CodexAuxiliaryClient. The main agent needs this for direct
responses.stream() access.
3251 passed, 2 pre-existing unrelated failures.
Add centralized call_llm() and async_call_llm() functions that own the
full LLM request lifecycle:
1. Resolve provider + model from task config or explicit args
2. Get or create a cached client for that provider
3. Format request args (max_tokens handling, provider extra_body)
4. Make the API call with max_tokens/max_completion_tokens retry
5. Return the response
Config: expanded auxiliary section with provider:model slots for all
tasks (compression, vision, web_extract, session_search, skills_hub,
mcp, flush_memories). Config version bumped to 7.
Migrated all auxiliary consumers:
- context_compressor.py: uses call_llm(task='compression')
- vision_tools.py: uses async_call_llm(task='vision')
- web_tools.py: uses async_call_llm(task='web_extract')
- session_search_tool.py: uses async_call_llm(task='session_search')
- browser_tool.py: uses call_llm(task='vision'/'web_extract')
- mcp_tool.py: uses call_llm(task='mcp')
- skills_guard.py: uses call_llm(provider='openrouter')
- run_agent.py flush_memories: uses call_llm(task='flush_memories')
Tests updated for context_compressor and MCP tool. Some test mocks
still need updating (15 remaining failures from mock pattern changes,
2 pre-existing).
Prevent stale Honcho tool exposure in context/local modes, restore reliable async write retry behavior, and ensure SOUL.md migration uploads target the AI peer instead of the user peer. Also align Honcho CLI key checks with host-scoped apiKey resolution and lock the fixes with regression tests.
Made-with: Cursor
When hermes-agent runs as a systemd service, Docker container, or
headless daemon, the stdout pipe can become unavailable (idle timeout,
buffer exhaustion, socket reset). Any print() call then raises
OSError: [Errno 5] Input/output error, crashing run_conversation()
and causing cron jobs to fail.
Rather than wrapping individual print() calls (68 in run_conversation
alone), this adds a transparent _SafeWriter wrapper installed once at
the start of run_conversation(). It delegates all writes to the real
stdout and silently catches OSError. Zero overhead on the happy path,
comprehensive coverage of all print calls including future ones.
Fixes#845
Co-authored-by: J0hnLawMississippi <J0hnLawMississippi@users.noreply.github.com>
Address merge-blocking review feedback by removing unsafe signal handler overrides, wiring next-turn Honcho prefetch, restoring per-directory session defaults, and exposing all Honcho tools to the model surface. Also harden prefetch cache access with public thread-safe accessors and remove duplicate browser cleanup code.
Made-with: Cursor
Two fixes for context overflow handling:
1. Proactive compression after tool execution: The compression check now
estimates the next prompt size using real token counts from the last API
response (prompt_tokens + completion_tokens) plus a conservative estimate
of newly appended tool results (chars // 3 for JSON-heavy content).
Previously, should_compress() only checked last_prompt_tokens which
didn't account for tool results — so a 130k prompt + 100k chars of tool
output would pass the 140k threshold check but fail the 200k API limit.
2. Safety net: Added 'prompt is too long' to context-length error detection
phrases. Anthropic returns 'prompt is too long: N tokens > M maximum'
on HTTP 400, which wasn't matched by existing phrases. This ensures
compression fires even if the proactive check underestimates.
Fixes#813
- max_retries reduced from 6 to 3 — 6 retries with exponential backoff
could stall for ~275s total on persistent errors
- ValueError and TypeError now detected as non-retryable client errors
and abort immediately instead of being retried with backoff (these are
local validation/programming errors that will never succeed on retry)
_preflight_codex_api_kwargs rejected these three fields as unsupported,
but _build_api_kwargs adds them to every codex request. This caused a
ValueError before _interruptible_api_call was reached, which was caught
by the retry loop and retried with exponential backoff — appearing as
an infinite hang in tests (275s total backoff across 6 retries).
The fix adds these keys to allowed_keys and passes them through to the
normalized request dict.
This fixes the hanging test_cron_run_job_codex_path_handles_internal_401_refresh
test (now passes in 2.6s instead of timing out).
Combined implementation of reasoning management:
- /reasoning Show current effort level and display state
- /reasoning <level> Set reasoning effort (none, low, medium, high, xhigh)
- /reasoning show|on Show model thinking/reasoning in output
- /reasoning hide|off Hide model thinking/reasoning from output
Effort level changes persist to config and force agent re-init.
Display toggle updates the agent callback dynamically without re-init.
When display is enabled:
- Intermediate reasoning shown as dim [thinking] lines during tool loops
- Final reasoning shown in a bordered box above the response
- Long reasoning collapsed (5 lines intermediate, 10 lines final)
Also adds:
- reasoning_callback parameter to AIAgent
- last_reasoning in run_conversation result dict
- show_reasoning config option (display section, default: false)
- Display section in /config output
- 34 tests covering both features
Combines functionality from PR #789 and PR #790.
Co-authored-by: Aum Desai <Aum08Desai@users.noreply.github.com>
Co-authored-by: 0xbyt4 <35742124+0xbyt4@users.noreply.github.com>
Adds tool_choice, parallel_tool_calls, and prompt_cache_key to the
Codex Responses API request kwargs — matching what the official Codex
CLI sends.
- tool_choice: 'auto' — enables the model to proactively call tools.
Without this, the model may default to not using tools, which explains
reports of the agent claiming it lacks shell access (#747).
- parallel_tool_calls: True — allows the model to issue multiple tool
calls in a single turn for efficiency.
- prompt_cache_key: session_id — enables server-side prompt caching
across turns in the same session, reducing latency and cost.
Refs #747
Two-tier warning system that nudges the LLM as it approaches
max_iterations, injected into the last tool result JSON rather
than as a separate system message:
- Caution (70%): {"_budget_warning": "[BUDGET: 42/60...]"}
- Warning (90%): {"_budget_warning": "[BUDGET WARNING: 54/60...]"}
For JSON tool results, adds a _budget_warning field to the existing
dict. For plain text results, appends the warning as text.
Key properties:
- No system messages injected mid-conversation
- No changes to message structure
- Prompt cache stays valid
- Configurable thresholds (0.7 / 0.9)
- Can be disabled: _budget_pressure_enabled = False
Inspired by PR #421 (@Bartok9) and issue #414.
8 tests covering thresholds, edge cases, JSON and text injection.
Three separate code paths all wrote to the same SQLite state.db with
no deduplication, inflating session transcripts by 3-4x:
1. _log_msg_to_db() — wrote each message individually after append
2. _flush_messages_to_session_db() — re-wrote ALL new messages at
every _persist_session() call (~18 exit points), with no tracking
of what was already written
3. gateway append_to_transcript() — wrote everything a third time
after the agent returned
Since load_transcript() prefers SQLite over JSONL, the inflated data
was loaded on every session resume, causing proportional token waste.
Fix:
- Remove _log_msg_to_db() and all 16 call sites (redundant with flush)
- Add _last_flushed_db_idx tracking in _flush_messages_to_session_db()
so repeated _persist_session() calls only write truly new messages
- Reset flush cursor on compression (new session ID)
- Add skip_db parameter to SessionStore.append_to_transcript() so the
gateway skips SQLite writes when the agent already persisted them
- Gateway now passes skip_db=True for agent-managed messages, still
writes to JSONL as backup
Verified: a 12-message CLI session with tool calls produces exactly
12 SQLite rows with zero duplicates (previously would be 36-48).
Tests: 9 new tests covering flush deduplication, skip_db behavior,
compression reset, and initialization. Full suite passes (2869 tests).
New tool lets Hermes persist conclusions about the user (preferences,
corrections, project context) directly to Honcho via the conclusions
API. Feeds into the user's peer card and representation.
- Add _repair_tool_call(): tries lowercase, normalize, then fuzzy match (difflib 0.7)
- Replace 3-retry-then-abort with graceful error: model receives helpful message and self-corrects
- Conversation stays alive instead of dying on hallucinated tool names
Closes#520
Completes the fix started in 8318a51 — handle_function_call() accepted
enabled_tools but run_agent.py never passed it. Now both call sites in
_execute_tool_calls() pass self.valid_tool_names, so each agent session
uses its own tool list instead of the process-global
_last_resolved_tool_names (which subagents can overwrite).
Also simplifies the redundant ternary in code_execution_tool.py:
sandbox_tools is already computed correctly (intersection with session
tools, or full SANDBOX_ALLOWED_TOOLS as fallback), so the conditional
was dead logic.
Inspired by PR #663 (JasonOA888). Closes#662.
Tests: 2857 passed.
Authored by tripledoublev.
After context compression on 413/400 errors, the inner retry loop was
reusing the stale pre-compression api_messages payload. Fix breaks out
of the inner retry loop so the outer loop rebuilds api_messages from
the now-compressed messages list. Adds regression test verifying the
second request actually contains the compressed payload.
Authored by 0xbyt4. Adds missing resets for _incomplete_scratchpad_retries and _codex_incomplete_retries to prevent stale counters carrying over between CLI conversations.
Automatic filesystem snapshots before destructive file operations,
with user-facing rollback. Inspired by PR #559 (by @alireza78a).
Architecture:
- Shadow git repos at ~/.hermes/checkpoints/{hash}/ via GIT_DIR
- CheckpointManager: take/list/restore, turn-scoped dedup, pruning
- Transparent — the LLM never sees it, no tool schema, no tokens
- Once per turn — only first write_file/patch triggers a snapshot
Integration:
- Config: checkpoints.enabled + checkpoints.max_snapshots
- CLI flag: hermes --checkpoints
- Trigger: run_agent.py _execute_tool_calls() before write_file/patch
- /rollback slash command in CLI + gateway (list, restore by number)
- Pre-rollback snapshot auto-created on restore (undo the undo)
Safety:
- Never blocks file operations — all errors silently logged
- Skips root dir, home dir, dirs >50K files
- Disables gracefully when git not installed
- Shadow repo completely isolated from project git
Tests: 35 new tests, all passing (2798 total suite)
Docs: feature page, config reference, CLI commands reference
Cherry-picked and improved from PR #470 (fixes#464).
Problem: On Ubuntu 24.04 with ghostty + tmux, the prompt input box
border lines flash due to cursor blink and raw spinner terminal writes
conflicting with prompt_toolkit's rendering.
Changes:
- cli.py: Add CursorShape.BLOCK to Application() to disable cursor blink
- cli.py: Add thinking_callback + spinner_widget in TUI layout so
thinking status displays as a proper prompt_toolkit widget instead of
raw terminal writes that conflict with the TUI renderer
- run_agent.py: Add thinking_callback parameter to AIAgent; when set,
uses the callback instead of KawaiiSpinner for thinking display
What was NOT changed (preserving existing behavior):
- agent/display.py: Untouched. KawaiiSpinner _write() stdout capture,
_animate() logic, and 0.12s frame interval all preserved. This
protects subagent stdout redirection and keeps smooth animations
for non-CLI contexts (gateway, batch runner).
- Original emoji spinner types (brain/sparkle/pulse/moon/star) preserved
for all non-CLI contexts.
Fixes from original PR #470:
- CursorShape.STEADY_BLOCK -> CursorShape.BLOCK (STEADY_BLOCK doesn't
exist in prompt_toolkit 3.0.52)
- Removed duplicate self._spinner_text = '' line
- Removed redundant nested if-checks
Tested: 2706 tests pass, interactive CLI verified via tmux.
Complements PR #453 by 0xbyt4. Adds isinstance(dict) guard in
run_agent.py to catch cases where json.loads returns non-dict
(e.g. null, list, string) before they reach downstream code.
Also adds 15 tests for build_tool_preview covering None args,
empty dicts, known/unknown tools, fallback keys, truncation,
and all special-cased tools (process, todo, memory, session_search).
Skills can now declare fallback_for_toolsets, fallback_for_tools,
requires_toolsets, and requires_tools in their SKILL.md frontmatter.
The system prompt builder filters skills automatically based on which
tools are available in the current session.
- Add _read_skill_conditions() to parse conditional frontmatter fields
- Add _skill_should_show() to evaluate conditions against available tools
- Update build_skills_system_prompt() to accept and apply tool availability
- Pass valid_tool_names and available toolsets from run_agent.py
- Backward compatible: skills without conditions always show; calling
build_skills_system_prompt() with no args preserves existing behavior
Closes#539
Some local LLM servers (llama-server, etc.) return message.content as
a dict or list instead of a plain string. This caused AttributeError
'dict object has no attribute strip' on every API call.
Normalizes content to string immediately after receiving the response:
- dict: extracts 'text' or 'content' field, falls back to json.dumps
- list: extracts text parts (OpenAI multimodal content format)
- other: str() conversion
Applied at the single point where response.choices[0].message is read
in the main agent loop, so all downstream .strip()/.startswith()/[:100]
operations work regardless of server implementation.
Closes#759
Combine read/search loop detection with main's redact_sensitive_text
and truncation hint features. Add tracker reset to TestSearchHints
to prevent cross-test state leakage.
Two changes to prevent unnecessary Anthropic prompt cache misses in the
gateway, where a fresh AIAgent is created per user message:
1. Reuse stored system prompt for continuing sessions:
When conversation_history is non-empty, load the system prompt from
the session DB instead of rebuilding from disk. The model already has
updated memory in its conversation history (it wrote it!), so
re-reading memory from disk produces a different system prompt that
breaks the cache prefix.
2. Stabilize Honcho context per session:
- Only prefetch Honcho context on the first turn (empty history)
- Bake Honcho context into the cached system prompt and store to DB
- Remove the per-turn Honcho injection from the API call loop
This ensures the system message is identical across all turns in a
session. Previously, re-fetching Honcho could return different context
on each turn, changing the system message and invalidating the cache.
Both changes preserve the existing behavior for compression (which
invalidates the prompt and rebuilds from scratch) and for the CLI
(where the same AIAgent persists and the cached prompt is already
stable across turns).
Tests: 2556 passed (6 new)
Split fallback provider handling into two clean registries:
_FALLBACK_API_KEY_PROVIDERS — env-var-based (openrouter, zai, kimi, minimax)
_FALLBACK_OAUTH_PROVIDERS — OAuth-based (openai-codex, nous)
New _resolve_fallback_credentials() method handles all three cases
(OAuth, API key, custom endpoint) and returns a uniform (key, url, mode)
tuple. _try_activate_fallback() is now just validation + client build.
Adds Nous Portal as a fallback provider — uses the same OAuth flow
as the primary provider (hermes login), returns chat_completions mode.
OAuth providers get credential refresh for free: the existing 401
retry handlers (_try_refresh_codex/nous_client_credentials) check
self.provider, which is set correctly after fallback activation.
4 new tests (nous activation, nous no-login, codex retained).
27 total fallback tests passing, 2548 full suite.
Codex OAuth uses a different auth flow (OAuth tokens, not env vars)
and a different API mode (codex_responses, not chat_completions).
The fallback now handles this specially:
- Resolves credentials via resolve_codex_runtime_credentials()
- Sets api_mode to codex_responses
- Fails gracefully if no Codex OAuth session exists
Also added to the commented-out config.yaml example.
2 new tests (codex activation + graceful failure).
Remove hallucinated providers (openai, deepseek, together, groq,
fireworks, mistral, gemini, nous) from the fallback provider map.
These don't exist in hermes-agent's provider system.
The real supported providers for fallback are:
openrouter (OPENROUTER_API_KEY)
zai (ZAI_API_KEY)
kimi-coding (KIMI_API_KEY)
minimax (MINIMAX_API_KEY)
minimax-cn (MINIMAX_CN_API_KEY)
For any other OpenAI-compatible endpoint, users can use the
base_url + api_key_env overrides in the config.
Also adds Kimi User-Agent header for kimi fallback (matching
the main provider system).
When the primary model/provider fails after retries (rate limit, overload,
auth errors, connection failures), Hermes automatically switches to a
configured fallback model for the remainder of the session.
Config (in ~/.hermes/config.yaml):
fallback_model:
provider: openrouter
model: anthropic/claude-sonnet-4
Supports all major providers: OpenRouter, OpenAI, Nous, DeepSeek, Together,
Groq, Fireworks, Mistral, Gemini — plus custom endpoints via base_url and
api_key_env overrides.
Design principles:
- Dead simple: one fallback model, not a chain
- One-shot: switches once, doesn't ping-pong back
- Zero new dependencies: uses existing OpenAI client
- Minimal code: ~100 lines in run_agent.py, ~5 lines in cli.py/gateway
- Three trigger points: max retries exhausted, non-retryable client errors,
and invalid response exhaustion
Does NOT trigger on context overflow or payload-too-large errors (those
are handled by the existing compression system).
Addresses #737.
25 new tests, 2492 total passing.
When the agent is interrupted, the model now receives descriptive
context instead of a generic 'Operation interrupted.' string:
- Tool skip messages include the tool name:
'[Tool execution cancelled — terminal was skipped due to user interrupt]'
'[Tool execution skipped — web_search was not started. User sent a new message]'
- API call interrupts include timing:
'Operation interrupted: waiting for model response (4.2s elapsed).'
- Retry/error interrupts include retry context:
'Operation interrupted: retrying API call after rate limit (retry 2/5).'
'Operation interrupted: handling API error (Timeout: connection timed out).'
This helps the model understand what was happening when it was
interrupted, reducing wasted iterations spent re-discovering state.
When context compression summarizes conversation history, the agent
loses track of which files it already read and re-reads them in a loop.
Users report the agent reading the same files endlessly without writing.
Root cause: context compression is lossy — file contents and read history
are lost in the summary. After compression, the model thinks it hasn't
examined the files yet and reads them again.
Fix (two-part):
1. Track file reads per task in file_tools.py. When the same file region
is read again, include a _warning in the response telling the model
to stop re-reading and use existing information.
2. After context compression, inject a structured message listing all
files already read in the session with explicit "do NOT re-read"
instruction, preserving read history across compression boundaries.
Adds 16 tests covering warning detection, task isolation, summary
accuracy, tracker cleanup, and compression history injection.
Removed the hard block on base_url containing 'api.anthropic.com'.
Anthropic now offers an OpenAI-compatible /chat/completions endpoint,
so blocking their URL prevents legitimate use. If the endpoint isn't
compatible, the API call will fail with a proper error anyway.
Removed from: run_agent.py, mini_swe_runner.py
Updated test to verify Anthropic URLs are accepted.
Kimi Code (platform.kimi.ai) issues API keys prefixed sk-kimi- that require:
1. A different base URL: api.kimi.com/coding/v1 (not api.moonshot.ai/v1)
2. A User-Agent header identifying a recognized coding agent
Without this fix, sk-kimi- keys fail with 401 (wrong endpoint) or 403
('only available for Coding Agents') errors.
Changes:
- Auto-detect sk-kimi- key prefix and route to api.kimi.com/coding/v1
- Send User-Agent: KimiCLI/1.0 header for Kimi Code endpoints
- Legacy Moonshot keys (api.moonshot.ai) continue to work unchanged
- KIMI_BASE_URL env var override still takes priority over auto-detection
- Updated .env.example with correct docs and all endpoint options
- Fixed doctor.py health check for Kimi Code keys
Reference: https://github.com/MoonshotAI/kimi-cli (platforms.py)
Reduces token usage and latency for most tasks by defaulting to
medium reasoning effort instead of xhigh. Users can still override
via config or CLI flag. Updates code, tests, example config, and docs.
Eliminated the model parameter from the AIAgent class initialization, streamlining the constructor and ensuring consistent behavior across agent instances. This change aligns with recent updates to the task delegation logic.
Added logic to manage multiple compression attempts for large payloads and context length errors. Introduced limits on compression attempts to prevent infinite retries, with appropriate logging and error handling. This ensures better resilience and user feedback when facing compression issues during API calls.
_incomplete_scratchpad_retries and _codex_incomplete_retries were not
reset at the start of run_conversation(). In CLI mode, where the same
AIAgent instance is reused across conversations, stale counters from
a previous conversation could carry over, causing premature retry
exhaustion and partial responses.
Updated the default model version from "anthropic/claude-sonnet-4-20250514" to "anthropic/claude-sonnet-4.6" across multiple files including AGENTS.md, batch_runner.py, mini_swe_runner.py, and run_agent.py for consistency and to reflect the latest model improvements.
Subagent tool calls now count toward the same session-wide iteration
limit as the parent agent. Previously, each subagent had its own
independent counter, so a parent with max_iterations=60 could spawn
3 subagents each doing 50 calls = 150 total tool calls unmetered.
Changes:
- IterationBudget: thread-safe shared counter (run_agent.py)
- consume(): try to use one iteration, returns False if exhausted
- refund(): give back one iteration (for execute_code turns)
- Thread-safe via Lock (subagents run in ThreadPoolExecutor)
- Parent creates the budget, children inherit it via delegate_tool.py
- execute_code turns are refunded (don't count against budget)
- Default raised from 60 → 90 to account for shared consumption
- Per-child cap (50) still applies as a safety valve
The per-child max_iterations (default 50) remains as a per-child
ceiling, but the shared budget is the hard session-wide limit.
A child stops at whichever comes first.
Enhance message compression by adding a method to clean up orphaned tool-call and tool-result pairs. This ensures that the API receives well-formed messages, preventing errors related to mismatched IDs. The new functionality includes removing orphaned results and adding stub results for missing calls, improving overall message integrity during compression.
Authored by areu01or00. Adds timezone support via hermes_time.now() helper
with IANA timezone resolution (HERMES_TIMEZONE env → config.yaml → server-local).
Updates system prompt timestamp, cron scheduling, and execute_code sandbox TZ
injection. Includes config migration (v4→v5) and comprehensive test coverage.
- Added fallback mechanism to utilize previous content when the model generates an empty response after tool calls, reducing unnecessary API retries.
- Enhanced logging to indicate when prior content is used as a final response.
- Updated logic to ensure that genuine empty responses are retried appropriately, maintaining user experience.
Authored by Farukest. Fixes#435. The retry summary in
_handle_max_iterations() hardcoded max_tokens instead of using
_max_tokens_param(), which returns max_completion_tokens for direct
OpenAI API (required by gpt-4o, o-series). The first attempt already
used _max_tokens_param correctly — only the retry path was wrong.
Includes 4 tests for _max_tokens_param provider detection.
Replaces the unsafe 128K fallback for unknown models with a descending
probe strategy (2M → 1M → 512K → 200K → 128K → 64K → 32K). When a
context-length error occurs, the agent steps down tiers and retries.
The discovered limit is cached per model+provider combo in
~/.hermes/context_length_cache.yaml so subsequent sessions skip probing.
Also parses API error messages to extract the actual context limit
(e.g. 'maximum context length is 32768 tokens') for instant resolution.
The CLI banner now displays the context window size next to the model
name (e.g. 'claude-opus-4 · 200K context · Nous Research').
Changes:
- agent/model_metadata.py: CONTEXT_PROBE_TIERS, persistent cache
(save/load/get), parse_context_limit_from_error(), get_next_probe_tier()
- agent/context_compressor.py: accepts base_url, passes to metadata
- run_agent.py: step-down logic in context error handler, caches on success
- cli.py + hermes_cli/banner.py: context length in welcome banner
- tests: 22 new tests for probing, parsing, and caching
Addresses #132. PR #319's approach (8K default) rejected — too conservative.
The retry summary in _handle_max_iterations hardcodes max_tokens instead
of calling _max_tokens_param(). For direct OpenAI API users (gpt-4o,
o-series), the correct parameter name is max_completion_tokens. The first
attempt at line 2697 already uses _max_tokens_param correctly but the
retry path at line 2743 was missed.
The flush_memories() and run_conversation() code paths already stripped
finish_reason and reasoning from API messages (added in 7a0b377 via PR
#253), but _handle_max_iterations() was missed. It was sending raw
messages.copy() which could include finish_reason, causing 422 errors
on strict APIs like Mistral when the agent hit max iterations.
Now strips the same internal fields consistently across all three API
call sites.
Authored by ch3ronsa. Fixes#348.
Adds 'context size' (LM Studio) and 'context window' (Ollama) to
context-length error detection phrases so local backend 400 errors
trigger compression instead of aborting. Also removes 'error code: 400'
from the non-retryable error list as defense in depth.
Two fixes for the case where a user switches to a model with a smaller
context window while having a large existing session:
1. Preflight compression in run_conversation(): Before the main loop,
estimate tokens of loaded history + system prompt. If it exceeds the
model's compression threshold (85% of context), compress proactively
with up to 3 passes. This naturally handles model switches because
the gateway creates a fresh AIAgent per message with the current
model's context length.
2. Error handler reordering: Context-length errors (400 with 'maximum
context length' etc.) are now checked BEFORE the generic 4xx handler.
Previously, OpenRouter's 400-status context-length errors were caught
as non-retryable client errors and aborted immediately, never reaching
the compression+retry logic.
Reported by Sonicrida on Discord: 840-message session (2MB+) crashed
after switching from a large-context model to minimax via OpenRouter.
Local backends (LM Studio, Ollama, llama.cpp) return HTTP 400
with messages like "Context size has been exceeded" when the
context window is full. The error phrase list did not include
"context size" or "context window", so these errors fell through
to the generic 4xx abort handler instead of triggering compression.
Changes:
- Move context-length check above generic 4xx handler so it runs
first (same pattern as the existing 413 check)
- Add "context size" and "context window" to the phrase list
- Guard 4xx handler with `not is_context_length_error` to prevent
context-related 400s from being treated as non-retryable
session_search was returning the current session if it matched the
query, which is redundant — the agent already has the current
conversation context. This wasted an LLM summarization call and a
result slot.
Added current_session_id parameter to session_search(). The agent
passes self.session_id and the search filters out any results where
either the raw or parent-resolved session ID matches. Both the raw
match and the parent-resolved match are checked to handle child
sessions from delegation.
Two tests added verifying the exclusion works and that other
sessions are still returned.
Authored by 0xbyt4. Adds smart home control via REST tools (ha_list_entities,
ha_get_state, ha_call_service) with domain blocklist and entity_id validation,
plus WebSocket gateway adapter for real-time event monitoring.
Also includes Gemini 3 thought_signature preservation fix (extra_content on
tool calls) needed for multi-turn tool calling via OpenRouter.
In _handle_max_iterations, the codex_responses path set tools=None to
prevent tool calls during summarization. However, the OpenAI SDK's
_make_tools() treats None as a valid value (not its Omit sentinel) and
tries to iterate over it, causing TypeError: 'NoneType' object is not
iterable.
Fix: use codex_kwargs.pop('tools', None) to remove the key entirely,
so the SDK never receives it and uses its default omit behavior.
Fixes#300
Issue #263: Telegram/Discord/WhatsApp/Slack now show tool call details
based on display.tool_progress in config.yaml.
Changes:
- gateway/run.py: 'verbose' mode shows full args (keys + JSON, 200 char
max). 'all' mode preview increased from 40 to 80 chars. Added missing
tool emojis (execute_code, delegate_task, clarify, skill_manage,
search_files).
- agent/display.py: Added execute_code, delegate_task, clarify,
skill_manage to primary_args. Added 'code' and 'goal' to fallback keys.
- run_agent.py: Pass function_args dict to tool_progress_callback so
gateway can format based on its own verbosity config.
Config usage:
display:
tool_progress: verbose # off | new | all | verbose
The TestFlushSentinelNotLeaked test from PR #227 had two issues:
1. flush_memories() uses get_text_auxiliary_client() which could bypass
agent.client entirely — mock it to return (None, None)
2. No assertion that the API was actually called — added guard assert
Without these fixes the test passed vacuously (API never called).
The OpenAI API returns content: null on assistant messages with tool
calls. msg.get('content', '') returns None when the key exists with
value None, causing TypeError on len(), string concatenation, and
.strip() in downstream code paths.
Fixed 4 locations that process conversation messages:
- agent/auxiliary_client.py:84 — None passed to API calls
- cli.py:1288 — crash on content[:200] and len(content)
- run_agent.py:3444 — crash on None.strip()
- honcho_integration/session.py:445 — 'None' rendered in transcript
13 other instances were verified safe (already protected, only process
user/tool messages, or use the safe pattern).
Pattern: msg.get('content', '') → msg.get('content') or ''
Fixes#276
* fix(agent): skip reasoning param for Mistral API to prevent 422 errors
* fix(agent): strip finish_reason from assistant messages to fix Mistral 422 errors
Updated the AIAgent class to print the full content of assistant messages without truncation, enhancing visibility of the messages during runtime. This change improves the clarity of communication from the agent.
Added the tools attribute to the AIAgent class's status output, ensuring that the current tools used by the agent are included in the status information. This enhancement improves the visibility of the agent's capabilities during runtime.
Added the system prompt to the AIAgent class's status output, ensuring that the current system prompt is included in the agent's status information. This enhancement improves visibility into the agent's configuration during runtime.
Enhanced the AIAgent class to capture and normalize summary information for reasoning items. Implemented logic to handle summaries as lists, ensuring proper formatting for API interactions. Updated tests to validate the inclusion of summaries in reasoning items, both for existing and default cases.
Introduced a new `provider_routing` section in the CLI configuration to control how requests are routed across providers when using OpenRouter. This includes options for sorting providers by throughput, latency, or price, as well as allowing or ignoring specific providers, setting the order of provider attempts, and managing data collection policies. Updated relevant classes and documentation to support these features, enhancing flexibility in provider selection.
Added support for processing encrypted reasoning content within the AIAgent class. Introduced logic to determine reasoning effort and enable/disable reasoning based on configuration settings. Updated the kwargs to reflect these changes, ensuring proper handling of reasoning parameters during agent execution.
Introduced a new command "/usage" in the CLI to show cumulative token usage for the current session. This includes details on prompt tokens, completion tokens, total tokens, API calls, and context state. Updated command documentation to reflect this addition. Enhanced the AIAgent class to track token usage throughout the session.
When subagents run via delegate_task, the user now sees real-time
progress instead of silence:
CLI: tree-view activity lines print above the delegation spinner
🔀 Delegating: research quantum computing
├─ 💭 "I'll search for papers first..."
├─ 🔍 web_search "quantum computing"
├─ 📖 read_file "paper.pdf"
└─ ⠹ working... (18.2s)
Gateway (Telegram/Discord): batched progress summaries sent every
5 tool calls to avoid message spam. Remaining tools flushed on
subagent completion.
Changes:
- agent/display.py: add KawaiiSpinner.print_above() to print
status lines above an active spinner without disrupting animation.
Uses captured stdout (self._out) so it works inside the child's
redirect_stdout(devnull).
- tools/delegate_tool.py: add _build_child_progress_callback()
that creates a per-child callback relaying tool calls and
thinking events to the parent's spinner (CLI) or progress
queue (gateway). Each child gets its own callback instance,
so parallel subagents don't share state. Includes _flush()
for gateway batch completion.
- run_agent.py: fire tool_progress_callback with '_thinking'
event when the model produces text content. Guarded by
_delegate_depth > 0 so only subagents fire this (prevents
gateway spam from main agent). REASONING_SCRATCHPAD/think/
reasoning XML tags are stripped before display.
Tests: 21 new tests covering print_above, callback builder,
thinking relay, SCRATCHPAD filtering, batching, flush, thread
isolation, delegate_depth guard, and prefix handling.
- Introduce a separate error log for capturing warnings and errors related to tool execution, ensuring detailed inspection of issues post-failure.
- Enhance error handling in the AIAgent class to log exceptions with stack traces for better debugging.
- Add a similar error logging mechanism in the gateway to streamline debugging processes.
- Replace `hermes login` with `hermes model` for selecting providers and managing authentication.
- Update documentation and CLI commands to reflect the new provider selection process.
- Introduce a new redaction system for logging sensitive information.
- Enhance Codex model discovery by integrating API fetching and local cache.
- Adjust max turns configuration logic for better clarity and precedence.
- Improve error handling and user feedback during authentication processes.
- Enhanced Codex model discovery by fetching available models from the API, with fallback to local cache and defaults.
- Updated the context compressor's summary target tokens to 2500 for improved performance.
- Added external credential detection for Codex CLI to streamline authentication.
- Refactored various components to ensure consistent handling of authentication and model selection across the application.
Add a new hooks system allowing users to run custom code at key lifecycle points in the agent's operation. This includes support for events such as `gateway:startup`, `session:start`, `agent:step`, and more. Documentation for creating hooks and available events has been added to `README.md` and a new `hooks.md` file. Additionally, integrate step callbacks in the agent to facilitate hook execution during tool-calling iterations.
The retry exhaustion checks used > instead of >= to compare
retry_count against max_retries. Since the while loop condition is
retry_count < max_retries, the check retry_count > max_retries can
never be true inside the loop. When retries are exhausted, the loop
exits and falls through to response.choices[0] on an invalid response,
crashing with IndexError instead of returning a proper error.
Gemini 3 thinking models attach extra_content with thought_signature
to function call responses. This must be echoed back on subsequent
API calls or the server rejects with a 400 error. The assistant
message builder was dropping this field, causing all Gemini 3 Flash/Pro
tool-calling flows to fail after the first function call.
Fixes#149
The _strip_think_blocks() method existed but was not applied to the
final_response in the normal completion path. This caused <think>...</think>
XML tags to leak into user-facing responses on all platforms (CLI, Telegram,
Discord, Slack, WhatsApp).
Changes:
- Strip think blocks from final_response before returning in normal path (line ~2600)
- Strip think blocks from fallback content when salvaging from prior tool_calls turn
Notes:
- The raw content with think blocks is preserved in messages[] for trajectory
export - this only affects the user-facing final_response
- The _has_content_after_think_block() check still uses raw content before
stripping, which is correct for detecting think-only responses
The 413 "Request Entity Too Large" error from the LLM API was caught by the
generic 4xx handler which aborts immediately. This is wrong for 413 — it's a
payload-size issue that can be resolved by compressing conversation history.
- Intercept 413 before the generic 4xx block and route to _compress_context
- Exclude 413 from generic is_client_error detection
- Add 'request entity too large' to context-length phrases as safety net
- Add tests for 413 compression behavior
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When running via the gateway (e.g. Telegram), the session_search tool
returned: {"error": "session_search must be handled by the agent loop"}
Root cause:
- gateway/run.py creates AIAgent without passing session_db=
- self._session_db is None in the agent instance
- The dispatch condition "elif function_name == 'session_search' and self._session_db"
skips when _session_db is None, falling through to the generic error
This fix:
1. Initializes self._session_db in GatewayRunner.__init__()
2. Passes session_db to all AIAgent instantiations in gateway/run.py
3. Adds defensive fallback in run_agent.py to return a clear error when
session_db is unavailable, instead of falling through
Fixes#105
- Added _max_tokens_param method in AIAgent to return appropriate max tokens parameter based on the provider (OpenAI vs. others).
- Updated API calls in AIAgent to utilize the new max tokens handling.
- Introduced auxiliary_max_tokens_param function in auxiliary_client for consistent max tokens management across auxiliary clients.
- Refactored multiple tools to use auxiliary_max_tokens_param for improved compatibility with different models and providers.
USER.md stays in system prompt when Honcho is active -- prefetch is
additive context, not a replacement. Memory tool user observations
write to both USER.md (local) and Honcho (cross-session) simultaneously.
When Honcho is active:
- System prompt uses Honcho prefetch instead of USER.md
- memory tool target=user add routes to Honcho
- MEMORY.md untouched in all cases
When disabled, everything works as before.
Also wires up contextTokens config to cap prefetch size.
Opt-in persistent cross-session user modeling via Honcho. Reads
~/.honcho/config.json as single source of truth (shared with
Claude Code, Cursor, and other Honcho-enabled tools). Zero impact
when disabled or unconfigured.
- honcho_integration/ package (client, session manager, peer resolution)
- Host-based config resolution matching claude-honcho/cursor-honcho pattern
- Prefetch user context into system prompt per conversation turn
- Sync user/assistant messages to Honcho after each exchange
- query_user_context tool for mid-conversation dialectic reasoning
- Gated activation: requires ~/.honcho/config.json with enabled=true
The `hermes` CLI entry point (hermes_cli/main.py) and the agent runner
(run_agent.py) only loaded .env from the project installation directory.
After the standard installer, code lives at ~/.hermes/hermes-agent/ but
config lives at ~/.hermes/ — so the .env was never found.
Aligns these entry points with the pattern already used by gateway/run.py
and rl_cli.py: load ~/.hermes/.env first, fall back to project root .env
for dev-mode compatibility.
Also fixes:
- status.py checking .env existence and API keys at PROJECT_ROOT
- doctor.py KeyError on tool availability (missing_vars vs env_vars)
- doctor.py checking logs/ and Skills Hub at PROJECT_ROOT instead of HERMES_HOME
- doctor.py redundant logs/ check (already covered by subdirectory loop)
- mini-swe-agent loading config from platformdirs default instead of ~/.hermes/
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Simplified the logic for determining support for reasoning based on the base URL by introducing clearer variable names.
- Added product attribution for the Nous Portal to the extra body of requests when applicable, enhancing tagging for better tracking.
- Introduced a new static method `_clean_session_content` in the `AIAgent` class to convert REASONING_SCRATCHPAD tags to <think> blocks and clean up whitespace in session logs.
- Updated the `_save_session_log` method to utilize the cleaned content for assistant messages, ensuring consistency in session logs.
- Changed the default output directory for TTS audio files from `~/voice-memos` to `~/.hermes/audio_cache`, reflecting a more appropriate storage location.