OpenClaw Operations And Troubleshooting
Operate OpenClaw like a real system: validate the gateway, router, task board, heartbeats, memory layer, CalDAV, voice path, and channels in a fixed order before you start guessing.
Published January 26, 2025 · Updated May 8, 2026
OpenClaw Operations And Troubleshooting
Once OpenClaw becomes part of daily use, the real danger is not installation failure. The real danger is partial health.
That is the awkward state where the gateway is up, but the router is not. Or the router is fine, but the task board API key expired. Or household still answers plain text, but voice notes silently fall back. Or every agent looks healthy except the heartbeats have stopped and the UI is lying to you.
The fix is not more intuition. The fix is a fixed order of checks.
Treat It Like A Stack, Not A Bot
The current OpenClaw shape has several moving parts:
- gateway on CT 106,
- llama.cpp Router Mode on CT 102,
- Command Center on CT 107,
- heartbeat and memory sidecars on CT 106,
- Radicale on CT 92 for household state,
- optional SearXNG and other service dependencies,
whisper.cppfor voice transcription,- Telegram and Discord as the human-facing edge.
When something breaks, the goal is to identify which layer failed first and stop blaming everything else.
Baseline Environment
Before running deeper checks, load the environment file so the commands use the same secrets and endpoints as the service:
source /root/.openclaw/env
export MC_URL="${MC_URL:-http://192.168.50.86:3000}"
export RADICALE_URL="${RADICALE_URL:-http://192.168.50.92:5232}"Keep secrets in env files or systemd environment blocks. Do not paste long-lived tokens or API keys back into docs or JSON config just because you are debugging quickly.
Fast Health Pass
If you only have two minutes, run this order.
1. Gateway And Router
curl -sf http://192.168.50.85:18789/health && echo "gateway ok"
curl -sf http://192.168.50.45:8012/health && echo "router ok"
curl -s http://192.168.50.45:8012/v1/models | jq '.data[].id'If the gateway is dead, stay there. If the gateway is healthy but no models are available, the problem is upstream in Router Mode, not in Telegram, Discord, or agent prompts.
2. Command Center And Workforce State
curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/tasks" | jq 'if type == "array" then length else . end'
curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/agents" | jq '.[] | {name: .name, last_seen: .last_seen}'
systemctl is-active mc-heartbeat.timer
systemctl is-active memsearch-watch.serviceThis tells you whether the board is reachable, whether agent liveness data is fresh, and whether the memory indexer is still running.
3. Household Dependencies
curl -sf http://192.168.50.45:8013/health && echo "whisper ok"
test -x /root/.openclaw/voice/transcribe-whisper.sh && echo "transcribe script ok"
curl -sf \
-u "$RADICALE_FAMILY_USER:$RADICALE_FAMILY_PASS" \
-X PROPFIND \
-H "Depth: 1" \
"$RADICALE_URL/family/shared/" | grep -q "href" && echo "radicale ok"If household features are failing, these checks separate calendar state from voice state immediately.
4. Direct Agent Roundtrip
openclaw agent --agent main --message "Reply with exactly: SMOKE_TEST_OK"The exact CLI alias can change between releases, but the test idea does not: force a minimal direct response path that bypasses chat apps and proves the gateway can still talk to a model.
The Better Way To Debug: A Fixed Sequence
When a quick pass is not enough, use the deeper sequence below.
Step 0. Check Service State And Recent Logs
openclaw --version
systemctl status openclaw
journalctl -u openclaw -n 40 --no-pagerThis catches the embarrassing failures first: wrong binary path, service not running, recent crash loop, broken environment file.
Step 1. Prove The Gateway Is Real
curl -v http://192.168.50.85:18789/health
ss -tlnp | grep 18789If this fails, you do not have an assistant. You have a dead listener.
Step 2. Prove Inference Still Exists
curl -sf http://192.168.50.45:8012/health
curl -s http://192.168.50.45:8012/v1/models | jq '.data[].id'If the model list is empty or the health endpoint is dead, do not waste time on channel pairing, prompt files, or agent configuration.
Step 3. Prove The Control Plane Still Works
curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/tasks" | jq 'type'
curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/agents" | jq '.[] | {name: .name, last_seen: .last_seen}'
bash /root/.openclaw/scripts/mc-heartbeat.shThis is where you find broken API auth, dead board endpoints, stale agent presence, or rename-related drift between agent names and queue expectations.
Step 4. Check Memory And Persistence Services
systemctl status memsearch-watch.service
journalctl -u memsearch-watch.service -n 20 --no-pagerIf memory retrieval feels inconsistent, do not assume the model has suddenly become forgetful. First verify the indexer is still alive and watching the right workspace paths.
Step 5. Check Household I/O Paths
curl -sf http://192.168.50.45:8013/health
/root/.openclaw/voice/transcribe-whisper.sh /path/to/test-audio.ogg
curl -sf \
-u "$RADICALE_FAMILY_USER:$RADICALE_FAMILY_PASS" \
-X PROPFIND \
-H "Depth: 1" \
"$RADICALE_URL/family/shared/"These tests tell you whether the household path is failing at media transcription, calendar access, or later in the agent reasoning layer.
Step 6. Check The Real Message Path
For Telegram or Discord, send a plain text probe and watch live logs:
journalctl -u openclaw -f | grep -E "telegram|discord|message|response|inference"If the message arrives but inference never starts, the fault is behind the gateway. If inference completes but no reply leaves the service, the problem is in the channel path.
Operational Tasks Worth Keeping Routine
Update OpenClaw Carefully
pnpm install -g openclaw@latest
systemctl restart openclaw
openclaw --version
openclaw doctorThen rerun at least the gateway, router, and direct-agent checks before assuming the upgrade was uneventful.
Back Up Before Major Changes
# On the Proxmox host
vzdump 106 --storage local --compress zstd --mode snapshot
# Or inside CT 106
tar czf /tmp/openclaw-backup-$(date +%F).tar.gz ~/.openclaw/Watch Growth In The Working Set
df -h /
du -sh ~/.openclaw/
du -sh ~/.openclaw/agents/*/sessions/
du -sh ~/.openclaw/workspace/*/memory/Long-running agent systems quietly accumulate state. Disk pressure rarely announces itself gracefully.
Common Failure Patterns
Gateway Is Up, But The System Is Still Broken
This usually means the health check is telling the truth about the listener, but not about the rest of the stack. Always pair gateway health with Router Mode health and at least one direct agent probe.
Voice Notes Fall Back To The Default Message
That is usually not a reasoning issue. It usually means one of these is broken:
tools.media.audiois missing or malformed,ffmpegis unavailable,jqis unavailable,whisper.cppis down,- the transcription script is missing or not executable.
Treat it as a media pipeline failure until proven otherwise.
Household Replies, But Calendar Or Todo Features Do Not
That usually points to Radicale connectivity or credentials, not the LLM. Test the CalDAV path directly before editing prompts.
Board Tasks Look Wrong Or Agents Seem Offline
Look at three things together:
- Command Center API availability,
mc-heartbeat.timer,- whether older
mission_controlnaming still exists in queue logic, scripts, or stored strings.
In newer documentation the orchestrator is commander, but migration residue can still create confusing symptoms.
Telegram Or Discord Works In A Shell, Then Dies After Restart
That is almost always environment persistence. If a token or webhook only exists in an interactive shell, systemd will not magically inherit it later.
Grafana And Metrics Still Feel Fuzzy
That is because the metrics work is still partially design-stage. Do not debug a non-existent exporter. First answer the unresolved infrastructure question: where Pushgateway actually lives. Until that is settled, treat metrics as planned observability, not a broken production subsystem.
The Useful Discipline After Every Change
After touching agent config, channel wiring, board scripts, or media tooling, rerun a focused smoke test instead of trusting memory.
The minimum sensible set is:
- gateway health
- router health and model list
- task board API reachability
- heartbeat timer state
- one direct agent roundtrip
- one real channel message if the change touched channels
That order catches most regressions before they become mysterious stories.
Related Pages
- OpenClaw Architecture, Agents, And Workflows
- OpenClaw Workspace, Skills, And Use Cases
- OpenClaw On Proxmox
- OpenClaw Channels, Daemon, And Secure Exposure