OpenClaw Operations And Troubleshooting

Once OpenClaw becomes part of daily use, the real danger is not installation failure. The real danger is partial health.

That is the awkward state where the gateway is up, but the router is not. Or the router is fine, but the task board API key expired. Or household still answers plain text, but voice notes silently fall back. Or every agent looks healthy except the heartbeats have stopped and the UI is lying to you.

The fix is not more intuition. The fix is a fixed order of checks.

Treat It Like A Stack, Not A Bot

The current OpenClaw shape has several moving parts:

gateway on CT 106,
llama.cpp Router Mode on CT 102,
Command Center on CT 107,
heartbeat and memory sidecars on CT 106,
Radicale on CT 92 for household state,
optional SearXNG and other service dependencies,
whisper.cpp for voice transcription,
Telegram and Discord as the human-facing edge.

When something breaks, the goal is to identify which layer failed first and stop blaming everything else.

Baseline Environment

Before running deeper checks, load the environment file so the commands use the same secrets and endpoints as the service:

source /root/.openclaw/env
 
export MC_URL="${MC_URL:-http://192.168.50.86:3000}"
export RADICALE_URL="${RADICALE_URL:-http://192.168.50.92:5232}"

Keep secrets in env files or systemd environment blocks. Do not paste long-lived tokens or API keys back into docs or JSON config just because you are debugging quickly.

Fast Health Pass

If you only have two minutes, run this order.

1. Gateway And Router

curl -sf http://192.168.50.85:18789/health && echo "gateway ok"
curl -sf http://192.168.50.45:8012/health && echo "router ok"
curl -s http://192.168.50.45:8012/v1/models | jq '.data[].id'

If the gateway is dead, stay there. If the gateway is healthy but no models are available, the problem is upstream in Router Mode, not in Telegram, Discord, or agent prompts.

2. Command Center And Workforce State

curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/tasks" | jq 'if type == "array" then length else . end'
curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/agents" | jq '.[] | {name: .name, last_seen: .last_seen}'
systemctl is-active mc-heartbeat.timer
systemctl is-active memsearch-watch.service

This tells you whether the board is reachable, whether agent liveness data is fresh, and whether the memory indexer is still running.

3. Household Dependencies

curl -sf http://192.168.50.45:8013/health && echo "whisper ok"
test -x /root/.openclaw/voice/transcribe-whisper.sh && echo "transcribe script ok"
 
curl -sf \
  -u "$RADICALE_FAMILY_USER:$RADICALE_FAMILY_PASS" \
  -X PROPFIND \
  -H "Depth: 1" \
  "$RADICALE_URL/family/shared/" | grep -q "href" && echo "radicale ok"

If household features are failing, these checks separate calendar state from voice state immediately.

4. Direct Agent Roundtrip

openclaw agent --agent main --message "Reply with exactly: SMOKE_TEST_OK"

The exact CLI alias can change between releases, but the test idea does not: force a minimal direct response path that bypasses chat apps and proves the gateway can still talk to a model.

The Better Way To Debug: A Fixed Sequence

When a quick pass is not enough, use the deeper sequence below.

Step 0. Check Service State And Recent Logs

openclaw --version
systemctl status openclaw
journalctl -u openclaw -n 40 --no-pager

This catches the embarrassing failures first: wrong binary path, service not running, recent crash loop, broken environment file.

Step 1. Prove The Gateway Is Real

curl -v http://192.168.50.85:18789/health
ss -tlnp | grep 18789

If this fails, you do not have an assistant. You have a dead listener.

Step 2. Prove Inference Still Exists

curl -sf http://192.168.50.45:8012/health
curl -s http://192.168.50.45:8012/v1/models | jq '.data[].id'

If the model list is empty or the health endpoint is dead, do not waste time on channel pairing, prompt files, or agent configuration.

Step 3. Prove The Control Plane Still Works

curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/tasks" | jq 'type'
curl -sf -H "x-api-key: $MC_API_KEY" "$MC_URL/api/agents" | jq '.[] | {name: .name, last_seen: .last_seen}'
bash /root/.openclaw/scripts/mc-heartbeat.sh

This is where you find broken API auth, dead board endpoints, stale agent presence, or rename-related drift between agent names and queue expectations.

Step 4. Check Memory And Persistence Services

systemctl status memsearch-watch.service
journalctl -u memsearch-watch.service -n 20 --no-pager

If memory retrieval feels inconsistent, do not assume the model has suddenly become forgetful. First verify the indexer is still alive and watching the right workspace paths.

Step 5. Check Household I/O Paths

curl -sf http://192.168.50.45:8013/health
/root/.openclaw/voice/transcribe-whisper.sh /path/to/test-audio.ogg
 
curl -sf \
  -u "$RADICALE_FAMILY_USER:$RADICALE_FAMILY_PASS" \
  -X PROPFIND \
  -H "Depth: 1" \
  "$RADICALE_URL/family/shared/"

These tests tell you whether the household path is failing at media transcription, calendar access, or later in the agent reasoning layer.

Step 6. Check The Real Message Path

For Telegram or Discord, send a plain text probe and watch live logs:

journalctl -u openclaw -f | grep -E "telegram|discord|message|response|inference"

If the message arrives but inference never starts, the fault is behind the gateway. If inference completes but no reply leaves the service, the problem is in the channel path.

Operational Tasks Worth Keeping Routine

Update OpenClaw Carefully

pnpm install -g openclaw@latest
systemctl restart openclaw
openclaw --version
openclaw doctor

Then rerun at least the gateway, router, and direct-agent checks before assuming the upgrade was uneventful.

Back Up Before Major Changes

# On the Proxmox host
vzdump 106 --storage local --compress zstd --mode snapshot
 
# Or inside CT 106
tar czf /tmp/openclaw-backup-$(date +%F).tar.gz ~/.openclaw/

Watch Growth In The Working Set

df -h /
du -sh ~/.openclaw/
du -sh ~/.openclaw/agents/*/sessions/
du -sh ~/.openclaw/workspace/*/memory/

Long-running agent systems quietly accumulate state. Disk pressure rarely announces itself gracefully.

Common Failure Patterns

Gateway Is Up, But The System Is Still Broken

This usually means the health check is telling the truth about the listener, but not about the rest of the stack. Always pair gateway health with Router Mode health and at least one direct agent probe.

Voice Notes Fall Back To The Default Message

That is usually not a reasoning issue. It usually means one of these is broken:

tools.media.audio is missing or malformed,
ffmpeg is unavailable,
jq is unavailable,
whisper.cpp is down,
the transcription script is missing or not executable.

Treat it as a media pipeline failure until proven otherwise.

Household Replies, But Calendar Or Todo Features Do Not

That usually points to Radicale connectivity or credentials, not the LLM. Test the CalDAV path directly before editing prompts.

Board Tasks Look Wrong Or Agents Seem Offline

Look at three things together:

Command Center API availability,
mc-heartbeat.timer,
whether older mission_control naming still exists in queue logic, scripts, or stored strings.

In newer documentation the orchestrator is commander, but migration residue can still create confusing symptoms.

Telegram Or Discord Works In A Shell, Then Dies After Restart

That is almost always environment persistence. If a token or webhook only exists in an interactive shell, systemd will not magically inherit it later.

Grafana And Metrics Still Feel Fuzzy

That is because the metrics work is still partially design-stage. Do not debug a non-existent exporter. First answer the unresolved infrastructure question: where Pushgateway actually lives. Until that is settled, treat metrics as planned observability, not a broken production subsystem.

The Useful Discipline After Every Change

After touching agent config, channel wiring, board scripts, or media tooling, rerun a focused smoke test instead of trusting memory.

The minimum sensible set is:

gateway health
router health and model list
task board API reachability
heartbeat timer state
one direct agent roundtrip
one real channel message if the change touched channels

That order catches most regressions before they become mysterious stories.

OpenClaw Operations And Troubleshooting

OpenClaw Operations And Troubleshooting

Treat It Like A Stack, Not A Bot

Baseline Environment

Fast Health Pass

1. Gateway And Router

2. Command Center And Workforce State

3. Household Dependencies

4. Direct Agent Roundtrip

The Better Way To Debug: A Fixed Sequence

Step 0. Check Service State And Recent Logs

Step 1. Prove The Gateway Is Real

Step 2. Prove Inference Still Exists

Step 3. Prove The Control Plane Still Works

Step 4. Check Memory And Persistence Services

Step 5. Check Household I/O Paths

Step 6. Check The Real Message Path

Operational Tasks Worth Keeping Routine

Update OpenClaw Carefully

Back Up Before Major Changes

Watch Growth In The Working Set

Common Failure Patterns

Gateway Is Up, But The System Is Still Broken

Voice Notes Fall Back To The Default Message

Household Replies, But Calendar Or Todo Features Do Not

Board Tasks Look Wrong Or Agents Seem Offline

Telegram Or Discord Works In A Shell, Then Dies After Restart

Grafana And Metrics Still Feel Fuzzy

The Useful Discipline After Every Change

Source Trail

Comments