M A N D A L I V I A
Obsidian Lab All Notes

Adding Voice Mode to Claude Code with a Stop Hook

The Problem

Claude Code is a text interface. That’s fine when you’re at the keyboard, but sometimes you want to step back — pace around, think out loud, treat it more like a conversation than a code review. The responses are right there in the terminal, but you have to keep looking at the screen to follow along.

There are a few community projects that add text-to-speech to Claude Code — OpenAI TTS wrappers, local neural engines, log-file monitors. Most of them are always-on. You install them and every response gets read aloud, which gets old fast when you’re doing actual coding work and just want to scan a diff.

What I wanted was a toggle. Voice mode on when I’m on the couch; off when I’m at the desk.

The Solution

Three pieces, all at the user level so they work across every project:

  1. A /voice skill that toggles a flag file on and off
  2. A Stop hook that checks the flag and pipes responses through TTS
  3. A UserPromptSubmit hook that injects context telling Claude its output will be spoken — so it adapts its style

The flag file is the hinge. Everything checks for /tmp/claude-voice-enabled — present means on, absent means off. No config parsing, no JSON toggling. Just a file. Putting it in /tmp means voice mode auto-clears on reboot, which is the right default for something you toggle per-session.

How To

The Toggle Skill

Create ~/.claude/skills/voice/SKILL.md:

# Voice Mode Toggle
 
Toggle voice mode on or off. When voice mode is on, Claude's responses are spoken aloud via text-to-speech.
 
## Instructions
 
Check if voice mode is currently enabled by testing for the file `/tmp/claude-voice-enabled`.
 
- If the file exists: remove it, and tell the user "Voice mode off."
- If the file does not exist: create it, and tell the user "Voice mode on — I'll keep responses speech-friendly."
 
When voice mode is ON, follow these rules for the rest of the session:
- Keep responses concise and conversational — they'll be spoken aloud.
- Avoid markdown tables, code blocks, and dense formatting that sounds bad as speech.
- Use short sentences. Prefer plain language over technical shorthand.
- Don't announce that you're in voice mode on every turn — just adapt your style.

Now /voice toggles the mode from any session.

The TTS Hook

Create ~/.claude/hooks/voice-speak.sh:

#!/bin/bash
[ -f "/tmp/claude-voice-enabled" ] || exit 0
 
text=$(jq -r '.last_assistant_message // empty' < /dev/stdin)
[ -z "$text" ] && exit 0
 
# Strip markdown that sounds bad as speech
text=$(echo "$text" | sed -E '
  s/```[^`]*```//g
  s/`[^`]*`//g
  s/\*\*([^*]*)\*\*/\1/g
  s/\*([^*]*)\*/\1/g
  s/^#{1,6} //g
  s/\[([^]]*)\]\([^)]*\)/\1/g
  s/^[|].*$//g
  s/^[-*] //g
')
 
say "$text" &

The & at the end is important — it runs say in the background so Claude Code isn’t blocked waiting for the speech to finish.

The Context Hook

Create ~/.claude/hooks/voice-context.sh:

#!/bin/bash
[ -f "/tmp/claude-voice-enabled" ] || exit 0
 
echo "Voice mode is active. Your response will be spoken aloud. Be very brief — 1-3 sentences max. No markdown, no lists, no code blocks. Talk like a person, not a document."

This fires on every prompt submission. When voice mode is off, it exits immediately. When it’s on, the message gets injected into Claude’s context — so Claude knows to write for the ear, not the eye.

Wire the Hooks

Add both to ~/.claude/settings.json:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/voice-speak.sh",
            "timeout": 30
          }
        ]
      }
    ],
    "UserPromptSubmit": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/voice-context.sh",
            "timeout": 5
          }
        ]
      }
    ]
  }
}

Also add permissions so the toggle doesn’t prompt every time:

{
  "permissions": {
    "allow": [
      "Bash(touch /tmp/claude-voice-enabled)",
      "Bash(rm /tmp/claude-voice-enabled)",
      "Bash(test -f /tmp/claude-voice-enabled *)"
    ]
  }
}

Make both scripts executable: chmod +x ~/.claude/hooks/voice-*.sh

Using It

Type /voice to toggle on. Claude confirms, and from that point every response gets spoken aloud. Type /voice again to turn it off. The flag file means you can also toggle from outside Claude Code — touch /tmp/claude-voice-enabled from any terminal. Reboot clears it automatically.

Choosing a TTS Engine

The hook architecture stays the same regardless of which engine you use — you just change what the last line of voice-speak.sh calls. But the choice matters more than you’d think.

macOS say: The Starting Point

say is built into every Mac. Zero dependencies, zero latency, zero cost. It sounds robotic — flat prosody, no natural pacing — but it’s instant. For short responses where you just want audio confirmation that Claude finished, it works fine. This is where I started.

Kokoro: Local Neural TTS

Kokoro is an open-weight TTS model with 82 million parameters. It sounds dramatically better than say — natural prosody, decent pacing, multiple voices. The model itself is small (~300MB weights), and it runs on Apple Silicon with MPS acceleration.

Installing it is straightforward with uv:

brew install espeak-ng
uv tool install kokoro --with soundfile \
  --with "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl" \
  --python 3.12

The --python 3.12 matters — Kokoro’s blis dependency won’t compile against Python 3.14. The spacy model URL is pre-installed to avoid a runtime conflict where uv on your PATH intercepts misaki’s internal pip install call.

The first run downloads model weights from Hugging Face (~300MB, cached locally after that). To use it in the hook:

PYTORCH_ENABLE_MPS_FALLBACK=1 kokoro -t "$text" -o "$tmpfile" 2>/dev/null \
  && afplay "$tmpfile" && rm -f "$tmpfile" &

Here’s the problem: every call to kokoro launches Python, loads PyTorch, loads the model into memory, runs inference, writes a wav, then plays it. That cold start takes 2-3 seconds — every single time, not just the first. For a voice hook that fires on every response, this makes it feel sluggish enough to be unusable.

The fix is running Kokoro as a persistent server. The community has two main options: Kokoro-FastAPI for the PyTorch version, and kokoro-tts-mcp for an MLX-native version that runs as a Claude Code MCP server. The MLX path is lighter — no PyTorch dependency, smaller install footprint — and wraps Kokoro via mlx-audio, which also supports newer models like Qwen3-TTS and Dia. Either way, the model stays loaded after the first call, and subsequent requests drop to near-instant for short text. But “stays loaded” means 600MB-1.8GB of resident RAM, and Kokoro-FastAPI has a documented memory leak under sustained use. On a 16GB machine, that’s a real cost for something you toggle on occasionally.

OpenAI TTS: Where I Landed

For a voice hook — short snippets, latency-sensitive, intermittent use — cloud TTS turned out to be the right answer. OpenAI’s TTS API runs about 250ms time-to-first-byte, needs zero local resources, and costs ~$0.015 per 1K characters. A fraction of a cent per Claude response.

The hook is just a curl call:

if [ -n "$OPENAI_VOICE_API_KEY" ]; then
  tmpfile=$(mktemp /tmp/openai-speech-XXXXXX.mp3)
  curl -s -o "$tmpfile" \
    -H "Authorization: Bearer $OPENAI_VOICE_API_KEY" \
    -H "Content-Type: application/json" \
    -d "$(jq -n --arg t "$text" '{model:"gpt-4o-mini-tts",voice:"nova",speed:1.5,input:$t}')" \
    "https://api.openai.com/v1/audio/speech" \
    && afplay "$tmpfile" && rm -f "$tmpfile" &
else
  say "$text" &
fi

I use gpt-4o-mini-tts with the nova voice at 1.5x speed. Nova has a warm, conversational tone that fits well for spoken Claude responses, and 1.5x keeps the pacing brisk without sounding chipmunk. The API key lives in ~/.zshenv.local so the hook picks it up automatically — no hardcoded secrets in scripts.

ElevenLabs Flash is another option at sub-100ms TTFB, but pricier. For this use case, OpenAI’s latency is plenty fast.

Kokoro earns its keep for bulk work: audiobooks, podcast generation, anything where per-call API costs add up and you don’t mind a background server. For a conversational voice mode where you’re generating a sentence or two at a time, the economics flip — a fraction of a cent per response is cheaper than the RAM and complexity of keeping a local model warm.

What I Learned

The context hook matters more than the TTS hook. Without it, Claude writes the same dense, markdown-heavy responses it always does — and hearing a table read aloud is painful. Telling Claude “you’re being spoken” changes the output shape dramatically: shorter sentences, no formatting, conversational rhythm.

The other surprise was response length. Even with the context hook, the first version was too verbose for speech. Tightening the prompt from “keep it concise” to “1-3 sentences max” made a real difference. Reading is forgiving of length; listening isn’t.

Caveats

Long responses can overlap if Claude responds before the TTS engine finishes the previous one. A production version would want to kill any running process before starting the next.

The sed-based markdown stripping is rough. It handles bold, headers, links, and code blocks, but complex nested formatting will still leak through as garbled speech.

Keep Exploring