Skip to content

Audio detection & speech transcription

Alcoves can listen to any video or audio file and do two things automatically: tag the sounds it hears and turn spoken words into a searchable transcript. Both run entirely on your instance — no audio leaves your server, no cloud API key required.

  • Audio event detection — classifies sounds (laughter, applause, music, speech, and 520+ more categories from Google’s AudioSet taxonomy) into a timestamped timeline alongside your file.
  • Speech transcription — converts spoken words into plain text and WebVTT subtitles using whisper.cpp, an open-source, CPU-only speech model.

Both features feed directly into the video editor, where detections and transcript cues become a seekable overlay on the timeline and power the highlight filters engine — for example, “find every moment where there is laughter within five seconds of the word ‘wow’.”


Open any video or audio file in the editor. The toolbar gives you two buttons:

Transcribe — runs whisper.cpp over the file’s audio track. When it finishes, the Transcript panel appears with:

  • Time-coded cues you can click to jump to that moment in the video.
  • A full-text search box to find any spoken phrase.
  • A “top words” frequency view — click a word to filter the cue list.

Detect audio — runs the audio-event tagger over the same file. When it finishes, the Audio detections panel appears with:

  • Results grouped by sound label (e.g. “Speech”, “Laughter”, “Music”).
  • A confidence score badge for each category.
  • A clickable timeline strip — one bar per analysis window, with opacity scaled to confidence — so you can jump to the loudest moment of any sound.

Both buttons show live progress while the job runs (Transcribing 42%...), offer a Retry affordance if something goes wrong, and switch to Re-transcribe / Re-detect once a result exists.

You don’t have to process files one at a time. From Library → Settings you can kick off transcription or audio detection across all eligible files in the library at once. The library browser’s multi-select context menu also exposes Transcribe N files and Detect audio in N files actions for a targeted batch.

Once a file has both a transcript and audio detections, the Highlight filters panel in the editor lets you write expressions combining word and sound criteria — for example, audio:laughter:30 & word:wow. Alcoves evaluates these client-side against the detections and cues and highlights matching segments on the timeline, making it fast to cut a highlight reel from a long recording.


The instance owner can change the active whisper model at any time from Admin → Inference Models — no restart required. A smaller model is faster and uses less RAM; a larger model is more accurate.

ModelDiskPeak RAMNotes
tiny75 MB390 MBFastest; good for quick drafts
base142 MB500 MB
small466 MB1 GB
medium1.5 GB2.5 GB
large-v3 (default)3.1 GB3.9 GBBest accuracy
large-v3-q5_01.1 GB1.3 GBQuantized; near large-v3 accuracy
large-v3-turbo-q5_0574 MB900 MB~8× faster than large-v3
large-v3-turbo-q4_0470 MB800 MBSmallest near-accurate option
distil-large-v3.5-q5600 MB1 GBEnglish-only

The owner can also set a default language or leave it on auto for automatic detection. Supported languages include English, French, German, Spanish, Italian, Portuguese, Dutch, Japanese, Chinese, Korean, and Russian.

The instance owner selects the audio detection model from the same Admin → Inference Models page.

ModelDiskAccuracy (mAP)LicenseStatus
efficientat_mn045 MB0.432MITPlanned
efficientat_mn10 (default)20 MB0.471MITAvailable
efficientat_mn40280 MB0.487MITPlanned
ced_tiny22 MB0.481Apache-2.0Planned
ced_small85 MB0.496Apache-2.0Planned
ced_base330 MB0.500Apache-2.0Planned
pann_cnn14 (legacy)313 MB0.431Apache-2.0Available

The default efficientat_mn10 is a practical balance of size (20 MB) and accuracy. pann_cnn14 is the original baseline, kept selectable as a rollback option.

Models marked Planned are catalogued but their weights are not yet published to the model bucket. They appear greyed-out in the admin picker and cannot be selected — choosing one would otherwise fail every audio-detect job with a 404 when the worker tries to download the missing file. They become selectable once their weights are published.

Models download automatically the first time they are needed, so the initial run of a new model may take longer while the file is fetched.


Both pipelines run as background jobs in Alcoves’ async worker queue (Asynq). No user action blocks on inference — you trigger the job, the worker picks it up, and the editor polls for progress.

  1. The audio track is extracted from the source file by ffmpeg as mono 16 kHz WAV.
  2. The whisper.cpp CLI processes the WAV and emits plain text and WebVTT subtitle output.
  3. Anti-hallucination flags are always applied: no prior-segment context is fed between segments, non-speech tokens are suppressed, and a Silero VAD (voice-activity detection) model filters out silent regions that can cause whisper to repeat itself.
  4. Progress is tracked line by line from the whisper output and reported back to the editor in real time.
  5. When complete, the transcript text and VTT are stored on the file record. An activity event is emitted so the library Feed reflects the update.
  1. The audio track is extracted by ffmpeg as raw mono float32 PCM at the model’s required sample rate (16 kHz for CED models, 32 kHz for EfficientAT).
  2. The PCM is processed in 10-second windows using an ONNX model. Each window produces a probability score for each of 527 AudioSet sound categories.
  3. Only results above a minimum confidence threshold (0.2 by default) and the top results per window are kept.
  4. Results are stored as timestamped detection records, replacing any prior results for the file in a single atomic transaction.

These environment variables tune the transcription and audio detection pipelines. Most deployments can leave them at their defaults.

VariableDefaultDescription
ALCOVES_WHISPER_BINARYwhisper-cliPath to the whisper.cpp CLI binary
ALCOVES_WHISPER_MODELlarge-v3Default model (overridden by admin setting)
ALCOVES_WHISPER_LANGUAGEautoDefault language (overridden by admin setting)
ALCOVES_WHISPER_MODELS_DIR{data}/.whisperLocal directory for downloaded model files
ALCOVES_WHISPER_MODEL_BASE_URL(internal)Base URL for model downloads
ALCOVES_WHISPER_VAD_MODELsilero-v6.2.0VAD model for hallucination suppression; set empty to disable
VariableDefaultDescription
ALCOVES_AUDIO_DETECT_WINDOW_SEC10.0Analysis window length in seconds
ALCOVES_AUDIO_DETECT_TOP_K5Maximum labels kept per window
ALCOVES_AUDIO_DETECT_THRESHOLD0.2Minimum confidence score to keep a label
ALCOVES_MODELS_PATH{data}/.modelsLocal directory for downloaded ONNX models
ALCOVES_AUDIO_DETECT_MODEL_BASE_URL(internal)Base URL for ONNX model downloads
VariableDefaultDescription
ALCOVES_FFMPEG_BINARYffmpegPath to the ffmpeg binary

When a video file is uploaded, Alcoves automatically enqueues both transcription and audio detection — no manual trigger needed. By the time you open the editor, the jobs may already be in progress or complete.