ML models & runtime inference
Alcoves ships four AI pipelines — face detection and recognition, COCO object detection, AudioSet audio-event tagging, and whisper.cpp speech transcription — all running locally on CPU with no cloud inference calls and no GPU dependency. This page explains how the engine works: how models reach disk, how they are loaded into memory, and how they turn raw media into structured metadata in the database.
Design principles
Section titled “Design principles”CPU-only, no GPU. ONNX-based models (face, object, audio) run through ONNX Runtime via Go bindings. Speech transcription is handled by the whisper-cli binary from whisper.cpp. There is no CUDA, no TensorRT, and no remote inference API.
Nothing leaves the instance. Inference is entirely local. The only outbound network traffic the inference subsystem makes is downloading model weights on first use, from a configurable mirror. Your media is never sent to a third party.
Models are not bundled in the Docker image. The image ships the ONNX Runtime shared library and the whisper-cli binary, but no model weights. Weights are fetched on demand and cached to disk. This keeps the image small and lets operators swap models without rebuilding.
All inference is async. No ML runs inside an HTTP request handler. API handlers enqueue jobs; worker processes (running when ALCOVES_MODE=all or ALCOVES_MODE=worker) pick up jobs, lazily download any missing models, run inference, and write results back to Postgres. The frontend polls the file’s status columns for live progress.
Preprocessing is baked into the graph where possible. Audio models embed the mel-spectrogram transform inside the ONNX graph, so workers feed raw mono PCM without implementing an FFT pipeline in Go. Image preprocessing (resize and normalize) is done in Go with libvips.
Model catalog
Section titled “Model catalog”Face detection — SCRFD det_10g.onnx
Section titled “Face detection — SCRFD det_10g.onnx”Detects faces at strides 8/16/32, producing bounding boxes, five landmarks, and a confidence score per detection. Preprocessing resizes the longest side to 640 px, pads the remainder with black, and normalizes pixel values to (px - 127.5) / 128.0. Non-maximum suppression runs at IoU 0.4.
| Property | Value |
|---|---|
| File | det_10g.onnx |
| Disk size | ~17 MB |
| Task queue type | face:detect |
Face recognition — ArcFace w600k_r50.onnx
Section titled “Face recognition — ArcFace w600k_r50.onnx”Produces a 512-dimensional L2-normalized embedding for each detected face crop. Embeddings are stored as vector(512) in Postgres (pgvector) and clustered into named People via HNSW cosine nearest-neighbor search. Tensor names are probed at load time across nine known input/output name combinations — this lets the same Go code work with ArcFace weights from multiple sources.
| Property | Value |
|---|---|
| File | w600k_r50.onnx |
| Disk size | ~167 MB |
| Task queue type | face:detect (same job as detection) |
Object detection — YOLO26x yolo26x_fp16.onnx
Section titled “Object detection — YOLO26x yolo26x_fp16.onnx”Labels files with COCO-80 object classes (person, car, dog, and so on). The model is NMS-free — it returns 300 already-deduplicated proposals — so no post-processing NMS runs in Go. Preprocessing resizes directly to 640×640 and normalizes to [0, 1].
| Property | Value |
|---|---|
| File | yolo26x_fp16.onnx |
| Disk size | ~107 MB |
| Labels | COCO-80 |
| Task queue type | object:detect |
Audio-event detection — selectable registry
Section titled “Audio-event detection — selectable registry”Classifies audio into 527 AudioSet event classes (music, speech, dog bark, machinery, and so on). Each model in the registry bundles the mel-spectrogram transform inside the ONNX graph so the worker only feeds raw mono PCM. The active model is selectable from the admin panel at runtime.
The registry catalogues several models, but a model is only selectable once its ONNX weights are mirrored to the model bucket. Entries whose weights are not yet published carry Available: false in audiodetection.Registry: the admin API rejects selecting them, the picker renders them disabled, and LookupSpec falls back to the default for any stored-but-unavailable selection — so a stale setting can never make the worker 404 on a missing file. Flip Available to true in the same change that uploads the artifact.
| ID | Disk | mAP | Status | Notes |
|---|---|---|---|---|
efficientat_mn04 | 5 MB | 0.432 | Planned | Smallest; fast on constrained hardware |
efficientat_mn10 (default) | 20 MB | 0.471 | Available | Best balance of size and accuracy |
efficientat_mn40 | 280 MB | 0.487 | Planned | Higher accuracy, larger footprint |
ced_tiny | 22 MB | 0.481 | Planned | Good accuracy-to-size ratio |
ced_small | 85 MB | 0.496 | Planned | |
ced_base | 330 MB | 0.500 | Planned | Highest mAP in the registry |
pann_cnn14 (legacy) | 313 MB | 0.431 | Available | Kept for rollback; not recommended for new installs |
| Property | Value |
|---|---|
| Task queue type | file:audio-detect |
| Labels | AudioSet 527-class |
Speech transcription — whisper.cpp GGML models
Section titled “Speech transcription — whisper.cpp GGML models”Produces timestamped transcripts (transcript_text and transcript_vtt) from speech in video and audio files. The active model and language are selectable from the admin panel. Silero VAD (voice activity detection) runs alongside every transcription job to suppress repetition-loop hallucinations on non-speech audio — it is enabled by default.
| ID | Disk | RAM peak | WER (clean) | Notes |
|---|---|---|---|---|
tiny | 75 MB | 390 MB | 7.5% | Fastest; useful for previews |
base | 142 MB | 500 MB | 5.0% | |
small | 466 MB | 1000 MB | 3.4% | |
medium | 1500 MB | 2500 MB | 3.0% | |
large-v3 (default) | 3100 MB | 3900 MB | 2.7% | Highest accuracy |
large-v3-q5_0 | 1080 MB | 1300 MB | 2.9% | Quantized; good accuracy at lower RAM |
large-v3-turbo-q5_0 | 574 MB | 900 MB | 3.0% | ~8× faster than large-v3 |
large-v3-turbo-q4_0 | 470 MB | 800 MB | 3.2% | Smallest near-SOTA option |
distil-large-v3.5-q5 | 600 MB | 1000 MB | 3.0% | English-only; fast |
| Property | Value |
|---|---|
| Task queue type | file:transcribe |
| Languages | auto, en, fr, de, es, it, pt, nl, ja, zh, ko, ru |
On-demand model downloads
Section titled “On-demand model downloads”Model weights are not bundled in the Docker image. Instead, the worker downloads them on first use and caches them to disk. All four pipelines follow the same defensive pattern:
- Stat check. If the file already exists on disk and exceeds a minimum size threshold, the download is skipped.
- Atomic write. The download writes to a temporary path first (
.tmpfor ONNX,.partfor whisper), then atomically renames it into place. A crash or interrupted download never leaves a corrupt file at the final path. - Retry with backoff. Up to six attempts with exponential backoff, capped at 30 seconds per delay (1 s, 2 s, 4 s, 8 s, 16 s, 30 s). Network errors and 5xx responses are treated as transient.
- HTML/pointer rejection. If the response body looks like HTML (such as a Git LFS pointer page or a CDN error page served with a 200 status), the download is rejected. This prevents silently caching a text pointer as a model file.
- Minimum-size validation. ONNX models must be larger than 1 MB; the AudioSet labels CSV must be at least 1024 bytes. Anything smaller is discarded and retried.
The first job of a given type may take longer while the model downloads. On startup, the server kicks off a background goroutine that pre-fetches the face and object detection models so those are ready before the first job arrives. Audio and whisper models are fetched lazily on first use.
Model cache directories:
| Env var | Default | Contents |
|---|---|---|
ALCOVES_MODELS_PATH | ./data/.models | ONNX weights (face, object, audio) |
ALCOVES_WHISPER_MODELS_DIR | ./data/.whisper | Whisper GGML .bin files |
Runtime model selection
Section titled “Runtime model selection”Two of the four pipelines let the instance owner change the active model from the admin panel without restarting the server or editing environment variables.
- Audio-event detection — the admin panel “Inference Models” section shows every model in the registry with its disk size and mAP score. Selecting a new model takes effect on the next queued job; the new weights are downloaded automatically if not already on disk.
- Speech transcription — the admin can select both the model and the target language. Changing the model applies to all subsequent transcription jobs.
The admin panel validates selections against the model registry before persisting the setting. The environment variables ALCOVES_WHISPER_MODEL and ALCOVES_WHISPER_LANGUAGE serve as boot-time fallbacks when no admin setting has been saved yet.
Face detection, face recognition, and object detection use fixed models and are not selectable from the admin panel.
Data flow
Section titled “Data flow”Each AI pipeline follows the same shape:
- An API handler (or a bulk endpoint) enqueues a typed Asynq task with a two-hour deduplication window, so rapid re-triggers and pod races produce a single job.
- The worker validates the file: it must exist, must not be trashed, and must have an appropriate MIME type.
- The worker sets the file’s
<job>_statuscolumn toprocessingand records the requested job version. - The model, labels (if any), and ONNX session are loaded from cache — downloading if necessary.
- Inference runs. Audio and transcription jobs update
<job>_progressas they advance through the file. - A database transaction replaces any prior results with the new rows and sets
<job>_statustoready(orfailedon error).
Progress and status columns
Section titled “Progress and status columns”Every async ML job writes a consistent set of columns on the files table. The frontend polls these to display live progress indicators and trigger UI updates.
<job>_status queued | processing | ready | failed | null<job>_progress 0–100 (integer percentage)<job>_eta_seconds nullable integer<job>_error nullable error message<job>_version incremented to request a rerun<job>ed_version set equal to <job>_version when completeJob prefixes in use: proxy, transcribe, audio_detect, waveform. Transcription additionally writes transcript_text, transcript_vtt, and transcript_model; audio detection additionally writes audio_detect_model.
Asynq task types
Section titled “Asynq task types”| Task type | Pipeline | What triggers it |
|---|---|---|
face:detect | Face detection + recognition | File upload, library face-recognition enable, reprocess endpoint |
object:detect | Object detection | File upload, library object-detection enable, reprocess endpoint |
file:audio-detect | Audio-event tagging | Per-file and bulk audio-detect endpoints |
file:transcribe | Speech transcription | Per-file and bulk transcribe endpoints |
Workers only run when ALCOVES_MODE is all or worker.
Session management
Section titled “Session management”Each ONNX service loads its sessions differently, reflecting how often the underlying model can change:
| Pipeline | Strategy | Hot-swap while running? |
|---|---|---|
| Face detection | Session loaded once per process (sync.Once) | No |
| Object detection | Session loaded once per process (sync.Once) | No |
| Audio detection | Session cached by "{modelPath}|{sampleRate}" key; new key triggers a reload | Yes (admin-selectable) |
| Speech transcription | No persistent session; whisper-cli is spawned fresh per job | Yes (admin-selectable) |
When an admin changes the audio model, the old ONNX session is intentionally not destroyed before creating the new one. Destroying an in-use session risks a use-after-free if an in-flight inference still references it; since model switches are rare admin actions, leaving one session in memory is the safer trade-off.
The audio detection session is load-probed across 96 input/output name combinations on first use (12 input names × 8 output names, running a one-second silent inference per combination). This is what lets a single Go worker handle weights from EfficientAT, CED, and PANNs without requiring model-family-specific code.
Docker ML build
Section titled “Docker ML build”backend/Dockerfile is a multi-stage build. It ships the inference runtime but not the model weights.
- Build stage (
golang:1.26-bookworm): installslibvips-dev(image preprocessing),ffmpeg(audio/video decoding),libgomp1(OpenMP for ONNX), and build tools. Built withCGO_ENABLED=1. - ONNX Runtime v1.26.0: downloaded from GitHub releases with architecture-aware selection (
arm64orx64), installed to/usr/local/lib, stripped, and symlinked asonnxruntime.so.ldconfigruns after install. The runtime must be 1.26.x to match the version pinned byonnxruntime_go. - whisper.cpp v1.8.4 (separate stage): shallow-cloned at the pinned tag, built with CMake Release mode (AVX/AVX2/FMA/F16C enabled; AVX-512 disabled). Produces
whisper-cliand the required shared libraries. - Final image (
debian:bookworm-slim): copies the ONNX Runtime library,whisper-cli, and whisper shared libraries. SetsENV LD_LIBRARY_PATH=/usr/local/lib— this is required because the Go ONNX bindings calldlopen("onnxruntime.so")without an absolute path, andonnxruntime.sois not a SONAME, soldconfigalone does not resolve it.
Environment variables
Section titled “Environment variables”| Variable | Default | Purpose |
|---|---|---|
ALCOVES_MODELS_PATH | ./data/.models | ONNX model cache directory (face, object, audio) |
ALCOVES_WHISPER_MODELS_DIR | ./data/.whisper | Whisper .bin model cache directory |
ALCOVES_WHISPER_BINARY | whisper-cli | Path to the whisper.cpp CLI binary |
ALCOVES_WHISPER_MODEL | large-v3 | Boot-time fallback model (admin panel overrides this) |
ALCOVES_WHISPER_LANGUAGE | auto | Boot-time fallback language |
ALCOVES_WHISPER_MODEL_BASE_URL | https://s3.rustyguts.net/models | Whisper weight download mirror |
ALCOVES_WHISPER_VAD_MODEL | silero-v6.2.0 | Silero VAD model ID; set to "" to disable |
ALCOVES_AUDIO_DETECT_MODEL_BASE_URL | https://s3.rustyguts.net/models | Audio model weight download mirror |
ALCOVES_AUDIO_DETECT_LABELS_URL | …/audioset_class_labels_indices.csv | AudioSet 527-class label CSV URL |
ALCOVES_AUDIO_DETECT_WINDOW_SEC | 10.0 | Inference window length in seconds |
ALCOVES_AUDIO_DETECT_THRESHOLD | 0.2 | Minimum probability to keep an audio label |
ALCOVES_AUDIO_DETECT_TOP_K | 5 | Maximum labels retained per window |
ALCOVES_FACE_DETECTION_MIN_SCORE | 0.28–0.5 | SCRFD detection confidence threshold |
ALCOVES_FACE_RECOGNITION_MAX_DISTANCE | 0.42–0.6 | Maximum cosine distance for a face match |
ALCOVES_FACE_RECOGNITION_MIN_FACES | 2–3 | Minimum cluster size to create a Person |
ALCOVES_FACE_RECOGNITION_NEIGHBOR_LOOKUP | 80 | kNN candidates during face assignment |
ALCOVES_OBJECT_DETECTION_MIN_SCORE | 0.25 | Object detection confidence floor |
ALCOVES_OBJECT_DETECTION_MAX_DETECTIONS | 100 | Maximum detections per file |
ALCOVES_FFMPEG_BINARY | ffmpeg | Path to ffmpeg (used for audio extraction) |
Testing (real-data inference)
Section titled “Testing (real-data inference)”Each pipeline has a real-data test that runs actual inference end to end against committed sample media — not mocks — so a model, runtime, or preprocessing regression is caught rather than papered over:
| Pipeline | File | What it asserts on real input |
|---|---|---|
| Waveform | services/waveform/realmedia_test.go | ffmpeg-decoded tones with a known amplitude envelope → peaks; full ProcessTask → cache JSON |
| Object detection | services/objectdetection/realinference_test.go | sample images → expected COCO labels (person/dog/bicycle) + box/score invariants |
| Face detection | services/facedetection/realinference_test.go | faces found; 512-dim L2-normed embeddings; same face matches across re-encode, different people don’t |
| Audio events | services/audiodetection/realinference_test.go | speech clip → Speech; 527-class output range; full ProcessTask → DB rows |
| Transcription | services/transcribe/realinference_test.go | real whisper-cli (tiny) → transcript contains the spoken words |
Fixtures plus their provenance and licenses live in internal/testsupport/testdata/ (AI-generated faces + CC0 images + a locally synthesized speech clip). Shared setup is in internal/testsupport (mlfixtures.go, onnxtest/).
Extending the model registry
Section titled “Extending the model registry”Add an audio-event model
Section titled “Add an audio-event model”Export the model with scripts/export-audio-tagger.py (PyTorch → ONNX opset 17, with the mel-spectrogram transform baked in). The exported model must accept waveform: float32 [batch, samples] and produce clipwise_output: float32 [batch, 527] (post-sigmoid AudioSet probabilities). Mirror the file to the model bucket, then add a ModelSpec entry to audiodetection/registry.go with the file name, sample rate, mAP, and license. The admin panel and validation logic pick it up automatically.
Add a whisper model
Section titled “Add a whisper model”Add a WhisperModelSpec entry to transcribe/whisper_models.go, then mirror the corresponding ggml-<id>.bin file using scripts/upload-whisper-models.sh. The entry immediately becomes valid input for ALCOVES_WHISPER_MODEL and selectable in the admin panel.
Point at a different mirror or air-gapped host
Section titled “Point at a different mirror or air-gapped host”Set ALCOVES_WHISPER_MODEL_BASE_URL and ALCOVES_AUDIO_DETECT_MODEL_BASE_URL to your mirror URL. Face and object detection model URLs are currently hard-coded constants in the source and require a code change to redirect.