Skip to content

Media processing pipeline

Alcoves processes media in two distinct modes: request-driven (image transforms delivered synchronously via a cache-backed proxy) and async (video transcoding, waveform extraction, and thumbnail generation run as background jobs). All derived artifacts are stored through the same pluggable blob storage layer used for original files.

This page covers the non-ML media pipeline. AI-driven analysis (face recognition, object detection, audio event tagging, transcription) is a separate concern documented in the AI feature docs.

PipelineTriggerOutput
Image transformHTTP request (lazy, cached)Resized/converted derivative in cache storage
Video proxyUpload or manual re-encodeH.264/AAC MP4 + WebP thumbnail
Video thumbnailUpload or manual reprocessJPEG or WebP thumbnail derived file
Audio waveformUploadwaveform.json in cache storage
Avatar normalizationAvatar upload512 px square WebP

Shared infrastructure used by all pipelines:

  • libvips (via govips) — image decode, resize, color-space normalization, encode
  • ffmpeg / ffprobe — video transcoding, thumbnail extraction, PCM audio extraction
  • Asynq job queue (backed by Dragonfly/Redis) — async task dispatch, deduplication, and retention
  • Storage service — unified blob I/O across local-disk and S3 backends, scoped to Files, Avatars, and Cache

Worker processes register with the Asynq mux when the server runs in all or worker mode (ALCOVES_MODE). Each job type runs on its own named queue — the single source of truth for queue names and weights is the internal/queues package. Weights follow importance ÷ complexity: how much a user is blocked on the result, divided by how long the job takes (so a heavy job class never hogs the worker pool ahead of fast ones).

QueueWeightCarries
imageproxy100Interactive, on-demand image transforms (a user is blocked on these)
metadata70EXIF/GPS + ffprobe extraction (fast; unblocks Timeline, Map, file details)
thumbnail65Video poster-frame extraction (fast ffmpeg seek)
hash60SHA-256 content hashing for dedup
default50Retained only as a drain target for tasks enqueued by an older build during an upgrade; no new work routes here
moment-export45User-initiated clip encodes
waveform40Audio waveform peaks
object-detection30YOLO ONNX inference (background)
face-detection30Face ONNX inference + clustering (background)
audio-detection25AudioSet ONNX inference (background)
video-transcode10Full video proxy transcodes — heavy, multi-minute ffmpeg work
transcription5Whisper speech-to-text — the longest-running job class
maintenance1Low-priority background upkeep — the hourly image-proxy variant pre-warm

The ladder means “interactive first, fast derivations next, then ML inference, then the heavy long-runners, with upkeep last”: a large transcription or video-transcode backlog can never starve interactive image loads or fast jobs like thumbnailing. Asynq samples non-strictly across non-empty queues in proportion to these weights, so a low-priority queue still drains fully when it’s the only one with work.


Every image displayed in the Alcoves UI flows through an authenticated proxy route rather than serving original bytes directly. The proxy resizes and converts on first request, caches the derivative, and serves subsequent requests from cache with a one-year immutable Cache-Control header.

The proxy accepts four query parameters:

ParameterValuesDefault
width1–4096unconstrained
height1–4096unconstrained
quality1–10080
formatjpeg, webp, avif, pngjpeg

Width and height are clamped to 4096 to prevent memory exhaustion from crafted requests against the libvips allocator.

Rather than scattering sizes across the UI, every distinct transform the app requests is named once in a shared variant registry. The registry is mirrored on both sides of the stack and must be changed in lockstep:

  • Backend: backend/internal/services/imageproxy/variants.go (imageproxy.Variants)
  • Frontend: frontend/shared/image-variants.ts (IMAGE_VARIANTS)
VariantBoxQualityFormatSizingUsed by
search80×8070jpegfixedSearch-result avatars
timeline240×24070webpfixedTimeline grid
face300×30080jpegfixedPeople / face grid
card720×36082jpegcappedLibrary browser cards
preview1920×108090jpegcappedFull-screen lightbox

A capped variant clamps its box down to the source image’s own pixel dimensions (a 500 px-wide original is requested at w500, not w720), so the cache key matches the source exactly and no oversized box is stored. A fixed variant always requests the full box. Both the frontend URL builder (resolveVariant) and the backend pre-warm job (Variant.Resolve) apply the identical rule, guaranteeing the cache key a request produces is exactly the one the pre-warm job generated. The registry carries a VariantsVersion; bumping it (plus a one-line migration that resets image_proxy_warmed_version) re-warms every image against the new set.

  1. Decode source bytes from storage.
  2. Auto-rotate based on EXIF orientation.
  3. Convert to sRGB — bakes the ICC profile before metadata is stripped, so wide-gamut sources render correctly rather than washed out.
  4. Fit-inside resize — maintains aspect ratio, uses linear interpolation, never upscales.
  5. Encode to the requested format. JPEG output is progressive with coding optimization; all formats strip metadata. Default quality 80.

When a transform derivative is not yet cached, the proxy must coordinate concurrent requests for the same derivative without running duplicate work. This is handled with a five-step flow:

  1. Cache check — return immediately if the derivative already exists in storage.
  2. Subscribe — subscribe to a Redis pub/sub completion channel before enqueueing, so a completion signal cannot be missed.
  3. Double-check — re-check cache and any transient result key to handle the race where a job finished between steps 1 and 2.
  4. Enqueue — dispatch the transform task with a 2-minute deduplication window (so concurrent requests collapse to a single job) and no retries (fast-fail semantics).
  5. Wait — block up to 30 seconds for the pub/sub signal. An "ok" signal reads the result from a transient Redis key first (avoids stale NFS attribute cache), falling back to storage with retries. An "error" signal returns immediately without waiting for the timeout.

When Redis is not available (development or test environments), the proxy transforms synchronously and writes directly to cache storage.

The concurrency model above fills the cache lazily — the first viewer of each derivative pays the transform latency. To eliminate that cold-start cost, an hourly background maintenance loop (running on all/worker nodes, gated by ALCOVES_IMAGE_PROXY_PREWARM_ENABLED, default on) generates every registry variant for every image ahead of time:

  1. Scan — a bounded batch (500/pass) of live image files that have not been warmed at the current VariantsVersion, are under the 3-strike cap, and are not already in flight (or are stuck past 15 minutes). Derived video-thumbnail images are included; trashed files are not.
  2. Enqueue — one image:prewarm task per file on the low-priority maintenance queue, so the backfill never competes with interactive transforms.
  3. Generate — the worker reads the source once and writes any missing variant to cache (already-cached variants are skipped, making the job idempotent), then marks the file warmed.

Failure handling. A genuine per-file failure — a corrupted or unsupported source that fails the libvips transform — increments a strike counter (files.image_proxy_attempts). After 3 strikes the scan drops the file permanently, so a job that fails every time runs at most three times rather than being re-queued forever. Infrastructure failures (a storage read/write blip) are recorded without burning a strike, so a transient outage can’t sideline a healthy file. Tasks carry no asynq-level retries; the database strike counter is the sole cap, applied across maintenance passes.

The proxy route (GET /api/files/proxy/{libraryId}/{fileId}/{filename}) requires an authenticated session and library membership. Non-members receive 404 rather than 403 to avoid confirming that a library or file exists.


Browsers cannot play every container or codec a user might upload. Alcoves transcodes non-web-friendly video to a standardized MP4 and generates thumbnails for grid and preview display.

On upload, Alcoves inspects the MIME type:

  • video/mp4, video/webm, video/ogg — already web-playable; a thumbnail is generated but no proxy transcode is enqueued by default.
  • All other video MIME types — a proxy transcode is enqueued automatically.

Library owners can also trigger re-encoding manually from the library settings UI, or bulk-reprocess thumbnails for all videos in a library.

  1. Probe — ffprobe inspects the source streams. If the video is already H.264, the audio is AAC (or absent), the container is MP4 or MOV, and the height is 1080 px or less, transcoding is skipped and the file is marked not_needed.
  2. Transcode — ffmpeg encodes with libx264, CRF 23, medium preset, High profile, level 4.1, yuv420p pixel format, AAC audio at 128 kbps stereo, and +faststart for progressive download. Video taller than 1080 px is scaled down.
  3. Progress — ffmpeg’s -progress output is parsed during the encode; the proxy_status, proxy_progress, and proxy_eta_seconds columns on the file record are updated so the frontend can show a progress bar and ETA.
  4. Storage — the transcoded MP4 is stored as a derived file record linked to the original by source_file_id.
  5. Thumbnail — a WebP thumbnail is generated at 480 px wide and written to cache.

Thumbnail extraction tries up to four ffmpeg strategies in order to handle HDR and SDR sources gracefully:

  1. Auto HDR to BT.709 — linearize the source, apply Hable tone-mapping, convert to BT.709 YUV420p. Requires an ffmpeg build with zscale (libzimg).
  2. Explicit SDR — same pipeline but declares the source as BT.709 explicitly, for untagged content.
  3. SDR no-seek — same as above but without seeking to mid-file.
  4. Simple scale — last resort with no color management.

The proxy_status column on a file record cycles through:

queued → processing → ready
→ not_needed
→ failed

The frontend polls GET /api/libraries/:id/files/:fileId every two seconds while status is queued or processing, showing the progress percentage and estimated seconds remaining.

GET /api/libraries/:id/files/:fileId/playback-sources returns the original file and any proxy MP4 as a list of playback sources. The video player and editor let users switch between sources if both are available.


The video editor draws a scrollable, zoomable waveform under the timeline so users can navigate audio visually when creating moment clips. Waveform data is extracted on upload and cached as JSON.

  1. Audio check — ffmpeg probes for an audio stream. Files with no audio receive an empty waveform and are marked complete immediately.
  2. PCM extraction — ffmpeg decodes the audio track to raw float32 little-endian PCM at 16 kHz mono.
  3. Peak computation — a streaming, window-based algorithm divides the PCM into 320-sample windows (16000 Hz / 50 peaks per second) and records the maximum absolute value in each window, clamped to [0, 1]. The algorithm uses a fixed read buffer and a reused window buffer to avoid loading the entire file into memory. Quiet files stay quiet — there is no per-file normalization.
  4. Storage — the peak array, peaks-per-second, and sample rate are serialized to JSON and written to cache storage.
  5. Version check — before writing the result, the worker reloads the file record to confirm the waveform version has not changed (which would indicate the file was replaced while the job was running). Stale results are discarded.

The resulting JSON structure has 50 peaks per second of audio, which gives the waveform renderer enough resolution to draw accurately at any zoom level the editor supports.

The editor renders the waveform on a <canvas> element pinned to the visible viewport. Only the visible slice of the waveform is drawn on each frame, with peak values accumulated per pixel column. The renderer is HiDPI-aware and SSR-safe.


When a user uploads a profile avatar, the image is normalized synchronously before storage — no background job is involved.

The normalization pipeline:

  1. Reject empty input or files larger than 8 MiB.
  2. Decode the image with libvips; reject unrecognizable formats.
  3. Auto-rotate based on EXIF orientation.
  4. Center-crop to a square (using the shorter dimension).
  5. Downscale to at most 512 × 512 px with Lanczos3 interpolation (never upscales).
  6. Export as WebP at quality 85.

Avatars are stored keyed by user ID and served at /api/auth/me/avatar and /api/auth/users/:userId/avatar with a short private, max-age=300 cache header.

The upload endpoint maps errors to HTTP status codes: empty or unrecognizable input returns 400; files over 8 MiB return 413.


Async media jobs use a consistent status/progress/version pattern on the files table:

ServiceColumns
Video proxyproxy_status, proxy_progress, proxy_eta_seconds
Waveformwaveform_status, waveform_progress, waveform_error, waveform_version, waveformed_version, waveform_peaks_per_second
Derived filesthumbnail_file_id, source_file_id

The waveform_version / waveformed_version pair implements optimistic re-processing: incrementing waveform_version schedules a re-run, and the worker checks both at the start and end of the job to detect files that were replaced mid-extraction.


On upload: A new file record is created. Based on MIME type, the server enqueues derivatives — for video, this includes a thumbnail job, optionally a proxy transcode job, a waveform job, and any AI analysis jobs the library has enabled.

On display: Grid cards and file previews request images through AlcovesImage.vue, which builds a proxy URL with sorted query parameters for stable cache keys. The request hits GET /api/files/proxy/..., which either serves a cached derivative immediately or waits up to 30 seconds for a freshly computed one.

In the editor: The video editor loads playback sources and waveform data, then renders the waveform on a canvas under the moment timeline. Job status is polled for progress display while transcoding is in progress.


Environment variablePurposeDefault
ALCOVES_MODESet to all or worker to enable background job processingall
ALCOVES_FFMPEG_BINARYPath to the ffmpeg binaryffmpeg (system PATH)
ALCOVES_QUEUE_HOSTDragonfly/Redis host for the job queuelocalhost
ALCOVES_QUEUE_PORTDragonfly/Redis port6389
ALCOVES_STORAGE_DRIVERlocal or s3local
ALCOVES_STORAGE_PATHRoot path for local file storage
ALCOVES_CACHE_STORAGE_PATHRoot path for derived/cache storage