MediaLayer

Engineering

Architecture for scaling media search toward 100M+ assets

Engineering8 min read

Low-latency 1-to-many media fingerprint search across very large indexed libraries is not a hardware problem — it is an architecture problem. The decisions that make it possible are made before a single query runs: how fingerprint data is stored, how ingestion is decoupled from the API, and how the search path avoids scanning data it does not need to look at.

The problem with the obvious approach

The straightforward architecture for media similarity search puts everything in a relational database. Fingerprints go in one table, metadata in another. A query compares the incoming fingerprint against every row and returns the closest matches.

This works well at 10,000 assets. At 1 million it starts to strain. At 100 million it collapses. A full-table scan across 100 million fingerprint rows — even with a good index — cannot return results in under a second when the comparison logic requires computing similarity scores row by row.

The scale problem compounds for video specifically. Video is not one fingerprint per asset — it is hundreds or thousands of fingerprints per asset, one per sampled frame. At 100 million videos with an average of 100 frames each, the fingerprint table has 10 billion rows. A relational database running sequential comparison against that table is not a latency problem. It is a fundamentally wrong tool for the job.

Naive full-scan vs. clustered index lookup — illustrative architecture estimate

Library sizeFingerprint rowsFull scanClustered lookup
10K assets~320K rows~20 ms< 5 ms
1M assets~32M rows~4 s< 20 ms
100M assets~3.2B rowstimeoutlow-latency lookup

Video: ~100 frames × 32 fingerprint bands per asset. Clustered lookup touches only matching bands — not the full table. Figures are illustrative architecture estimates for a properly sized deployment.

Why vector embeddings are not the answer here

The modern reflex for 'similarity search at scale' is to reach for vector embeddings and an approximate nearest-neighbour index. Embed the media into a high-dimensional vector, store it in a vector database, query by cosine distance. Products like Pinecone, Weaviate, and pgvector make this relatively easy to set up.

For semantic similarity — finding videos that are about the same topic, or images with the same visual concept — this is the right approach. For content duplication detection, it is the wrong one.

Embedding-based search answers 'what is this similar to in meaning?' Fingerprint-based search answers 'is this the same content in a different encoding?' Those are different questions with different right answers. A re-encoded ad and its original will have near-identical fingerprints. They may or may not have similar embeddings depending on how the model was trained, how the encoding differs, and whether frame-level or clip-level embeddings are used.

Beyond accuracy, vector search at this scale is expensive. Embedding 100 million video clips through a neural model, storing float32 vectors for each, and running ANN queries against them is a significant ongoing cost. For content duplication, classical signal processing fingerprints solve the problem more accurately and at lower cost.

The architectural insight: two databases with different jobs

The foundation of how MediaLayer scales is a clean separation between two categories of data that have very different access patterns.

Metadata — the name of an asset, its ingest status, which tenant it belongs to, when it was created — is relational data. It changes frequently. An asset transitions from pending to processing to ready. A job counter increments as items complete. This data lives in PostgreSQL, where transactional updates, foreign keys, and status transitions work naturally.

Fingerprint data is the opposite. It is written exactly once during ingest and never updated again. It is read heavily during search. The ratio of reads to writes is extreme — a single ingest event produces data that will be queried thousands of times over the asset's lifetime. Relational databases are not optimised for this pattern. Columnar databases are.

MediaLayer stores all fingerprint data in ClickHouse, a columnar database engine built for analytical read workloads. The storage layout is ordered by the exact fields that appear in every search query's WHERE clause. This means a search does not scan fingerprint rows — it performs a clustered lookup, touching only the fraction of data relevant to the query. This architecture is designed to keep lookup latency low even across very large fingerprint tables.

Two-database architecture — each store does one job

FastAPI — REST layer

PostgreSQL

Metadata — updates frequently

  • Tenant records
  • Media item status
  • Ingest job progress

ClickHouse

Fingerprints — write once, read always

  • Video frame fingerprints
  • Audio fingerprints
  • Image fingerprints
Transactional — frequent UPDATEs Columnar — append-only, analytical reads

Ingestion: keeping the API non-blocking at high volume

Fingerprinting a video is not fast. Frame extraction, signal processing, and bulk database writes for a single 60-second video takes several seconds of CPU time. If the API blocked on this during a POST request, ingest throughput would be limited to a few assets per second per worker, and clients would experience long-hanging HTTP calls.

MediaLayer decouples the API response from the fingerprinting work entirely. When a client submits a URL for ingest, the API creates a media record in PostgreSQL with status pending and returns immediately. The fingerprinting work is handed off to a background worker queue.

Workers run independently of the API. They download the asset, run the fingerprinting pipeline, write fingerprints to ClickHouse in a single bulk operation, and update the media record status in PostgreSQL. Multiple workers run concurrently and pick up tasks from the queue as fast as they can process them.

This design means ingest throughput scales horizontally by adding workers, not by increasing API server capacity. It also means a spike in ingest volume — say, a client uploading 10,000 assets in a batch — does not affect search latency for other tenants. The queue absorbs the load; the API and search path remain unaffected.

Bulk ingest: CSV and ZIP for large library migrations

Beyond single-asset ingest, MediaLayer supports two bulk modes for clients with large existing libraries.

CSV ingest accepts a file of URLs with one row per asset. The API parses the CSV, creates a job record with the total count, and enqueues one background task per row. Clients poll the job status endpoint to watch completed and failed counts increment in real time. A CSV with 50,000 rows enqueues 50,000 tasks without the API holding any connection open.

ZIP ingest handles clients whose assets are stored locally rather than behind a URL. The client uploads a ZIP archive containing a manifest CSV and the media files themselves. A worker downloads the ZIP, extracts it, reads the manifest, and enqueues one fingerprint task per file. This allows large library migrations — millions of assets — to be handed off in a single operation.

Both bulk modes write job progress to PostgreSQL incrementally as tasks complete, so clients always have an accurate view of where a large ingest stands.

The search path: candidate shortlist, then rerank

Search in a 1-to-N system works in two stages. The first stage retrieves a small set of candidates from the index quickly. The second stage scores those candidates precisely in memory. The key is that the expensive scoring step runs on a few hundred rows, not billions.

When a query arrives, MediaLayer extracts fingerprints from the incoming media — the same signal processing pipeline used at ingest. Those fingerprints are used to query ClickHouse for candidate matches. The query is structured so that ClickHouse's storage order does most of the work: only rows that share a fingerprint component with the query are touched. The result is a ranked shortlist of candidates, typically the top 100.

That shortlist is then reranked in memory using precise similarity scoring. For video, temporal consistency is also evaluated: a genuine duplicate will have multiple query frames matching the same asset at consistent time offsets. A single lucky frame match scores high but will not survive the temporal consistency check. This two-stage approach is what keeps false positive rates low at scale — the candidate retrieval casts a wide net, and the reranking applies rigorous scoring before anything reaches the response.

The entire pipeline — query fingerprinting, ClickHouse lookup, reranking — is designed to target low-latency retrieval for typical video queries against large indexed libraries.

Two-stage search path — candidate shortlist, then precise rerank

1

Query media arrives

Video URL / audio clip / image

2

Fingerprint extraction

Same pipeline as ingest — designed for low latency

3

ClickHouse candidate lookup

Clustered index scan → top 100 candidates

4

Rerank in memory

Precise similarity score on 100 rows, not billions

5

Temporal consistency check

Video: filters single lucky-frame false positives

6

Top-K results returned

Score + match timestamps — target low-latency end-to-end response

PreparationIndex workResult

Multi-tenancy: isolation without separate deployments

MediaLayer is a multi-tenant system. Multiple clients share the same ClickHouse cluster and PostgreSQL instance, but their data is never mixed. Every fingerprint row written to ClickHouse includes a tenant identifier. Every search query includes a tenant filter as the first condition in the WHERE clause, ensuring that no query ever touches another tenant's data.

This design keeps infrastructure costs shared while providing the isolation guarantees a production API requires. A tenant with 10 million assets and a tenant with 100,000 assets run on the same cluster. The tenant filter pushes the cardinality problem back to ClickHouse, which handles it efficiently through its storage ordering.

Tenants are provisioned by the MediaLayer team during onboarding, not through self-serve signup. This is intentional — correct infrastructure sizing and API key issuance require a brief scoping conversation, and it ensures every account is set up correctly from day one.

What this enables for clients

The practical outcome of these architectural choices is a system where scale does not change the developer experience. A client querying against 1,000 indexed assets and a client querying against 100 million see the same API surface, the same JSON response shape, and similar latency. The complexity of the storage layer is entirely invisible.

Ingestion is fire-and-forget. A client submits a URL, gets a media ID, and polls for ready status. There is no upload size limit that affects throughput, no need to pre-process media before submission, and no API timeout to worry about for large files.

Search is synchronous and fast. A single API call with a media URL returns ranked matches with similarity scores and — for video — the time range in the original asset where the match was found. No polling, no job IDs, no async result collection.

MediaLayer Enterprise AI is available for clients who need to index and search their own media libraries at scale. Plans are modular — activate Image, Audio, Video, or any combination — with published pricing and managed onboarding.

Ready to wire it in?

Subscribe on RapidAPI to call the public API on your own key, or talk to MediaLayer AI Labs about enterprise direct API access.