Concept

Audio fingerprinting API explained

April 22, 2026Concepts6 min read

Audio fingerprinting compares two recordings without comparing the raw bytes. This article explains what fingerprinting actually does, why hashes don't work for transcoded audio, and how POST /audio/match returns offset-aligned matched segments.

Hashes don't survive audio in the wild

If you SHA-256 an MP3 and a SHA-256 a re-encoded copy of the same song, the hashes are completely different. The audio is the same; the byte stream isn't. That's the gap that audio fingerprinting fills.

A good fingerprint depends on perceptual features of the audio (spectral peaks, time–frequency landmarks) rather than on the byte stream. The fingerprint is derived from the content; the content is what matters.

What fingerprinting actually does

Take a short window of the audio, transform it into the frequency domain, find the locally salient peaks, and store their relative time positions. That set of (frequency, relative-time) anchors is the fingerprint.

Two recordings match when enough anchors agree — and crucially, when those agreements are time-consistent. A real match has a bunch of anchor pairs at a fixed time offset. A coincidental match has anchors scattered across offsets. The fingerprinter scores by counting the largest time-consistent group, which is also why /audio/match can return the offset-aligned segment, not just a yes / no.

What POST /audio/match returns

match — Boolean decision based on a sensible threshold for audio.
similarity_score — Float in [0, 1]. Use this directly for custom thresholds — different domains tolerate different false-positive rates.
matched_segments — Aligned start/end timestamps. source_start to source_end maps to target_start to target_end at a fixed time offset. This is the fingerprint alignment, exposed.

RESPONSE · /AUDIO/MATCH

{
  "match": true,
  "confidence": "medium",
  "similarity_score": 0.74,
  "processing_time_ms": 612,
  "media_type": "audio",
  "matched_segments": [
    { "source_start": 5.1, "source_end": 18.9, "target_start": 0.0, "target_end": 13.8, "score": 0.78 }
  ]
}

Calling the endpoint

CURL

curl -X POST https://medialayer-image-audio-video-matching-api.p.rapidapi.com/audio/match \
  -H "x-rapidapi-key: YOUR_RAPIDAPI_KEY" \
  -H "x-rapidapi-host: medialayer-image-audio-video-matching-api.p.rapidapi.com" \
  -H "Content-Type: application/json" \
  -d '{
    "source_url": "https://example.com/source.mp3",
    "target_url": "https://example.com/target.wav"
  }'

What survives, what doesn't

Survives — Codec changes (MP3 ↔ AAC ↔ Opus), bitrate drops, container swaps, sample-rate conversions, light EQ, modest pitch / time stretching, and re-uploads of clips trimmed from longer originals.
Doesn't survive — Heavy time-stretching that distorts the time–frequency relationships, very aggressive pitch-shifts, or recordings whose speech / content is fundamentally re-performed (a re-recorded cover, not a re-encoded copy).
Edge cases — Silence, room tone, and very short clips (< 2 s) make for unreliable fingerprints. Per-request duration caps are 300 s for audio, which keeps unbounded files out of the queue.

Choosing a threshold

match defaults to a sensible per-medium threshold. If you want a stricter or looser cutoff — e.g., for monetization workflows that share revenue based on overlap — use similarity_score directly and pick the threshold that matches your false-positive tolerance.

A simple, durable pattern is to combine score with overlap duration: require similarity_score >= 0.6 AND total_overlap_seconds >= 5. That discards short coincidental matches without depending on a single fragile cutoff.

PYTHON · SCORE + DURATION

def is_real_audio_match(response: dict) -> bool:
    score = response["similarity_score"]
    segs = response.get("matched_segments", [])
    overlap = sum(s["source_end"] - s["source_start"] for s in segs)
    return score >= 0.6 and overlap >= 5.0

Production checklist

Public URLs — URL validation rejects private, loopback, and cloud-metadata addresses, so signed-only-from-VPC URLs won't work for the public endpoint.
Bound timeouts — Audio matching is fast for short clips (sub-second to a few seconds) but can run several seconds for long sources. Pick a 30s-60s timeout in your client.
Server-side keys — Never expose x-rapidapi-key in browser or mobile clients. Call from your backend.

Endpoints used in this article

POST /audio/match

Compare two audio recordings and return offset-aligned matched segments. Survives transcoding, partial reuse, and modest pitch / time stretching.

See full reference →

POST /video/match

When the audio you care about lives inside a video, /video/match returns aligned matched segments using the same envelope shape.

See full reference →