MediaLayer

Concept

Audio fingerprinting API explained

Concepts6 min read

Audio fingerprinting compares two recordings without comparing the raw bytes. This article explains what fingerprinting actually does, why hashes don't work for transcoded audio, and how POST /audio/match returns offset-aligned matched segments.

Hashes don't survive audio in the wild

If you SHA-256 an MP3 and a SHA-256 a re-encoded copy of the same song, the hashes are completely different. The audio is the same; the byte stream isn't. That's the gap that audio fingerprinting fills.

A good fingerprint depends on perceptual features of the audio (spectral peaks, time–frequency landmarks) rather than on the byte stream. The fingerprint is derived from the content; the content is what matters.

What fingerprinting actually does

Take a short window of the audio, transform it into the frequency domain, find the locally salient peaks, and store their relative time positions. That set of (frequency, relative-time) anchors is the fingerprint.

Two recordings match when enough anchors agree — and crucially, when those agreements are time-consistent. A real match has a bunch of anchor pairs at a fixed time offset. A coincidental match has anchors scattered across offsets. The fingerprinter scores by counting the largest time-consistent group, which is also why /audio/match can return the offset-aligned segment, not just a yes / no.

What POST /audio/match returns

  • matchBoolean decision based on a sensible threshold for audio.
  • similarity_scoreFloat in [0, 1]. Use this directly for custom thresholds — different domains tolerate different false-positive rates.
  • matched_segmentsAligned start/end timestamps. source_start to source_end maps to target_start to target_end at a fixed time offset. This is the fingerprint alignment, exposed.
RESPONSE · /AUDIO/MATCH
{
  "match": true,
  "confidence": "medium",
  "similarity_score": 0.74,
  "processing_time_ms": 612,
  "media_type": "audio",
  "matched_segments": [
    { "source_start": 5.1, "source_end": 18.9, "target_start": 0.0, "target_end": 13.8, "score": 0.78 }
  ]
}

Calling the endpoint

CURL
curl -X POST https://medialayer-image-audio-video-matching-api.p.rapidapi.com/audio/match \
  -H "x-rapidapi-key: YOUR_RAPIDAPI_KEY" \
  -H "x-rapidapi-host: medialayer-image-audio-video-matching-api.p.rapidapi.com" \
  -H "Content-Type: application/json" \
  -d '{
    "source_url": "https://example.com/source.mp3",
    "target_url": "https://example.com/target.wav"
  }'

What survives, what doesn't

  • SurvivesCodec changes (MP3 ↔ AAC ↔ Opus), bitrate drops, container swaps, sample-rate conversions, light EQ, modest pitch / time stretching, and re-uploads of clips trimmed from longer originals.
  • Doesn't surviveHeavy time-stretching that distorts the time–frequency relationships, very aggressive pitch-shifts, or recordings whose speech / content is fundamentally re-performed (a re-recorded cover, not a re-encoded copy).
  • Edge casesSilence, room tone, and very short clips (< 2 s) make for unreliable fingerprints. Per-request duration caps are 300 s for audio, which keeps unbounded files out of the queue.

Choosing a threshold

match defaults to a sensible per-medium threshold. If you want a stricter or looser cutoff — e.g., for monetization workflows that share revenue based on overlap — use similarity_score directly and pick the threshold that matches your false-positive tolerance.

A simple, durable pattern is to combine score with overlap duration: require similarity_score >= 0.6 AND total_overlap_seconds >= 5. That discards short coincidental matches without depending on a single fragile cutoff.

PYTHON · SCORE + DURATION
def is_real_audio_match(response: dict) -> bool:
    score = response["similarity_score"]
    segs = response.get("matched_segments", [])
    overlap = sum(s["source_end"] - s["source_start"] for s in segs)
    return score >= 0.6 and overlap >= 5.0

Production checklist

  • Public URLsURL validation rejects private, loopback, and cloud-metadata addresses, so signed-only-from-VPC URLs won't work for the public endpoint.
  • Bound timeoutsAudio matching is fast for short clips (sub-second to a few seconds) but can run several seconds for long sources. Pick a 30s-60s timeout in your client.
  • Server-side keysNever expose x-rapidapi-key in browser or mobile clients. Call from your backend.

Ready to wire it in?

Subscribe on RapidAPI to call the public API on your own key, or talk to MediaLayer AI Labs about enterprise direct API access.