FrameFetch logo FrameFetch

How does an AI agent get a video transcript and specific frames from one API?

Short answer: An AI agent gets a video's metadata, transcript, and specific frames by calling FrameFetch's POST /v1/extract endpoint — or its MCP tool — with a video URL and the fields it needs. One call works the same across YouTube (incl. Shorts), TikTok, Reddit, Instagram and Pinterest: it returns metadata + insights, a Whisper transcript (captions when available), and parametric frames (pick fps or exact timestamps) pushed to S3 — no per-platform scrapers.

Agent-first: typed errors, refund-on-fail, result caching, and an MCP server.

The problem it removes

Every platform needs its own scraper, rate limits differ, transcripts are a separate pipeline, and grabbing specific frames at exact times is painful. Agents end up maintaining brittle glue per site. FrameFetch is one interface over all of it.

The one call

curl -X POST https://framefetch.net/v1/extract \
  -H "Authorization: Bearer <key>" -H "Content-Type: application/json" \
  -d '{"url":"https://youtu.be/jNQXAC9IVRw",
       "fields":["metadata","transcript","frames"],
       "frames":{"mode":"fps","fps":1,"width":480}}'

Via MCP, add the server once and call the same extract as a tool:

{ "mcpServers": { "framefetch": {
  "url": "https://framefetch.net/mcp",
  "headers": { "Authorization": "Bearer YOUR_KEY" } } } }

Get a key (free $0.05 credit): POST https://framefetch.net/v1/keys {"email":"you@example.com"}

What comes back

FieldWhat it is
metadata + insightstitle, author, duration, views/likes/comments
transcriptcaptions if present, else a Whisper transcription (source marked)
framesby fps or exact timestamps; any size; jpg/png/webp; returned as S3 URLs with index + time
costexact per-call breakdown, with refund-on-fail

Shortcut endpoints & pricing

EndpointUsePrice
/v1/metadatametadata + insights only (cheapest)≈ $0.002
/v1/transcriptcaptions or Whisper~$0.0015 / audio-min
/v1/framesframes only (needs a frames spec)$0.00012 / frame
/v1/extractany combination in one callsum of the above

Pay per call via x402 (USDC on Base, no account) or Stripe. A 3-min transcript ≈ $0.02; 60 frames @480px ≈ $0.08–0.12.

FAQ

Which platforms are supported?
YouTube (incl. Shorts), TikTok, Reddit, Instagram, Pinterest. GET /v1/platforms returns the live matrix.
How do I get only specific frames?
Pass a frames spec: mode:"fps" with an fps, or exact timestamps, plus width/format. Frames land in S3 and return as URLs.
Where does the transcript come from?
Captions when available; otherwise Whisper transcribes the audio. The response marks the source.
What makes it agent-first?
Typed error codes, refund-on-fail, result caching, an MCP server, and pay-per-call with no signup friction.

Get a key & try FrameFetch →