An AI agent can't watch a video. To reason about one it needs the video turned into text and images first. This guide shows the three things an agent needs — transcript, metadata, and frames — and how to fetch all of them from a single URL in one call.
Send a social-video URL to an extraction API and get back: a transcript (what is said), metadata + insights (what the video is, who posted it, how it performed), and sampled frames (what it looks like at chosen moments). With FrameFetch that is one POST /v1/extract across YouTube, YouTube Shorts, TikTok, Instagram Reels, Pinterest, and Reddit — or one MCP tool call if your agent speaks Model Context Protocol.
Large language models are text-and-image reasoners. A raw .mp4 is neither. Three derived signals close the gap:
| Signal | Answers | Built from |
|---|---|---|
| Transcript | What is being said? | Platform captions, else Whisper speech-to-text |
| Metadata & insights | What is this and how did it do? | Title, author, duration, date, views, likes, comments |
| Frames | What does it look like? | Parametric sampling — every Nth, 1-per-second, or a time range, at any width |
Feed any subset into your model's context and it can summarise, classify, fact-check, caption, or search the video without ever downloading it.
The cheapest, highest-value signal. Captions when the platform has them; Whisper when it doesn't — so you still get text for caption-less Reddit and TikTok clips.
curl -X POST https://framefetch.net/v1/transcript \
-H "Authorization: Bearer <your-key>" \
-H "Content-Type: application/json" \
-d '{ "url": "https://www.tiktok.com/@user/video/123" }'
Title, author, duration, upload date, plus views/likes/comments — useful for ranking, dedup, and "is this worth processing" gates before you spend on a transcript or frames.
curl -X POST https://framefetch.net/v1/metadata \
-H "Authorization: Bearer <your-key>" \
-H "Content-Type: application/json" \
-d '{ "url": "https://www.youtube.com/watch?v=..." }'
For visual questions — products shown, on-screen text, scene changes — pull frames. Sampling is parametric so you don't pay for 30 fps you don't need: one per second, every Nth frame, or a specific [start,end] range, at the width you want.
curl -X POST https://framefetch.net/v1/frames \
-H "Authorization: Bearer <your-key>" \
-H "Content-Type: application/json" \
-d '{ "url": "https://www.youtube.com/watch?v=...",
"frames": { "mode": "fps", "fps": 1, "width": 512 } }'
Frames come back as time-limited signed image URLs you can hand straight to a vision model.
POST /v1/extract and "fields": ["metadata","transcript","frames"]. You're billed only for what you request.If you build on Claude, Cursor, or any MCP client, add FrameFetch as a server and the model can call a video URL directly — no glue code:
claude mcp add --transport http framefetch \ https://framefetch.net/mcp \ --header "Authorization: <your-key>"
It exposes framefetch_extract and framefetch_platform_capabilities.
An autonomous agent shouldn't need a human to provision an API key. With x402 the agent calls the endpoint, gets an HTTP 402 with payment requirements, pays in USDC from its own wallet, and retries — no signup, no human. FrameFetch settles x402 on Base mainnet and is listed in the x402 Bazaar, so discovery is automatic. Humans can still use a free tier, prepaid credits, or a Stripe card.
Pay per call: metadata is sub-cent, transcripts are metered per minute of audio, frames are metered per frame. A free credit is included on signup, and identical requests are cached at the price floor. See the pricing page for exact numbers.
It doesn't watch it — it reads a transcript, metadata, and a few sampled frames derived from the video. FrameFetch returns all three from one URL.
POST the URL to /v1/transcript. Captions are used when present, Whisper otherwise, so caption-less videos still return text.
Yes — via x402 (USDC) the agent pays per call with no signup or human in the loop.