What is the easiest way to get a video transcript via API?

POST the video URL to a transcript endpoint and read back the text. FrameFetch uses the platform's own captions when present and falls back to Whisper speech-to-text when they are not, so you get a transcript even for videos with no captions, including Reddit and TikTok.

Can an AI agent pay for an API without a human account?

Yes, with x402. The agent sends a request, receives an HTTP 402 with payment requirements, pays in USDC from its own wallet, and retries — no signup, API-key provisioning, or human in the loop. FrameFetch supports x402 on Base mainnet alongside normal API keys and Stripe.

How to Give an AI Agent Video Data (Transcripts, Frames, Metadata)

Q: How does an AI agent read a video?

An agent cannot watch a video directly. It needs the video turned into text and images first: a transcript (what is said), metadata and engagement insights (what the video is and how it performed), and a handful of sampled frames (what it looks like at chosen moments). FrameFetch returns all three from a single video URL in one JSON response.

The short answer

Send a social-video URL to an extraction API and get back: a transcript (what is said), metadata + insights (what the video is, who posted it, how it performed), and sampled frames (what it looks like at chosen moments). With FrameFetch that is one POST /v1/extract across YouTube, YouTube Shorts, TikTok, Instagram Reels, Pinterest, and Reddit — or one MCP tool call if your agent speaks Model Context Protocol.

Why agents need this

Large language models are text-and-image reasoners. A raw .mp4 is neither. Three derived signals close the gap:

Signal	Answers	Built from
Transcript	What is being said?	Platform captions, else Whisper speech-to-text
Metadata & insights	What is this and how did it do?	Title, author, duration, date, views, likes, comments
Frames	What does it look like?	Parametric sampling — every Nth, 1-per-second, or a time range, at any width

Feed any subset into your model's context and it can summarise, classify, fact-check, caption, or search the video without ever downloading it.

Step 1 — Get a transcript

The cheapest, highest-value signal. Captions when the platform has them; Whisper when it doesn't — so you still get text for caption-less Reddit and TikTok clips.

curl -X POST https://framefetch.net/v1/transcript \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://www.tiktok.com/@user/video/123" }'

Step 2 — Add metadata and engagement

Title, author, duration, upload date, plus views/likes/comments — useful for ranking, dedup, and "is this worth processing" gates before you spend on a transcript or frames.

curl -X POST https://framefetch.net/v1/metadata \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://www.youtube.com/watch?v=..." }'

Step 3 — Sample frames (when vision matters)

For visual questions — products shown, on-screen text, scene changes — pull frames. Sampling is parametric so you don't pay for 30 fps you don't need: one per second, every Nth frame, or a specific [start,end] range, at the width you want.

curl -X POST https://framefetch.net/v1/frames \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://www.youtube.com/watch?v=...",
        "frames": { "mode": "fps", "fps": 1, "width": 512 } }'

Frames come back as time-limited signed image URLs you can hand straight to a vision model.

One call instead of three: ask for several fields at once with POST /v1/extract and "fields": ["metadata","transcript","frames"]. You're billed only for what you request.

Wiring it into an agent (MCP)

If you build on Claude, Cursor, or any MCP client, add FrameFetch as a server and the model can call a video URL directly — no glue code:

claude mcp add --transport http framefetch \
  https://framefetch.net/mcp \
  --header "Authorization: <your-key>"

It exposes framefetch_extract and framefetch_platform_capabilities.

Letting the agent pay for itself (x402)

An autonomous agent shouldn't need a human to provision an API key. With x402 the agent calls the endpoint, gets an HTTP 402 with payment requirements, pays in USDC from its own wallet, and retries — no signup, no human. FrameFetch settles x402 on Base mainnet and is listed in the x402 Bazaar, so discovery is automatic. Humans can still use a free tier, prepaid credits, or a Stripe card.

What it costs

Pay per call: metadata is sub-cent, transcripts are metered per minute of audio, frames are metered per frame. A free credit is included on signup, and identical requests are cached at the price floor. See the pricing page for exact numbers.

FAQ

How does an AI agent read a video?

It doesn't watch it — it reads a transcript, metadata, and a few sampled frames derived from the video. FrameFetch returns all three from one URL.

What's the easiest way to get a video transcript via API?

POST the URL to /v1/transcript. Captions are used when present, Whisper otherwise, so caption-less videos still return text.

Can an agent pay without an account?

Yes — via x402 (USDC) the agent pays per call with no signup or human in the loop.

Try it free — no signup Read the docs