YouTube

Transcript API

Q: Can I read on-screen text from a YouTube video?

Yes — add "text_overlay" alongside "frames" to run OCR on each extracted frame and get back burned-in captions, titles, or lower-thirds.

Q: Can I see how the audience reacted to a YouTube video?

Yes — add "comments" for up to comments_cap raw top-level comments, or "comment_sentiment" for an aggregated mood rollup (percentages, themes, summary) over them.

Get the transcript of any YouTube video from a single URL. FrameFetch returns the caption track when one exists, and falls back to Whisper when it doesn't — plus metadata, view/like counts, frames, and the on-screen text burned into them.

Read the docs Pricing

What you get

For YouTube, FrameFetch returns metadata, insights (views/likes/comments), transcript (captions or Whisper), parametric frames, and on-screen text (OCR) per frame. YouTube is also the only platform that supports raw comments and the aggregated comment sentiment rollup — no competing API offers that summary. One JSON response, billed per call — every response includes a cost block.

Quickstart

curl -X POST https://framefetch.net/v1/extract \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.youtube.com/watch?v=jNQXAC9IVRw",
    "fields": ["metadata", "transcript"]
  }'

Get a key with POST /v1/keys (free credit). Full reference in the docs. Agents can pay per call with x402 (USDC) — no account.

Also from Python or Node

The curl call above is the whole API — every client library is a thin wrapper over the same POST /v1/extract. From Node, the framefetch npm package wraps it in a scoped transcript() helper:

import { FrameFetch } from 'framefetch';

const ff = new FrameFetch({ apiKey: process.env.FRAMEFETCH_API_KEY });

const result = await ff.transcript('https://www.youtube.com/watch?v=jNQXAC9IVRw');
console.log(result.transcript.source, result.transcript.text.slice(0, 80));
// timed cues, when the source carries them
for (const seg of result.transcript.segments ?? []) {
  console.log(seg.start, seg.text);
}

From Python (no official SDK — just requests, same as every other FrameFetch call):

import requests

resp = requests.post(
    "https://framefetch.net/v1/transcript",
    headers={"Authorization": "Bearer <your-key>"},
    json={"url": "https://www.youtube.com/watch?v=jNQXAC9IVRw"},
    timeout=180,
)
data = resp.json()
print(data["transcript"]["source"], data["transcript"]["text"][:80])

npm install framefetch — zero dependencies, needs Node 18+ for its built-in fetch. The client also exposes extract(), ask(), metadata(), frames(), platforms(), status(), demo(), and createKey(). See the dedicated Python transcript API page for more.

Captions first, Whisper only as a fallback

FrameFetch never transcribes a video that already has captions. The metadata probe checks for a caption track first; when one exists, getTranscript() fetches it directly with yt-dlp — both uploader-authored subtitles and YouTube's own auto-captions (--write-subs --write-auto-subs) — and returns it as-is, marked "source": "captions". Only when there is genuinely nothing to read does it fall back to downloading the audio and transcribing it with Groq's whisper-large-v3-turbo, marked "source": "whisper". You never pick which path runs — the response tells you which one did, every time.

The two paths are not interchangeable. A caption track comes back character-for-character as YouTube serves it — nothing is re-recognized, so there is no speech-recognition error to inherit. Whisper is a genuine transcription pass over the downloaded audio, carrying whatever error rate whisper-large-v3-turbo has on that specific recording (accents, cross-talk, background music). One honest gap: the API does not currently distinguish a manually-authored subtitle track from YouTube's own auto-captions — both come back identically as "source": "captions", so if you specifically need to know which kind a video has, this endpoint will not tell you.

Caption fetching also has a language scope worth knowing: it requests English tracks specifically (en, en-orig — YouTube's auto-caption tag — en-US, en-GB), not every language a video might carry subtitles in. A video whose only captions are, say, Japanese falls straight through to Whisper instead — which has no such restriction, since no language is set on the Whisper request; it auto-detects and transcribes whatever is actually spoken. And YouTube's rolling auto-captions repeat trailing words as their on-screen window scrolls (a raw export reads like "hello there hello there this is a…") — FrameFetch's parser detects and strips that overlap cue by cue before you see it, in both the flat text and the timed segments.

What captions-first actually saves you

The rate card bills transcription at $0.0015 per audio-minute — but that rate only ever applies to the Whisper path. A caption-sourced transcript consumes zero transcript-minutes: the cost model only counts transcriptMinutes when transcript.source === "whisper", so a captions call is billed just the base per-call floor, $0.002, regardless of how long the video is.

Video length	Whisper-sourced transcript	Caption-sourced transcript	Difference
Under ~1m 15s	$0.002 (floor)	$0.002 (floor)	none — both floor-priced
5 minutes	$0.00765	$0.002	~3.8×
10 minutes	$0.01515	$0.002	~7.6×
20 minutes	$0.0302	$0.002	~15×
60 minutes	$0.09015	$0.002	~45×

Computed directly from the published rate card ($0.0015/audio-minute, $0.002 call floor) for a plain fields: ["metadata","transcript"] call — not a re-run against a specific video. The crossover is exact: a Whisper call stays pinned at the $0.002 floor for anything under about 74 seconds of audio; past that, its price scales linearly with duration while a caption-sourced call never moves off the floor. For most real talking-head or tutorial-length YouTube videos — several minutes to well over an hour — that is the difference between a call that costs the floor and one that costs meaningfully more, on top of taking longer: captions skip the video download entirely (--skip-download), while Whisper has to pull the audio first.

What's in the response

Field	Shape	What it is
`transcript.text`	string	The full transcript as one flat string
`transcript.source`	string	`"captions"` (fetched, not re-recognized) or `"whisper"` (transcribed from downloaded audio)
`transcript.lang`	string \| omitted	Best-effort source language code, when known
`transcript.segments`	array \| omitted	Timed cues — `{ start, end, text }` in seconds — when the source carried timing. Omitted entirely (never `[]`) when it did not
`captionsAvailable`	boolean	Top-level flag: whether the metadata probe saw a caption track at all, independent of which fields you actually requested

{
  "platform": "youtube",
  "url": "https://www.youtube.com/watch?v=jNQXAC9IVRw",
  "captionsAvailable": true,
  "metadata": { "title": "...", "uploader": "...", "durationSec": 19, "uploadDate": "...", "sourceFps": 30, "thumbnail": "https://i.ytimg.com/..." },
  "transcript": {
    "text": "...",
    "source": "captions",
    "lang": "en",
    "segments": [ { "start": 0, "end": 3.4, "text": "..." }, { "start": 3.4, "end": 7.1, "text": "..." } ]
  },
  "cost": { "totalMicros": 2000 }
}

Field names and types are the real response contract (ExtractResult / TranscriptResult — see the full response shape) — this is a schema illustration, not a re-run against a specific video, so free-text fields are shown as "...". One field is worth calling out: uploadDate is populated here because this is a direct /v1/extract lookup; the same video found via /v1/search instead would come back with uploadDate: null — different code paths, different completeness (see below).

Finding a video when you don't have a URL yet

Everything above needs a URL to start from. When you are starting from a topic instead, POST /v1/search finds YouTube videos by keyword — yt-dlp's ytsearchN: extractor, one lightweight pass, no login, no per-video download. It is YouTube-only, and that is a stated limit rather than an oversight: YouTube is the one platform where yt-dlp's search extractor actually works, so no other platform value is accepted (a typed 400, not a silent fallback to something that would not really search). Flat $0.002 per call regardless of how many of the (1-25, default 10) results come back — the same price class as a bare metadata-only extract.

One real gap worth knowing before you rely on it: a search result's uploadDate comes back null in practice — yt-dlp's flat-search JSON simply does not carry it. title, uploader, durationSec, thumbnail, and views come back reliably; uploadDate does not, from search specifically — the exact same field on a direct /v1/extract or /v1/transcript call for that video's URL (the response above) is populated normally. If you need the real upload date, take the search hit's url and look the video up directly.

The natural pairing: search for candidates, then call this transcript endpoint — or a top-level ask question — on whichever result's url looks right. One call to find the video, one to read or question it. Full field-by-field reference on the YouTube Video Search API page.

Use it from an AI agent (MCP)

FrameFetch ships an MCP server at POST https://framefetch.net/mcp with the tools framefetch_extract, framefetch_search, framefetch_account and framefetch_platform_capabilities — point your agent at a YouTube URL directly. See the MCP setup guide for a working Claude Desktop / Cursor config.

No-code: n8n

Building this without code? The n8n YouTube transcript workflow guide has a ready-to-import workflow JSON — one HTTP Request node for the transcript, a second for a direct ask question — plus why n8n's built-in YouTube node can't pull captions for a video you don't own.

FAQ

Does it work when a YouTube video has no captions?

Yes. If there is no caption track, FrameFetch transcribes the audio with Whisper and returns the text the same way.

Can I also get frames or metadata in the same call?

Yes — add "frames" and "metadata" to the fields array in one /v1/extract call.

Does it support YouTube Shorts?

Yes, Shorts URLs work the same way. See the Shorts page.

Can I read on-screen text from a YouTube video?

Yes — add text_overlay alongside frames to run OCR on each extracted frame and get back burned-in captions, titles, or lower-thirds.

Can I see how the audience reacted to a YouTube video?

Yes — add comments for up to comments_cap raw top-level comments, or comment_sentiment for an aggregated mood rollup (percentages, themes, summary) over them. See the comment sentiment API for the full shape and pricing.

Is captions-first actually cheaper, or just faster?

Both. Whisper transcription is billed $0.0015/audio-minute; a caption-sourced transcript bills none of that — just the $0.002 per-call floor, regardless of video length. Past about 74 seconds of audio, a Whisper call's price climbs past what the equivalent caption-sourced call costs; on a 20-minute video that is roughly a 15× gap. See the pricing breakdown above.

Does FrameFetch say whether a caption track was auto-generated or written by a human?

No — that distinction is not in the response today. Both a manually-authored subtitle track and YouTube's own auto-captions come back identically as transcript.source: "captions".

What language does the transcript come back in?

Depends on the path. Caption fetching specifically requests English tracks (en, en-orig, en-US, en-GB) — a video whose only captions are in another language falls through to Whisper instead. Whisper itself has no language restriction: it auto-detects and transcribes whatever is actually spoken, in any language.

How do I find a video if I don't already have the URL?

POST /v1/search with a keyword — YouTube only, flat $0.002 per call, up to 25 results. Take a result's url and call this endpoint on it. Note that uploadDate comes back null from search specifically (yt-dlp's flat-search JSON does not carry it); it is populated normally on a direct lookup like this one. Full detail on the Video Search API page.