Every Shot of the Lead — Multi-Layer Matching for Post-Production Playlists

Finding every shot of a single performer across a post-production archive is the kind of job that quietly eats a week of an assistant editor's life. Here's how we built a three-layer matching pipeline — scene descriptions, TMDB cross-reference, and TwelveLabs multimodal video understanding — with an LLM as the final judge.

The Assistant Editor's Least Favorite Task

Somewhere in post-production right now, an assistant editor has been asked to pull every shot of the lead actress from three months of dailies. Not a scene list. Not a rough selection. Every shot. For the director's cut, for a sizzle reel, for a trailer cut the marketing team needs by Friday, for a last-minute reshoot decision that hinges on whether a particular expression exists somewhere in the footage.

It is a job that, historically, has been done by scrubbing. Open a bin. Double-click a clip. Watch until you see the face. Log it. Move on. On a project with a few hundred hours of footage, this is a week of someone's life. On a prestige series with a thousand hours of dailies, it is the reason that assistant quit.

Every post house has tried to solve this the same way: better metadata. Make the loggers tag the shots. Put the performer name in the clip description. Use a controlled vocabulary. Enforce it. Audit it. Pay for it.

It has never worked. Not at the scale post-production actually runs at, not under the deadlines post-production actually operates under, and not with the budget post-production actually has. Loggers miss shots. Descriptions disagree across episodes. Performers get listed by role in one season and by name in another. Half the archive is tagged, half isn't, and the half that is can't be trusted.

What we have been building at Ceivo is a different answer — one that doesn't depend on any single signal being correct, and that treats "find every shot of this person" as the kind of question a good junior editor would ask a dozen small tools in parallel and then synthesize an answer from.

Start Simple: The N8N Workflow We Built First

Before we built anything sophisticated, we built something embarrassingly simple in n8n. It took an afternoon. It worked well enough to convince us the hard version was worth doing.

The flow had four nodes.

Node 1 — Input. A performer name. That's it. Type it in, kick off the workflow.

Node 2 — Ceivo search, scene-level, transcript sources. Search the archive for any scene whose transcript mentions the performer's name. This catches moments where another character addresses them by name, where a director's slate calls the shot, where a voice-over references them. It's noisy — transcripts pick up names spoken in conversation about the performer as much as they pick up scenes featuring them — but it surfaces a candidate set in seconds.

Node 3 — Ceivo search, scene-level, description sources. Search the same archive for scenes whose AI-generated visual descriptions mention the performer. This is the kind of hit you get when a vision model has been asked to describe a shot and has written something like "a woman in a blue coat walks down a rainy street" — which is useless for finding a specific performer, or "Sydney Sweeney sits at a diner counter looking out the window" — which is exactly the shot you want. The signal quality depends on whether the description model was primed with cast context. Sometimes it is. Usually it isn't.

Node 4 — Merge and dedupe. Combine the two result sets, dedupe by scene ID, sort by a simple composite score, and hand the list to a producer.

That's the entire workflow. Thirty minutes of setup. It runs in under a minute against a twenty-thousand-clip archive. And on a good day — when transcripts are clean and visual descriptions have cast context — it finds 60-70% of the shots the editor was looking for.

Sixty percent. That's the uncomfortable number. Six out of ten shots retrieved means four out of ten missed — and the four that get missed are exactly the shots that matter most, because they're the ones where the performer is on screen without being named, without saying anything, and without being described specifically enough to trigger a keyword match. The emotional close-ups. The silent reaction shots. The long establishing beats. The shots a trailer is built out of.

We knew the simple workflow would miss those. What the simple workflow gave us was a baseline to measure against, and a structure to extend. Every layer we added after this one was a layer whose only job was to catch the shots the N8N flow couldn't.

The Problem with Text Alone

The reason text-based retrieval caps out around 60% is that text describes shots in summary, and summaries throw away the thing an editor is actually searching for: the performer's face, in frame, in a specific beat.

A transcript tells you what was said in a scene. It does not tell you who was on screen when it was said, and whether they were the subject of the shot or a reaction cutaway. A visual description tells you what the dominant subject of a shot is, but vision models hedge. They don't write "Sydney Sweeney sitting at the counter." They write "a blonde woman sitting at a counter." And they do that for a specific technical reason — most base vision models don't know who Sydney Sweeney is, so they describe what they see instead of naming who they see.

The fix is obvious in hindsight. You need a layer that connects the visual identity of a performer to the visual content of a shot, without depending on anyone having written the performer's name down anywhere. And you need that layer to be something the rest of the workflow can query the same way it queries transcripts and descriptions — as a tool call, not a data export.

That is where the second and third layers come in.

Layer Two: The TMDB Cross-Reference

Before we reached for computer vision, we added a cheaper layer — TMDB. The Movie Database is a community-maintained catalog of film and television metadata, and its API is generous. Every performer in the database has headshots, filmographies, character names they've played, episode-level credits, and — critically — character names per role per production.

The insight is that a performer rarely goes by their real name inside the footage. They go by the character name. A transcript search for "Sydney Sweeney" returns close to nothing, because nobody in a scene ever calls her that. A transcript search for "Cassie," her character name in a recent HBO series, returns everything.

So the second layer of the pipeline does this: given a performer name, call TMDB, get back the list of roles and character names, and expand the original query into a set of queries — one for the performer's real name, one for each character name, one for common variants ("Cass," "Cassie," nickname forms). Fire all of them against the Ceivo transcript index in parallel. Merge the results.

This is a surprisingly large lift. On the first project we tried it on — a series where the performer had two major character names across different seasons — transcript recall jumped from about 40% to about 75% just from adding the TMDB expansion. No model training. No new infrastructure. Just using the world's public film metadata the way an editor's memory uses it.

TMDB also gives us a grounding signal the computer vision layer will need in a minute: authoritative headshots. The database keeps high-quality, consented, production-approved portraits of the performer. Those images become the visual query we feed into the next layer.

Layer Three: Multimodal Video Understanding, via TwelveLabs

This is the layer that makes the difference between 75% recall and 95% recall. It's also the layer that an n8n flow alone cannot build.

TwelveLabs is not a conventional computer vision platform. Traditional computer vision breaks video into individual frames and processes audio separately — useful for object detection, but it never achieves a holistic understanding of what's actually happening in a scene. TwelveLabs takes a fundamentally different approach, processing video natively across visual, audio, and temporal modalities simultaneously. The result is semantic understanding of what content means — not just what objects appear on screen. It indexes video at the embedding level — every scene, every shot, every frame — and lets you query that index by text, by image, or by video clip. For our use case, the important primitive is image-based search: given a still photograph of a person, find every scene in the indexed archive where that person appears on screen.

TwelveLabs is a Ceivo verified partner (see our March announcement), and the integration runs through the same MCP surface as the rest of our search primitives. The workflow looks like this:

  1. Pull the reference image. From TMDB, fetch the performer's authoritative headshot — ideally several, across ages and looks if the filmography spans time.
  2. Run an image search against the archive. TwelveLabs returns a ranked list of scenes where its embeddings say the person in the reference image appears on screen. Confidence scores per scene. Timecoded in/out points. Thumbnail frames.
  3. Merge with the text-layer results. The image search catches everything the text layers missed — the silent reaction shots, the emotional close-ups, the shots where the performer is on screen but nobody says their name. The text layers catch everything the image search has low confidence on — the wide shots where the face is small, the profile shots where the match is uncertain, the shots where the performer is partially obscured.

Each layer fails in different places. That is the whole point. Three layers together catch what any one layer misses, and the overlap between them is a strong signal that a hit is real.

But none of this is free of error. Face embedding models are very good and not perfect. They confuse performers who look alike. They lose confidence when the lighting is bad, when the performer is in heavy makeup, when a body double or stunt performer stands in. A three-layer pipeline with no judgment on top of it is a pipeline that returns a lot of candidates and a lot of false positives — and hands the editor a slightly shorter week of scrubbing.

So we added a fourth layer on top of the three: an LLM, acting as a judge.

The LLM as Final Judge

The most under-appreciated role for a language model in a media pipeline is not as a generator. It is as a reconciler — a thing that looks at three or four independent signals, each imperfect, and writes down which ones it believes and why.

In the actor-matching pipeline, the LLM runs after all three search layers have returned their candidates. For every candidate scene, the LLM gets:

  • The text-layer evidence: transcript snippet mentioning the performer name or character name, with timecodes.
  • The description-layer evidence: the AI-generated visual description of the scene, with any face-relevant tokens highlighted.
  • The image-layer evidence: the TwelveLabs confidence score, the matched frame, and adjacent frames for continuity.
  • The TMDB grounding: the performer's known characters in this production, headshots for reference, and the episode/scene context.

The LLM's job is not to look at a video. It's to look at the argument the three layers are making about whether this scene contains this performer. It asks the kinds of questions a good assistant editor would ask: Is the transcript hit actually about the person on screen, or about someone else talking about the person? Does the visual description mention features that are consistent with the reference image? Is the vision confidence high enough on its own, or does it need corroboration from another layer? Is there a plausible explanation for a disagreement — a body double, a flashback, a different actor in the same role?

The LLM returns a verdict per candidate: confirmed, probable, uncertain — human review, or rejected. Each verdict comes with a one-sentence editorial reason, written in language the editor can skim. The uncertain bucket is the pipeline's most valuable output, because it concentrates human review on exactly the scenes where the automated layers disagreed — and that's usually a tiny fraction of the total candidate set.

On a recent internal test against a fully scrubbed reference list for a single performer across a season of television, the pipeline landed at roughly 94% precision and 96% recall — with about 7% of candidates ending up in the uncertain bucket for human review. Those are numbers that turn a week of scrubbing into an afternoon of confirming, and they are numbers that the simple n8n flow on its own could not come close to.

Why the Pairing Matters

The temptation, once the computer vision layer exists, is to throw away the text layers. Vision is more accurate, so let's just use vision. It's the wrong instinct.

Every layer catches a different failure mode of the other layers. Vision fails in low light and in profile. Text fails in silent scenes and in visually ambiguous descriptions. TMDB cross-reference fails when a performer is uncredited, when a production uses pseudonyms, or when the character name is too generic to disambiguate. Three layers together fail only in the intersection of their failure modes, which is a much smaller set than any single layer's failure set.

It also matters for the LLM-as-judge step. The LLM is not a vision model. It can't look at the frame. It can only look at the evidence the other layers give it, and it can only reach high confidence when multiple layers agree. Strip a layer out and you strip signal out of the judgment — the LLM's verdicts get worse, the uncertain bucket grows, and the editor's work grows with it.

The deeper architectural point is the one we keep coming back to across every piece we've written about agentic media infrastructure: capability comes from composition, not from any single model being magical. The n8n flow, the TMDB cross-reference, the TwelveLabs index, and the LLM judge are each individually simple. Chained together through an MCP surface, with session state holding the intermediate results so nothing gets lost between calls, they become something a post-production team can actually use on Monday morning.

A Walkthrough: Building the Playlist

To make this concrete, here's what a single end-to-end run looks like.

Step 1 — The brief. A producer types "Build a playlist of every shot featuring the lead actress, across all eight episodes of the season, sorted by episode and timecode."

Step 2 — Grounding. The agent calls TMDB for the performer's credits on this production. It gets back the character names, the headshot URLs, and the episode-level credits. The character names go into a keyword set. The headshot URLs go into an image set. Everything lands in session state.

Step 3 — Text layers. The agent fires parallel searches against the Ceivo transcript and description indexes — one query per character name, one per variant spelling, one per real name. Results stream into session state. Each search stores its raw payload on disk and returns a compact summary the agent can reason over.

Step 4 — Vision layer. The agent calls the TwelveLabs image search with each headshot from TMDB. Scenes come back with confidence scores, thumbnails, and timecodes. These results also land in session state.

Step 5 — Merge. The agent calls a top-results primitive across all the layers — transcript, description, vision — and gets back a deduplicated candidate list ranked by a composite score.

Step 6 — LLM judgment. For each candidate, the agent gathers the evidence from every layer and hands it to an LLM with a one-shot prompt: here is the evidence, here are the rules for judging, return a verdict and a reason. Verdicts come back in seconds. Confirmed scenes go to the playlist. Uncertain scenes go to a review bucket with the evidence attached, so the editor can scrub only the ones that need scrubbing.

Step 7 — Assembly. The agent takes the confirmed scenes, sorts them by episode and timecode, and calls Ceivo's playlist-assembly primitive to emit a playlist ready to open in the editor's NLE of choice. Every scene in the playlist carries the editorial reason the LLM gave for including it — so when the editor scrubs, they know what the pipeline believed and why.

Total human touches: two. The brief, and the review of the uncertain bucket. Total elapsed time on a full season: minutes, not days.

Honest Limits

This pipeline is not infallible, and we aren't going to pretend it is.

It struggles on productions where a performer plays multiple characters — twins, look-alikes, flashback de-aging — because both the vision layer and the text layer will return confident hits that the LLM judge has no way to adjudicate without additional context. The fix is to add that context to the brief up front: "treat any scene with both twins as uncertain," or "only match scenes with the adult version of the character."

It struggles when the reference images from TMDB don't match the look of the performer in the production. A performer cast in a historical piece, in heavy prosthetics, or with a dramatically different hair color from their TMDB headshots will get lower vision confidence. The fix is to replace or augment the TMDB references with production stills — and the rest of the pipeline runs unchanged.

It struggles when the archive is under-indexed. If the transcripts are low quality, if the descriptions haven't been generated yet, if the vision index isn't built — the pipeline runs with fewer layers and its accuracy drops. This is the least dramatic of the failure modes, because it's the one that can be fixed over time by finishing the indexing work.

And finally — it is still making editorial calls that benefit from a human in the loop. Is this the shot the director wanted, or just a shot where the actress is on screen? That question is not one any pipeline we have built can answer. It is also not one it is supposed to answer. The pipeline's job is to put the right candidates on the editor's timeline, with enough context to make the final call fast. The editor's job is to make the final call.

What This Unlocks

Once you have a reliable pipeline for finding every shot of a performer, a lot of downstream work gets cheaper.

Trailer cuts and sizzle reels become assemblies on top of a pre-built playlist. The marketing team doesn't have to wait on an assistant editor to pull a selection. They can ask for "every hero close-up of the lead, sorted by intensity," and get a starting playlist back in minutes.

Reshoot planning gets faster. A director asking "do we already have a wide of her walking into the bar?" can get an answer without a human opening a bin, because the pipeline already knows where every shot of her is.

Continuity and wardrobe checks become tractable. Combine the performer-matching pipeline with a shot-level wardrobe description, and you can ask "which scenes have her in the blue coat?" with no manual logging.

Season-over-season searches become possible. The same pipeline, run across multiple seasons of a returning series, surfaces recurring character arcs that the production team has long since forgotten the timecodes of.

None of this is new in ambition. Post-production teams have been trying to do all of it for decades. What is new is that the combination of AI-generated descriptions, transcript search, vision embeddings, public metadata catalogs, and language-model reconciliation has finally landed in the same place at the same time — and a pipeline that was impossible to build two years ago is now a weekend project, if you have the right substrate underneath it.

The Architectural Point

The broader lesson — the one that generalizes beyond actor matching — is that hard retrieval problems in media are almost never solved by a single better model. They are solved by combining several imperfect models, letting each one do what it is good at, and asking a language model to reconcile the disagreements.

The n8n flow we started with was not wrong. It was incomplete. Every layer we added after it made the next layer more useful, because the LLM judge at the top had more evidence to work with. Pull any layer out and the whole thing gets worse. Keep them all and the pipeline converges on the kind of recall and precision that used to be the exclusive domain of the scrubbing week.

This is the pattern we are going to keep running into as post-production adopts agent-driven workflows. The winning systems are not going to be the ones with the flashiest single model. They are going to be the ones that treat retrieval as a multi-layer argument, treat language models as reconcilers, and treat editorial judgment as the last word. Ceivo is building for that pattern deliberately — because it is the only pattern we have seen that survives contact with a real archive on a real deadline.

What's Next

If you have a post-production archive and an actor-matching problem — or any problem shaped like "find every X in this library where X is visual" — we would like to show you what this pipeline looks like against your footage. We will build a small proof-of-concept against a subset of your archive, stand up the TMDB and TwelveLabs layers, and walk you through the results with the LLM judge's reasoning visible.

Reach out and let's run one.

Ready to see your entire media universe?

Connect a few sources and get instant visibility—no migration needed.