Annotation Guidelines

Video Caption Correcting

Each clip is a short video with a machine-generated caption. Watch the clip with sound on, then correct the caption — fix mistakes, delete hallucinations, then add the missing detail — so it accurately and completely describes everything seen and heard. The goal: a caption good enough that a generation model could recreate the video and its audio from the text alone.

⚠️ Read this before you start

For every clip: watch it with sound on, then correct the caption — fix mistakes, delete hallucinations, add the missing details. Stay faithful — never invent. Every word of the final caption must be supported by what you can actually see or hear.

The clip is already trimmed to the exact segment you should caption. You do not adjust the video's start/end — there is no trimming step. Your only job is to make the caption an accurate, complete description of this clip.

The workflow

Start in the annotation platform. Open the task and copy its task data.
Come to this site. Click Paste task, paste the task data — the video loads at its locked window with the caption ready to correct.
Correct the caption, use the Compare tab to review your edits, then click Complete task and copy the result.
Return to the annotation platform and paste the result back to submit.

Worked examples

Two full before → after corrections from the guidelines — expand to read each.

Example 1 — a gentle, single-scene clip (stop-motion mouse)

Before (machine)

The narrator says, "This is a charming stop-motion animated video created entirely from needle-felted wool, depicting a day in the life of a small mouse character. The video opens with detailed shots of a felted tree with fruits, then introduces a mouse wearing a blue and white striped dress and white apron who picks oranges from the tree branches. She carries her harvest in a woven basket through a cozy, rustic kitchen with a fireplace, red checkered curtains, and wooden furniture. The video concludes with her tending to the fireplace, creating a warm, domestic atmosphere…" A gentle, whimsical music box melody runs throughout, featuring a soft flute line and a delicate music-box tune later. Soft rustling sounds accompany the mouse's movements, and the crackling of the fire can be heard in the final scene.

After (corrected)

This is a charming stop-motion animated video created entirely from tactile, needle-felted wool, beautifully depicting a cozy day in the life of an anthropomorphic small mouse character. The video opens with detailed macro close-up shots of a meticulously handcrafted felted tree with branches heavy with soft, muted-orange fruits, then introduces the main character: a small mouse in a quaint blue and white striped dress and a crisp white apron who gently picks oranges. The shot then cuts to a medium view as she carries her harvest in a tiny, woven wicker basket through a rustic, heartwarming kitchen with a stone fireplace, red checkered curtains, and miniature wooden furniture. Then the shot cuts to a low-angle POV shot from the fruits' perspective inside the basket as she sets them down. The video concludes with the mouse tending to the hearth — her tiny paws striking a match to ignite a warm, glowing fire that bathes the room in a soft, golden hue. Overall, a warm, nostalgic, domestic atmosphere.

The production showcases master-level, intricate wool craftsmanship — soft organic textures, a muted natural palette, and gentle, deliberate animation evoking a nostalgic storybook aesthetic reminiscent of Beatrix Potter. A whimsical music-box melody loops throughout, with a soft, airy flute counter-melody early on, transitioning into a pure, delicate music-box tune later — a calm, serene fairy-tale atmosphere. Soft rustling sounds of fabric accompany the mouse's movements, culminating in the cozy crackling of firewood in the final scene.

Drops the "The narrator says, '…'" framing; adds shot sizes, an explicit cut, and a whole missing low-angle POV shot; sharpens actions, objects, lighting and texture; enriches music and SFX — all while staying faithful, with grounded aesthetic words ("charming", "heartwarming") layered on top of concrete facts.

Example 2 — a fast-cutting commercial advertisement (Ayla cloths)

Before (machine)

This is a commercial-style video demonstrating the versatility of 'Ayla' brand cleaning cloths across multiple surfaces and in different colors. The video features a woman with curly blonde hair wearing a dark floral dress, showcasing yellow, orange, light blue, and pink versions of the cloth. She cleans various household surfaces including gray countertops, black stovetops, glass windows, and flat-screen TVs. The video employs quick cuts, with special effects like sparkles to emphasize cleanliness. An off-screen narrator with a bright, upbeat tone says, 'Quick, easy, and', 'Enjoyable!', 'With one wipe, you get a stunning result. And say', 'Fifty percent of your cleaning time. The cloth', 'streak free in a flash.', and 'Flash'. Upbeat synthesized pop music plays throughout.

After (corrected)

This is a bright, commercial-style advertisement demonstrating the versatility of 'Ayla' brand cleaning cloths across multiple surfaces in vibrant colors. The live-action video is a briskly paced sequence of seven distinct shots that cut roughly once per second, featuring a woman with curly blonde hair in a dark floral dress. Shot one opens on a close-up of a gray countertop scattered with crumbs and debris, which she sweeps away with a bright cloth. Shot two cuts to a black oven top covered in greasy splatters as an orange cloth wipes it clean. Shot three switches to a unique low-angle view looking up through a clear glass table as a cloth clears smudges. Shot four is a close-up where glowing sparkles flash across the screen to emphasize the spotless finish. In shot five she cleans a glass window and the camera zooms in to display the 'Ayla' logo on the corner of the fabric. Shot six wipes dust and streaks off a black flat-screen TV with a pink cloth. The seventh, final shot is a product reveal: a medium view of the smiling woman holding a stack of yellow, orange, light blue, and pink cloths against a vibrant yellow background. Lighting is bright, high-key, and professional throughout.

An upbeat synthesized pop track plays continuously, establishing a cheerful, promotional mood. Clean and prominent in the mix, an off-screen narrator delivers a bright voiceover: "Quick, easy, and enjoyable! With one wipe, you get a stunning result and save 50% of your cleaning time. Ayla cloths are highly absorbent and make all smooth surfaces streak-free in a flash!" The narrator stays upfront while the music sits quietly underneath — a polished commercial atmosphere.

Names the format and live-action medium; gives a shot count and cutting pace and walks the seven shots in order; adds what's being cleaned off each surface; catches the low-angle through-the-glass shot and the zoom onto the logo; ties each cloth color to its action; and reassembles the garbled narrator transcript into one clean voiceover.

How to watch — three passes

Don't single-pass under time pressure — you'll under-correct.

Pass 1 — content: the overall story — shots, subjects, actions, setting.
Pass 2 — focused audio: listen closely (eyes off the screen helps) to separate speech vs music vs sound effects vs ambience.
Pass 3 — verify: scrub through and check every correction and added detail against what is actually there, before you submit.

Step A — Cross-validate & fix (remove hallucinations)

Read the caption sentence by sentence against the clip. Fix or delete anything that doesn't hold up. The golden rule is faithfulness first: never invent objects, text, sounds, counts, names, or brands.

Error type	What to do
Hallucinated content	Delete any object, action, character, on-screen text, or sound that isn't actually in the clip.
Wrong attribute	Fix incorrect colors, counts, gender, age, object identity, location, materials, or actions ("blue dress" when it’s green; "three people" when there are two).
Wrong sequence	Re-order events to match the order they actually happen on screen.
Mis-framed as narration	Make it a direct, third-person description. Remove wrappers like The narrator says, "<the whole description>." Real spoken narration goes in the audio part instead.
Vague / unfounded guess	"deep, gravelly voice", not "a scary voice"; "neon-lit alley at night", not "a creepy place".
Unverifiable specifics	Only name a brand, song, place, or person if it's clearly identifiable from the clip; otherwise describe it generically.

Concrete first — aesthetic language welcome

Being specific does not mean dry, clinical text. Lead with concrete, observable facts (objects, counts, colors, actions, sounds, camera, on-screen text) — this layer must carry the caption. Keep information-bearing adjectives (golden, muted-orange, gravelly, shallow-focus). Then layer mood / tone words (charming, serene, dramatic) on top — grounded in the clip, and never replacing the concrete description. A great closing move is one overall mood phrase, e.g. "a warm, serene fairy-tale atmosphere".

Step B — Add the missing details

Once accurate, enrich it. Add whatever is present in the clip but missing or thin — covering both the visual and the audio track. Only mention categories that are actually present; don't pad. Priority order: faithfulness > salient completeness > richness. Light touch on good captions. Rough length: about one flowing paragraph per ~10s of clip.

Visual — check & add

Scene / content type (vlog, movie scene, ad, gameplay, music video, documentary…) — state it early
Overall motion / energy (still → fast & frenetic)
Shot size & framing, per shot (wide, medium, close-up, macro…)
Camera angle / viewpoint (eye-level, low-angle, overhead, POV…)
Camera movement — only if real (pan, push-in, tracking, zoom, handheld…)
Shot count, changes & cutting pace (hard cut, dissolve; summarize fast montages)
Main subject: type, appearance, expression — referenced consistently
Actions & interactions, in order, with specific verbs
Setting, background, time of day, location changes
Lighting & color, focus / bokeh
Visual style / medium (live-action, CGI, anime, stop-motion…); flag AI-generated cues
Key objects & spatial layout; VFX / atmospherics; on-screen text, logos, watermarks

Audio — check & add

Speech / narration: on-screen vs off-screen; voice profile (gender, age, pitch, rate, emotion); intent; accent; what's said (repair garbled transcripts)
Music: genre, mood, tempo, instrumentation, vocals, how it changes
Sound effects (foley): source, synced timing, volume, close/distant, dry/reverberant
Ambience: room tone, nature, crowd, traffic, wind
Mix & balance: which layer dominates; music ducked under speech
Audio quality & overall mood the combined audio + video creates

Edge cases

Case	What to do
No speech	Don't invent dialogue. Describe the music / SFX / ambience, or state plainly there is no speech.
Silent clip (no audio)	Explicitly note the clip is silent rather than omitting the audio part.
Black / abstract / no clear subject	Describe what is literally on screen ("a black frame", "abstract shifting shapes"). Don’t guess a subject.
Multiple subjects	Describe each, pick the one the clip centers on as the "main" subject, reference all consistently.
Text-heavy screens (slides, UI, code)	The on-screen text is the content — transcribe it faithfully as the primary description.
Very fast cutting / montage	Summarize ("a rapid montage of roughly N shots") rather than listing every cut.
Caption already good	Make only light edits; don't rewrite for style or pad it.

Keep the caption's format

Edit in place and keep the same free-text format (one or two flowing paragraphs). A natural order is visuals first (chronological — shots, camera, subjects, actions, setting), then production style and audio (music, SFX, ambience, mix, overall mood). Write fluent prose, not a bullet dump.

Quick fixes

Small before → after corrections.

Before — machine

…a dog runs past in the background as she lights the fire.  (no dog is visible)

After — corrected

…as she lights the fire.  (hallucinated dog deleted)

Before — machine

a mouse in a blue dress picks oranges

After — corrected

a mouse in a green dress picks oranges  (wrong color fixed to what's on screen)

Core principles

Faithful first. Only describe what you can see or hear. Never invent; delete hallucinations.
Concrete first, vivid on top. Observable backbone, then grounded mood words.
Rich but grounded. faithfulness > salient completeness > richness.
Light touch on good captions. Don't pad what's already accurate.
Cover both tracks. Describe the video and the audio.
Chronological. Describe shots and actions in the order they occur.
Direct description. The caption describes the video; it is not a quote of someone narrating it.
Describe the whole clip. Everything shown or heard belongs in; nothing that isn't shown should appear.

Before you submit — checklist

Describes everything in the clip and nothing that isn't actually shown or heard.
No hallucinations — every object, action, text, and sound is actually in the clip.
All attributes correct — colors, counts, gender/age, identity, location, materials.
A direct description, not framed as "the narrator says ‘…’".
Scene/content type identified; overall motion level & cutting pace stated.
Camera covered — shot sizes per shot, angles, real movements, shot count, each cut.
Actions specific and in order; main subject described and referenced consistently.
Setting, lighting/color, focus, style/medium (incl. AI cue), objects, VFX, on-screen text/watermarks captured where present.
Audio covered — speech (voice profile + intent), music, SFX, ambience, mix, overall mood.
If already accurate, only light edits — not padded.
Verified every correction against what's on screen (Pass 3).
Reads as fluent prose in the same format as the input.

Appendix — vocabulary cheat-sheet

Controlled vocabulary to pull from (not exhaustive)

Scene / content type: vlog · interview · movie / TV scene · advertisement · tutorial · gameplay · music video · product demo · documentary · news · livestream
Motion level: mostly still · slow & gentle · moderate · busy · fast & frenetic
Shot size: extreme wide · wide / establishing · full · medium · medium close-up · close-up · extreme close-up · macro
Camera angle: eye-level · low-angle · high-angle · overhead / top-down · Dutch (tilted) · POV · over-the-shoulder · insert POV
Camera movement: static · pan · tilt · push-in · pull-out · tracking / follow · crane · zoom · handheld · orbit · whip pan
Transitions & pace: hard cut · jump cut · match cut · fade · cross-dissolve · wipe · shot count · cutting pace
Focus / depth: shallow (bokeh) · deep focus · rack / pull focus
Lighting: natural daylight · golden hour · soft · hard · high-key · low-key · backlit · practical · neon · warm / cool
Medium & style: live-action · photorealistic · 3D CGI · 2D animation · anime · stop-motion · claymation · pixel art · vintage · AI-generated
VFX & atmospherics: glow · particles · sparkles · transformation · slow-motion · steam · smoke · mist · haze
Speech / voice: on-screen vs off-screen · gender · age · pitch / timbre · rate · emotion · intent · accent · synthetic / TTS · lip-sync
Music: genre · mood · tempo (~BPM) · instrumentation · vocals · role (bg / fg) · changes over time
Sound effects: source · synced timing · volume · close / distant · dry / reverberant
Ambience & mix: environment type · constant vs changing · which layer dominates · music ducked under speech

Start correcting