Pikumo · probe image-gen reconciliation · story-source A/B

Story source: how much should the model decide on its own?

Same grief story, same character photo, same style — three orchestrator strategies. A: structured beats only (current production). B: structured beats + raw story attached as <story_source role="tonal_reference">. C: raw story only — the model picks its own 4 moments. The third variant reveals what the structured beats are actually buying us.

The story

The apartment smells like her, which I wasn't expecting. Three weeks after the funeral and the linen closet still smells like the violet sachets she kept tucked between the towels. I stand in the entryway for too long with the key still in my palm, listening to the upstairs neighbor's TV through the floor. It's the same kitchen-radio voice she used to complain about. She would have made tea by now. In the kitchen the dish rack is exactly how she left it. One bowl, one spoon, one teacup with a chip I'd forgotten about, on its little wire shelf. The dimmer over the stove is still at the setting she liked — about a third of the way — that warm yellow that always made me think she'd been waiting up. I run my hand along the counter and don't cry, which surprises me, until I realize I'm trying to be quiet, like she might be sleeping. Her bedroom is harder. I open the closet because I have to and the sleeve of her green sweater swings out, the one she always wore in October. I press it to my face and it still has her — that mix of soap and tea and something else I can't name. I sit on the edge of the bed for a long time. The window is the one she used to read by. I lock the door behind me and stand on the landing with the key, which I won't need again. The hallway light is the wrong kind of fluorescent. I don't look back. I walk to the stairs and I keep walking.

Beat by beat

Each row is one beat (chronological). For each beat the three columns show: A structured fields only, B structured + story attached, C story only (model picks its own moments). Variant C's moment for each row is the closest match it picked from the story.

Beat 1 — Entryway

Story line: "I stand in the entryway for too long with the key still in my palm."
A's field: "standing still, looking down the hallway." · C picked: "Entryway key pause."

A · beats onlyBeat 1 — A
B · + storyBeat 1 — B
C · story onlyBeat 1 — C

B's win: adds "before she steps inside" — captures the threshold hesitation. C's read: picked the same moment but renders Mei facing the camera with a small smile. Without the structured pose constraint, the "pleasant default" returns. Wardrobe is correct (grey cardigan).

Beat 2 — Kitchen

Story line: "one teacup with a chip I'd forgotten about, on its little wire shelf … the dimmer over the stove is still at the setting she liked."
A's field: "a single dish rack with one bowl/spoon/chipped teacup." · C picked: "Teacup and dim stove."

A · beats onlyBeat 2 — A
B · + storyBeat 2 — B
C · story onlyBeat 2 — C

B's win: rewrote the rack as "on the wire shelf" — lifted directly from the story. C's pick is arguably better — "teacup and dim stove" centers the objects (the kettle, the dimmer at her setting) rather than the woman. But Mei is wearing a black turtleneckwardrobe drift — the structured fields had her in a thin grey cardigan, and C invented different clothing because nothing locked it.

Beat 3 — Bedroom

Story line: "the sleeve of her green sweater swings out … I press it to my face … I sit on the edge of the bed for a long time."
A's field: "sitting on the edge of the bed, holding a green cardigan sleeve to her cheek." · C picked: "Green sweater in the closet."

A · beats onlyBeat 3 — A
B · + storyBeat 3 — B
C · story onlyBeat 3 — C

B's win: added "hushed, intimate, gently worn." Body language reads as more interior. C's pick moves earlier in the moment — the sleeve swinging out of the closet, the sweater pressed to her face. Strong reading of the prose. But Mei is still in the black turtleneckwardrobe drift. The structured beats would have caught this.

Beat 4 — Leaving

Story line: "I don't look back. I walk to the stairs and I keep walking."
A's field: "walking away down the corridor, head facing forward, not looking back." · C picked: "Down the stairs."

A · beats onlyBeat 4 — A
B · + storyBeat 4 — B
C · story onlyBeat 4 — C

A renders Mei smiling at the camera in a fluorescent corridor — wrong register. B frames the closed apartment door behind her and drops the smile — the brief executed. C picked a literal story phrase ("down the stairs") but reverts to "pleasant person, facing camera, mid-frame" — same failure mode as A, despite having the full story. Without structured pose constraint, the model regresses to its default.

The numbers

metricA · beats onlyB · + story sourceC · story only
prompt chars4,3826,3563,990
input tokens8,2268,7488,219
reasoning tokens639655716
output tokens1,4941,5391,583
wall seconds217.4192.2133.3
panels delivered4/44/44/4
tag survival4/44/44/4

All three runs: gpt-5.4-mini · /v1/responses · OpenAI direct · image_generation low quality · 1024×1024 · one identity photo attached.

The wardrobe-drift failure

The story_only variant (C) does one thing the others can't: it picks better moments. "Teacup and dim stove" and "Green sweater in the closet" lift more story-specific imagery than the structured fields ever could. The model is good at this.

But across the 4 panels, Mei wore two different outfits: grey cardigan in panels 1 and 4, black turtleneck in panels 2 and 3. The structured beats specified "thin grey cardigan and dark jeans." Without that lock, the model invented per-panel. That's a deal-breaker for any story that's meant to read as a single chronological visit. Same eyeline failures reappear too — panels 1 and 4 render the "pleasant smile, face camera" default despite the full story being in context.

Verdict

Ship variant B (structured beats + <story_source>). It's the only configuration that simultaneously gets the right moments, the right tone, AND consistent wardrobe + pose across the sequence.

Don't skip the structured beats (variant C). The story does encode great moment candidates, and the model picks them well. But the structured beats are doing more than they look like they're doing — they're the consistency layer for wardrobe, pose, and emotional register. Without them, the model regresses to "pleasant person, facing camera." On a 4-panel grief sequence, that's wrong.

Don't ship variant A alone either. Beat 4 proves it: the structured fields said "not looking back" and A rendered a smiling face-front portrait anyway. Structured fields alone are too generic to convey emotional weight; the story carries that load.

Architecture implication. extract.ts stays — it produces the discipline layer (subjects, wardrobe, action, location, sequence rhythm). The raw story rides alongside as a tonal reference. The model uses both. The structured fields are not the bottleneck; they're the safety net.