Pikumo · probe image-gen reconciliation · story-source A/B
Story source: how much should the model decide on its own?
Same grief story, same character photo, same style — three orchestrator strategies.
A: structured beats only (current production).
B: structured beats + raw story attached as <story_source role="tonal_reference">.
C: raw story only — the model picks its own 4 moments. The third variant
reveals what the structured beats are actually buying us.
The story
The apartment smells like her, which I wasn't expecting. Three weeks after the funeral and the linen closet still smells like the violet sachets she kept tucked between the towels. I stand in the entryway for too long with the key still in my palm, listening to the upstairs neighbor's TV through the floor. It's the same kitchen-radio voice she used to complain about. She would have made tea by now. In the kitchen the dish rack is exactly how she left it. One bowl, one spoon, one teacup with a chip I'd forgotten about, on its little wire shelf. The dimmer over the stove is still at the setting she liked — about a third of the way — that warm yellow that always made me think she'd been waiting up. I run my hand along the counter and don't cry, which surprises me, until I realize I'm trying to be quiet, like she might be sleeping. Her bedroom is harder. I open the closet because I have to and the sleeve of her green sweater swings out, the one she always wore in October. I press it to my face and it still has her — that mix of soap and tea and something else I can't name. I sit on the edge of the bed for a long time. The window is the one she used to read by. I lock the door behind me and stand on the landing with the key, which I won't need again. The hallway light is the wrong kind of fluorescent. I don't look back. I walk to the stairs and I keep walking.
Beat by beat
Each row is one beat (chronological). For each beat the three columns show: A structured fields only, B structured + story attached, C story only (model picks its own moments). Variant C's moment for each row is the closest match it picked from the story.
Beat 1 — Entryway



B's win: adds "before she steps inside" — captures the threshold hesitation. C's read: picked the same moment but renders Mei facing the camera with a small smile. Without the structured pose constraint, the "pleasant default" returns. Wardrobe is correct (grey cardigan).
Beat 2 — Kitchen



B's win: rewrote the rack as "on the wire shelf" — lifted directly from the story. C's pick is arguably better — "teacup and dim stove" centers the objects (the kettle, the dimmer at her setting) rather than the woman. But Mei is wearing a black turtleneckwardrobe drift — the structured fields had her in a thin grey cardigan, and C invented different clothing because nothing locked it.
Beat 3 — Bedroom



B's win: added "hushed, intimate, gently worn." Body language reads as more interior. C's pick moves earlier in the moment — the sleeve swinging out of the closet, the sweater pressed to her face. Strong reading of the prose. But Mei is still in the black turtleneckwardrobe drift. The structured beats would have caught this.
Beat 4 — Leaving



A renders Mei smiling at the camera in a fluorescent corridor — wrong register. B frames the closed apartment door behind her and drops the smile — the brief executed. C picked a literal story phrase ("down the stairs") but reverts to "pleasant person, facing camera, mid-frame" — same failure mode as A, despite having the full story. Without structured pose constraint, the model regresses to its default.
The numbers
| metric | A · beats only | B · + story source | C · story only |
|---|---|---|---|
| prompt chars | 4,382 | 6,356 | 3,990 |
| input tokens | 8,226 | 8,748 | 8,219 |
| reasoning tokens | 639 | 655 | 716 |
| output tokens | 1,494 | 1,539 | 1,583 |
| wall seconds | 217.4 | 192.2 | 133.3 |
| panels delivered | 4/4 | 4/4 | 4/4 |
| tag survival | 4/4 | 4/4 | 4/4 |
All three runs: gpt-5.4-mini · /v1/responses · OpenAI direct · image_generation low quality · 1024×1024 · one identity photo attached.
The wardrobe-drift failure
The story_only variant (C) does one thing the others can't: it picks better moments. "Teacup and dim stove" and "Green sweater in the closet" lift more story-specific imagery than the structured fields ever could. The model is good at this.
But across the 4 panels, Mei wore two different outfits: grey cardigan in panels 1 and 4, black turtleneck in panels 2 and 3. The structured beats specified "thin grey cardigan and dark jeans." Without that lock, the model invented per-panel. That's a deal-breaker for any story that's meant to read as a single chronological visit. Same eyeline failures reappear too — panels 1 and 4 render the "pleasant smile, face camera" default despite the full story being in context.
Verdict
Ship variant B (structured beats + <story_source>). It's the only configuration that simultaneously gets the right moments, the right tone, AND consistent wardrobe + pose across the sequence.
Don't skip the structured beats (variant C). The story does encode great moment candidates, and the model picks them well. But the structured beats are doing more than they look like they're doing — they're the consistency layer for wardrobe, pose, and emotional register. Without them, the model regresses to "pleasant person, facing camera." On a 4-panel grief sequence, that's wrong.
Don't ship variant A alone either. Beat 4 proves it: the structured fields said "not looking back" and A rendered a smiling face-front portrait anyway. Structured fields alone are too generic to convey emotional weight; the story carries that load.
Architecture implication. extract.ts stays — it produces the discipline layer (subjects, wardrobe, action, location, sequence rhythm). The raw story rides alongside as a tonal reference. The model uses both. The structured fields are not the bottleneck; they're the safety net.