I'm bullish on AI filmmaking. But there's a reason most AI generated content you see on X or Reddit is either slop, single shots, or trailers. The stuff that's actually good involves heavy human effort at every stage.
If you're exploring this space, here are the problems you'll hit.
Scripts Fall Apart at Length
A good writer still crafts fundamentally better scripts than most AI tools. If you try hard enough, you can create good scripts with AI as well but the real problem shows up when you go past 5-10 minutes and start creating shot lists.
You can build an elaborate prompt chain: character bibles, prop lists, scene geography. But by Scene 15, your agent framework has "forgotten" the protagonist is left-handed, or that the photograph on the mantle is a crucial plot point. You end up building complex retrieval systems just to remind the model of its own world state. The context problem remains.
Multi Character Scenes Are a Nightmare
This is where current tools fall apart completely.
You need a base image with consistent characters in the right positions before you touch video generation. Try generating a few wide shots with 4-5 characters in Nano Banana Pro. After several iterations, you'll likely get character positions swapped, inconsistent faces, or compositions that ignore basic framing rules.
Say you get your base image. You want Character A to turn to Character B and deliver a line. Current video models can handle a single character talking to camera, good movements, decent camera motion.
They cannot handle:
- Character A talking while Character B reacts (most real shots include reaction shots)
- Back-and-forth dialogue in the same frame
- Any interaction requiring both characters to move in coordination
Your options: lip-sync tools (break with multiple people or profile angles), generate dialogue audio via 11Labs (and pray the video model approximates the mouth movements), or rewrite the scene as shot-reverse-shot singles.
If you rewrite as singles, that decision should have happened at script stage. Your establishing shot showed both characters in frame. Now your edit jumps to singles and spatial continuity is broken. You need to regenerate the establishing shot, update your blocking, regenerate subsequent shots that reference that geography.
Every limitation downstream forces changes upstream.
Spatial Reasoning Doesn't Exist
Directors work with spatial intuition: the 180-degree rule, screen direction, eyeline matches, the geography of a scene. AI generates each shot independently with no memory of where anything is.
What happens in practice:
Shot 1: Character enters room. Camera is by the door, looking toward the window.
Shot 2: Character at window. You prompt "character looking out window."
The model doesn't know where the door is relative to the window, which direction the character walked, or what's camera-left versus camera-right from Shot 1. You get a character at a window, but the room layout has shifted. The door is now visible in the wrong position. The window moved walls. The character faces the wrong direction relative to where they just walked from.
The introduction problem is worse. You generate your base shot: two characters, coffee table between them, 5 seconds of dialogue. Now you want Character C to enter the frame.
You can't. Video models don't understand "add this character to this scene." They'll hallucinate Character C in random positions, ignore the instruction, or regenerate the entire scene — but now Characters A and B look different.
Your base image needs everything that will appear in the shot. No adding elements mid-generation. Planning has to be perfect upfront.
Evaluation Is Manual and Slow
You're working in a probabilistic system. The loop looks like this:
Generate image. Iterate until one is close. Check composition, character accuracy, lighting; manual review. Generate video from that image. Iterate until motion looks right, no artifacts, no character morphing. Review each output: does it match intent? Is motion natural? Any glitches? If yes, next shot. If no, back to the start.
Why not automate evaluation? Use vision models to score outputs, auto-reject below threshold.
The problem: image understanding models aren't consistent enough. "Quality" is subjective. And false positives kill you. The model says it's good, you move forward, then in the edit you realize the character's hand position breaks continuity with the previous shot.
You need a human in the loop. But at what stage? After every generation? After hitting some automated baseline? There's no clean answer yet.
Asset Management Becomes Its Own Job
By the time you're 50 shots in, you're drowning in files:
- Which iterations did you actually use?
- What prompt generated each one?
- What seed? Which character references were active?
- How does this version differ from what you tried yesterday?
You realize Shot 47 doesn't work. You need to regenerate. But what were the constraints? What was the character state at that point in the story?
Your script says: "Character A picks up the phone while saying 'I can't believe it.'"
The text-to-speech model needs voice profile, emotion, pacing. The image model needs character reference, phone object, hand position, composition. The video model needs previous shot context, spatial layout, movement path, timing. All of them need the same character state, same emotional tone, same scene geography.
Keeping all of this in sync, and in a way your orchestrator can handle intelligently is a job in itself.
Try This
If you want to see the gap in action: create a three minute AI video of two people arguing in an elevator.
You need spatial consistency: the buttons and mirrors can't change between shots. You need natural back-and-forth dialogue with multi-character coordination. You need one character to exit while the other stays.
Now compare the effort required to shooting this on your phone with two friends.
The tools will get better. But right now, these are the walls.