Why dialogue placement is the hardest part of AI comic generation

If you ask people what is hard about AI comics, most will say character consistency. That is a real problem, but it is not the worst one in practice.

The harder problem, in my experience building Comicory, is dialogue placement.

Putting words inside a comic panel sounds trivial. It is one of the parts that breaks down most often, and when it breaks down the whole page reads wrong, even if every individual panel looks beautiful.

Speech bubbles are constraint puzzles

A speech bubble is not just a graphic. It is a constraint that ties together text length, panel composition, character position, reading order, and the actual story logic.

A bubble needs to:

Sit somewhere that does not cover the speaker's face.
Not cover other important visual information in the panel.
Point clearly to the speaker.
Be readable in size and contrast.
Fit the text without overflowing or shrinking the font.
Come before the next bubble in reading order.
Match the tone of the line (a whisper bubble looks different from a shout).

Each constraint sounds simple. Together, they collide. A model that generates a perfect-looking panel often leaves no room for the bubble, or covers the character's mouth, or breaks the reading order with the next panel.

Text rendering is still unreliable inside images

A second problem is that most image models still cannot reliably render text inside the generated image.

Even when the model places a bubble shape correctly, the letters inside it may be unreadable. Half a word missing. A weird artifact. A misspelling.

For a comic, this is not cosmetic. The dialogue is half the storytelling. A misread bubble flips the meaning of a panel.

That is why most working systems, including Comicory, do not let the image model write the text at all. The model produces an empty bubble or a marked region. The actual text is composited on top by a typography layer that knows about font, kerning, and bubble fit. That gives clean, predictable letters.

But it also moves the hard problem somewhere new. Now the bubble shape and position have to match a text length that was decided separately.

Length mismatch breaks everything

The most common failure is the mismatch between the planned dialogue length and the visual space the model leaves for it.

Imagine the storyboard says character A has a four-word line. The model generates a panel with a small empty corner for the bubble. Fine. Then in iteration, the user rewrites the line to eighteen words. Suddenly there is no room. The compositor either shrinks the text until it is unreadable, or overflows the panel art.

Solving this is not a one-shot problem. The system has to negotiate between text length and panel composition at every revision step. It needs to know:

How big can this bubble grow before it hits important art?
How much of the panel is safe to cover?
Should the bubble break across multiple shapes if the line is long?
Should the line be split into two bubbles for the same speaker?

That is a layout engine, not a prompt.

Reading order is a separate layer

Even if every single bubble fits, the bubbles in a multi-panel page must follow a reading order. Western comics read left to right, top to bottom. Within a panel, bubbles read roughly the same way.

A model that generates panels in isolation has no incentive to keep this order consistent. Panel 1 might have its bubble in the top right, while panel 2 has it in the bottom left, and the eye has to ricochet around the page.

Reading order is the cheap part to fix if the system explicitly models it. It is hard to fix if you only realize after rendering that the page is unreadable.

What the product actually has to do

So the working pipeline for dialogue in an AI comic ends up doing more layout work than image generation work:

Decide dialogue lines per panel during the storyboard stage, so the rendering stage knows how much room each bubble needs.
Tell the image model to leave clear, neutral space in a specific region of each panel.
Composite the real text in a typography layer that the user can tweak font, size, and shape on.
Validate that no bubble covers an important face or hand.
Validate reading order across the page.
Provide a small editor so the user can drag a bubble or split a long line into two.

None of these are heroic. All of them are essential. Skip any one and the page reads worse than a manually drawn comic by an amateur.

The product lesson

The flashy demos of AI comics all show character art. The unglamorous reality is that dialogue placement determines whether the page is actually readable.

For Comicory, treating bubbles as a real layout problem, not a rendering hint, was one of the larger architectural decisions. The model produces art and space. The typography layer produces text and bubble shape. The product glues them together with constraints the user can override.

Image quality matters. Character consistency matters. But until the words sit cleanly in the right place at the right time, no amount of pretty art makes the page feel like a comic.

推荐订阅源

DEV Community