The Generative Model Cascade
Generative models evolve in a predictable way: text learns first, then image, then video, then 3D. Abstract capabilities, like fine-grained editing, negation, and stylistic consistency cascade from early emergence in text into the multimodal models over time. When something becomes possible in text models, you can be pretty confident it will eventually work in image models, then video, then 3D. If something isn't yet possible in text, it surely won't be possible in video.
This isn't necessarily about innovation flowing from text to vision teams; it’s about the fundamental complexity of these domains plus the data and compute we've allocated to each. We have more useful training data and have invested more compute in text models, so they have unlocked more abstract capabilities so far.
An example of this cascade phenomenon is in the task of editing content. For a long time, we didn't have the ability to make fine-grained edits on any model. In an image, you couldn’t change a detail without changing 10 others. In code, asking for a feature would often result in "drive-by" edits to unrelated files. In text, asking the model to rephrase one sentence would sometimes result in a couple other sentences getting tweaked too.
At some point, we solved that in text and the model was able to make fine-grained edits. Some months later, it made its way into code, and we were finally able to one-shot new features without worrying about random edits and abandoned stubs. Just last month, we saw the first image models capable of making fine-grained edits. This is the cascade at work.
The same pattern shows up everywhere. Coarse-grained edits, where you want to rewrite something major while keeping the core intact. Identity consistency, keeping the same character or object across multiple generations. Style consistency, maintaining an aesthetic. Negation handling, actually understanding "not X" and following it (like when I ask for no em dashes!). Note that these aren't model-specific evals, they’re very abstract notions which can be interpreted differently in different contexts.
This doesn’t just apply between modalities. In fact, within each modality, there are sub-cascades that might even be more predictable. For video models, the most capable are closed-source offline video generation models, then after a small gap their open-source cousins, then a vast chasm between that and closed-source interactive/realtime video, and finally open-source realtime rounding it out.
If you're building an application that requires realtime video, you don't have to guess whether realtime video will get style consistency. You can just watch how good the offline video models are and have a pretty good estimate of how many months it will take to reach realtime. For example, offline models now have character and style consistency, so we should expect that by this time next year, realtime will have it too.
One of the most important things for an AI builder is to keep track of where we are in the cascade, what abstract capabilities are working for which modalities, and crucially how big the gaps are. The gaps can grow and shrink, so it’s important to track not only what capabilities are where, but how far behind each modality is based on the delta in the latest emergence. This informs your strategy on whether to build around current capabilities or to wait for the model to unlock the capability.
To be clear, this isn’t a law of nature. The cause is not simply complexity and information theory, but also a function of compute and data input. But until those constraints change dramatically, it's one of the most useful frameworks I've found for thinking about where AI capabilities are headed and when they'll arrive.
- ← Previous
Sloppification