CS 153

CS 153 '26: Frontier Systems - Andreas Blattmann, Black Forest Labs

Anjney Midha Season 1 Episode 3

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 1:01:09

Guest lecture from Black Forest Labs Co-Founder Andreas Blattmann. 

SPEAKER_00

Quick show of hands. How many people recognize a song that was playing? Those are the Germans in the house, aren't you? I wouldn't have mentioned. Yeah, you're from Germany? Yeah. One of my favorite songs called Bella Napoli. It has been added to the uh CS153 Spotify playlist for those of you who aren't on it. For anybody who has music requests for CS153 this quarter, also known as AI Coachella. We've got an open playlist. Please feel free to add songs there. That one was a request from me in honor of our speaker today, who I'm very lucky to call a close friend, and is the co-founder of Black Forest Labs, Andreas Blottman. Thank you for joining us, Andy. Thanks, Hans. Thanks, everyone. Thanks for having me. Andy is joining us from Germany in a little town called Freiburg, which I think a lot of you will be hearing about more and more as it becomes a hub for frontier research in Europe. If you remember, in our first lecture, right, we talked about the anatomy of frontier AI progress. And we talked about three or four important touch points in this class you're going to be hearing about over and over again. One is that there's a transition happening from the old systems, the old infra stack, to a new one, right? And you got to be open to understanding what those rewrites are looking like. And our speakers are going to tell you which parts of the stack they're helping to rewrite. We talked about the basic AI scaling recipe, right? We've got two sort of loops that are important to run. Once you do, you get some compute, you get some data, you build a model, and then you do inference, right? That gives you revenue to buy more compute and then context feedback. We've talked about the bottlenecks, right, on that, on getting those loops scaling, which is context, compute, capital, and culture. We talked about context and compute. We'll talk a little bit about all four today. And then the last was well, for your projects, which is the part I'm sure many of you are anxious about, is how do you get one of those scaling flywheels going? Right. And we talked about there being sort of three steps in the journey. There's an incubation phase where you kind of figure out which specific part of the frontier you want to attack with a state-of-the-art system, right? Then you land with a soda release, a state-of-the-art release, and then that allows you to expand to more and more capabilities on the frontier that you care about. And if you remember, we did sort of a field trip into one of the frontier factories, right, in our first lecture, which was Anthropic. We talked about code as one domain. And today we have a chance to do a field trip into another Frontier AI factory in Germany called Black Forest Labs. And we've got here one of the factory owners, Andy Blottmann, who's a co-founder of Black Forest Labs, also co-creator of Stable Diffusion. How many people here have heard of Stable Diffusion? All of you. Perfect. Great. So you've done some homework. And so today we're going to talk about the frontier. Last uh on Tuesday, we talked about the audio and the speech frontier, right? What is audio intelligence like? What was it? Where is it going with Maddie from 11 Labs? And today we have Andy talking to us about the frontier of visual intelligence, which I think is one actually one of the most exciting frontier, if not the most critical frontier, to unlock more progress and if we really want to get these models to work in mission critical contexts in the real world. And so we're going to spend some time talking about the anatomy of visual intelligence as Andy sees it as one of the pioneers of the field. And then we're going to talk, go back in time a little bit and zoom into how we bootstrapped the flux flywheel together a couple of years ago. Flux is the name of the flagship model family from Black Forest Labs. And then we're going to spend some time on the fun part, which is future frontiers. Where are things right now that where on where are their unsolved problems? Where are we right now where you guys can step in and start co-creating this journey in the space? So this was the Frontier Factory, right? We talked about this is sort of the basic template. Again, to be clear, this is a directional heuristic. Every team is different, every research project is different. But to kind of give you a grounding sense of repeating patterns about how some of the best teams are manufacturing intelligence repeatedly, remember this was the pipeline. We had um pre-training, mid-training, post-training with agents in the real world. There's a version of this that that Andy's going to walk us through. But before we jump into that, why don't we just spend some time on you, Andy? Who are you and how'd you get here?

SPEAKER_01

Yeah, cool. Thank you, Anj. Thanks again for having me, everyone. Yeah, I'm Andy. Um started looking into AI, I think, in 2019. Um I was actually originally studying mechanical engineering. It's classic German education, uh, I think. You go to a school and then you figure out you're kind of somewhat technical. And what are you doing if you don't know exactly what to do? Studying mechanical engineering in Germany, right? Um and then it yeah, through through a couple of um, I think coincidences, I got into computer science, into coding, into already robotics back in the days. We talk more about robotics uh later. Um and applied at a PhD in Heidelberg, uh, where I met my two co-founders, Robert and Patrick. Um, and that was a really like small lab. Everyone back in the day was doing representation learning with visual models, uh, or like for the visual domain and computer vision in itself back, that was 2019, was kind of a niche topic in this niche topic back then of AI. It was really like people saw the potential already, but but no, no one no one had an idea of how how that would uh explode then later, right? So it was really uh kind of a yeah, niche topic we worked on, but we soon had a very good intuition about like how to train models to generate pixels, mainly images back then. Um, and we're competing on a research level as a very small lab with players that were much larger than us. Uh and finally, that already back in the day was Google and OpenAI, their research teams, and it was not about building frontier systems, was foundation models. Yeah, or or even before that, uh, who wrote actually the nicest paper to show that something was was happening? Right. So back in the days, it it was like this was uh pre uh that that was pre-stable diffusion. That was that was 20, it was 19. For the ones who remember it, style GAN was just kind of the images were most often generated with GANs because they had a kind of a good inductive biases for for for kind of this data domain. Um and it was generating a 256 by 256 pixels image was a challenge. Like not every algorithm could do that, and yeah, it was just a very different world. So um we competed with labs that were much larger than us, and we had even back in the day way, way uh less compute. So we had to come up with kind of more um efficient algorithms to solve that problem because images, and not speaking of videos, are so much higher dimensional than other representations, say text or something. Text is uh much lower dimensional.

SPEAKER_00

And and to anchor focus on time, you were still this was when you were at the University of Heidelberg. Exactly, right.

SPEAKER_01

Um so um, yeah, and then we we spent like two years investigating how we can actually find representations for natural data, for images, for video um mainly, that are perceptually equivalent to the pixel space or to what matters to us, humans in the pixel space, but much lower dimensional and much more efficient because we didn't have the compute to train a kind of generative model on the pixel space, and it's also super wasteful. And that was what gave rise to a series of papers on latent generative modeling. So you actually train a kind of a compression model, um, similar to a learned JPEG code, like you could imagine it, to find that perceptually equivalent representation to the uh pixel space, and you train the generative model there, and that um helped us saving tons of compute, training our models much more efficiently, and with orders of magnitude less compute than our competitors, put out like better like models that were on power even better than those competitors. And that was what that algorithm, latent diffusion, also gave rise to stable diffusion then. Um so we proposed the algorithm, saw the potential, set out to search some compute, luckily find that in the open source community, um, and trained stable diffusion that was then released in 2022. Um, and pretty much surprised us as well, like with all the hype it got. And I actually was it was fun, it was here in the Bay Area, it was hype much more than in Germany. In Germany, still today, not a lot of people know about that model, funnily.

SPEAKER_00

Yeah, it I there were there was a moment I remember DALI 2 was in preview, I think. And and then you guys put out stable diffusion. And I remember on Reddit, there was somebody had sketched out uh, they'd taken one of their kids' like uh drawings. It was like a crayon drawing, and had turned it, had run it through the image-to-image transfer on SD. Uh, I think it's for SD1. Um, and it and out had come this beautiful illustration. And I remember taking a screenshot of that, because I was just blown away, and I tweeted it. And I think it was like a Monday morning, we went into our exec meeting at Discord, and then I came out for lunch, and the tweet had like three or four thousand likes. And it was it like a for me, it was a moment where I realized that the technology of generative modeling at that point had crossed an inflection point where it suddenly became legible to people outside the machine learning community because it was so visual.

SPEAKER_02

Yeah.

SPEAKER_00

Right. I I think it it might be worth spending a couple minutes here to just take people back in time because at this moment in time in the ML community, I would say that there was a bit of a dogma that language modeling was the be-all and end-all of intelligence. You know, the general consensus at the time was that language is the interface to reasoning, to, to, for intelligence, which is the way humans reason about intelligence, the way we think, is through language. And I would say that's a philosophical belief that has come and gone in its sort of the strength of its religious zeal. Um, but for those who are in the computer vision community, and I count myself as one of those because my last company, as we've talked about, was Ubiquity Six. It was a 3D mapping and computer vision company. We were working on 3D reconstruction. It was clear that language is like was extraordinarily valuable, don't get me wrong, at reasoning about certain tasks and fields. But for those of us who are in the computer vision community, it it felt incomplete because language is just one way we communicate, one way we reason about the world. For those of you who are visual thinkers or visual intelligence who believe in multiple intelligences, right? You just learn better when you see visual representations of things. And so it was quite um cool to see stable diffusion coming out and make progress of a different kind legible to both the machine learning community as well as the broader developer community, the broader consumer community. And that's when I think we we reached, you know, we started working together because we were trying to get some of these um stable diffusion like capabilities onto Discord. But can you talk about it? You you know, you said two things that I think are quite helpful to overlay for the students here, which is the difference between natural and unnatural representations. Could could you speak about that for a sec?

SPEAKER_01

If we think about ourselves, if everyone uh you you look at me currently, uh hopefully, uh, and and um the medium through which you are perceiving this is clearly video and audio, right? You hear what I'm saying, and you're seeing me um gesturing here or or talking to Anj. Um so these are these are what we what what I call natural representations. If we think about the source of those representations, eventually it's the sun. Or here we have some some lights that try to resemble what the sun sun does, but it's electromagnetic waves of a source that we humans cannot uh control. We can uh shape obviously the the or we we we can we can control the shape of this world and we can build buildings, but this the electromagnetic magnetic spectrum that falls onto the earth we cannot control. Safe or sound, like natural signals like uh uh hearing a river flow or something, that's just you might some some might call it noise or something. That's just natural and it's there. Whereas text is inherently human-made. You see this in so many different um occasions. If you just measure the the information per sign that text transports, it's so much higher than the information per sign per pixel in an image. And why is that? Because it's human-made. It was evolutionarily very important for us to communicate uh efficiently. Um and that there's I think that that's also at the heart of why we need to compress images and videos before we train a generative model on them, because there's so much redundancy in it. In text, you don't have this redundancy because it's human-made. Throughout evolution, we reduced the redundancy and um and made it efficient. For learning, however, it's super important at least in how I see it and how we see it at Black Force Labs. So consider two things. First, if you think about yourself as as babies, how you learn, it's first observing things, hearing and seeing, and then interacting with things in the physical world, right? This is pretty much the first I would say three, four, five years. I don't know when I learned reading, or but uh it's I think maybe with five or something. And just the level of intelligence a three-year-old has compared to the level of intelligence a a language model has is very different, right? And I think that's what what why we care so much about natural representations like audio and video because we th we are like absolutely convinced that this will be the fundament of all the kind of higher intelligence that these systems will eventually have. And starting from language and trying to to stack up a bit of additional uh kind of representations on top of that is in my um kind of opinion the the the wrong way. You should start with from first principles, how we humans do it, and that's clearly learning on natural representations by first observing and second, we'll talk about that later, interacting. These are just from how we think about it, the main pillars of learning and also the main pillars of what we define as visual intelligence, actually.

SPEAKER_00

So I I I think um So this is pretty important because two years ago, three years ago, I would say the consensus was that the way to do generative modeling was roughly this, right? Where you had this these foundation models that were unimodal. They were text-to-image, text-to-video. Looking at stable diffusion as a stable diffusion was the image model, exactly. Unimodal based on images. Yeah, but could you could you just talk about what this the what the state of the art was then versus now?

SPEAKER_01

Yeah. So yeah, stable diffusion, I think it's a perfect example of that. Um it was a text-to-image model, you could you could do super nice kind of artistic things that have not been possible before, but it was clearly made for content creation, right? Right. It was a unimodal model made for the purpose of content creation. You could you could yeah, make artistic style transfer, you could, you could do, you could train a LoRa and do maybe some some kind of um character consistent marketing transformation, or like yeah, character consistency into the get get character consistency into the model, then use it for marketing or something. But that's all content creation. We currently see that visual models starting to become way more than that. We don't train a single unimodal model anymore to just like fulfill the purpose of content creation. We're training a unified, a multimodal model for natural representations or natural data that then can give rise to so much more. It's about physical AI, it's about robotics, computer use. We can do with these models. We had a couple of demos already, like, or there were recently a couple of demos that were super impressive. We can do world modeling and simulation and still content creation. Um, but combining different natural representations or only training on one is the key ingredient because it will give the model a much more natural understanding. As one example, if I if I just see two rigid bodies colliding, I always have a sound attached to it, right? There's a correlation between that sound happening and a certain action in physical in the physical world happening, and being able to observe this correlation for a model is super important because it will help it that model understand much better what's actually going on. Whereas if I only train at at one single modality, it's much harder to kind of understand what's going on, or to just interacting with this bottle. I think it's super hard for a model to understand what's actually going on if it's not if it doesn't hear that sound. How would that be different for that kind of transparent body compared to to someone um putting their hand through water or something, which is also transparent, right? Right. So um these correlations between different natural data representations are super important for a model to learn kind of at a higher um representation of intelligence as well.

SPEAKER_00

Now, this this idea, you know, for those in the machine learning community is not new. I mean, for a while, there was an I there was a sense that the progression of technology would be we'd have sort of um state-of-the-art systems that were capable of individual modalities. And then at some point, to make them smarter, we'd have to give them the ability to reason across different domains, you know, transfer learning, so to speak, where you can reason about the physics of this bottle hitting that and and the sound, the audio emerging. Um, but you can't start with everything on day one. And so could you just take let's let's talk about how we bootstrapped the flywheel. Because now now, today, fast forward two years ago, you know, four years after staple diffusion came out, you know, flux is now uh used by millions of people around the world. You guys have hundreds of millions in revenue, blah, blah, blah. But for the purpose of the students, I think it's helpful to zoom back in time and say, okay, you guys had this clear thesis for eventually models will be good enough at reasoning about all kinds of all of visual intelligence. But you have to start somewhere. Yes. Especially when you have less resources than than the largest companies in the world, you're a smaller team, you have less data. So can we spend a little bit of time talking about how did you concretize where to start? Yeah. And then how did you initialize that kind of momentum flywheel we talked about from at day one?

SPEAKER_01

Yeah, absolutely. Yeah, I think uh that's one of the most important things when starting a company. Focus matters well, or any research project, right?

SPEAKER_00

At the time, actually, SD was not even a company.

SPEAKER_01

Yeah, yeah, yeah. Absolutely, absolutely. But I think I want to as an example, I want to I want to take how we started the company because there we we had this kind of huge experience in uh image generation, unimodal image generation. We've done Stable Effusion, then we've worked for Stability AI, put out a couple of uh more models on that domain, and we pretty much had the recipe to kind of train a frontier model for that domain. Right. So when we started the the the company, we clearly, or we looked at the field and we said, there's clearly a need for a next generation of image models because so far the models cannot say produce hands that are that are actually having five fingers, right? That that was back in the day a thing. So we attacked that specific field and said, okay, we want to be building a model for for specifically for image that is just 10x better than everything else. And that's what that then we sat down together three months. We had all the recipes, we knew what to do. We scaled it, and what came out of that was Flux One, right? That initially had kind of product market fit, you could say. We uh even before we took our API public, we had a couple of very large customers uh that that kind of helped us close the feedback loop. Now talking about a feedback loop, because obviously once once you can build a technology, but setting that technology out to solve real-world problems will give you the very important kind of data to actually learn from first what is an important problem to work on, and also how to make the model better for that specific problem, right? Yeah, by that you you have the first kind of uh loop closure for the fly view.

SPEAKER_00

I mean let's let's break that um that release down. Flux one. I think this is the kind of pipeline we talked about, right? So could you just go through sort of the BFL version of this and explain what's going on at each step within the the company of the BFL sort of pipeline? Yeah.

SPEAKER_01

So I mean this this is particularly for um for how we would define visual intelligence now, but I think I can also for VLUX. One of us clearly we trained only on unimodal, uh like text and image, right? Right. Only on those representations. So the pre-training was just a large corpus of text and image. For the mid-training, we added um higher resolution um and like a couple of more capabilities into, and then we had this kind of post-training phase where we exposed the model to first, we did a kind of offline post-training. Before you release an initial model, do some uh distillation to make the model more efficient. You you align it with uh your intuition about what customers would care about, right? And then you expose it to kind of the real world, but then you get this feedback. And for Flux One, a very interesting um observation we made was oh wow, so many people are using our text-to-image models to actually train a LoRa and then do character consistency. Like they they want they they wanna they wanna have the ability to control the model with more than only text. Because text is obviously nice and easy and low-key, everyone understands it, everyone can use it, but it's also very ambiguous if you like and and again the the the there's a kind of disconnect between this kind of artificial representation text and and the natural representation image. So if I say an image of a blue bird, there are infinitely many images that that give rise to this kind of um description, right? The bird could be sitting on a branch, the bird could be flying, and so on and so forth. And it's actually super hard to apply precise control to um image to the image you want to be generating. So that was that was I think that's a perfect example of the benefit of the loop closure because we what we learned was okay, people want to actually do image editing. Right. So what did we do? We did a post-train on partially based on the data we got, partially based on new stuff, to create an image editing model, which was Flux One context that came out I think pretty exactly a year ago, a bit less than a year ago. Um and that was the first image editing model where you could actually in a scalable and fast way get character consistency. So I now I could take an image of you, Arnj, and say uh, and maybe of me and combine us two sitting together not in a lecture hall, but um in a cafe having having a chat. And that's that just has massive potential for everything content creation, right? Marketing needs it to like get different product, different products into different contexts, and it just like supercharged or currently is supercharging like a lot of different applications around the creative world.

SPEAKER_00

Yeah, so this may not be obvious, but I want to pause here because you know Andy did this quite naturally, but for those of you who who were trying out AI image models, let's say 18 months ago, how many of you tried giving it a photo of yourself and then saying, you know, give this person a hat and then it came out actually looking like you? Yeah, no hands are going out. One one hand's going out. It was a pretty basic capability gap. These image models just didn't have character consistency. Right? You just give it like a photo of yourself and say, you know, give him a mustache or whatever, and out would come somebody looking not like me. And um that you know that that for for many people in the space, that was just a I actually can't I if I had a dollar for every time uh people would would say, you know, that that's just that problem will not get solved. Like these models are so dumb. Like look, AI is so dumb. It will never like it'll never be able to surpass that capability threshold. Um I I would just sit there. And the and these are very smart people, including, by the way, some of the speakers in this class who over the years have realized they had to update their priors about the the speed at which you can update these capabilities. But it was common consensus at the time that these image models are just not gonna get that good. You know, AI is dumb. It can't reason about the the way that humans do about create like you know, faces and specific characters. And I was just, it was it was shocking to me how very smart people would very confidently proclaim in the industry that I was just not gonna get solved. Meanwhile, you know, here we were in Freiburg, um, looking at the data where there were people using flux, one, which was not very good at character consistency, but that that context feedback of seeing how the prompts people were trying to use with the model, and the the feedback of them saying, actually, that that was not good. Can you please try doing this better? The the multi-step sort of reasoning chain that we were getting from seeing people out in the wild using it. It was an open weight model, which we'll talk about in a second, which is quite unique, gave us a very clear path actually to improving the capabilities. And then actually, it was you know, one of our team members, Dustin, uh in in SF, who figured out that, you know, we should just make uh an update to this that's that's called context with a K, because that's the German way to pronounce it, um, that is specifically an editing model. And I think but between the insight, that insight, which was at an off-site in Spain, where where were we? We were in. I think it was in um in Italy. In Italy. We were in Italy, you know, uh I I I think chat Dali, no, chat GP GPT-1 image had just come out. We were literally all together. Yeah. And there was a sense, you know, this is an important thing as it as it as a new team or as a first-time, you know, researcher, it can be quite daunting when some lab that has way more resources and you launches something that's that looks way better. And and your first intuition is to go, oh my God, we're f but you gotta remember that the mark of a good leader is to not panic, keep calm, look at the data, assess the landscape, and then come up with a plan step by step. And often you'll notice that if you're good at at mapping the domain, you're you want to be an expert, and somewhere in your intuition, you have a gut feeling that's telling you there's actually some unsolved problems still left. And you know, Dustin did a great job at that. The team rallied, I think within 24 hours we had redone the staffing on the team, and I think what, 60 days later, context came out. Yeah. Right? And revenue from context doubled, I think, within six weeks. In fact, I think soon after is when this part is public now that Meta announced a partnership with BFL and said they were going to be using Black Force models, the Steiny team out of Germany. I think the team was you guys were like 25 people to drive image editing for all 2 billion Facebook and Meta and so on users. I mean, this is not normal. Right? And and observing my butt I was lucky enough by this time, you know, I had I had been an investor with you guys for about a year and a half. And so we would go to these off sites together, and I had had a chance to sort of see how in real time, you know, this isn't the systems problem often is not just a technical problem, because actually all the data was available to us, the context was available to us. It's the human system of organizing the team and the research sort of culture, right? In a way that is not, you're not panicking, but you're still assessing kind of the frontier method methodically and being very honest with yourself about how fast capabilities are moving and where you can uniquely sort of contribute to that system is the key to keeping that loop going over and over and over again. And I think that's why you know BFL is where they are today. They went from zero to several hundred million revenue. It's the company is now worth more than three billion. But it can be easy to forget that that wasn't always the case. And there are these moments in your journey, especially in machine learning where things change so fast that it's very tempting to just give up. You know, and say, you know what, this problem is solved. I mean, it's really it is remarkable to me how many teams in image generation just don't no longer exist. Because they just gave up. And instead, BFL just stayed persistent and today is one of the only leaders left. I would say an independent leader that's pushing the visual frontier. In fact, I I hope some of the projects here push the frontier too. But um, that's that's I think a learning for me has been it's actually quite straightforward sometimes technically, if you have the right leadership to keep advancing the frontier. But sort of adrift in your mission, a lack of conviction that you know it's worth attacking the problem you you're committed to over and over again in the face of crazy challenges, is often just the difference between success and failure. How many people know that have seen that meme of the guy tunneling uh and giving up right before yeah. You guys know that meme we'll we'll we'll put it in the class reading list. I'm dating myself with boomer memes here. But I can't tell you how many times it's felt like that at BFL, and then one release later, right, the world has changed.

SPEAKER_01

And that that's actually uh, I think also a good segue into what's next. Like now we're seeing this kind of these models being applied everywhere for content creation. But again, then you need to think more, or like look forward and and think, okay, what's next? And clearly we now see this insane potential of especially these combined multimodal video, audio, image models for the capabilities we just talked about, right? Physical AI, um, computer use, right? World modeling and simulation, and still also configuration. And you can you can actually build a single model that is capable of all doing all of that together, and you will actually get compounding effects based on this example with with the correlation of noise and the f action in the physical space. Um you can also make models that are much smarter for generating normal regular images or videos footage for, say, again, advertising or something.

SPEAKER_00

Well, could could we could you talk a little bit about that? So let's zoom forward to um actually could you could you talk about you know, how do you take an image or uh a content creation pipeline and add to it what you just talked about, the ability to actually interact with the physical world or learn from the physical world. You know, what what does action prediction mean? How is that done? And then maybe you can talk about self-flow a little bit since we're gonna be assigning that as reading.

SPEAKER_01

Yeah, yeah, yeah, absolutely. Um, so yeah, I think first you you need to go from we already talked about this, unimodal to multimodal. Right. Um and then you get this kind of, yeah, this is this one. And if if you if you go back to the slide here, I think this this is a good one. So there's a large pre-training on, again, natural representations. These are the representations we humans use to learn from in our first years of of our lives. Um and there you you yeah, you just combine everything together and you have these um this combined pre-training that gives you a very, very general model. So pre-training for us means images, video, audio, combining with an architecture or with an algorithm that we've also published um beginning of March, self-flow, which allows the model to actually get compounding effects by observing again. I don't make this this uh example again. You saw it uh already a couple of times, by observing correlations that that exist between those modalities. That gives you a very, very general representation. What we add next in mid-training is additional context. We do new tasks such as conditioning on, I can condition a model on an input image and an audio track, and I say, I I I wanna I wanna hear uh Arn saying XYZ in that voice, the model does this. This is additional context, but importantly for extending the scope beyond pure um content creation, you also want to condition the model on actions and you wanna have the model predict actions. And then we can arrive at models like this computer use models, for instance, that that are conditioned on a video or an image, and they predict the next move based on keystrokes or something to achieve a certain task. Say I want to be opening a new browser, browser tab or something, right? So this is crucial to get to expand the scope of this kind of very general representation that we get from pre-training, but we actually want to be using it for, we want to make use of this kind of general generality of the representation. So we add additional context and importantly actions. Um and what we then do, this is very important, or yeah, maybe maybe to zoom out a bit more and come back to the human learning example. Pre-training, mid-training, this is all still observation. All the algorithms that we're training, like foundation models with in the early training stages currently are models observing examples. We we're calculating a loss from that, we backprop propagate that through the network, but there's no interaction whatsoever. Right. So, how do we actually get the model to interact really in the physical world? That's super important for kind of learning higher forms of intelligence, as we are all uh convinced of. So, what do we do? We use this model that can actually give it a video, predict an action to do something and hook it up in the real world on a say on a robot, for instance, right? And then that allows us to inter or like allows this model to through a robot interact with the physical world, create data for that again, and we can pipe that back into the model training, and that's when we close this feedback loop. Uh so our post-training looks or means interacting with the physical world.

SPEAKER_00

Right. So this is important. If you guys remember we talked about physical veri uh uh ver verification, right, as a key predictor where frontier progress is going to continue. Wherever you have context that can be and performance that can be verified, progress can quite reliably be made there.

unknown

Right.

SPEAKER_00

So in software engineering, that's verifiable because you can write unit tests. In image generation, not very verifiable, right? Because one uh beyond beyond the basic tasks that that um Andy talked about, which is accuracy, right? Six five fingers instead of six, character consistency, which is more a preference. Uh but even that example, how would you measure that at scale um without having a human telling you at actually five five fingers and you should talk about how how that verification works, and then what is the you know, in the new world where you have to verify physical task like robotics, what does that look like? Yeah, yeah.

SPEAKER_01

So I I think in it it's fun because if you if you oh yeah, so verification for for for images is uh are super is super tricky, especially when it comes to kind of or for videos, when it comes to physical things. But once you hook that up in the real world, there are just certain things that go that that that you can do and certain things that you can't do because a robot arm can cannot just do certain certain joints. So it's like exposing it to the physical world naturally um applies the boundary conditions that we would expect. So that that's a very important step, and by that you you have the perfect uh kind of environment to directly inherently model these kind of restrictions.

SPEAKER_00

Whereas in the case of aesthetics or visual preference, how did you guys verify that? How how did you get a model to be better at when it's just doing content creation?

SPEAKER_02

Yeah.

SPEAKER_01

Uh that that that that involves like massive massive massive amount of human judgment and then uh feedbacking that signal through the model again. Right. But that's like often very tedious and also often very dependent on who you're asking. Yeah. Like a uh if I ask you, you've looked at so many images by now, you're I would consider you an expert because we always show that guy uh our models, but you're also enjoying it, I guess. Depends on how much spits you have had that manager. Yeah, no, but but but like showing uh an image to someone who has no idea of content of say image generation versus to myself, and I've looked at so many images, it gives you a very different signal. I would rate something as good or bad that looks very different from what a kind of uh another person uh would do, right? It depends on the crowd who you're asking. So it's you can ask people, but that is very unambiguous, uh very ambiguous in a way.

SPEAKER_00

So this is a key insight, I would say, because anytime the answer to the eval question of how do you verify is it depends on the audience or it depends on the person consuming the system, it should trigger a light bulb, at least it does for me, that the value that you get from the system varies a lot by how much the model can be customized for a particular audience. And that is where open source comes in. Because the beauty of open models is if you give away the weights, and they're good general weights, right, then you can tell Meta, hey, you're welcome to customize the preferences of what of this model as you see fit for your users. And you can tell another government that has different cultural preferences and biases that wants to, you know, be able to deploy content creation for, let's say, internal teams in a completely different culture and say you can have the control over that last mile. And I think that's turned out to be a very critical part of the open ecosystem where I often get asked, you know, why did BFL open their models up and just give away all this research for free? Is it just that they want to save the world? Well, you know, part of it is cultural. As you can tell, Andy, you know, Andy was a came from the academic community, enjoyed and benefited from open publishing. But at the end of the day, you've got to turn these research products into businesses, right? And it turns out there's extraordinary value in producing state-of-the-art systems that are then open and customizable when the consumer of the system, the person benefiting from the system, has very different preferences from other people who might be consuming the system. Does that make sense? Have I lost you guys? Can I get some nodding if that's making sense? Yes. Okay. This will be a theme consistently, okay, in the class, which is anywhere you have consumers of a system or customers or the people benefiting from the system wanting more and more personalization and customization of the system for themselves, that's where open models become extraordinarily valuable. And you can actually build it turns out a very large business very quickly doing that. So I actually think there's a false trade-off in the space a little bit about open versus closed. These are both just techniques or tactics for how to deliver value. They they are some sometimes they get politicized philosophically, but actually just from a very base, basic first principled commercial perspective. You know, open makes a lot of sense in some domains where where the aesthetics, the preferences, and so on are quite there's a long tail of just the distribution is quite wide and and and um and heterogeneous versus domains where you know preferences are actually quite narrow. If the if there's a pretty narrow distribution, then I think you know, closed models and so on are quite valuable in that case. I think there's one last piece that we haven't covered, which is the state of the art today. Because as Andy said now, the state of the art, right, is about how do you get these systems to reason in a unified fashion across text, image, video, and so on, in a way that's that has cross sort of transfer learning across these different modalities. A very hard problem. Very, very hard problem. But as is the case with BFL consistently, the team has you know makes these sort of research advancements and then gives away the technology. And so this was actually one example, I think what, two months ago? No, a month ago.

SPEAKER_02

A month ago.

SPEAKER_00

Self-flow, which has turned out to be a technique. Now that suddenly magically, all my friends at all the labs are calling and saying, Ange, have you heard of self-flow? I was like, Yes, we have heard of self-flow. You it's on archive. Because Andy and team published it. So maybe you can talk a little bit about what the intuition behind self-flow is as a mechanism to solve the multimodal reasoning problem.

SPEAKER_01

Um, so when training uh visual generative models, it's always been historically um a bit tricky to get representations into the model where they don't only generate pixels in a way, right? But also understand what's actually semantically going on. And there has been a body of work on um so-called or that that worked on aligning representations of generative models with um representation, learning representations. Um that they matter there to get them a bit more understanding of what's actually going on and not only make them stupid pixel generators that just learn to kind of um resemble what looks consistent in an image. Um that has been in the last two years always been focused on single modality. So what people did, they used a pre-trained representation learning model, like maybe some of you might know the dyno model for images, um pre-trained model, and they try to just make the representations that the transformer, that is a backbone for image generation, has internally aligned with the representation that this representation uh learning model um had. And that is clearly restricted once you want to go multimodal. But as you saw in the last slides, going multimodal is super important to learn kind of higher form of intelligence. It's we want to be multi- we want to have models learning multimodal representations because that's frankly what we humans uh have, right? So we it's crucially needed. So, how can we actually combine learning having having these so-called alignment losses with multimodal representations? And the Sellflow paper solves exactly that problem in a very natural way. Really recommended read for everyone. It's it's assigned reading.

SPEAKER_00

Okay, good. Shall we transition to questions? Okay.

SPEAKER_01

Yeah, so the question was Um when when we close the feedback loop, how do we ensure to um compress this? How do we make sure that I Actually, uh, personal data is respected and um that no no harm is is generated uh based on those models. So, first, um we have a lot of content filters on on our API, obviously, because we our belief is that these models are powerful tools for humans to create super, super nice and creative outputs, and also much more than only content creation, as we just saw. Um, and we don't want to have them misused, so we we add a lot of content filters that actually um make sure no harm is is generated on the personal uh information. Obviously, being being based in the uh European Union, we comply with the AUAI Act, and there's actually a uh kind of law that that um we also follow that you based on a request. So if you put in a um a an image of yourself on our API, and you say, hey, look, I I I don't want to you you to you to um to store this kind of data, we have to delete it. So we have systems in place to actually make sure this this basically happens. So the question is like we we had a lot of partners, um large companies that we worked with, like XAI, Meta, um, Back Buy Nvidia. And the the question was how do we evaluate um with whom we work and with whom not? Um I think maybe as a general statement, we are working on building the visual intelligence infrastructure for um everyone, basically. So from an infrastructure perspective, you really want to make sure you put guardrails around your models um that people cannot misuse those. But then infrastructure is there for basically everyone, right? And and that that that that's that's the the standpoint we're also taking. We care a lot about the safety of the models, that's important, and we do everything we can to prevent misuse. But then um I think it's also us provide putting out a technology there and providing it to everyone. And that it's always hard to take a certain standpoint on like who you're working with, who you're not working with, uh, because it you get it it gets very tricky to justify in the end.

SPEAKER_00

Uh let me try and translate what Andy's saying. The company basically applies its guardrails to everybody. So no matter who you are and how big you are and how much money you've got, if you want us to remove our guardrails, sorry. Those guardrails apply to everybody equally. Because being a standard and being infrastructure that people can rely on means you don't treat different people differently. And everyone can rely that they're not getting, you know, just because they might have more money or they might be more politically influential, whatever it might be, that they can get the same quality of service as everybody else. And so that's the position BFL's taken as an infrastructure provider is that it doesn't matter who you are. Now, sometimes you have custom needs because you have of the scale that are technical. Hey, we need it to be deployed in this way. We need some latency requirements that are more technical. But when it comes to guardrails, that applies to everybody. And so when some partners say we want you to move those guardrails, say sorry. You can go elsewhere. And that has resulted in sometimes the company losing meaningful amounts of revenue. And that's okay. Because in the long term, as we talked about in the first lecture, the way you get infrastructure to move stably is you have trusted standards and trusted institutions to enforce them. And sometimes you gotta enforce them yourselves. Would you say that's roughly correct? Thanks, sir. Yes. We've had some, we've had some spirited debates, I would say, at the company. And, you know, we've talked about culture as a bottleneck on progress. You know, one of the most one of the secret sauces of BFL is a very united culture where there's a lot of debate and dissent on what to do and not to do, but then when they commit, they all commit together. I mean, how many people have left a company in the entire lifetime of the company? Like two? Uh one. One. They've had one person leave in the entire history of the company. Not common in the AI space, whereas sometimes you have like co-founders leaving six months in. I'm sure you this is the thing about this is my one issue with the Bay Area. The culture's forgotten that sometimes to keep to make progress on long-term ambitious goals, you've got to stay together as a unit. And that and that that's a great question. I think it challenged uh, you know, I think the culture at several points, and I think they turned into uh sort of moats in a sense. Yeah, absolutely.

SPEAKER_01

You debate, and then you you you disagree, and then you commit. You commit. Uh and onwards then. And there'll be more, I'm sure. Uh next question, yes. Question is how do we deal with um the insane amount of data labeling that has to be done? And other than for for text, uh images are just like not not that straightforward to label. Um I think two answers first when we train a model. We start obviously from we just saw this kind of pre-training, mid-training, post-training uh stages. We start with more data and also more noisy data and pre-training, and then we like as you progress through training, you reduce the amount of data, but you increase the quality. So for um in pre-training, it's enough to do automatic uh automatic like labeling that you can automate and then really apply it at massive scales. There are systems that that that are available to do this, uh also publicly, some, but obviously also uh we have some internal uh stuff that I kind of talk too much about now. Um but then the more we approach later stages in training, the more we also involve um, say human signals and stuff like that, because you want to make sure, as you say, that in the latest stages of training, where you actually then again align this kind of very broad and general representation your model learns with what actually matters most to everyone out there, you want to make sure that this is actually you have annotations that reflect exactly what you want, and that's when you when still the goal standard is involving uh human labeling then. Where do we see the the field going in terms of um denoising, it iterate like just in general, iterative denoising? Uh, is is it will it still be needed in the future? There are now other um probabilistic approaches that such as drifting models that allow us to do maybe a single step. Um and yeah, I'll answer that very generally. I think it's super interesting if you compare these kind of flow matching diffusion models uh with language models. Both are iterative, both are iterative models, but flow matching models or diffusion models are iterative in a dimension that is orthogonal to the data and this kind of time dimension that we are artificial time dimension that we apply that goes from pure noise to kind of the data you want to be generating, whereas language models are iterative in the direction of the data, right? You generate token by token. And that that has very interesting implications for both the training and the the inference um kind of properties these these models have. Um for diffusion flow matching type models, you have you're actually pretty data inefficient because every training example gives rise to infinitely many kind of potential losses, because you can pick every kind of um point on the continuous trajectory from clean image to noise and say, I want to denoise from here to say the next step, right? And then I can do this super often. So that that that tells us it's super data inefficient in a way, compared to language models where we can train on all tokens parallel in parallel, or let me specify language models a bit uh more auto-regressive models uh where we can train on all tokens in parallel. On the other side, we have at inference, it's like the the these two properties being switched, so um, or the effects of these proper two properties being switched when you see language models, you have to generate token by token. And there are some hacks like such as speculative decoding and stuff like that, um, that maybe can help you, but essentially you still have to like you cannot just miss data, whereas for diffusion models or flow matching models, you can actually distill a model down, right? What we do when we do post-training, we do distillation. We've written we've written a bunch of papers on uh adversarial diffusion distillation where you get down the kind of number of steps from flow matching models from 50, say, to four or two. And then it actually doesn't make a real difference anymore if you if you then do a drifting model and you have this then directly at at one step, maybe, or you maybe take two steps, but the pipeline is just more stable and mature when you distill a diffusion model down to two steps using adversarial diffusion distillation, right? So I think it's it's two things of the same uh yeah, of the same side of the coin. But coming back to all regressive models, that that that's not really the like possible for like get getting these in insane uh speed ups by just uh distillation in in the iter or using the iterative nature of the model, that's not possible. So I think a very interesting research problem that I'm thinking often, how can we combine the data efficiency of auto-aggressive models with the kind of inference capabilities or inference properties that these kind of diffusion flow matching type models have? So everyone who's who's doing who's like who likes to do research, that that's a super interesting problem to work on. Uh are you guys hiring? And you yeah, okay. Always, always. And yeah.

SPEAKER_00

I could not I I could not have spent the next half hours talking about this, but we're not gonna be able to do this this this part is uh the you know latent adversarial distillation is a very um it's a part of the the pipeline at BFL that I would say is is very near and dear to the to the core of the company, not only for for two reasons. One is because it actually makes these models extraordinarily efficient. And for those of you who have German friends, you know that efficiency is top of mind. And I think that's that's that's a through line through everything uh BFL does. It's high quality, it's efficiency, but it also ended up being a key unlock for our business model. Because early on, you know, a big question was well, we have this philosophy of we want to be open, we want to produce open weights, but we've got to find a way to make it commercially sustainable. Because there's a lot of projects that open models up and then they just die, and then that's not stable infrastructure either that you can rely on. And so one of the key differences between diffusion models and auto-aggressive models, as Andy's talking about, is that a, you know, the the the model size is actually the same in a diffusion model. In the if you look at the first flux family, we we released, we didn't release flux as a single model. Uh it was actually Flux One was was um packaged into three different models. Flux uh schnell, which is uh German for fast, right? Uh Flux Dev, and then Flux Pro. And Flux Pro we put behind an API, whereas Flux Schnell was full Apache 2.0 open weights, and then Flux Dev was open weights, but a commercial license, where any you were welcome to look at the weights used the model, but if you wanted to make revenue off of it, you had to pay. And the key distinction between these three was actually they were the same size model. Unlike, for example, language models where you have flux, uh like if you have Claude, haiku, sonnet, opus, and so on, they're actually different sizes. So in autoaggressive land, you distill down the model, you train a big model, then you distill down to smaller and smaller sizes. In flux one, which is a diffusion model family, it was the same size, but fewer steps. So I mean you can still do size distillation. You can still do size distillation, but I think it was they were all at this point. For flux one it was, yeah, yeah, absolutely. And so we we distilled it down to Schnell, which was basically a single-step model at that point. So it's four steps. Four steps, sorry, four-step model, super fast, super lightweight, um, lower quality. Pro, more steps, super high quality, slower, right? Because you're it rating over more diffusion steps. And so that turned on to be this very beautiful kind of packaging of the core technology in a way that was also commercially sustainable because the open source developer community was ex was thrilled because now they have this really fast model for a lot of use cases that you can run locally. And all the enterprises who didn't want to deal with customization had a high quality model that was behind API. And developers who wanted the mix of a sort of a mix of both got a pretty high quality model that was also open weights, that was fast. And and that trade-off, you know, is a hard one to make if if you don't sort of foresee the fact that you want to close this loop that we've talked about of frontier research repeatedly. You know, I would say two years ago, the state of the art was train a model, put it out, put the weights out there, let's see. But when you start thinking long term, then you're not thinking in terms of a single model release. You're trying to think about it as a system that's capital content. You know, you know, we've talked about all the bottlenecks, and you want one iteration to help you unlock the bottleneck for the next run and the next run. And adversarial distillation, latent adversarial distillation is what turned out to be a pretty key unlock for that part of the bottleneck two years ago. Um, next question.

SPEAKER_01

Spatial intelligence, yes, and whether it's more 3D or um, or like how how how I see the the the 3D space where some companies are working versus our kind of more video-based um approach um going forward in the future. Um I think I'll I'll I'll take a kind of opinionated uh view here. Again, it comes back to how how how do we humans learn uh in in our childhood? I think we don't have explicit 3D representations of anything in our head. We just learn based on video and audio. And I think that's the way that that will or I I I don't know if anyone have at least myself, I don't I cannot I cannot look into people's heads, but I I don't have an explicit kind of 3D coordinate representation of something in my head. I just I see you, I have a kind of my my eyes are doing a bit of triangulation, obviously, but it's still pretty much a a kind of projection onto uh those two eyes. Um so I think it it's I don't have an explicit 3D model of of like say Ange, Lisbottle, or something else. I just learn it from video and I can obviously move my head around or I can interact with this object, and that's all we need. So I'm not a I'm not a kind of um so like I don't believe too much in these explicit 3D representations. It's also a data problem at some point. Do we get all the data kind of labeled with 3D representations? I don't know. Just learning based on natural representations and then interacting with it, as I already said, like before, is in my view the the kind of way to go forward.

SPEAKER_00

I I I'm gonna I'm gonna express a somewhat contrarian opinion to what and which is which is actually not totally disagreeing with you, but uh, I would sharpen it to say I think what we have learned empirically is that these static 3D representations, and I have a strong point of view on this because I I started a company that failed at this. Ubiquity 6 was a 3D mapping company. We tried to map the world with uh reconstructions in 3D using a bunch of deep learning priors. And what we learned was it actually, don't get me wrong, the technology worked and it was valuable, but explicit 3D representations are very narrow, inflexible, and static. Especially when you take the temporal, the time element out of it, their uses are quite limited and niche. And so there are these applications for 3D representations that are useful. Don't get me wrong, especially when you're getting human like machine perception. You know, point clouds, for example, are great at having robotics, you know, do indoor positioning in GPS-denied environments. But when it comes to when you're trying to build a system that can be interacted with by humans, it turns out point clouds, 3D meshes, these are all sort of intermediary representations and in a sense hacks to, you know, that that are less general and flexible than representations that can integrate naturally time and audio and all these other modalities, because that's how we reason. Now, I because I actually disagree with you, Andy. I do think I have a 3D representation in my head of certain things. Like I if I if I But is it is it explicit?

SPEAKER_01

Are you thinking in in in terms of coordinates? I don't I don't think so.

SPEAKER_00

Well, I think about this lecture hall. Like when I'm planning a lecture, you know, I often have a 3D representation spatially, but it's it's just one input representation as part of a broader set of.

SPEAKER_01

But but but you're not it like that there's no there's no kind of prior that enforces you to think think about this uh kind of explicitly. It's implicitly. It's implicitly. You learn it based based on like what you're perceiving, right? And what you're interacting with. That's right. And then that's and obviously we could we could maybe maybe a network could in in its weights represent this kind of uh implicit 3D structure if it needs that. But I think the interface, like we're talking about the interface, right?

SPEAKER_00

Do we do we need these explicit Ah okay, yes. At the human interface level, no, it's it's it's quite unnatural to reason in in explicit 3D price. I would agree with that. Well welcome to our late night conversations. Thank you so much, Andy, for coming. Thanks for having me.

SPEAKER_01

Thanks, everyone.