AIAW Podcast

E124 - All about #DBRX AI Model - Hagay Lupesko

April 05, 2024 Hyperight Season 8 Episode 11
AIAW Podcast
E124 - All about #DBRX AI Model - Hagay Lupesko
Show Notes Transcript Chapter Markers


Tune in to Episode 124 of the AIAW Podcast for an exclusive deep dive into the #DBRX AI model with Hagay Lupesko, one of the contributor to this groundbreaking project. Hagay, with his rich background at tech leaders like GE, Meta, and Amazon, and currently at MozaicML, a part of Databricks, brings unique insights into the development and capabilities of DBRX. This episode is packed with intriguing discussions about DBRX's revolutionary features, its place in the competitive landscape of AI, and the technical finesse behind its exceptional performance. We'll also touch on Hagay's journey in the tech world before joining MozaicML and Databricks. The episode delves into the broader context of AI, from the open-source vs closed-source debate to the transformative role of Large Language Models and the concept of Artificial General Intelligence. For enthusiasts and professionals alike, this episode is a rich source of knowledge on one of the most exciting advancements in AI technology.

Follow us on youtube: https://www.youtube.com/@aiawpodcast

Anders Arpteg:

I'm so jealous of you to have access to FSD in Tesla, but have you tried it out then to just make it go from point A to B in some way?

Hagay Lupesco:

Yeah, I use it to go to work, go from work, and it's about a 30 to 40 minutes drive from my home to our office in San Francisco.

Anders Arpteg:

And it works well without any interventions, or how well does it?

Hagay Lupesco:

work. It works pretty well. Well like I maybe have to intervene maybe once or twice in the whole drive, but also not for something risky. It's more like just I don't like how it picks the lane or something like that. Um, and you know, I tried fsd on a friend's car. I think it was maybe two years ago and it was pretty bad then, like it was actually dangerous, uh, in a short drive. You know there were multiple interventions we had to do. But uh, now I think it got much, much better is it version 12, or which one do you have?

Anders Arpteg:

you think?

Hagay Lupesco:

uh, actually I don't know.

Anders Arpteg:

I did need to do some kind of a software update to get it, but yeah, yeah, and for people listening, Tesla and please correct me, you just told us this, but Tesla just opened up FSD for free for everyone for a month in US, basically right.

Hagay Lupesco:

In.

Anders Arpteg:

North.

Hagay Lupesco:

America yeah.

Anders Arpteg:

Yeah yeah, that's super fun and exciting to really being able to try out and your experience so far is positive, right.

Hagay Lupesco:

Oh yeah, I mean, I think it's incredible, and I mean all of us working in the AI field I think it's kind of hard sometimes hard to see the progress right, the big picture of progress being made. It's sometimes hard to see the progress right, the big picture of progress being made. But you know, autonomous vehicle is one of the key, original key applications for AI and it took quite a bit, quite a few years, actually to get to where we are now. But you know, fsd and Tesla is, in my experience, actually working super well. In San Francisco, where I live, you can already take driverless robot taxi with Waymo, which is also an amazing experience. It feels like you're in a Harry Potter movie where the car just moves by itself. The wheel is moving, takes you from point A to point B. Have you tried it? Yeah, yeah, I used Waymo a few times in San Francisco and it's pretty incredible.

Anders Arpteg:

Which one would you put first? Which one has the best full self-driving, you would say today?

Hagay Lupesco:

Probably Waymo, I think it's one thing where there is a driver behind the wheel and the driver can intervene at any given point of time and you know they actually do require you to put the hand on the wheel. In the Waymo it's like fully autonomous you just order a car, it comes by, it stops close to where you are, your door is, you know, unlocked. You get in, the car goes, no one is in the driver's seat, you know it just takes you to where you need to get and, uh, it's very smooth, uh, very careful, but also kind of, you know, getting to where it needs to get.

Jesper Fredriksson:

I think next time, next time we need to record a podcast from from a waymo in the us.

Anders Arpteg:

Yes, and it's so exciting to hear about this and I'm very jealous, as I said to, for you to have access to fsd in your car right now.

Jesper Fredriksson:

So great to hear, do you have any any like fears that some people will need to intervene and not manage to intervene, like this thing? Based on your experiences, are there any risk? Do you think of a pushback from people?

Hagay Lupesco:

I think safety wise, you know, I think definitely. You know, I would actually wait for the regulators to kind of call it out. As you know, the same kind of quality bar that Waymo is hit and regulated in the US. Do kind of review Waymo's technology. Required Waymo to do a regulated in the US. Do kind of review Waymo's technology. Requires Waymo to do a bunch of experiments and prove that it's safe. I don't think Tesla has done something similar just yet. So I would definitely. You know, when I'm using FSD, I'm definitely in the driver's seat, my hands are on the wheel and I'm paying attention. So I definitely don't think it hit that safety bar just yet. But I think it's just a matter of time. With enough training data, enough mileage, this technology is if it's not already, it is going to be much safer than humans Because you know, the machine doesn't drink, doesn't have to sleep, right, all it needs is electricity and software updates, and so it's much safer than the average human.

Anders Arpteg:

Yeah, true, true, perhaps for software engineers as well. But before that, I'd love to welcome you here. It's an honor to have you, a guy Lupe sco, here or here. You're rather over there, guess, in in California joining us online here, but you're the VP of Engineering at Mosaic ML, right, and you're now part of Databricks these days. Was it last summer that they acquired Mosaic ML?

Hagay Lupesco:

Yeah, because last July, mosaic ML we were a San Francisco-based startup acquired by Databricks and now you know all of us. We are part of Databricks. We are now called, within Databricks, the Mosaic AI organization and we basically work on all of Databricks' AI platform capabilities, from Gen AI up to what's called today traditional ML.

Anders Arpteg:

Awesome, and one of the main reasons, of course, that you're here is the announcement of the new Databricks large language model called DBRX, and you've been a key member of that team right.

Hagay Lupesco:

Yeah, I mean there's a lot of incredible people have worked on DBRX from, you know, shortly after the acquisition and yeah, I'm part of the team that you know built the model, made it available on the platform along lots of super talented people from a research team or engineering team, and then even you know product, you know marketing, legal. That was, you know, a pretty big effort across the company.

Anders Arpteg:

Also, I'm looking very much forward to this discussion and hearing more about how it was actually built, why you chose to make it open source which is amazing and great but also perhaps getting some secrets for how you actually made this model perform as well as it does, outperforming Grok even though it's, like, I guess, three times bigger or something like that and still made it work as well as it does Super good at coding also.

Jesper Fredriksson:

That's really nice to see.

Anders Arpteg:

But before we go into those details, we'd love to just hear very briefly about your background. Who is really Hagai Lopescu Before you joined Mosaic, ML and Databricks how would you describe yourself?

Hagay Lupesco:

That's a good question. How do I describe myself? Well, I guess you. I think I was fortunate to work at quite a few really great companies. I started my career as a software engineer 20 years ago, worked at small companies. Larger companies focused a lot on computer vision early in my career. It wasn't called AI back then. Most applications didn't use neural networks. They were not considered to work well, were very hard to train, so the way computer vision was done back then was very, very different than how it's done today.

Hagay Lupesco:

You were working in medical imaging right working in medical imaging, right, yeah, I was in medical imaging working at an Israeli startup at the time, later moved to GE Healthcare working on cardiology application and radiology application, so that's like very heavily image-based systems. And then about 10 years ago, I was relocated to, uh, the U? S by Amazon Amazon and they were looking for people all over the world and bringing them to the U? S. So they brought me to the U? S, to California. That's where I've been in the last 10 years and, uh, you know, for me it was professionally. It was really important to get to Silicon Valley here, which to me, was, like you know, the center of software technology, right, globally, and really a small area with, condensed, you know, a lot of technology companies, startups, vcs, right, all of that soup was super interesting to be a part of.

Anders Arpteg:

I mean, I can see that, but I think israel as well is a very tech startup. Dense area, right.

Hagay Lupesco:

Yeah, 100%. Israel, I think maybe second or third in terms of kind of, or maybe even first per capita, I would guess. But see, the scale in Israel is nowhere near the scale in Silicon Valley. I think Silicon Valley probably is another area that comes close in terms of the tech ecosystem. I'd say, yeah, so, anyway. So I worked for about five years in Amazon, starting in Amazon Music and then moving to AWS, and at AWS I actually had to build SageMaker. Those were the early days of the industry adopting neural networks, and SageMaker was kind of a key product that Amazon launched back then to really democratize machine learning and deep learning.

Anders Arpteg:

We were actually competitors at that time, because then I worked at Peltarion and we saw SageMaker as a competitor to us actually we're friends anyway.

Hagay Lupesco:

Yeah, those are very dynamic times, I think, and lots of companies were kind of rushing to figure out what's this thing called neural networks, how it can be used. We just spoke earlier about Sage driving cars and I think one of the early applications of neural networks was, back then, autonomous vehicle. I mean, back then there was a lot of you know, that field was very high, there were a lot of investments going into the field and lots of training jobs for what we back then thought were really large models. That were, you know, the ResNets of the world and of course, in comparison to large language models today, those are tiny, tiny models. But anyway, that was exciting. So we shipped SageMaker and the turning engine, then inference services, and SageMaker, by the way, is still, I guess, or was until recently, aws's fastest growing service across all of their portfolio of services. So definitely it was very successful.

Anders Arpteg:

And then you moved to Mosaic ML at some point, right? How many years ago was that?

Hagay Lupesco:

So after AWS, I moved to Meta and I worked with.

Hagay Lupesco:

Meta AI for about three years and again, meta is one of those companies in terms of AI. They like work in the future. Right, they're doing things that other companies or most other companies would be doing, you know, in a few years from now. So that was exciting. And again, meta was moving back then. A lot of their recommendation systems, which is the number one ML application in a company like Meta, was moving back then.

Hagay Lupesco:

A lot of their recommendation systems, which is the number one ML application in a company like Meta and I'm pretty sure the company like Google as well.

Hagay Lupesco:

All of the recommendation systems. They were moving from being tree-based models into being deep learning based and I worked a lot on that effort and by the time I left, there were about 3,000 models within the company recommendation models that were using neural networks or, you know, serving recommendations. So from you know feed recommendation to ads ranking to you know marketplace listing ranking and all of those models. So, and it was very successful, like neural networks were definitely beating you know the other more traditional ML models for recommendations and a lot of it was just because you know a company like Meta has a lot and a lot of really troves and troves of data that they collect. And what's nice about neural networks is they actually they scale, they become better the more data you feed them. In more traditional ML, what we found is that kind of the ability of the model to improve at some point kind of is diminished. But with neural networks, you know, you can increase the model capacity by adding more layers or making layers, you know, larger.

Hagay Lupesco:

Then the model is capable of learning more from the data.

Anders Arpteg:

Just keep scaling right.

Hagay Lupesco:

The more data, the more parameters you have.

Anders Arpteg:

It just gets better and better in some way.

Hagay Lupesco:

Yeah, exactly. And then after that I moved to Mosaic ML. I kind of decided, hey, maybe it's about time, after big companies like Amazon and like Meta maybe go to you know a downside a little bit go to a smaller company. How big was the company then when you started? When I started with Mosaic, I think we were about 30 people, so fairly fairly small. You know, if you compare it to Amazon's, I don't know how many companies they have a million or whatnot and even Meta. When I joined Meta there were, I think, about 30,000 employees. When I left it was close to 100,000. So, yeah, much smaller.

Hagay Lupesco:

And I came in to lead the engineering team, which back then was also fairly small. It was maybe 10 people, 10 engineers. So yeah, it was maybe 10 people, 10 engineers. Um, so yeah, it was exciting. You know, small company, uh, you know kind of tightly meet everyone working very closely together, very clear mission, very clear focus. It's a very different environment than you have in a typical large company. That was for me. It was a lot of fun and very exciting and really a refreshing change.

Anders Arpteg:

I mean awesome, and I'm trying to rush ahead a bit because we want to move into DBRX as soon as possible, but perhaps you can just quickly describe. You know what happened when Databricks acquired Mosaic ML.

Hagay Lupesco:

Yeah. So I think let's see, you know Mosaic just kind of very briefly Mosaic. We built a platform focused on training large models in general, large deep learning models, and doing it efficiently and allowing researchers and ML engineers to really experiment quickly. And as we were building this platform, kind of this big wave of LLM came in and suddenly everyone wanted to train a large language model or medium to large companies. And I think we had the right platform at the right time. We kind of focused our platform and all of our efforts into really perfecting our platform for building LLMs and as we were doing that, we also decided to train our own models as sort of proof that the platform works really well.

Hagay Lupesco:

So we built an open source two models, mpt-7b and MPT-30B. So that's 7 billion parameter models and 30 billion parameter models. Those are dense GPT style models trained on one trillion tokens each, and those models were very successful. We open sourced them and it really allowed the company to kind of turn a corner, because we got a lot of visibility when we open sourced these models and a lot of customers came in and our revenue started increasing and we started onboarding a lot of customers. At point, we also added inference to the platform. So once you train a model, you can also deploy it for infants and it really we were on a kind of a really good path. Uh, very painful path because you have to, you know, grow and serve more customers, but also, you know, uh successful you said in that point, I think you said the inference to the platform.

Anders Arpteg:

Do you mean the Databricks platform then, or did you mean Mosaic ML platform?

Hagay Lupesco:

No, we built the Mosaic ML platform. It was before the acquisition, but then, as part of that also, we kind of, I guess, got on the radar of Databricks, and Databricks were already, you know, having a pretty successful platform right based on Spark, and they're expanding it to, you know, data warehousing, but also more generative AI capabilities, and I think they just saw us as being very complementary to what they were building. And you know, what's amazing to me is, from the first conversation or, at least to the best of my knowledge, first conversation that the two CEOs had Databricks CEO and Mosaic CEO to the time the acquisition actually closed. It was literally two months, so two months from the first conversation until we were actually databricks employees and, uh, it was a fairly large deal. You know, it's about 1.3 billion dollars of an acquisition and I'm still amazed how such a big acquisition would kind of fully close within just, you know, yeah, two months, which is, uh, it's incredible, yeah yeah, and nice to get going as early as possible in the new sort of role.

Hagay Lupesco:

Yeah. So I think you know, as we all know, the iField moves super fast. Nobody is waiting for any. You know too lengthy due diligence or you know negotiations and things, and I think the two companies just decided that it was a good fit on both ends and move on with it. And yeah, within two months we were Databricks employees.

Jesper Fredriksson:

And so I'm trying to move into DBRX now. So was that initiated during the Databricks time? What was the sort of goals of this model? Was it just an improvement on the earlier models, or is this part of the Databricks sort of offering now?

Hagay Lupesco:

Yeah, it's a great question.

Hagay Lupesco:

So, like I mentioned earlier, I think it was like we were already deciding that actually us training models on our platform and then open sourcing them is a good thing.

Hagay Lupesco:

And it's a good thing because of a couple of reasons and that strategy kind of carried through also into Databricks. First of all, you know, all of us in the AI community we're kind of building on top of one another and I think that's one of the key things that allows us to move so quickly, because most of the research is done in public. Even today, most research papers you see out there come with some kind of code attached in GitHub, especially if you're showing some results, and it allows everyone to use what the others did, build on top of it. And all of that, of course, while recognizing that we're all sort of standing on the shoulders of giants, if you will. So for us, one thing was hey, we're actually leveraging so many things built by the community. We want to make sure we contribute back, and we contribute back, you know, to help the community, but we also realize it helps us as well.

Anders Arpteg:

We have another topic going into depth about the open source versus closed source, so we'd love to hear more about your thoughts, about you know the pros and cons of doing that. But perhaps you can just speak a bit about the origins of DBRX. Was that initiated after the acquisition and what was the thinking there, moving from MPT, I guess, into DBRX?

Hagay Lupesco:

Yeah, so DBRX. Yeah, so DBRX was definitely initiated after the acquisition, but the core people, or the team really the Mosaic team that was building MPT mostly continued to build DBRX. So, again, the motivation was very similar to why we built an open source MPT and everything kind of carried through. So I guess the main thing that changed is or the things that changed were, first of all, the naming. So we went from MPT stood for Mosaic Pre-Train Transformer and we changed the naming to DBRX to kind of be associated with Databricks very clearly.

Hagay Lupesco:

And then, of course, the model, where we you know, if you think about it, the model DBRX is a much, much more powerful model than even the larger MPT-3DB. And you know, we recognize that the industry moves quickly, models becoming larger, more sophisticated, and we wanted to make sure we have a really good open source model out there. Where we ended up is with the best open source model today if you look at most evaluation metrics, and that's where we were targeting. But we weren't sure that we were actually going to get there right, because it's a very, very tall order, very tough task.

Anders Arpteg:

How did you? I think we can move more into some practical details about the model and the architecture, but how did you come up with exactly the architecture, with a mixture of expert approach, with the 132 billion parameters, etc. How did you decide to do that? Did you have some ablation studies? Did you have some reasoning about different approaches? How did you come up with a final architecture?

Hagay Lupesco:

Yeah, so we have a fantastic research team at Mosaic and the research team definitely did a lot of studies and the training itself took about two months, net from when we started the full training until it ended, but there were quite a few months before that where there are a lot of evaluations being done, a lot of experiments, to you know. Look into things like. First of all, you know, one big decision was mixture of experts versus a dense model. When we started doing our work, it was definitely not clear that MOEs are going to give us what we're looking for in terms of training a model that is more complex, with higher capacity of learning and higher quality target, but at the same time being efficient to serve and we come from a very practical standpoint. I mean our customers, both at Mosaic and our Databricks. They are enterprise companies and enterprise companies that want to leverage AI, but they also recognize the price tag that generative AI has. And, to be clear, serving a large model is very expensive, and the larger the model is in terms of number of parameters, you need more expensive GPUs to serve it and your throughput is limited, et cetera. So we were trying to find the right balance between quality of the model and serving costs and MOE seemed to be like a really interesting approach. But we had to do quite a few experiments to compare MOE with Dense, make sure we are actually getting the benefits for both training and serving.

Hagay Lupesco:

And when we started that it was actually, you know, mixtrel was not around. Back then, you know, rockone, which is also an MOE, was not out. There were rumors that GPT-4 is an MOE, but you know it's really hard to know what's going on at OpenAI because it's pretty closed these days. So we definitely did a lot of experiments. Moe versus dense was one of the key experiments. And then there are a lot of hyperparameter choices for the model architecture and we ran dozens and dozens of different experiments on our Mosaic platform. That helped us really choose the right architecture. So that was sort of step one.

Anders Arpteg:

Just before we move on after a mixture of experts. I mean, you have chosen a bit of a different architecture or hyperparameters for the mixture of experts than Mistral and Grok, etc. It seems that you went for 16 experts instead of normally eight, I guess, and you also have four active experts at every time. How did you come to that decision? What kind of insight made you choose that, and why do you think others chose some other approach? Do you?

Jesper Fredriksson:

think the chose some other approach? Do you think the continuation will have even more experts, or is this sort of a sweet spot?

Hagay Lupesco:

So we okay, what made us choose this specific setup?

Hagay Lupesco:

And, like you said, we have a number of experts 16 and then active, the top page four and yes, mixtrel is an example as eight and two. So first of all, theoretically there's a trade-off here. The more experts you have, the more top K, it impacts the total number of parameters the model has, as well as the total number of active parameters during inference. So there's a trade-off here. The more you increase these values, you get more complexity in the model, which theoretically means the model has better capacity of learning. But on the other hand, training becomes more expensive and inference becomes more expensive. One way to think about that mathematically is if you look at the combinations of 16 over 4 versus the combination of 8 over 2, what we have there are 60 times more combinations for each router. So theoretically, one way to think of it is yes, it kind of increases the model's capacity and the ability of the model to perform better in terms of quality. The trade-off is serving Serving becomes more expensive and training becomes more expensive as well.

Anders Arpteg:

Just to elaborate, to make sure people understand a bit there. If you have an 8 and 2 active, you have only like 8-1 to choose 2 from, so to speak, 2 over 8. And when you have 16 there is a lot more combinations to find the four that is actually the best for each token. So in that way it's 60 times more choices to make in some way right exactly, yeah, specifically, you know the number of combinations.

Hagay Lupesco:

Uh, when you choose two from eight is 28. And the number of combinations of choosing four out of 16 is 1,820. So we get basically 65 times more combinations with our approach.

Jesper Fredriksson:

Now I would say this is in theory.

Hagay Lupesco:

we did a lot of experiments with different setups, different parameters to validate this theory and I mean we got results that kind of made us pick 16 experts with four active at every every router.

Anders Arpteg:

I'm super interested in these things. The future will go in this direction to just have more experts, or, as you said, jesper, do you think that will be a sweet spot where you find the suitable number of experts, depending on parameter size or something, or what's your thinking there?

Hagay Lupesco:

yeah, I think. Uh, first of all, moe in my mind is here to stay, just because I think it's very impractical to continue increasing dense models just because serving becomes too expensive for most practical purposes. So I'm always angry to stay. Now, in terms of the specific hyper-architecture hyper-parameter is chosen, I think we don't know enough at this point. I do expect a lot of research to go out. You know in academia and from the industry about the approaches and we may see something even more complex. Or maybe we are actually differentiating this setup of number of experts and top K based on the, you know, the layer the router is in. The router is in. You know, today in DBRX, in MixedRail, it's the same number of experts and same number of experts chosen at each layer or each block. But going forward this may change as well. So I think it's still early days for MixedRail experts. We'll learn more as we go.

Anders Arpteg:

Awesome.

Jesper Fredriksson:

And the context window it was 32k right for?

Anders Arpteg:

yeah, did you? I mean, there are some like rock, I think at least 1.5 has a bit bigger, but 32k as I guess, I guess it's rather big for open source models. Did you have any thinking there in? You know the size of the context window.

Hagay Lupesco:

Yeah, so we trained the model on 32k context length. When you use the model, you can actually use larger context length. We just you know it would probably not work as well because it was not trained on larger context windows. And then, of course, the other practical limit you have is just memory on the GPU when you're doing serving, because the longer the context window the have is just memory on the GPU when you're doing serving, because the longer the context window, the more you need memory for the KV cache and for the computation. We chose 32K because we thought this is what enterprises need when we look at practical applications. So, again, being part of Databricks, we have more than 10,000, I think now it's like 12,000 customers or whatnot that are on the Databricks platform. Many of them are trying to build AI applications and we try to really be attentive to what they're trying to do, what are their needs. And so we chose Data2K because we believe it's really good.

Anders Arpteg:

It addresses, uh, their needs and then, of course, it helps us keep the serving of the model more practical and uh, you know, and cost efficient so it's basically so when you think of yeah, this was like a balanced choice to make it for one practical and economically viable, but also still useful for enterprise use cases in some sense.

Hagay Lupesco:

Exactly, yeah, and the main reason, by the way, people need this large context window and I'm talking about more practical applications for when they implement RAG and RAG typically, you do retrieval, you pull into the prompt context that is going to help the model give the right answer to the question that's being asked or to the completion it has to do, and then you know, with all these retrievals the context may actually grow quite a bit. But you know, 32,000 tokens is almost kind of an average novel if you think about it. So you know there's plenty of room to put a lot of context into RAG applications with 32K.

Jesper Fredriksson:

I can't help thinking about. I think it was maybe Bill Gates who talked about the RAM in computers. It feels a little bit like that. Was it like 64K or whatever is more than anybody will need 640K, 640k. It feels so far away anyway.

Hagay Lupesco:

Yeah, yeah, yeah. Well, I'd be super happy to be in the same camp as Bill Gates.

Anders Arpteg:

Just a final question about the training aspects of DBRX. You used like 3000 plus H100s, right, or what was the hardware setup for training DBRX?

Hagay Lupesco:

Yeah, we ran on a Mosaic AI cluster with a bit over 3,000 H100s and that was the hardware, very importantly interconnected with InfiniBand, which is super critical for large-scale distributed training, and it was all orchestrated by the Mosaic AI platform, which dozens of companies already use to train their own large models.

Hagay Lupesco:

That platform is doing orchestration, deployment of the container images, configuration of the GPUs and then, most importantly, and something I think people often overlook, is full tolerance of GPUs.

Hagay Lupesco:

Because when you run our training job for DBRX ran for about two months, give or take on this large cluster, and when you run a job for that long for about two months, give or take on this large cluster and when you run a job for that long on so many GPUs, many of them are going to fail. And you know it's not an if a GPU is going to fail, gpus are going to fail and it's really important that the system you have is fault tolerant, meaning you know it's capable of detecting these failures, pausing training. You know, replacing the failed nodes, nodes with fresh nodes and then resuming from the last checkpoint and being deterministic about the resumption is really important. So you know, if some of the nodes already saw some of the samples. You don't want them to see the same samples again. Um, all of that has been, you know, is orchestrated by the mosaic platform. Um and uh, that was a key part of actually being able to train DBRX.

Anders Arpteg:

And I guess a lot of people underestimate the infrastructure needs or the code necessary for just having these kind of large scale training sessions to happen, and if you don't have a proper AI platform underneath that can help you with this, it will be a lot of headache. Is that correct?

Hagay Lupesco:

that's 100 correct. I would even, uh, take it to the extreme and I would say you cannot train a model like dbrx without a really good software platform. And I think in some ways it's like you know, when you there's this analogy of an iceberg right, when you see an iceberg over water, it's really impressive. It's this you know big thing that is hovering over water, but actually the most of the mass of an iceberg is underwater and in my mind it's like you know, the software infrastructure needed for the model is kind of that part of the iceberg that is underwater. It tends to be a very hererculean software engineering effort that enables the AI researchers to actually do their work, you know, both quickly but also just being able to really just achieve that goalpost of training a large model.

Anders Arpteg:

And speaking about the Mosaic AI platform, is that going to be? I mean, given that you're now acquired by Databricks, is that going to be offered somehow through the Databricks platform as well, or how is that going to be? I mean, given that you're now acquired by Databricks, is that going to be offered somehow through the Databricks platform as well, or how is that going to be integrated?

Hagay Lupesco:

Yeah, no, it's already offered. So Databricks what used to be the Mosaic AI training platform, is now called MCT, matic Cloud Training, so that's the same platform. It was used to train MPT models, it's used to train DBRX models, and then there's also dozens of customers that also train LLMs on this platform on their own data. So it's already there available, and if anyone in the audience is interested in training their own LLMs, they should reach out to me and we can get you connected with the right people at Databricks.

Anders Arpteg:

Yeah, so it's already there and I guess some customers or potential users or listeners are thinking okay, 132 gigabyte model is a bit big. Are you going to release some kind of smaller models as well that people can more cheaply and quicker adapt to their needs?

Hagay Lupesco:

We're definitely thinking about it and if you actually look at our technical blog, quicker adapt to their needs. We're definitely thinking about it. And if you actually look at our technical blog announcing DBRX, we are mentioning a couple of smaller models that we trained before DBRX and we were thinking potentially making those available as well and then eventually we decided to just focus on really we're not such a large team so we wanted to focus on really offering the base DBRX and doing a really good job with that. But we are thinking about smaller models that will be more practical to some customers and I agree, I think some companies definitely would be interested in serving a model like DVRX where it hits a good trade-off between quality and serving efficiency or serving costs. But then there are a large segment of customers that would want a smaller model and I've seen a lot of examples in our customers of customers who took a smaller model like a 7 billion parameter, mistrel or MPT 7B, fine-tuned it for their task in our customers or customers who took a smaller model like a 7 billion parameter, you know, mitstrel or MPT7B, fine-tuned it for their task and got good results that were comparable to something like GPT-4, you know, for that specific task, of course, not general purpose, but just for a specific task. And then the serving cost of a 7 billion parameter model is, you know, smaller than is smaller than calling GPT-4 by probably two or three orders of magnitude. So it's a very practical approach for enterprises. They already are doing it.

Hagay Lupesco:

Yeah, we definitely are thinking about making a smaller model available. But just the call out I'll make is there are already great open source, smaller open source models out there. Right, you know again, mistral 7b, lama 2.7b, and you know you can already take these models and you can fine tune them on the Mosaic platform or you can even pre-train these models from scratch on the platform. You know the architecture is open source, it does do it.

Anders Arpteg:

Awesome. Let's move into the other topic. We started speaking about the open source versus closed source. You have chosen, of course, to my happiness, to go with the open source direction. Can you just elaborate a bit more about the thinking behind that? I know you did it also for the MPT models, but what do you see as the potential pros and cons with going open source?

Hagay Lupesco:

Well, the pros are, I think you know, easier right to highlight. I mean contributing to the community right, helping the community build on top of what you build, but also putting your code, your model, open source, allowing the community to tinker with it, to improve it, to find issues with it, and then it all you know, is contributed back. So it's almost like you know, you get a lot of improvements to the model or to your software relatively, you know, easily and relatively cheaply. So I think that's sort of a win-win in that front.

Jesper Fredriksson:

Sorry, can I interrupt? How fast would you see those contributions coming in? Did you already see something I was thinking about? I saw something from Gemma, I think from Google. It had lots of issues and I saw people were posting those problems at an early point.

Hagay Lupesco:

Yeah, you know, I can tell you we. I spent the night when we open sourced DBRX. I spent the night at the Databricks office with a lot of my colleagues you know my colleagues to make sure the release goes well and flip the flag in our platform to make the model and the serving available. And all of that and I can tell you, within the hour from when we announced it, comments, suggestions and even PR started flowing into our GitHub repo.

Jesper Fredriksson:

So the community is amazing.

Anders Arpteg:

What about the cons?

Hagay Lupesco:

then I think, yeah, sorry, yeah, I will say, I think, one one thing to highlight someone already made a DBRX 4-bit quantized available on open source so people can actually run, you know 4-bit quantized EBRX on their math books. I thought it was incredible they didn't even talk to us, they just did it.

Anders Arpteg:

Amazing. Open source is amazing when it works. But just thinking about potential cons and thinking about the potential safety aspects of it, and if you were to release something and it doesn't have the safeguards in place and people are starting to abuse it, it will be you were to release something and it doesn't have the safeguards in place and people are starting to abuse it.

Hagay Lupesco:

I mean it will be really hard to retract it, right? Yeah, I think that is uh, that is the challenge, uh, in, I think, every software you put out there, it basically you lose control over it, right, and, by the way, it doesn't matter what license you have in place, because some actors, right would abide by the license and some actors would not, unfortunately. So it's out there and people will do whatever they want with it. And I think, to your point, this point is more significant for ai models, because these models, you know, they have kind of very strong capabilities or they may have very strong capabilities so they can be used for different things.

Hagay Lupesco:

And I think over here, like there are, let's say, two major camps right in the tech world. One camp says, hey, we should keep these things closed so we can control them and we can ensure safety, and the other camp is saying, no, actually it's, the best thing is to put things out there so that the community actually helps us find the issues, so we can improve faster and really have better coverage to find issues and problems. So those are like the two camps and I think I don't know if you guys remember there was recently like a call to pause development and some companies have been arguing that, hey, we should actually put very strong protections in place. So, you know, llm technology or generative AI technology is, you know, monitored more closely and not as publicly available? I think it's definitely is a tough call.

Hagay Lupesco:

My personal view of this is I think companies need to make best effort to make sure their models are safe before they release them.

Hagay Lupesco:

This includes doing things like red teaming before launching a model and taking the time to do safety checks and evaluation and all of those things, which are things we did, by the way, before we released EBRX. I think other companies are doing as well. I was actually super impressed by Meta's work with Lama2 on safety. Like half of their technical paper about Lama2 is about safety and I think that's you know, that's really really impressive and important work. But I also think you know one company the amount of really full testing they can do and safety checks and safety work is still limited and it's not anywhere near what you can get by putting it in front of the whole community and I think putting it out there would yield better results. And, like one reference, I can, you know, my intuition comes kind of operate working in software for a while, and uh, I look at the most secure uh server operating system and it's not the closed, you know windows, right, it's actually the open, linux, and um so I guess we.

Anders Arpteg:

Yeah, I think you could argue, as you say, that safety aspects is actually potentially better off by open sourcing it, because more people can look into the issues.

Anders Arpteg:

Yeah, yeah, yeah, yeah exactly awesome and I'm sorry for rushing a bit here, but time is flying away so quickly when we have fun, um, but if we, if we were to move to, I think, one of the most interesting, which is, let's call it, the secrets of DVRX, I mean, it is a bit surprising how well it actually does work. It works really well for coding apparently to what I've seen and math problems, and also it outperforms like ROC1 that has like 340 billion parameters and this one has like 132. Can you just try to elaborate a bit more? Why do you think the performance is as good as it is?

Hagay Lupesco:

Yeah, I think it's a great question. I don't think there is one answer to it. I think there's multiple factors and I can try to maybe go through some of these things and we have kind of touched on some of them already. I mean, the first thing, I think, is the model capacity. Again, it's a mixture of experts, model total of 132 billion parameters with, you know, 36 billion active on every input, and we found this setup to work really well, like again, like it's a good trade of people, quality and efficiency.

Hagay Lupesco:

Now it's true that Grok1, actually I think, is maybe twice as hard or whatnot, but Grok1, from what I understand, the version that was open source was trained on a much smaller number of tokens and it's important for everyone to remember the scaling laws.

Hagay Lupesco:

There's a paper from Google a while back you can refer to it as the chinchilla paper, but it really explored that trade-off like getting to higher quality scaling the model capacity, the model size and then also the data. What that paper found is actually you need to scale both if you want to really be optimal in terms of the model capacity. So we train DBRX on a pretty large amount of pre-training data about 12 trillion tokens. That's six times more than Lama 270B what Lama 270B was trained on. I believe it's much more than what Grok 1 was trained on, and I think that's a key factor in why the model is performing so well.

Hagay Lupesco:

In terms of the quality, now, I would say it's not just the amount of the pre-training data, it's also the quality of the pre-training data, and we took a lot of effort to curate our training dataset. It's all based on open data sets, but it's curated, meaning we evaluate literally each and every token to make sure that we feed the model with high quality training data, and we also implemented some form of curriculum learning. So there's a bit of that as well, and I think that was key.

Jesper Fredriksson:

Sorry, could you say something more about how do you make sure that the data is good? It seems like such a key question to have the right data. How can you make sure that you only put in the good data?

Hagay Lupesco:

Yeah, I don't think we have enough time to dive into it. Sorry, maybe it can be another session, but the internet contains a lot of things, a lot of high quality content and a lot of things that are, I'd say, low quality, like bad carbs, and we build a lot of mechanisms and algorithms to allow us to differentiate the two and, you know, render systems on the open data sets to differentiate them Exactly how we did it. That's like a let's move on.

Anders Arpteg:

Yeah, and just a very quick note about the curriculum learning. Can you just elaborate a bit more? Do you go to some first easier tasks and then move to more difficult ones, or how do you go about doing that?

Hagay Lupesco:

Yeah, again there's a lot of details, but one way to think about it is again kind of differentiating the training data from not lower quality but simpler data set that teaches more of the basic concepts of language, and then having other data sets that contain more nuanced notions, more complex notions, and then just making sure the order of training and the mix of the different data set is such where we are gradually increasing the mix of the different data set, is such where we are gradually increasing the complexity of the training data and that in our experiments it actually works pretty well.

Anders Arpteg:

Awesome. That sounds very logical and reasonable. And let's move quickly into the future of LLMs and I would love to hear if you can disclose any kind of future of DVRX and what the next version would be, and we have some directions to go in. It could go into these kind of infinite context window sizes of one million plus tokens or not. We also have what Jan-Li Kun and Meta are speaking about a lot, which is the JEPA kind of architecture where you do the inference in the middle, in the latent space instead of the token space. Guess we have the q star thing as well that open ai is is looking into. Do you have any thoughts about the next steps for dbrx and where the potential changes will happen coming year?

Hagay Lupesco:

yeah, I think the field again is moving so quickly and it's kind of hard to predict the future. I can tell you what is our focus. Our focus is really on enterprise applications. We want to make the Databricks customers successful and what they're trying to do is really implement generative applications. So, short term at least, we'll be looking to see how we can adapt, improve, take the next step with DBRX to power as many enterprise applications as we can, and some parts of it may be.

Hagay Lupesco:

I can't share any concrete plan at this point.

Hagay Lupesco:

Part of it is because we're still thinking about it ourselves, but one part can be actually offering variants of DBRX that are more practical for enterprises. Part of it may be actually improving DBRX, and you can think about a lot of potential fine-tuning we can do with DBRX, as an example, to make it better at specific areas or domains, uh. And part of it can be just hey, let's increase the quality of what we have even further, and there are definitely paths for us to do so. We want to, in general, continue contributing to the you know, the ai community by making more things available, you, the open source, but also continue to have really good offering for enterprises who are using Databricks today and want to be able to adopt alternative AI Industry-wise. I think we're going to still see bigger and better models, but I don't think we hit that ceiling just yet. I think there is still. You know, training larger models on more tokens still improves quality. That is very clear from the work we've been doing and, I think, the work that is out there.

Anders Arpteg:

What about the more like reasoning approach, where you differentiate and sometimes have more you know inference time for a specific task than for simple tasks? You have smaller inference time, Do you think, and for simpler tasks, you have smaller inference time. Do you think that will be coming soon?

Hagay Lupesco:

I think that's one of the interesting approaches and it actually kind of touches the point of there's a bit of a challenge here. Yes, models become better as they grow in complexity and size, but then also the ability to actually serve them is reduced because it becomes so expensive to serve, and I think there's a lot of work done across the industry to figure out how can we reduce the cost of inference, and mixture of experts is one approach actually that helps with that. The other thing is like applying sparsity at the low level of the model computation on the forward pass. But another approach is, yes, achieving sparsity in inference through kind of differentiating based on the complexity of the task, and that I think is definitely an interesting approach.

Hagay Lupesco:

What I've seen currently in the industry applied by companies is also when you think about an AI system, because most applications of generative AI today they're not one model that is doing all of the work. It's actually a system, compound system. We have multiple LLMs, you usually have an embedding model, you have some kind of retrieval mechanism or search mechanism and it's all kind of orchestrated together and one of the practical approaches actually is to use different models for different tasks. So maybe you have one model that's actually doing classification of the task or some kind of annotation on the input and then, based on that, it fits with different models and these systems tend to be much more efficient for serving um. Being able to actually, yeah, consolidate that into a model that is capable of making these routing decisions within the forward path is, I think, one interesting approach that some, some organizations are pursuing do you think agency as well, being able to have models taking actions by themselves and collaborating together to solve a task will be something we're going to see in the future.

Hagay Lupesco:

A hundred percent. Agents is the next killer applications of generative AI in my mind. And again, we can talk for an hour just about agents. But it's already coming, when models are fine-tuned to make api calls, and that's an important enabler for agents and it is already here, right?

Jesper Fredriksson:

so I think we'll see more agents let's book another hour and talk about agents.

Anders Arpteg:

That would be fun and I know you have to leave now, in a couple of minutes, and what to say Gora.

Hagay Lupesco:

Hagai. How much time do we have left? A few minutes.

Goran Cvetanovski:

Oh good, okay, so we are sharp on one hour, okay, cool.

Anders Arpteg:

Okay. So final question Hagai and we'd love to speak more about different things, but let's go a bit philosophical and think about the potential AGI future and assume that there will be a point where we'll have AI models that is surpassing most humans. You can think it's becoming either more of a utopian future where we'll have AI working together with humans and becoming like a paradise where we're free to pursue our passions and solving all the crises and challenges we have to healthcare and whatnot, or it can be more kind of dystopian nightmare the Terminator movies, machines trying to kill humans. Where do you lie in this kind of scale? What do you think about the future?

Hagay Lupesco:

So my answer is a bit nuanced. First of all, I think technology like if you look at the history of our species right Technology is a force that is just unstoppable. You know, it's almost like we are programmed to build better and better tools, and I think that would continue and it will take us to a place where AI is super capable and, I think, definitely more capable than the average human on any given task. I would argue actually we are very close to that already.

Hagay Lupesco:

Yeah, like a number of years, or I think so. Yeah, yeah, I think we'll. And now, again, the definition of AGI is like all over the place. There's actually a really interesting paper from Google that tries to define levels of AGI, yeah, but I think if you think about AGI is, you know, an AI that is capable at most knowledge tasks more than the average human then I think we are very close to achieve that and anyone who interacted with a model like GPT-4, I think, can get a sense of that.

Hagay Lupesco:

I think the future is positive. I think it's already making people more productive. I think it's enabling people to get more done with less time and less effort and it also enables better decision-making. And again, if we go back to our discussion of, you know, autonomous vehicle and FSD and Waymo, again, I think Waymo is safer than the average human easily, you know, on the task of driving a vehicle in the street. So I'm positive. You know cautiously optimistic about our future with AI. But I also recognize we need to put in regulation in place. Governments need to regulate the deployment of AI, just as they do with aviation and they do it with medical devices, and I think that is key to make sure that we avoid a lot of the challenges around mass influence campaigns really having a negative impact on democracies all over the world, terrorist organization using this technology to generate harm and a lot of other bad things that can happen. But I think with the right regulation and the right systems in place we can avoid most of it.

Anders Arpteg:

Very well said. That's a perfect ending word, I think, and so great to have you here and that you can spend some time with us in the early mornings of California. So thank you very much, hagai Lopescu, for joining us in the AI After Work podcast.

Hagay Lupesco:

Thank you guys, I really enjoyed it. Thank you guys, I really enjoyed it.

Anders Arpteg:

Thank you, bye-bye, bye-bye. Awesome, one hour goes so fast.

Jesper Fredriksson:

I mean so many questions yeah.

Anders Arpteg:

And so many questions we skipped. Yes, I thought we had to come, include at least the endings and the future of lms, etc yeah, I think it was interesting with the um, um, the open source thing.

Jesper Fredriksson:

Um, he, he mentioned the poster sides and I was reading yesterday something from uh uh, andrews and andreas and horowitz about how companies, how enterprise which is what they're looking for as customers how they are moving towards open source. If you compare it to 2023, it's increasing the the number of companies that are using open source models, because they don't have to be afraid of leaving the data, leaving the the building, so to speak. They don't have to be afraid of the data leaving the building, so to speak, and they can fine-tune easier. So I think that's got to be a given thing. If you open source it, then the companies can take it and use it as they want, which has to be a positive, even if they don't serve it. But you still have the infrastructure in Databricks, so that's got to be a positive thing.

Anders Arpteg:

Yeah, I'm still a bit perplexed about the future. In a couple of years I will see either some really, really big foundational models taking over or it will still.

Anders Arpteg:

I mean we were thinking about having the new section and we had previously spoken about the $7 trillion investment that Sam Altman spoke about, about compute and the infrastructure for AI and if you have that kind of compute needs in the future, it will be very few people that can train analyst models from scratch. So it will really accelerate, in my view, the ai divide and the very few, you know tech giants that can actually build these kind of models I think so.

Jesper Fredriksson:

Yeah, I, I agree with you there. I think, uh, this is, this is what's happening right now here. Now it's good with the, with the open source models, and once you see gpt5, I think will you want to go back to something that's not a super model.

Anders Arpteg:

Then I don't know you would wish you would instead have a large number of small open source models that everyone can, you know, democratize in some way and everyone can have access to, but I don't think that will be the future, unfortunately I'm so excited to see I mean, we started talking about that with Haggai this inference time compute.

Jesper Fredriksson:

That will be so amazing when you can tell the computer, when you can tell the LLM to just pause and think about this for a day or two, Like what can you achieve at that point? That's gotta be out there.

Anders Arpteg:

When I'm using ChatGPT and GPT-4 these days I can notice at least that some questions you ask it, you get the answer so much quicker than for other questions. I think they already have some kind of inference, planning or some kind of search in place for ChatGPT.

Jesper Fredriksson:

Could be, or maybe you're just trying to make it feel like I asked a really good question. Now, that's what I think of myself.

Anders Arpteg:

I think, and I'm kind of sure, that OpenAI are continuously updating the existing models. I saw Sam sometimes saying you want to have more of a continuous kind of improvement, not these kind of large steps going from one version to another, and I'm sure that's very, very efficient from a cost perspective to when you have a simple question, just a factual kind of jeopardy question, there's no point in spending the huge amount of compute on it.

Anders Arpteg:

You can just either go for a smaller model that can serve that, or you don't have the additional kind of inference.

Jesper Fredriksson:

Do you think the model will know that this is? This is a simple question. Or do you think I will tell the model this is a simple question. How do you think it will work?

Anders Arpteg:

yeah, good question. I guess you can take one inference step in a small model first and it will say I'm not sure, then you have to move on to a bigger model and perhaps do some inference steps.

Jesper Fredriksson:

You know a number of times you can have some verification step like this very let's verify step by step paper. It starts with a simple one to just you know.

Anders Arpteg:

If it can with high confidence say I know the answer, then you stop there. Otherwise you you move on to other models or you move on to more iterative kind of inference steps yeah, I feel always like I would want to.

Jesper Fredriksson:

I would want the strategy to stay a bit with the question. It's that it's often you get this reflexive kind of yes, this is how it works, but it's maybe you think about it for a bit and then that would be nice.

Anders Arpteg:

What do you think? Yes, but, by the way, about the jeppa approach, saying you know, I mean, I think it's kind of obvious that that should be the direction, that today, when you do the next token prediction and do the inference, step in the token space, so to speak, you actually have to produce a token to know what next token should be. It's very inefficient. And if you'd instead could do the inference in the latent space, in the middle of the model, then there you just move around in the latent space and you don't have to produce the tokens and you move away from these kind of ultra-aggressive models that Jan Lekun is saying all the time. This is not the future and he's very negging on this kind of approach. And I must say, I agree, I think it's kind of absurd how we are doing this kind of very efficient.

Jesper Fredriksson:

I'm a bit on the fence there. It does seem like a nice thing, and I also have some problem with my imagination, seeing this sort of walk in latent space, how that would work. I mean, I'm sure it sounds like a good idea. Then again, if it's that good of an approach, why doesn't Meta do something like that? I haven't seen anything around JEPA that's available.

Anders Arpteg:

I think JEPA is doing that. The video generation part of it did that, similar to Sora, actually from OpenAI. They also did inference in latent space and actually stable diffusion when it first came out also did inference in the latent space and didn't go to pixel space, so to speak. I think we've actually seen a number of approaches already going in that direction.

Jesper Fredriksson:

Yeah, it could be. I'm just looking for Lama. Is Lama 3 going to be in?

Anders Arpteg:

Why didn't you do that there? That's a good question.

Jesper Fredriksson:

I think that the next Lama potentially will be some kind of JEPA inspired kind of model. Yeah, should we move on to the news section?

Anders Arpteg:

Let's do that. Do we have the? Something is happening normally we have the news section in the middle of the discussion, but given that we only had one hour with a guy, we chose to do it afterwards and um, should I start?

Jesper Fredriksson:

by the way, you can start, uh, I. I have only a couple of smaller things and I think your news seems like a better. We talked about it briefly before and I think, let's stay with that one for a bit longer.

Anders Arpteg:

Yeah, it's continuing the discussion that we just had about how the big tech giants, the super scalers of the world, they are basically running away now faster and faster to build compute and, as Sam Maltzman and others are saying, the compute will be the new currency of the future. It's not necessarily the dollars. The one that have the most compute will win in some aspect, and Microsoft just announced that they together with the opening, of course, since they are basically very intertwined these days are planning a $100 billion, are planning to build at the end of this year. They will have 350,000 H100 data center, which is insanely much bigger than you know. A lot of other Google, of course, have so many TPUs et cetera, so they already have a lot of compute. But now that also Meta has it, now Microsoft will have it.

Anders Arpteg:

You know these three companies are running away and accelerating away, I would argue. So everyone else is getting further and further behind. And this kind of 100 billion data center approach, which will be like this kind of five phases apparently and they're currently in phase three and and this kind of stargate, as it's called they call it, you know, stargate names yeah, I love the stargates from the movie.

Anders Arpteg:

I guess uh will be the phase five and um you know, will that be the agi backbone?

Jesper Fredriksson:

I guess so I heard some comment around around stargate and agi recently. Um, I'm not sure if it was somebody at openai or where this person was from, but um, it's obvious that everybody's rushing for compute. Everybody wants to get as many h 100s or a100s as they can, but they're also recruiting a lot of people. So that is an interesting thing. If we would see AGI in a couple of years, do we really need to hire more people? That's a scary question to ask, but that was thrown into this discussion as maybe we're not so close to AGI as we think. I mean, we talked to a guy. He seems to be on on par with what we're saying, like in a couple of years we will see AGI. That's, that's what, what I think, and I think what you think as well. But a counter argument could be like why do you need more people at this stage if you, if the only thing you need is more GPUs and more inference?

Anders Arpteg:

I think we will simply adapt to it, similar to what we did with the printing press or spinning Jenny, or with Ford and their that was much slower. But still, what happened when you got this kind of increase in productivity, as you did with all these kind of technological revolutions, was simply that you want more of it. Yes, think that that will happen. Now, again, so now when the cost will completely, you know, reduce this elasticity thing like if you have some.

Jesper Fredriksson:

If the if the price goes down, then you will get more demand for it. I I think it's not that we're lacking things to do, I mean from a medical point of view, we have so much things to do.

Jesper Fredriksson:

From an education point of view, yes, from a security point of view, there is so many things that we are currently bottlenecked with, you know, and just from and just from from my experience being at companies, it's like you can never employ enough back-end engineers or front-end engineers people that can do and iterate on coding and if we now would have infinite power to do those things, I think we will just see crazy things happening. What will be the next things that we?

Anders Arpteg:

develop. It will be, as Elon Musk calls it, the world of abundance, where partisan services become more or less free. Yes, but I'm a bit afraid because the one that will get first to the AGI approach, their kind of power, their kind of economical wealth that they will achieve from this, will be insane.

Jesper Fredriksson:

And that's obviously why everybody's Russian right now, everybody wants to be first.

Anders Arpteg:

It's a little bit of a winner, takes it all kind of thing. Yes, potentially, or at least the one that are in the game, so to speak. I mean, you need obviously huge amount of computes and we can see these kind of top three players in the world, and, of course, china is doing the same, so we shouldn't, of course, count them out. I think Amazon also. They are investing in Anthropic with the Cloud 3 thing, and I guess they are letting Anthropic then use Amazon GPUs for this. So they're trying to catch up, but from a US point of view, it's like three major players.

Jesper Fredriksson:

I would say right now, it was interesting to see Cloud 3 beating GPT-4. Finally, yes, at least someone is getting close and even better than.

Anders Arpteg:

GPT-4. But this is scary. I think it's scary that we are seeing this concentration of power as I see it with a huge amount of spending now being done on compute, and it's done very intentionally because they know they will get it back tenfold, and the one that will have sufficient compute to reach AGI or to reach a high level of AI capability, they will make so much money from it that it's worth every penny of it.

Jesper Fredriksson:

Would there be a better way of doing it? I mean, should we regulate and how could we? How could we regulate?

Anders Arpteg:

how could you regulate? Could you regulate companies from investing in a technology I don't know?

Jesper Fredriksson:

what will? Another question on that theme, like what will what will happen to open ai? You know the this non-profit thing. They have this class that when they reach AGI, then Microsoft cannot use the technology anymore that they that they made. What do you think will happen to that? Do you think that they will? Maybe they already have AGI and they're not releasing it just because they want to milk it.

Anders Arpteg:

I think Microsoft have all the competence, all the all the knowledge that they need to do it themselves, even if it's kept something into OpenAI. So I don't think that will be a problem. It's hard to bottle that and it's so quick. As soon as someone come up with a new idea, some new invention for making AI faster and better, it spreads very quickly. And I think it's not the knowledge that will be the limiting factor, it's simply the compute that will be the limiting factor going forward. So I think that's what we're seeing, exactly what we're seeing right now, like Microsoft spending 100 billion, which is, I think, like 10 times more than anyone else have spent on data centers before.

Jesper Fredriksson:

Do you know if they will catch up with google? Anyway, it seems like google is so far above, they're so far in front of everybody else at this point with tpus.

Jesper Fredriksson:

Yeah I haven't seen a number, though yeah, I just saw a graph like this is this is how much google has and this is so how much. I think it was open ai or somebody else like number two or everybody else something like that, and it seemed like crazy. It was OpenAI or somebody else like number two or everybody else something like that, and it seemed like crazy. It was like double at this point.

Anders Arpteg:

And I think Google is making so much money from this because they're not dependent on NVIDIA and NVIDIA, of course, is their crazy rise on the stock market being more or less the number three company in the world in terms of market cap because of their dominance in this field and Google can completely sidestep that and just build their own. I mean, imagine how much money they're making from this.

Jesper Fredriksson:

What do you think will happen to NVIDIA?

Anders Arpteg:

I can tell you I sold my BND stocks right before GTC, the big NVIDIA conference, and I'm glad I did. But NVIDIA have so much money. I heard some number on the margins, the profit margins they have on each GPU, and it's like insanely high the capital that they are now gaining from this kind of monopoly that they have. Imagine how much money they spend in research now I.

Jesper Fredriksson:

I heard an interesting uh uh sort of counter argument against nvidia. Um, they were talking about cisco when, when the internet came like, of course everybody's going to need a lot of routers and cisco makes the best routers, so they're going to be on top of the world, right? We all know what happened. It's like they're still a good company, but they're not. They're not like 10 times everybody else they're like, so maybe that will go for for NVIDIA as well. Once we reach AGI. Maybe NVIDIA will not be the ones to benefit from that, but somebody else.

Anders Arpteg:

No, if they stop innovating, they will lose, for sure, but they have the money to keep inventing innovating, but you know my pets and favorite kind of technology except tpus is neuromorphic computing and what sam altman is also investing in in rain. Ai haven't seen anything in from nvidia going in that direction and if they don't move in that direction, maybe you should mention more about is it rain, ai, or was it?

Jesper Fredriksson:

maybe you should say something more for listeners yeah.

Anders Arpteg:

So neuromorphic computing is moving away from this kind of standard model, the architecture of a computer that we had since for the last like 80 years, which is you have some kind of cpu or gpu and you have some kind of CPU or GPU and you have memory separated from it and you move the memory back and forth all the time. That's not how the human brain works. In the human brain you have neurons and they basically have this kind of spiking approach where when they have sufficient signals coming in, they release something very similar to a transistor then, which is a computer approach for it. So when you have some input the transistor will go one or zero, but then in a neuromorphic you will also have a state in the transistor, so it's called like a mem trister or memoryster so you have the memory as well, so it keeps states depending on what happened in the past.

Anders Arpteg:

It will have different behaviors, so it's some kind of dynamics added to the transistor or the memory store and then it becomes much more similar to the human brain. Then it becomes possible to not have to move data between the memory and the GPU all the time, as we do today, and that is going to be insanely much more efficient. And then you can let every kind of memory or neuron operate in parallel. You don't have to wait for some kind of clock synchronization to move data back and forth, they can simply operate independently all the time. For me at least, this is an obvious kind of direction and the one that gets this right and get this working from an economical point of view that's interesting.

Jesper Fredriksson:

There are many, many different, uh, kind of architectural innovations happening right now with with the grok, lpu and etc. This is a super interesting direction as well, and I guess I guess that sam altman is spreading his bets across the across different things, and this is probably one of his Spreading his bets across different things.

Anders Arpteg:

And this is probably one of his hopes that Rain AI would be something I know. Intel is spending a lot of. You know we're investing a lot in neuromorphic computing as well, but no one has really made that work from a practical point of view yet.

Jesper Fredriksson:

So it's a lot of research left there to make it work.

Anders Arpteg:

Or engineering work perhaps to make it work properly. But the potential of that, I think is insane.

Jesper Fredriksson:

Yeah, yeah, work perhaps to make it work properly, but the potential of that, I think, is insane. Yeah, yeah, it's um, yeah, so stargate. That was the.

Anders Arpteg:

That was maybe the most exciting thing last week it shows what the coming years will yes, hold definitely it will be insane investments in compute from all the top giants otherwise, I think it's been a bit of a slow week.

Jesper Fredriksson:

I haven't seen so much interesting stuff. I was saying before. I noticed something released from Silo AI about a model called Viking Interesting. I think they had some kind of checkpoint so they trained on I think a thousand billion I guess that's a trillion tokens up until now, so maybe released some partial model. What I thought about. I thought about this when you mentioned NVIDIA, because I noticed that they're training on a supercomputer in Finland somewhere I don't know if it's Oboe or wherever and they're using AMD, which is interesting, yes.

Jesper Fredriksson:

LUMO, I think it's called. Okay, it's a european data cluster. Yeah, yeah, yeah, but it was interesting to read about somebody who's using amd.

Anders Arpteg:

It's not so common yeah, and I'm happy to see that that nvidia has some competition. At least amd released like mx 300 or something. It's a competitor to H100, similar compute with a smaller cost to it. And the dominance of NVIDIA needs to be broken, Otherwise it's not going to go well.

Jesper Fredriksson:

It's a bit boring out there.

Anders Arpteg:

Yeah, and I'm glad to see that we are having also some Nordic and Swedish approaches going forward here. To me, though, I am very afraid of the future of smaller countries and companies' ability to simply train models from scratch. I don't think that's a sustainable approach. I think you know it will be these kind of big tech giants that have monopoly on building these kind of large foundational models. So either we have to move into some kind of more specialized models that you can train yourself, but then, based on these kind of foundational models, you have more of an interaction between models. I think that is potentially a good approach going forward, because a smaller country or company I don't think could never train anything that competes with tech giants.

Jesper Fredriksson:

Maybe we don't need to. I mean, maybe fine-tuning is enough, maybe they have this new.

Anders Arpteg:

I don't think you can fine-tune either. If you take like GPT-5, I don't think you know company can fine tune it and have the money to do it, unless you use the api potentially and pay a lot of money to?

Jesper Fredriksson:

yeah, definitely, yeah, yeah, for if we're not talking about the open source models, I mean you can fine tune, uh, gpt4, but it's supposedly very expensive and you can't just do it.

Anders Arpteg:

It's, it's a process but I think you could potentially have this kind of you know, as jan lecun sometimes about it a number of more specific models that is being used together with the foundational models. So then you can train perhaps a smaller model that is specific for just doing customer service or whatnot customer support but you interact with a big foundational model that can ask anything, but then you add your specific skill in a separate model that each company can train. But perhaps something like that could work.

Jesper Fredriksson:

Are you talking about cooperating between different models?

Anders Arpteg:

Yes, so you don't ask a single model once and get the answer. First. You have more of a genetic approach, where you ask a model that is specially trained for, let's say, do customer support, and that one goes out asking the foundational model and gets some kind of general answer, but then it interacts with some other models and you have this kind of mesh of models that together come up with the answer.

Jesper Fredriksson:

I think that would be a nice future, potentially I think maybe the, the future of of agents, could be something like that, where you, where you have different kinds of models and maybe I mean one, one way of structuring the, the different agents would be in sort of a workforce. Maybe you have a software agent, maybe you have a manager agent, maybe you have customer support you mentioned there could be all different kinds of uh, fine-tuned, maybe you can even use. I've heard about model merging seems to be something emerging right now with where you just sort of almost like taking the average of weights somehow, you just tweak the weights and you get something very quickly. If you compare to training something from scratch, you get something that's performing quite well. Maybe that's the way forward with the future of agentic systems let's see Koran so I will join you just for the news part.

Goran Cvetanovski:

As always, I find some very boring stuff to you. Um, it's been a dry couple of weeks actually. Yeah, I was saying the same. Yeah, uh, let's see if this is continuing like this.

Goran Cvetanovski:

But uh, you know compared to what was happening in the past months. It is actually a little bit boring, but in any case. So two things that I wanted to lift up today. The first one was actually that you remember OpenAI opened this marketplace where you can build your chat GPDs. So the latest news now and there are many media channels that are reporting on this that is basically filling up with garbage and potentially it's dangerous for open AI to that extent that they can be actually sued again for illegal or infringement through intellectual property rights, but this time not because of their folder, rather than on who is training the chat GPTs.

Goran Cvetanovski:

So on the marketplace you remember so like they give the opportunity for people to train their chat GPTs and then basically offer them open source or basically as a part of the platform, as a marketplace there, so people can use them, but they don't have. Maybe they have not thought through the entire process, but this has given a lot of people the opportunity to train their gpt's on data that is actually then imported directly into the open ai. So it can be actually very smart move and very dangerous move in the same way, but in any case, so the latest report is that is filling up with a lot of garbage that people are basically specific gpt's that are, I don't know exactly, but I just read actually the news there, so I I will start looking into it.

Goran Cvetanovski:

But it makes sense actually, because if you give the power to people to actually train models on that way, you cannot foresee what they're going to add or import into the system by itself. So now, if they were smart, if their legal team was extremely smart and said like hey, we don't actually take any responsibility for people that have trained or upload any type of documents to the system, then probably they will be off the hook. But let's see how it's going to go. After this, I will go home and read it, because I have also my chat GPT.

Goran Cvetanovski:

I never find anything chat GPTs built.

Jesper Fredriksson:

I never find anything interesting in the store.

Goran Cvetanovski:

Me neither actually, but the chat GPTs that I'm looking for is actually a little bit more specific to the event industry and production and marketing. Marketing. You have some, but marketing is so specific it's almost like the fingers on my hands they're all my fingers, but they're none of them are the same right, so I cannot find anything that actually I can apply to, so I started building my own ones from our data. Of course, internally, yes, and the second news was that actually the musicians are on the ramp Again. You remember the film artists? They were basically striking against AI because of infringement of the intellectual property rights and deepfakes and whatnot. So now the musicians have actually come to play. So over 200 musicians have signed a petition for companies like OpenAI and Google and et cetera to stop utilizing their voices or their….

Jesper Fredriksson:

It's not their voices, not about the songs.

Goran Cvetanovski:

Yes, it's not about the songs, it's more about the voices and et cetera, because I mean, the songs are already open there, right? People are doing mashups on YouTube and et cetera. And this one is very heavy for me because in one sense, I am completely 100% on the intellectual property rights. On the other part, I'm a little bit more angry about this because, in the same time, I had an argument with my not argument, but a discussion with my daughter as well Not argument, but the discussion with my daughter as well and she was like, oh, they need to stop with this AI, because you know, he's taking the artist's work and making available to everybody else.

Goran Cvetanovski:

But when you go to school, they teach you about how Leonardo da Vinci used different techniques to paint, how Monet did his painting. So they teach you that art and then you incorporate that, you create something yourself, and then it's yours, but somehow now, when it's applied to a machine, it's somehow estranged. So there is a hypocrisy in this, all of these things. So I can understand like, hey, don't use my voice, this is my actually actually data, this is my voice, and et cetera.

Jesper Fredriksson:

I mean, this is the argument that everybody's making in favor of just reading the web. Like I can go read all the sources of the web and I can write something out of that while processing it in my memory, my brain, and I can do something with it, right, but then people are getting sued for that.

Goran Cvetanovski:

But why are we not getting sued? I mean, we are going to university, we are reading books, we are taking, we are utilizing the knowledge that people have created for many years Totally with you. And then we are basically building something out of it, which is basically the point of intelligence is multiplied knowledge combined together.

Jesper Fredriksson:

This is Anders. I think the only argument against that is sort of the scale of using.

Anders Arpteg:

I think it turns out to the question really, should we have different rules and regulation applied for machines than humans? Yeah, that's a good question.

Jesper Fredriksson:

I don't have an obvious answer here when we spoke about the AI Act in EU for machines than humans?

Anders Arpteg:

Well, that's a good question. I don't have an obvious answer here.

Anders Arpteg:

When we spoke about the AI Act in the EU, we always nagged on it because we thought it was a really bad idea to have a regulation based on a technology To say that just because you use AI to look at a video, then certain rules have to be applied. But if you use a human to watch videos in the subway station or whatnot, then it's okay. But I'm not so sure. Perhaps we will have a point in time where machines have different regulation, have different compliance to have to comply with than humans. I don't see that as an unrealistic future.

Goran Cvetanovski:

But I don't understand why, if the whole purpose of knowledge is to basically bring innovation forward. Because we are who we are today, because somebody actually invented how they can make fire of something, yes, and now today, basically so in the 90s and then in the zeros right. How they can make fire of something. And now today, basically so in the 90s and then in the zeros right. There was this. I mean, the music industry has been automated for many years. The same is with the graphics. So right now, for example, I have a piece of machine which is like cost like 1,400 euros and replace sound engineers and a mix boards that were costing thousands of thousands of dollars then in the zeros and etc, people started imagining softwares like ej, uh, e, djs and etc.

Goran Cvetanovski:

Where you could have taken samples and then create your own music and then people were selling samples, which was basically, let's say, the jasper is a drummer and he has this famous tone right, because how it sounds like right, and then they, they, they ask him to to play these tones and et cetera, right, and then they make, like this drum machine that you can do it, you can use it on your computer at home to make music, right, but it's going to sound like Jasper was basically drumming it. So at that point of the end, they were capitalizing on that. So at that point of time, and they were capitalizing on that, so at that point of time, nobody complained about it, right? So the hypocrisy of this movement for me is actually just.

Goran Cvetanovski:

I agree with them, but I don't agree that we should basically A stop developing AI on this. Rather than they should say, like, bluntly, pay me for my voice, right, and that will be super good, because if somebody pays these artisans for every single time somebody uses them, they will not even blink of eye. They will be sitting in a bora bora or somewhere else drinking some mai tai and then just, basically just cashing on the money do you remember when, uh, when there was an ai dra song pretty early on, like a year ago?

Goran Cvetanovski:

or something. Yeah, a year ago, yeah, and then Ghostwriter or something.

Jesper Fredriksson:

Yeah, Ghostwriter, yeah, and people were freaking out about what's going to happen to all the musicians. And then Grimes came out very quickly and said I'm with this. Anybody can use my voice, can train in my music. We share 50-50.

Goran Cvetanovski:

Yes, exactly, that's a good idea and that is the right way to pose it, because I think it's critical to go as a stop development of technology when you eventually say like, hey, pay me for my voice, and then it's completely different.

Jesper Fredriksson:

Makes sense.

Anders Arpteg:

But stop developing technology will not happen, of course, but I think we need to figure out the regulation that is making new work protected in some way, and this would be a challenge, and there have been a number of court cases now with the results basically coming in, saying that the generated text, like New York Times sued OpenAI etc. And they sued Just Ability, ai etc. For image generation, and they basically said that, well, the generation of images was not similar enough to claim copyright. And if that is going to be the approach, how are we I mean, it's similar to the question of patents in some way if someone is going to spend a huge amount of time generating, creating things, but not being able to get paid for it, how are we going to?

Goran Cvetanovski:

maybe that is a startup idea. We start a we, we make a new startup when people basically can donate their voice and whoever can utilize them, but then basically they get paid for it. Yeah, I think. I don't think that anybody will complain about it.

Anders Arpteg:

Everything will be super good uh, exact exciting times ahead for sure. Technology is a small part of it, but it the whole. You know use of it, the regulation of it, the societal aspect of it. You know how the democracy will be impacted by it. It will be a lot of things coming all right.

Goran Cvetanovski:

One more question for me. I think that the new news is over, karel, do you have something to add? You were reading some paper today. You didn't finish it.

Anders Arpteg:

It doesn't seem to have been like it's published by weird pages.

Goran Cvetanovski:

So let's, ignore it for this time.

Goran Cvetanovski:

So I have been now basically traveling quite intensively the last one and a half months across the world right now and executing a lot of events. So it's more like a question and then I will leave it so you can finish the podcast. And, of course, no matter where you go, everything is about generative AI and everything is about RAGA, and today we heard from Hagai about foundation models and the new AI models and et cetera, and my question is to you is like I already have my thoughts regarding this, but I just wanted to hear from you. So where do you think right now, enterprises are when it comes to implementing generative AI in the organization? How much is a bull and how much is actually in practice?

Anders Arpteg:

I can start if you want.

Goran Cvetanovski:

Maybe not bull, but let's say experimentation phase. Maybe it's a better word.

Anders Arpteg:

One interesting property, so to speak, of large language models is that they are surprisingly easy to integrate into applications. That's what you know.

Anders Arpteg:

Microsoft started doing very early on, adding Copilot to their office suite, to their operating system, to their web browser and everything and it is simple because you know you simply have to feed text into something and you get a reply. It's very easy. Integration work and, of course, google did the same, Adobe did the same, meta did the same, so it is getting surprisingly quick adoption, I would say, in moving generative AI into enterprises. Now, if you want to truly make use of it more than just an assistant, then it's going to be much more work, and if they need to fine tune it, it's going to be much more difficult, but you don't need to. That's also a very interesting aspect.

Jesper Fredriksson:

Drag is very much enough in many cases.

Anders Arpteg:

Yes, so compared to traditional AI or machine learning, where you do need to train your own model all the time to have any chance. You can simply add it to the prompt these days and it works surprisingly well, so once again, it becomes surprisingly easy to find use for it. This is a big change compared to traditional AI and machine learning, so I think the generative AI has a surprisingly big, quick impact on enterprise, what I'm saying.

Jesper Fredriksson:

yes, I agree with what you're saying. I think a lot of companies are using it. I think a lot of companies are doing proof of concepts.

Anders Arpteg:

Just let me just add I think it's easy to do, but I think few companies have come to that point yet, right? So I mean Klarna is one example that were one exception to that I was in Singapore and the first slide that they showed is actually Klarna Wow. They went really quick on this.

Goran Cvetanovski:

I don't know if it's positive or negative because the slide was like they're firing 700 people.

Anders Arpteg:

that can be easily replaceable, but the big tech giants, of course, is already there and they will simply get that out to all the customers that are using Office and Windows and whatnot, and Google Search and whatnot. That's what Google Search is now adding the SGE, the search generative experience, to the normal kind of search results now already, so I mean it's getting integrated into everything. Search results now already. So I mean it's getting integrated into everything I was reading also.

Goran Cvetanovski:

just I was reading that also there is some kind of indications that they will try to get paid for AI in amplified search.

Anders Arpteg:

Okay, so that is correct and what you're saying it might be the case right. But then for most other of course enterprises. They move slowly and but, but they are getting it by using the tech giants' products. In that way they're getting it quicker. But if they truly want to change the way the enterprises work, that will take some time. But it is possible to do it much quicker than traditional AI. Yes, I agree.

Jesper Fredriksson:

So my take on this is I'm working at Volvo Cars now and that's the only enterprise that I'm seeing and I was almost shocked when I got there to see how much they're actually doing at Volvo Cars. So it's a lot of initiatives. Most of them are on the proof of concept stage, experimentation, yeah, experimentation, but some of them are getting to a serious stage, are getting to a serious stage. I think the main question is are you going to do internal things that make you more productive, or are you going to do external facing, like new types of services? The latter is much more tricky to get right and I see much fewer instances of those kinds of projects. I think the type of productivity enhancing internal tools I think everybody is experimenting with it. I think quite a few are getting it into production, but I think we're close to that At Volvo. Definitely we're sort of at that level.

Goran Cvetanovski:

What are the main challenges of not basically putting something in production and scaling?

Jesper Fredriksson:

Sorry, you mean external or you mean Internal, internal, I mean it's tricky to get. It's easy to get to 50%, like something that works 50% of the time, basically. So last time I were here as a guest, I mentioned this text-to-SQL bot that we're working with and it's very easy to get something that works surprisingly well in many cases, but making it work in like 90% of the time, that takes a lot of work. So that's, I think, the tricky thing. You can easily get to a proof of concept that shows the potential, and then can easily get to a proof of concept that shows the potential, and then you have to spend a lot of time to get to 90% and then you have to do something about okay, so what do you do in the 10% that fails, etc. Those kind of things.

Anders Arpteg:

That's why I think the main approach for most enterprises and companies in general should be not to completely automate the task, but to actually augment humans doing the task. Because then the human gets the assistance but has the final touch, and you don't have to be able to handle the edge cases?

Jesper Fredriksson:

That's a very deep question. I've been spending a lot of time thinking about that. I think we will get to the agent type of situation quickly, but right now definitely not. Not what I'm seeing right now. I'm trying to picture a world, an enterprise world or like a mature company world, where, where the agents augment the humans and and you will you will have smaller teams that can take on much more things by having sort of a more broad capability per person and smaller teams and more like phrase it as a team can handle 10 times more tasks exactly and you can scale the, the capacity, up and down according to the task.

Jesper Fredriksson:

So if you have a huge project now, you just switch on the more agentic workforce and it has a cost to it. But it's, it's uh, it's easy to turn up and down because it's just uh, it's just going to be this, um, the, the builds to the api or the, or the fine-tuned models or whatever, maybe some inference time compute, but what we're seeing in terms of dev and the software agent, it seems like it's definitely possible to get to a stage. Just turn on GPT-5. Then we will get to a case where we can handle a lot of cases with the agentic workforce Still to see. I think that's going to be the focus for this year Much more agentic workforce.

Anders Arpteg:

Yeah, still to see, but I think that's going to be the focus for this year Much more agentic kind of work, cool, okay with that we spend much more time on the new sections, but a lot of very fun discussions, and it's always amazing to speak with you, jesper.

Jesper Fredriksson:

Nice to be here.

Anders Arpteg:

And awesome, of course, to have a guy on to have directly inside information about the training process of DBRX. So with that, thank you all of you and have a great evening. Thank you.

Jesper Fredriksson:

Thank you.

Anders Arpteg:

Bye-bye.

Advancements in Autonomous Vehicle Technology
Professional Journey Through Tech Companies
Model Architecture and Hyperparameters Discussion
Training DBRX on Mosaic AI
Open Source AI Model Safety Discussio
Future of AI Technologies and Regulation
Race for AGI
Future of Computing Innovations and Competition
AI Regulation and Intellectual Property Rights
Current Trends in Enterprise Generative AI