AIAW Podcast

E149 - Fast-Tracking AI. The Power of Inference - Hagay Lupesko

Hyperight Season 10 Episode 7

Episode 149 is here! Get ready to dive into the future of AI hardware with Hagay Lupesko, SVP AI Inference at Cerebras Systems. In this electrifying episode, Hagay takes you on his unique journey from big tech to hardware disruptor, revealing Cerebras’ game-changing innovations—including the Wafer-Scale Engine, famously dubbed the “Godzilla of AI chips.” Discover how this marvel is redefining AI performance, from massive foundational models to agile, nimble systems. Plus, we compare Cerebras’ startup agility with the powerhouse operations of hyperscalers like Tesla, Meta, and Amazon. Tune in for deep insights on real-world performance surprises, cutting-edge LLM testing, and what the future holds for AI training and inference. Don't miss out on this must-listen conversation that’s setting the stage for the next generation of AI technology! 

Follow us on youtube: https://www.youtube.com/@aiawpodcast

Anders Arpteg:

Let me welcome you very much here again, Hagai Lopesco Lopesco, or how should I pronounce your name again?

Anders Arpteg:

Lopesco, Lopesco awesome, yeah, and you're now the Senior Vice President right of AI Inference and Cerebras Systems and I know that you're doing so many cool stuff there and I'd love to discuss more about. You know the future of hardware and inference on that and how we can take advantage of that and what the secret source in Cerebras is. But you know what's extra interesting as well is your very, very high-tech background. Having worked at Mosaic ML, you know that was acquired by Databricks and you've been working at Meta AI as well as a director of engineering as well, and I love. You know love is a strong word, but I'm fascinated by Jan Lekun as well and their approaches, but also been working at Amazon, so you have so interesting background, I think. So it will be super interesting to hear both.

Anders Arpteg:

Of course, you know what Cerebras is doing. That's the main thing, but also a bit about the different way of working, and now, with Elon working with Doge and everything, this is an extra, I think, interesting topic in how to become efficient and have a good way of working that really delivers on speed. So with that, so welcome here. I'm very much looking forward to the discussion. So with that, you know, so welcome here and very much looking forward to the discussion and before that, perhaps you can give a quick background about yourself. Who is really Haggai and what's your?

Hagay Lupesko:

background. Yeah, thanks for having me, anders and Goran. I think it pretty much summarized kind of my last I don't know five to 10 years of background professionally. You know beyond that. I would just say that I've been living in the US for the past 10 years here in Silicon Valley, now a proud US citizen. But I was not a citizen when I came in, was born and raised in Israel and did my undergrad and graduate studies in computer science back in Israel and started my career as a software engineer working on systems. Later on moved to computer vision Back in the day, you know, nobody called it AI, although it was actually very similar to what you would call today ai.

Anders Arpteg:

What year was this approximately?

Hagay Lupesko:

that was so. I graduated my undergrad 2003 and then did a master's, which I finished at 2007. Um and um, yeah, so worked on medical imaging. Back in the day, I even moved to china to join a startup in china um, yeah, that that was quite an experience. Uh, worked there through the big financial crisis for the the folks that still remember that financial crisis of what was it? 2008 and nine, uh, the meltdown, um and um. Even kind of fun fact, my, my older son, was actually born in China, in Shanghai, and um, yeah, and then some point, we moved back to Israel.

Hagay Lupesko:

Um, did a few other roles there, again, mostly focusing on computer vision, medical imaging, and then I had an opportunity to Amazon back in the day was actually hiring people across the world to join their teams in the US, and actually, it's funny, I never heard of amazon when I interviewed there. Yeah, so that was when? Was it 2011, 12 or so? I think it's still. Amazon was well known in the us, I think, uh, but you know they didn't have much of a global presence back in the day, but still a big company in 2011 right yeah, it was big um and uh but looked awesome.

Hagay Lupesko:

I interviewed there. I always had a dream of um kind of working in silicon valley, I think in our field, especially in software, but I guess also hardware probably.

Anders Arpteg:

You know, silicon valley is like the, the mecca right of uh of our field and I always wanted to describe, I mean israel, have an awesome tech community as well, um, and I guess it's one of the hot spots as well, but silicon valley, I guess, is on the top, or or how would you rate the two if you compare israel and um silicon valley?

Hagay Lupesko:

yeah, I think there are a lot of similarities. So you know the the silicon valley or I guess now we refer to it a bit more as the bay area in san franc, bay Area is actually a fairly small piece of area I think it's like 5,000 square kilometers or something like that I know, smaller than Shanghai, is an example in terms of the land area, of course, in terms of population, much smaller than Shanghai, but it has a magical mix of talent, of academia and of VC, which is a really important ingredient, right. So that mix gives you both the talent but also the funding to encourage this talent, to start new businesses.

Anders Arpteg:

And the big tech companies to get the talents acquired there or hired there, right? Sorry, and the big tech giants working there, of course, is drawing a lot of the talents there, I guess, to either get acquired or hired at the companies, right?

Hagay Lupesko:

Exactly. And so the big tech companies are sort of an important ingredient because they're like a magnet for talent that comes in and then that talent actually learns at these big companies how to work, how to build software, how to go to market, which is also an important ingredient. But then these folks not all of them, but quite a few they have their own ambitions to start their own businesses and it's almost like these big tech companies are these schools, quite a few, right, they have their own uh ambitions to start their own businesses and they, you know, it's almost like these big tech companies are these schools, uh, they teach people how to do it and then they go out do it on their own. So I think israel is actually very, very similar in that sense. It's also a very small right piece of land, uh, in the middle east, um and the, the area. Actually, the tech hub is even smaller. It's mostly centered around tel aviv, which is the, the East and the area. Actually, the tech hub is even smaller. It's mostly centered around Tel Aviv, which is the, I'd say, the cultural and business hub of Israel.

Hagay Lupesko:

The VC ecosystem is not as big, I'd say, as Silicon Valley, but it's fairly big and actually it's interesting because it was kick-started by the government in Israel and I think governments are not always very useful, I'd say that but in this case it's an example of actually a really active and positive role the government played in kickstarting an ecosystem. And now, by the way, the government's role in the venture capital ecosystem in Israel is minimal to none, but they kickstarted it. They put the taxpayers' money to help seed a lot of companies. That was back in the 80s, so it was very innovative, I'd say the time to do that and it really helped.

Hagay Lupesko:

And, of course, just the other thing is in Israel the culture is extremely entrepreneurial, so every other person have an idea of starting a business and you know many of them would do it and most of them would fail, of course, but there is enough to kind of create a very vibrant environment. So there's actually a lot of similarities. One thing I see is different is that silicon valley is just so much larger, like probably a couple of orders of magnitude larger in terms of the amount of talent, the amount of money vc money that is looking for investments. And you know, of course it's also within the us where for most startups, at least, the initial market is the US, so that always obviously helps a lot. It does.

Anders Arpteg:

But also that the government is involved in being, you know, or supporting entrepreneurs and actually helping startups. Wouldn't you think that, potentially, that's something that Sweden even could learn from and perhaps grow a similar kind of area? What do you think?

Hagay Lupesko:

Yeah, yeah, no, totally. I think that model, which has been working so well in the US, it's already been replicated. But I think every right nation out there should figure out, understand that recipe and try to replicate it. Technology if you look at the growth, economical growth over the last 70, 80 years, it's been mostly driven by technology. If you look at the just look at the US stock market today, the top companies are tech companies in terms of market capitalization. That wasn't the case only I don't know 40, 50 years ago. So a lot of growth is coming from technology and I think you know I'm going to talk a bit more about AI. I'm sure I think a lot of the growth so far has came through, you know, software, the internet, mobile, the next step, function. Growth is going to come through AI.

Anders Arpteg:

Yeah, that's a super interesting topic and I think also that the tech companies today they had specialized in actually making use of data and now AI is becoming an increasing, you know important part to be able to find value from data by using AI as well. So it's super interesting.

Anders Arpteg:

But you know moving to that topic and perhaps you know thinking a bit more about you know about the cultural differences and speaking to someone like you that have experience from so many of them being both Amazon, them, and, and then also Mosaic, ml and Databricks and now Cerebras, can you just give a quick view of what are the differences and how would you compare them potentially to normal companies and their way of working?

Hagay Lupesko:

Yeah, you know, I think each one of these companies you mentioned has kind of a unique identity.

Anders Arpteg:

I feel a lot of it actually stems. They're different between each other, you would say as well. So Amazon versus Meta, versus Databricks and AWS, okay yeah.

Hagay Lupesko:

There are similarities, for sure, but there are differences, and I feel a lot of kind of the real core of the culture actually stems from the founder of these companies, or founders, because these companies are all fairly young companies Kind of, if you look at the bigger perspective. So you know, in Amazon the culture is very much focused on moving fast but having business impact and being extremely rigorous extremely rigorous, like Amazon has a highly rigorous culture and being extremely rigorous extremely rigorous, like Amazon has a highly rigorous culture. And one of the things I really liked about Amazon is many meetings start by just 10 minutes of silent reading of a document.

Anders Arpteg:

Yeah, I heard about that. Is that really true? I was wondering if that yeah.

Hagay Lupesko:

So that's, true.

Anders Arpteg:

Okay, yeah, interesting.

Hagay Lupesko:

Yeah, hardly any PowerPoint presentation or slide decks. People are expected to write good documents, and one of Jeff Bezos' quotes that I really like and I really agree with is he says he likes a meeting that starts with a crisp document and then a messy meeting, and that's true. I mean you want to start by getting a download of the information and you get that to a really good document that someone on the team has written, and it can be a software design document, it can be a postmortem of some screw up that happened in production, or it can be the plan for building the next uh, awesome product or awesome feature. And then you spend 10 minutes. Everybody's literally in the room. It's quite a. It's a quite a thing. If you walk into this room, you think maybe someone had passed away or something, because everyone's silent.

Hagay Lupesko:

Uh, look reading the document and it's usually also a paper copy, so people take small pens and write their comments. And then, after 10 minutes and you need to be a fast reader- Is it like two or three pages, or is it longer than that?

Anders Arpteg:

Or how long are these documents normally?

Hagay Lupesko:

So typically, I mean, there is either one pager for relatively simple stuff or six pagers for stuff that is more complex, and you are asked to keep it to no more than six pages and usually what people do is all the other details they shove in an appendix. But but yeah, the idea is that folks in the room can read it within 10 minutes and then there's a discussion and the discussion. What's nice about this methodology is that the discussion is so well informed, right and rather than you know, because the alternative in what many other companies do is there is a slide deck and then there's someone who presents a slide, and slides don't contain a lot of information. There are bullet points and then the important information is often in that person, the presenter, in their head and they would talk about it. It doesn't really enable a really deep and detailed discussion. So that's one of the kind of best practices of Amazon that shows the company's rigor, and there are a lot of other examples Like.

Hagay Lupesko:

The other thing I would highlight is post-mortems, or some companies call it retrospectives. In Amazon it's called the COE correction of error and it's a well-designed kind of process on how you debrief failures that happen and how you learn from them and everybody in Amazon takes it very seriously. There is a process, there is a standard. Usually there'll be an executive in every post-mortem like that and people who take the follow-ups very seriously and a lot of other details. But that just shows the rigor of the company and I think that's one of the main reasons Amazon is so good at executing. I mean, they go into a new field and they're just executing so well, taking market share, building products that people love.

Anders Arpteg:

This is such an interesting topic and I know we have a lot of topic, but I still want to ask one more thing about Amazon. You know, and that's their you know way of innovation, and I've heard you know about the what's called a one page no, the press release kind of approach.

Hagay Lupesko:

Is that something? Yeah, the RFAQ, it's called.

Anders Arpteg:

Yeah, so that's really true as well. So you have a press release. It's true as well, yeah, okay.

Hagay Lupesko:

And that's another again, kind of one of Amazon's sort of secret sauces, which today, by the way, many other companies here in Silicon Valley at least, adopted it. But the idea again is when you have an idea for a new product, a new feature or maybe in general something kind of meaningful to do, you start by you work backwards from the press release. So when this thing is announced, what would be the press release that Amazon would publish and usually it's, like you know, maybe half a page of the press release and really should have a very crisp description of that new feature or product has some quotes from customers. Of course those are imaginary quotes because the thing doesn't exist, but you try to think, okay, how would our customers talk about it? And then, after that imaginary press release or forward-looking press release, you have a list of FAQs, frequently asked questions that go into the details of, okay, what's the product, how is it going to work, how are we going to do this, how are we going to go to market?

Hagay Lupesko:

And again, it's written in a document and there is a review where people are doing silent reading and debate it, et cetera. And that's actually how Amazon works. At least when I was there, that's how it worked. I sat in dozens of these PR FAQs reviews. I wrote a few myself. It really pushes you to think about the outcome, think about the business results and only then think about all the technical engineering product aspects. Super cool.

Anders Arpteg:

And then meta, if you were to quickly go through a bit. What's special about the culture in meta?

Hagay Lupesko:

Yeah, I would say meta. Everything is about moving fast, and the original slogan when I joined, at least, was move fast and break things. Yeah, exactly, which was a bit controversial and I admit it took me a while to get it, but when I got it it made perfect sense. It's about hey, it's really important to be successful in business, it's really important to move fast, and it's okay if, in the process of moving fast, you don't do things perfectly. That's the idea. It's okay to make compromises, as long as you're making sure that you're moving fast.

Hagay Lupesko:

I think that really stems, I feel, from the character of Mark Zuckerberg, the founder, who has been, you know, I think he's became a little bit of a controversial figure in some ways. I mean, he's running the world's largest social platform network. Obviously it'll be controversial, but it's hard to ignore this guy, out of his dorm room in Harvard, built a juggernaut right, build a company that is so successful that so many people use every day. And I think one of the key things you know he brought to the table to make that happen is really being ruthless about moving quickly, experimenting, making some mistakes but then correcting them, and that is the core thing. So I think, if I compare Amazon to Meta, what they have, which is similar, is that focus on business impact and moving fast. Both companies are pretty good at that. But where they differ is Amazon would be super rigorous about things Sometimes it would actually hurt their speed of innovation or how much they're moving While Meta would actually be a bit less rigorous and would say it's okay not to be that rigorous, as long as you're moving quickly, let's just ship things as quickly as possible.

Hagay Lupesko:

And I can tell you on my first week at Meta actually my third day I already shipped code to production Really and yeah, and that was just incredible. And, by the way, I came in not even as a software developer, I came in as an engineering manager, which typically you focus more on building the team and strategy and that stuff and less so about coding. But as part of my onboarding I was asked to fix bugs in production and I did it and just the infrastructure was so amazing, again, optimizing for shipping stuff quickly. So I remember coming home from work on my third day and told my wife wow, today I shipped code to facebookcom.

Anders Arpteg:

Really cool. Which one, I guess, would you prefer? The meta approach to the Amazon approach, then?

Hagay Lupesko:

I think I don't know. It's a hard question. I think both are really top-notch companies in so many different ways. They're just a bit different. I can tell you personally, my personal philosophy in terms of software development and building products is that speed matters a lot. So I like this thing about both companies. Yeah, I think they're just a bit different, but I don't know if I have a special you know preference in that regard. I think both of them fit my kind of my values in terms of moving fast is really important. Building software quickly, deploying it to production quickly, getting customer feedback and figuring out what we need to improve that's the kind of value I share with both of these companies Awesome.

Anders Arpteg:

And just a quick question, then, about Meta as well. So Mark Zuckerberg did a big change recently for these companies awesome. And just a quick question, then, about met as well. So mark zuckerberg did a big change recently and um, he's um removing, uh, some fact checkers and um moving to a less, perhaps if you call it less censored approach. Any thoughts about that? What's your thinking?

Hagay Lupesko:

I think he's aligning with the current you of change in the US. I'd say maybe actually even globally. We've all seen all the changes that happen globally. I would say I think meta is at an extremely difficult point. It has been for years now. Right, we are trying to be the arbitrator of truth and what is truth? So some things are clear, but there are so many things that are like vague and depend on the cultural perspective, on where you stand, and other things like that. So I feel like Meta has always been kind of been put in an impossible position. So that's on one hand. So Meta tried to kind of be the arbitrary truth. They used to have I don't know what they have now, but tens of thousands of fact checkers that are assigned to handle these cases, and now Meta is cutting back on those tens of thousands of people. I think some of it is, you know, kind of maybe recognizing that it's not as effective. Some of it is probably aligning with the new winds right Coming out of Washington with the recent elections. I think some of it is also just trying to be more efficient. And again, we spoke about this earlier. Anders, right, Like you, look at Twitter.

Hagay Lupesko:

Once they were acquired by Elon made the private. The guy slashed 80% of the workforce. Everybody cried that you know the site is going to fall apart and stuff, but hey, it's still up and running. Last time I checked fall apart and stuff, but hey, it's still up and running. Last time I checked and I think everybody in Silicon Valley actually saw that Investors, Wall Street saw that and there is a lot of pressure now on tech CEOs to be as efficient or getting close to Elon's efficiency. So that's why we're also seen a lot of playoffs in different tech companies we could speak about this, I think, for two, three hours, uh alone, and such an interesting topic.

Anders Arpteg:

I actually add something about tesla, and we are working in the end here, and perhaps we if we have time. I don't think so, but if we have time I would love to hear more and about your thoughts about elon, the doge approach, twitter and Tesla way of working and whatnot. But we have so many other things to speak about as well. And just to close your background a bit Databricks and Mosaic ML. Can you just briefly mention a bit of their culture as well?

Hagay Lupesko:

Yeah, yeah. So just on the personal level for the listeners. So you know, I spent five years in Amazon since coming to the US, then another three years at Meta. It was a lot of fun, but I kind of realized, hey, I want to go a little bit back to my roots in startups and although Meta is really a fantastic company, I decided probably time for me to make a change and then decided to join Mosaic ML.

Hagay Lupesko:

Mosaic ML back in the time when I joined, I joined as the VP of engineering to build the company's engineering organization and product. It was an early stage startup, I think. When I joined we were like I don't know 15 people or something like that fairly small, working out of a very rustic office in San Francisco. It was a very funny office and, yeah you know, built a platform for training large models. Later on, llms came in in the storm and we completely focused the company on training LLMs. That business started taking off. Our customers asked for inference capabilities as well, so we added inference to the platform. It was a fantastic ride and at some point we were acquired by Databricks. So Databricks is another private company. Only Databricks was much, even when they bought us much later stage Already. Back then their valuation was in the tens of billions of dollars. Really good company, very strong founders. Fantastic CEO, who is one of the founders, ali Godsi, who is also, I believe, swedish.

Hagay Lupesko:

Swedish. Yeah, so I think he was born in Iran and then, following the revolution in Iran, his family migrated to Sweden and I believe Ali the CEO, he did his PhD in Sweden, if I'm not mistaken, and then at some point moved to the US, to the Bay Area.

Anders Arpteg:

It's a really well-run company I met Patrick Vandel right as well at some point, so he seems like a super smart guy as well, I must say.

Hagay Lupesko:

Yeah, all of the Databricks founders, I would say, for working with these guys very smart, very capable, very strong business acumen. These guys very smart, very capable, very strong business acumen and you know I really admire how they built a company. They built this open source project, apache Spark, within Berkeley's AI lab and they were able to take that open source project and build a company around it. I think today's Databricks is still a private company, but their valuation, I believe, is around $60 billion One of the biggest private companies out there and incredible work by this set of founders. Yeah, they're all smart, very capable, build a great company.

Hagay Lupesko:

So we were acquired by Databricks and after the acquisition, I was running Databricks AI platform products and we integrated Mosaic ML's large-scale training and serving products into Databricks. They're now called Mosaic AI products in Databricks and we also integrated other capabilities in Databricks of predictive ML. Sometimes people call it classical ML. I prefer to call it predictive AI, integrated AI, compute capabilities for people who just want a notebook and access to GPUs and run their own stuff, and it's been going super well.

Hagay Lupesko:

Yeah, and I think Databricks is also culturally very driven. The kind of culture stems a lot from the founders and I feel like, in some ways they took a lot of good concepts from Amazon and other tech companies in the Bay Area. It's also very rigorous culture, very customer focused, and I say what's kind of the unique twist in Databricks is that it's also somewhat academic a little bit, I would say, in the sense that there is a lot of depth in thinking through both engineering problems, research problems, but also business problems. Often things start with a hypothesis and you need to prove or disprove the hypothesis when you write documents about new features, new capabilities or how you're building stuff. So I'd say that is kind of the unique twist of Databricks.

Anders Arpteg:

So fascinating. But let's move on perhaps to the main topics here as well. We have a limited time with you as well, so I'm trying to manage time here the best that I can, even though it's so interesting topics. But let's move into Cerebras, and if you were to just start a bit about describing a bit the mission, what they're doing, and then also a bit how you got in contact with them and what you're doing there?

Hagay Lupesko:

Yeah, so Cerebras, which I joined almost six months ago to help the company build an inference product. Cerebras is a private company here in the Bay Area, started or founded nine years ago A few founders who previously built another company and sold it to AMD. The mission of the company is to accelerate AI by building the best hardware and software for that. Now I know it's a bit broad and a bit kind of grand, which I think, by the way, is important for companies to have really ambitious and broad missions Right. But really, if we look at what the founders did, is they looked at kind of the emergence? Nine years ago deep learning was still very nascent, but it was also kind of clear to people who are deep in the details that it's going to play a major role in AI. You know, nine years ago that was after the AlexNet moment, right, where Jeffrey Hinton's team, alex Krzyzewski they kind of broke the world record in image classification.

Hagay Lupesko:

Yeah, With neural networks. So these founders observed that GPUs are playing a big role in this emerging deep learning movement. But also, GPUs were not designed for neural networks. They were designed for graphics processing. Now the thing where kind of GPUs hit the jackpot right is that they're really good at highly parallel linear algebra. So if you wanted to multiply a large batch of matrices, GPUs are good for that, which is a lot of what neural networks need. It's definitely much better than CPUs at that.

Hagay Lupesko:

But there are deficiencies with GPUs and we can dive into these in detail. But I think one of the biggest deficiencies in terms of big neural networks is the limited bandwidth between memory and compute on the GPUs, and we'll get to that in a minute. But that is a major limiting factor. Anyone that tried to build especially inference products on top of GPUs like NVIDIA or AMD that doesn't matter how much compute you're packing on the silicon, you'll be limited by the memory bandwidth. We can talk about that in a minute. But going back to the founding story of Cerebra, so the founders set out to try to build hardware that will be designed from the ground up for large neural networks and they came up with something pretty amazing amazing, which is a wafer scale chip. Um, now, I don't know how many people know that, I did not know that before I joined syria, but but the origin of the name a chip is because you, literally you chip it out of a big silicon wafer. So when you manufacture the silicon.

Hagay Lupesko:

It's a big wafer. And then you, you chip out the individual's dyes yeah, oh cool, learned something new here.

Anders Arpteg:

That was interesting yeah.

Hagay Lupesko:

And then cerebas, they said hey, why are we actually chipping out these chips, put them on pcb boards and then connect them again together in host? Why not just keep them as one unit? So the bandwidth, the connectivity between each die remains high? There's a lot of advantages to keeping it that way. And they set out and they build this incredible thing which actually I can show you guys here. That's a Cerebras wafer scale engine chip. Now you can see it's not a chip, it's actually a whole wafer. Now the wafer itself is actually round in the manufacturing process, but to fit it into a system, basically you laser cut the ears. So this is a real Cerebras wafer scale engine chip.

Hagay Lupesko:

Jesus Christ that actually goes into systems. It's super huge right.

Anders Arpteg:

But the original wafer is about chip that actually goes into systems. It's super huge, right, but the original wafer is about the same size.

Hagay Lupesko:

It's just circled around it, yeah yeah, actually I can share my screen just to show a little bit of. I think for the audience would be nice to see that. Let me quickly share.

Anders Arpteg:

Cool. Normally, how many chips, would you say, nvidia and others are putting out on a single wafer? Would it be thousands of them, or hundreds of them? Or how many are normally shipped out, so to speak?

Hagay Lupesko:

um, I would say probably, maybe depends on the nvidia chip, but maybe around 50. That would be something, uh, in the order of 50. So actually this thing that we're seeing now. Can you see my screen?

Anders Arpteg:

Yeah, yes.

Hagay Lupesko:

Yeah, so this is about 50 times larger than an NVIDIA H100 chip. So my guess would be that the NVIDIA probably can extract about 50 chips from one wafer. If you look at the specs here on the right 4 trillion transistors, 900,000 AI cores Jesus Christ. So that's about 50x more than an NVIDIA H100 in terms of the number of cores and the size. But what's also interesting here we'll touch on that in a moment is the 44 gigabytes of on-chip memory, that's SRAM, static RAM, much faster than DRAM. That's about 1,000 times more than what an H100 chip from NVIDIA has. And that's a huge, huge difference, right, and I'll get in a minute into how it's useful, especially for inference. And obviously the bandwidth between this memory because it's static RAM on the silicon the bandwidth between that and the compute cores is extremely, extremely high, actually 7,000 times faster than the bandwidth between, let's say, an NVIDIA or an AMD GPU and the HBM memory, the high bandwidth memory which sits on the same card. This is just another kind of illustration.

Anders Arpteg:

That's's a cerebras wafer scale engine, the latest generation, three uh compared to nh100 in terms of the size, as to keep everyone kind of understanding the sizes we're talking about here I mean this is amazing, of course, and but I I've heard that normally you know if you try to build such a big ship, you know the probability of of them actually failing, or the yield that you normally get from a wafer can be very costly if you have to throw away a whole wafer.

Hagay Lupesko:

Yeah, definitely that's one. So that's one of the challenges. So the Cerebus founding team, their first few years in the company, they just kind of worked hard at the challenge of how the heck are we going to build such a big wafer and put it into a system and make it actually work and not go up in flames or break down. And yeah, yield was one of the problems. You know, managing the heat right, Because that thing heats a lot. It's like a sauna, almost generator.

Anders Arpteg:

Managing all the mechanical aspects, packaging.

Hagay Lupesko:

How the heck do you package such a big piece of silicon on a PCB? How do you hook up power into it? Lots of challenges, and actually there's been a bunch of attempts, by the way, before Cerebus did this to try to build a wafer-scale chip. That idea is not entirely novel. I mean, other companies, academa, tried to do it and it's always failed, as far as I know, before that. So the first incredible thing that the Cerebus founders were able to pull is to make this thing work.

Hagay Lupesko:

Yeah, the yield problem, by the way, which you raised. I don't know if the audience knows, but yes, like when you manufacture silicon, it's never a perfect process. I mean, we tend to think, oh, it's manufactured, it's probably perfect, everything works. But no, there's always cores that are not working because of a lot of different reasons. Um, and then what the chip manufacturers actually do today is they factor in the fact that some part, some cores, will not be working and they build in mechanisms to have redundancy. And actually, if you look at you know, intel has that, amd, nvidia, they sell different models of the same chip design, but one is, like you know, lower capacity versus higher capacity and that's basically just based on the yield After the manufacturing. They do testing and the chips that actually have more yield problems. They would be like lower grade chips would still be sold, still used by consumers.

Anders Arpteg:

um so, it's like a fine wine or something. Some wine you know is getting spoiled in some way and it's sold for like a diamond.

Goran Cvetanovski:

Yeah, yeah yeah, yeah, because they have inclusions and etc. So, depending on the inclusions you have the value of the diamond and the color. Yeah.

Anders Arpteg:

So the same manufacturing process but depending on the quality, they are sold in different product series or for different prices, right?

Hagay Lupesko:

Yeah, yeah. I think one difference between wine and computer chips is that I think computer chips lose value much faster than good life yeah Good point.

Hagay Lupesko:

But anyway, going back to the story, so design the chip focused a lot on packaging it into a system, and then you know there's other things that go on top of that. You need to be the compiler that then take high level code, compile it into a low level representation that can run on the chip microcode. This is a whole different computer architecture, whole different paradigm, event and all the traditional approaches. So you have to be the equivalent of nvidia's cuda, right? You have to, of course, then kind of build in the application on top of it, and they started with training of large models, and recently I joined to help build a new product around inference, and I'm happy to talk about that product because I'm extremely bullish about that product and I think it's.

Anders Arpteg:

Yeah, that sounds super interesting to hear about that product. So if you can speak about that, that and I think it's yeah Okay. So, yeah, that sounds super interesting to hear about that product, so if you can speak about that, that would be awesome. But just to get back to the CUDA and the software that NVIDIA of course have, which I guess is a big advantage for them and makes a lot of companies locked into using NVIDIA products, yeah, so are you building something similar here, or like the RockM kind of thing from AMD, or what's your approach here?

Hagay Lupesko:

Yeah, we already have built that. You know, to operate the accelerator we had to build a compiler pretty early on. So we have the equivalent of CUDA or RockM. It's supporting PyTorch, supporting TensorFlow, and our customers are using it already today.

Anders Arpteg:

Is it like 100% feature complete, or how close are you?

Hagay Lupesko:

to. It is feature complete. It's working. I think we have still work cut out for us in terms of making it more user-friendly, easier to understand how to use, easier to use open sourcing parts of it which we have not already.

Anders Arpteg:

But you have a separate Python library, you can use PyTorch, I guess, directly.

Hagay Lupesko:

You can use PyTorch. If listeners are interested, they can look up our documentation online. We have PyTorch support. We have extensions for PyTorch. There are examples on how you would train a PyTorch model on Cerebras. Of course, to be able to do that, you need to get access to our hardware, which, to do that, you have to actually access to our hardware, which, uh, you know, to do that you have to actually talk to us so we can give you access.

Anders Arpteg:

So how much does it cost to buy a wafer skyle cerebus ship?

Hagay Lupesko:

how much money? How much money can you spend?

Anders Arpteg:

is it a consumer product, or would it always be like enterprise products, or what do you think?

Hagay Lupesko:

so always be like enterprise products, or what do you think? So, um, I wouldn't call this a consumer product, you know? Um, so the product we sell is actually not the chip itself, because the chip would be useless to you, it's. You can just, you know, slot it into a pcie slot or something like that. You actually have to build a buy cerebra system, um, and, and the list price for that system is about $1 million for one one system. Of course, you know, the price varies based on different factors like how much you're buying and and things like that. But you know, so, definitely not consumers.

Hagay Lupesko:

Most of our customers are either, you know, large organizations, uh, with their own data centers, uh, government labs, or you know sort of semi-government based labs. So those companies would buy the, buy the hardware. Let me show you, maybe real quick how what it looks like hardware. Um, let me show you maybe real quick how what it looks like, um, so this is, um, yeah, this is the wafer scale engine, the chip. It goes into this system, which is about 16 u, so half a half a standard rack, and this is a cluster, which in this case, it shows eight systems. Today, our typical cluster footprint is about 16 Cerebra systems and if you think of it, 16 Cerebra systems is really a supercomputer, and each is a 60U size.

Anders Arpteg:

Sorry, and each of these 16 is basically 60 U-size systems.

Hagay Lupesko:

Yes, this, yeah, okay.

Anders Arpteg:

That's a big one.

Hagay Lupesko:

Yeah, and they sit in data centers of our customers, or we also operate our own data centers we have a few in North America and then our customers rent access to this compute, or it also powers our inference service, so you sell like a cloud service as well, correct? So that's a new product we've launched. Late last year, late August, we launched an inference product and basically it's an inference API product, so you're not really exposed to the nitty gritty details of our hardware or software stack. You access a standard you know, open, ai, compatible API.

Anders Arpteg:

So on that level it's not like PyTorch level, it's basically on the open AI kind of access the model directly level, so to speak.

Hagay Lupesko:

Correct, yes, and there's been a lot the model directly level, so to speak Correct, yes, and there's been a lot of demand right for these services, many companies out there.

Hagay Lupesko:

They want access to the goodness of AI and particular LLMs, but they don't necessarily want to worry about, you know, configuring the hardware and choosing the models and configuring the models and running inference.

Hagay Lupesko:

It's actually pretty hard to do it efficiently in that scale and if all you want to do is experiment a little bit with right LLMs, it's going to be very expensive for you to have to set up right GPUs and do all of that work. So we've launched an inference service that is based on our hardware and what's really incredible about that is that, thanks to the hardware, our performance is just kind of incredible. So if you look at this chart, it shows it's a screenshot from a website called Artificial Analysis and if you haven't heard about that website, it's a great kind of third-party independent company that is benchmarking different modern AI services and models. So they compare models in terms of their quality, they compare API services. So this is just a screenshot from that. It's not that the organization artificial analysis is actually not affiliated with Cerebras in any way. They just get access to our API to run benchmarking themselves.

Anders Arpteg:

And you mentioned Grok here as well, and I think a lot of people, or some people, know about Grok and they have a very different approach, right. Or can you compare a bit what the approach for Cerebras is compared to Grok.

Hagay Lupesko:

Yeah, yeah, definitely. Just quickly to go through what we're seeing here. So that's a benchmark showing the token generation speed. Tpot stands for time per output tokens, but really this chart is the number of tokens per second Of LAMA 3.370B. It's one of Meta's LAMA open flagship models, great model, very useful, really good size in terms of what the quality you get for the serving cost. You can see Cerebras here at 2,355 tokens per second and, yeah, it's you know. We can also see Grok there with speculative decoding at about 1596. Samba Nova, which is another custom ASIC provider, at 313. And then the rest really are all NVIDIA GPU-based or MDGPU-based serving APIs and you can see that Cerebras hardware is like at least 10 times faster than the fastest GPU-based inference. And if you even compare it to like a service like Azure right, the Microsoft Azure it's, I don't know what is it like 50 times faster or whatnot. And that's all thanks to the hardware, yeah, so what's interesting is, yes, there are a few kind of, uh, new asic providers.

Anders Arpteg:

I mean, cerebracy is one, right, and uh, we're the fastest uh, just let people understand what you mean when you say asic, but that's more like an application, specific circuit, or the ship that is specifically designed for a specific purpose, like AI, right?

Hagay Lupesko:

Yeah, correct, correct. So, cerebus. I mean we really design a new silicon. We're not using the standard NVIDIA GPUs. We design new silicon, new computer architecture, everything. We design it for deep learning workloads. You know there is Grok and Samba Nova. They're also. They have their own hardware. Each company took a bit of a different approach. I would say Grok has LPUs, which are much, much smaller accelerators so it's almost like the inverse approach that are interconnected with custom networking between them. So there are different trade-offs. Sambanova's approach is a bit different. I'm actually not an expert on kind of the hardware approaches of these other companies so I can't speak too much depth on that.

Anders Arpteg:

But is it true what we hear about Grok Like? It's like they basically have thousands of smaller DPUs and you have to use them all, or or use their kind of service and they have a yeah, very specific exactly yeah, yeah.

Hagay Lupesko:

So grok, uh yeah, grok is exactly what you said. The again approach is different. It's many, many thousands of much smaller accelerators, and I think grok also pivoted at some point a couple of years ago. So these do allow you to program the chip and work with it directly, and now it's more like an inference service provider that is backed by this hardware. So it's almost like they really pivoted the business completely.

Anders Arpteg:

Interesting. And this new product of yours, is it only for inference, or will there be some kind of training API as well, you think? Or how would you do if you want to train on Cerebras?

Hagay Lupesko:

Yeah, so you can. Definitely, training is a big business for us. Actually, it's bigger than inference at the moment because it's the original business for the company. And yeah, you can sign up as a customer and you'll get access to our compiler, to our software libraries and you can train models. We've trained models with companies like G42, which is a UAE-based AI company, fairly large. We worked with Mayo Clinic to build Genomics a foundation model. Yeah, there are quite a few customers who've been working with us on training large models.

Anders Arpteg:

But is that a separate product then, or can you use this?

Hagay Lupesko:

It's a separate product interface. I would say it's all based on the same hardware and software stack under the hood, but the interface for the product is different.

Anders Arpteg:

Super interesting, super interesting.

Hagay Lupesko:

The other interesting metric here is also the other thing that is important for when you're building applications against AI services is the latency, so the time to first token Right, and again, this is just benchmark from the same company, artificial Analysis, and you can also see Cerebras for this model, and the latency is the lowest by far across lots of other services, which is also very important.

Anders Arpteg:

And I guess it's due to the big way for, once again, that the memory is integrated into the ship, so to speak, and it's therefore is much higher bandwidth, right, exactly?

Hagay Lupesko:

Yeah, yeah. So that memory bandwidth plays a much bigger role, I would say, in sampling, so generating the next token, the auto-regressive sampling, but it's also very useful in the time-to-first token. Low on latency is actually more around the processing speed and like the compute capacity a bit more so than the memory bandwidth. The reason cerebrus is so much faster for token generation is more around the memory bandwidth and actually there is a slide here that kind of just explained this a little bit more, that that memory bandwidth issue so on the left, like a typical gpu. There is a slide here that kind of just explains this a little bit more, that memory bandwidth issue. So on the left, like a typical GPU. There is the compute on the silicon and then there's the HBM, which is a DRAM, dynamic RAM, that typically sits outside of the silicon and it has kind of a relatively high bandwidth connection with the compute silicon.

Anders Arpteg:

HPM is basically high bandwidth memory right.

Hagay Lupesko:

Exactly High bandwidth memory. The name is a bit misleading because, yes, it's high bandwidth compared to accessing the host RAM, if you try to access the CPU RAM, but it's much slower compared to accessing the static RAM on the silicon. But the thing to understand is that these modern neural networks are so large, they have so many parameters and you actually can't fit all their parameters on the GPU static RAM. So what you have to do is you load the first layer from the DRAM into the silicon and then the compute cores are doing the matrix multiplication, which they're doing pretty quickly right on GPUs. But then the next step is you need to go back to memory and load the next set of parameters, the next layer, and repeat that exercise of multiplication. Now, every person that tried to implement kind of high-performance, low-level inference knows that you're actually bounded by loading those weights from memory. You're not actually bounded by the compute capacity. So, like I said earlier, even if you had doubled the amount of compute cores on the GPU, your speed of token generation would not go to X. It would maybe increase by 10, 15% at best, because your latency, or time to generate a token, is really bounded by that memory access bandwidth. Yeah, that's cool.

Hagay Lupesko:

Now, on the right on Cerebras, the wafer scale engine. Within the same silicon, you have both the memory, the SRAM, and you have 44 gigs of that on our current generation and the compute cores. And even more so, it's co-located on the wafer. So if you remember our wafer from earlier, it has these dies imprinted on top of it. Each one is an independent unit and each one has both static RAM and the compute cores. So here you know, compared to the typical GPU, three terabyte per second of memory bandwidth, on Sirius we have 21,000 terabytes per second, so 7,000 times more. And this is the secret, if you want to go back and understand. Okay, how the heck is Cerebras so laser fast in token generation compared to GPUs? This is the main reason. It's that memory bandwidth so cool.

Anders Arpteg:

And I know you've been playing around with DeepSeek R1 as well. Um, and it would be super funny if you take a model llama or deep seek or whichever you'd like. How would it run on cerebras? How many wafers or ships would you need? And can you just, yeah, talk a bit more about you know how your experimentation has been there?

Hagay Lupesko:

yeah, I mean it's. It's pretty simple. I mean to do do inference. We implement a pipeline parallelism, so some models fit on one wafer right. We have 44 gigabytes of static RAM on the system hosting the wafer, so you can do the math right If you think about 8-bit per parameter. So in, let's say, lama 3.1 8-bit, the 8-bit parameter model easily fits right on one wafer If you take a larger model like the 70-bit. So maybe you'll need a few wafers. We have actually different configuration, by the way, each one with different characteristics of speed versus throughput, but you can just do that math. Of course, beyond the parameters, you need space for the kv cache activations. You need space for the uh, for the code that you host also on the wafer, but it's mostly dominated by the parameters. And if you look at the deepsq v3, that's a fairly large model 671 billion parameters, mixture of experts. Again, you'll need more wafers to host that.

Hagay Lupesko:

But because our implementation is pipeline parallelism. So basically, we split the models across the layers and we host each set of layers within a different wafer, across the layers and we host each set of layers within a different wafer, but in every given point of time, each one of these layers or each one of these wafers is busy computing something right. So the actual time per output token is not negatively impacted because that pipeline is always working right. The latency, the time to first token, may be impacted a bit more, but still it's fairly good. And, by the way, gpus have the same approach right. With GPUs, a typical, let's say, nvidia H100 has 80 gigabytes of HBA memory and you almost always have to split models across multiple GPUs, sometimes even multiple nodes, each one with multiple GPUs, and that's always kind of you know. You can go different routes. You can do pipeline parallelism, you can do tensor parallelism, you can do a mix of these.

Anders Arpteg:

Can you just? I don't know exactly how this works and I'm sure a lot of others don't either, but there are different types of parallelism you can do. You can have data parallelism, you can have model parallelism, you can have this pipeline parallelism by layers, etc. Can you just elaborate or just describe it a bit shortly how they work?

Hagay Lupesko:

Yeah. So first of all, I would say there are the training approaches and the inference approaches and they have some similarities but they're not exactly the same. So you know, on training, training is an offline activity. You don't care too much about latency, you care a lot about throughput, and it allows you to work at very high batch sizes. Right, I would say, you want to work as a high batch size as you can until it actually starts degrading the quality of training. And then most training approaches do data-paralleled training, which means you actually split up your data into different shards, sets of compute, sees, different examples of the data, which is how you scale up training to thousands or even tens of thousands of GPUs. Now, on top of that, you know we used to live in a world where models would fit on one GPU. That was a really nice you know world to live in.

Hagay Lupesko:

But today's models are so large that they don't fit on a single gpu, unfortunately. And then you have to figure out okay, how am I going to fit this thing? You have to figure out how to split the model and then the approaches for splitting the models. You know there is a tensor parallelism, which basically means, hey, I, for every layer, I'm going to split this layer and place it on different GPUs. And of course, at some point I need to have these gather operations that allow me to sync different parts, like, if you have layer norm as an example, it forces you to sync the weights, and so that's sensor parallelism. Pipeline parallelism basically says, hey, I'm going to actually split the model into layers and place different layers into different GPUs. And there's also a mix that does both pipeline parallelism and tensor parallelism. It's all extremely complex and that's part of why training large models is really difficult. Right, a lot of it is a challenge of kind of applied research, which looks at the model architecture and looks at the quality of the data and looks at, you know, hyperparameter tuning. But the other part that is also extremely difficult is more the engineering part of okay, how am I going to split this model so it fits on the accelerators while still maintaining as high throughput as possible. So that that's the challenge. Now. This is training.

Hagay Lupesko:

Now inference is a different problem, because inference maybe sometimes it's an offline activity. So, for example, if I want to transcribe you know a gazillion different uh, you know audio recordings of Zoom into text, or if I want to process a large amount of text, that's offline. I don't care about latency, I care about throughput. A lot of the inference use cases are actually online where I need to return a response to the user ASAP because the user is sitting there waiting. Like you know, chatbots is kind of a poster child example for that and then typically in those regimes you want to figure out okay, I maybe don't care that much about throughput, I care more about latency. So how can I optimize for latency?

Hagay Lupesko:

And again, you use data parallelism. So you have multiple serving replicas and obviously each one is handling an example and the other you know replica doesn't have to handle that example because there was a replica that handled it already. So that's the notion of data parallelism. On inference, um, the other thing, of course, is okay, how am I going to shard the model? And over there, yes, you do, tensor parallelism, pipeline parallelism. So those are the different approaches. Really in a nutshell, that's uh. Those things are part of why, as I mentioned earlier, many customers prefer to just call an API in France rather than build their own, because, especially with today's large models, building your own is very expensive and requires really a lot of expertise.

Anders Arpteg:

I mean awesome. So if people wanted to use DeepSe, deep seek which I wouldn't recommend, but that doesn't matter but anyway, or llama or something else, and you are basically now providing like cloud APIs that they can use directly to do inference and potential training them, and then you don't have to care about you know what type of parallelism you should use, and it simply works, I guess, in some games. Right, yeah, it just works.

Hagay Lupesko:

I guess in some cases, right, yeah, it just works. Our service offer kind of what's today is the industry standard. You have the OpenAI compatible API and you have LamaStack API. Those are designed by Meta, fairly straightforward. If you know how to write an API call in Python or JavaScript, you're going to be able to do it fairly easily, and we encapsulate all the complexity for you. So all the optimizations, making sure the model works well, having redundancy, reliability, security all of that is handled for you out of the box and you get access to you know, really the fastest inference out there.

Hagay Lupesko:

We recently onboarded big customers. Mistral is now a customer of Cerebras. We are powering LeChat. Lechat is Mistral's chat application assistant. They launched a feature called Flash Answers, which really gets back to you super quickly. That's powered by Cerebras. Um, we also recently announced our partnership with um Perplexity, which is, uh, a really interesting um product that tries to really disrupt Google search. Um and uh, they have their own model that is based on Lama. It's called Sonar. That is also powered by Cerebras. And, by the way, if you haven't tried Mistral or Perplexity, I highly encourage you to try it out. I think both are very interesting products and services and I think each one of them is kind of challenging the incumbents in their own way. And beyond that, we have now dozens of enterprise customers on the platform and our developer tier, which is developers that are using our product, many of them for free, within some rate limits. Recently just hit 100,000 developers which is awesome, yeah, amazing.

Anders Arpteg:

I just like to just think about also the cost a bit um I mean it's no question, as it seems that you're the fastest one, both in terms of throughput and latency. What about like cost per token kind of metrics?

Hagay Lupesko:

yeah, we are. By the way, if you go to that site artificial, artificial analysis it also compares cost. That's one of the things it compares. So we are, I'd say, competitive on price per token, although we're definitely not the cheapest and we don't try even to be the cheapest. You know, we think this service has a premium right. It's a premium service because of the speed and the latency. It's a premium service because of the speed and the latency and we also want customers to come for us, not because we are the cheapest, because that's a very risky proposition, but more for the quality of the service.

Hagay Lupesko:

I would say that for most of our customers, what we sell them is provision throughput. So if you're a customer, you want to say, hey, I want to have access to always this amount of throughput in terms of tokens per second maybe 100 tokens per second, 1,000, bigger customers obviously more and then we guarantee that we have this capacity up for them, explicitly for them, and that's a business model that actually works better, I think, both for the customer, because you have a production service. I heard so many times this complaint with services like Anthropic or OpenAI or others, that you build a production service that depends on that API, but then when you have high load, you actually get throttled by that service because they experience high demand. So our way to solve that for customers is to always make sure they have enough capacity to power their service, and that's the notion of provision throughput.

Anders Arpteg:

Interesting. But speaking about that and also about you know, you're fighting some really big competition here, and thinking specifically about N, about Nvidia, of course, and in some sense I guess it's a David versus Goliath Kind of a situation where they are one of the top market cap companies in the world these days. How is that going? How can you compete? I mean, you obviously are beating them in terms of performance in different ways. What do you think about that? How can you compete with these kind of super big companies?

Hagay Lupesko:

Yeah, I think it's definitely a challenge. By the way, every day when I drive to work, I actually see the NVIDIA fancy office just on the other side of the road. So silicon valley is funny, right. They have nvidia, amd, intel, cerebras, a bunch of other companies, kind of all around the same block. Um, but um, nvidia, I think, is a fantastic company. I mean, I've been I've been a customer of nvidia for quite some time. I actually partnered with them very closely in previous roles and I think they executed extremely well. So, first of all, kudos to them on building this incredible company and really powering a lot of the AI workloads. I do think that's kind of a standard question you can ask every startup, right?

Hagay Lupesko:

It's like how are you going to compete with this big company? They have so many more people, they have this incredible customer base revenue. They can very easily, uh, beat you. But then you see time and again that startups are able to beat the bigger right companies and I think that actually there are reasons for that and there are some environments where this is more likely to happen. So the first thing I say is, again having worked myself at big companies, there, the big company advantages are also their weaknesses, because when you're very large in terms of number of employees, you're also just kind of by design, you move more slowly because you have so many layers of management. Making decisions is slow. You need to get alignment from so many layers of management. Making decisions is slow. You need to get alignment from so many different people. Not everybody agrees right. People have different opinions. So, kind of by design, you move more slowly. Now the other thing is also, having a large customer base is great for your revenue. It's not great for your product innovation.

Anders Arpteg:

Ah, interesting.

Hagay Lupesko:

If you think about NVIDIA, for NVIDIA to change their computer architecture in a meaningful way, it's sometimes going to be extremely hard because they're going to break backward compatibility of their software. That's a good point. Yeah, I think there are. Obviously there are ways to work around that, and I'm not saying that big companies can't innovate, but I just say it's more difficult. Now compare that to a startup, private company, smaller like Cerebras. We are much smaller.

Anders Arpteg:

How many employees are there, by the way?

Hagay Lupesko:

right now. At the moment, we're a bit over 400 employees and obviously the founders are very, very involved in the company, and that's a really important thing, because founders can drive hard decisions very quickly. People tend to align much better with founders than with you know kind of middle management or whatnot, so we can move very quickly. We are moving super, super fast. I think that's actually one of our superpowers as a company, and we also can innovate very quickly on our product roadmap. We don't have to worry about annoying customers or breaking backward on compatibility as much, and that's really the reason why we were able to innovate with this product that you know is highly differentiated, in particular, for inference. So, yes, I, I mean it's going back to your question it's definitely a tall order to compete with nvidia, but I think we are going to, just like david, use this slingshot and hit goliath right between the eyes. Uh, we're going to use our assets and deploy them very effectively and continue to move fast, innovate quickly, and I think we're already seeing, uh right success with the companies like the one I mentioned.

Hagay Lupesko:

There's plenty of others that are actually choosing cerebrus hardware over nvidia because of the speed and, by the way, we can also maybe chat a little bit about speed, because it might not be that obvious. Why does, uh, being so fast matter? Because some people may say, I think the average what's the average reading rate of a human reader? If you read text, I think it's slower than today's kind of the pace at which a GPU spits out tokens or words for a standard model. So why do you need to be so fast? Like you know, over 2,000 tokens, roughly words, per second. How is that valuable? And I think actually in today's, like the direction that AI is going towards, that's actually highly valuable and that's what we're hearing from our customers, and there are two major trends.

Hagay Lupesko:

One is agentic workflows, or agentic applications, and the other one is reasoning. Both of them, by the way, super exciting. But with agentic workflows, we're basically talking about people implementing AI applications and systems. Talking about people implementing AI applications and systems by chaining together multiple models and allowing these models to use tools. So you know, the way you integrate models into an application is not just by making one model API call and you're done, but maybe there's a first model that classifies the request. There's another model that uses different tools to provide more information, like with retrieval, augmented generation, or with search or with other queries. And then there's another model that maybe just aggregates all the different information into a response, and maybe another model that just verifies this response for style, or something like that. That's a pretty typical implementation of an agentic workflow. So you have this, this graph of different components many of them are llms, and the good thing about that workflow is that it allows you to produce a really high quality response to the end user.

Anders Arpteg:

The downside of this is that latency builds up so with multiple models, latency builds up and they mount to compute as well. Sorry, and they mount to compute, I guess, as well, right?

Hagay Lupesko:

Yeah, there's compute, but if you think about the user experience, there's latency, and you know there is a difference, like we all know this from the internet days right, if a website loads in under 100 milliseconds versus in a few seconds, it actually makes users not as happy, makes you lose users, et cetera.

Hagay Lupesko:

So the fact that with Cerebus you can actually compress the latency by an order of magnitude allows you to really build interactive applications much more easily than how you would do it with traditional GPUs. And, yeah, perplexity is an example. They actually they launched publicly with us a couple of weeks ago, but they ran an A-B test for quite a long time with Cerebras powering their use case versus non-Cerebras, and they only launched when they kind of got really good indication that actually the business metrics are moving in the right direction thanks to the fast inference. So that was really an interesting experiment to you know. Obviously you know we were just powering, that, they were running the experiment and such, but the outcome of that was interesting even for us, right, that actually there is some data points that fast inference is generating better user retention metrics for them, etc. Yeah, so they're running full service now on.

Anders Arpteg:

Cerebras Is Perplexity running their full service now on Cerebras.

Hagay Lupesko:

Not the full service, but if you sign up to Perplexity Pro, which I encourage you to do, then the Sonar model, which is the default model for Pro, is powered by Cerebras. Interesting, cool, yeah, and you can actually feel the difference thanks to the speed. I really think it. For me, personally, it makes a huge difference because I think we were all indoctrinated. You know, we use Google, search results come back so quickly, and now if you try to use an alternative search engine like Perplexity, which really gives you higher quality results, I feel, but if it's slow, there's something missing in the experience. So the speed, I believe, do matter, because we all, you know, kind of got indoctrinated to get the search results so quickly.

Anders Arpteg:

Amen yeah, please.

Hagay Lupesko:

Yeah, the second interesting trend which personally I'm super excited about, I think that's next year the big accomplishments in AI are going to be around reasoning. Accomplishments in AI are going to be around reasoning. And you know, reasoning is basically kind of, in a nutshell is allowing models to take more time to think about the question before answering in order to produce better results. And it's almost like you trade off compute flops, floating point operations. You trade off flops for model quality and for me, like when I started kind of learning about it and seeing it in action, it made so much sense.

Hagay Lupesko:

I even think, you know, I have two young boys. Sometimes I help them with their math homework and the first thing I tell them is okay, think about the problem, write down a few options how you would solve it, pick one of them, solve it, verify your answer and only then come back. You know, write down your answer. That's like the like, the basic thing. Uh, in math, that's exactly reasoning. That's exactly, with reasoning, what we're teaching the models to do. So I think that's a really exciting direction. We're already seeing incredible results with OpenAI 01, 03, with DeepSeq R1, and Topic recently came up with the reasoning capabilities and we'll definitely see more of that.

Anders Arpteg:

I think it's yeah, it's a really exciting trend it is right, and I know the time is flying by here, but I hope you can stay on for like 10, 15 minutes more, or how much.

Hagay Lupesko:

Yeah, definitely, I can go for the next 20 minutes even. I'm also enjoying our conversation.

Anders Arpteg:

I mean, there's so many interesting topics and I love it as well, but just thinking a bit about the future of AI and what kind of workloads potentially Cerebras has to manage.

Anders Arpteg:

You mentioned Adjantic and the ability, I guess, to have a graph of computation that you need to handle in different ways, and it will be more compute. That's necessary to fix that and you want to minimize the latency then to get an answer, but still you're willing to do so. I guess reasoning is also moving in that direction. That requires more compute and, of course, if you have a high throughput, in that case that will be a big advantage compared to other competitors. What do you think then, what can Cerebras do in coming years to ensure that the coming trends in AI will be something that they can handle? Are these the major trends you would say? I'm thinking about also other things like could you move potentially into more kind of edge computing cases, or could we see a Tesla bot or Optimus bot having an on-device kind of Cerebras wafer, or is it mainly going to be in data centers?

Hagay Lupesko:

Yeah, that's a great question. Maybe just to wrap up on the reasoning bit, I would just say that we maybe didn't get to that point. But the reason fast inference is also highly valuable for reasoning is because now you have the model, you know spitting out thinking tokens right or just spending more time thinking before coming out with the result. And again, if you try to use something like you know, OpenIO1 or even DeepSeq R1, you know that actually it takes some time until the model is done thinking. For some questions it even takes more than a minute, which is a horrible user experience. And with Cerebus inference, because it's so fast, you compress that delay into a much shorter amount of time. So this is where fast inference is also highly valuable for reasoning models.

Hagay Lupesko:

Yeah, now, in your question you raised a few really interesting things. So I would say first of all, my own view is that I think we've been in this regime of the training scaling laws and we learned that, okay, the more we scale up models and the more we scale up the training data, we actually get models that are able to, that are smarter, basically Right, and we saw that with GPT-1, gpt-2, 3, 3.5, 4. And this thing would probably continue holding. But there are two problems with that approach. First one is that kind of we ran out of data. I mean, I think Ilya Satsky said something like he had a really good test of time talk at the last NeurIPS and he said something like we have but one internet. That's true. So we know from scaling laws that it's not just enough to increase the model size. You have to provide more data for training.

Anders Arpteg:

So that's kind of running out the second thing is think synthetic data, though, can fix that. If we think, like you know, um alpha go and alpha zero, that you know is playing chess or go, and then you can simply, you know, do self play. And perhaps you know what we see in R1, they basically are generating their own data in some sense. Do you think that will fix that?

Hagay Lupesko:

I think that's definitely one way to overcome this and I think it's, more broadly, I would say RL is the answer, rather than kind of the pre-training methods that we've used so far, which is generating or predicting the next token. So, yes, synthetic data is one approach, but I think it's more around reinforcement learning and I think actually that's the path for getting to intelligence that exceeds human intelligence is through RL, because the paradigm where we use data generated by humans or preferences from humans kind of puts a ceiling right, the ceiling of the intelligence of these models. How can you get better than humans if all you do is train your models on human generated data? Rl is the path to exceed that and I think the experiments with AlphaGo kind of shows that as well.

Hagay Lupesko:

But going back to my point on scaling, the second big problem and practitioners like me, right, I'm not a researcher, I'm busy taking AI systems, build them, deploy them in production to generate business value. It's just not practical to deploy models that are, I would even say, 1 trillion parameters. In my view is practical for a very small amount of use cases, just too expensive. So even scaling the models to be larger, yes, it will probably increase the model quality, although just the cost of serving this is going to be so high that it's not very practical. And I think we're seeing all these mini models that we're seeing everywhere. Those are smaller, distilled versions of larger models, and I think that's going to be the path.

Anders Arpteg:

Do you subscribe to? You know we've said on this podcast a number of times that the future will be a few set of super, you know, frontier AI models or God models or whatever you'd like to call it, and then a huge list of potentially open-sourced, more baby size or more you know, especially fine-tuned models for specific use cases. That will be in the millions or thousands of them. Do you think that will be potentially the future?

Hagay Lupesko:

I have to admit that I was more in the camp previously of fine-tuning on custom models is the way to go. I've been less in that camp recently because I've seen a lot of data and evidence that models can be customized without fine-tuning. Less in that camp recently because I've seen a lot of data and evidence that models can be customized without fine tuning. You can actually customize them with providing the right context in the prompt and if the model is good enough it would actually learn what's called in-context learning. You give it the examples, the data in the prompt and the model actually producing good results. There's quite a few research at this point that shows that you can get similar quality with that method compared to fine tuning, agreed, but they can't use the multi-trillion parameter models then, for normal operations.

Anders Arpteg:

right, then you need to go down in size, wouldn't you? Yeah?

Hagay Lupesko:

you need to go down in size, yeah, but you don't necessarily need fine tuning or specific business data things like that. Now I think the other interesting thing is I'm a big believer in open models versus closed frontier models and I think we've been seeing a trend. The open models have been catching up. Today one of the best models out there DeepSeq V3 R1, is an open model and it is comparable in many different ways to the best frontier models out there, even exceeds them in some ways.

Hagay Lupesko:

I think we'll continue seeing investments from companies like Meta and other companies in great open models and I think if you look at the history of software development, when there are new trends, kind of new paradigm shifts initially the software was bespoke and closed and then open source tend to win in the long arc of things. Ai is a bit different because it's not just about developers sitting in front of a laptop writing code. You actually need a lot of compute to train, but there are incentives for some well-funded players like Meta to actually build open models and I suspect just the economical incentive in training closed models is going to decrease over time because it's going to be more commoditized. But I mean, we'll see, maybe we can have a chat in two to three years and we'll see if I was right or wrong.

Anders Arpteg:

Let us disagree there a bit and see if we could have a discussion about this, to have a discussion about this.

Anders Arpteg:

But if we take like Grok 3, you know from XAI and you know they've previously been very, very pro to open sourcing models and now you know Elon is basically saying that no, sorry, we won't open source the latest version but the second to latest model that they may, you know, release, yeah, because of, you know the fear of, like DeepSeek, you know, just copying everything and distilling from it and then they lose all the money and investments they've done in it.

Anders Arpteg:

So they can't really open source the last one because of commercial reasons potentially, and the competitiveness that we have now in the ARAs. And then we have also the regulation in place that says if we actually do an open source version of this and people start using it, we may become liable in some sense and that could cause a lot of issues as well. Do you still think you know the real frontier models, even from Meta we've seen already XAI, you know, backing down from it Would you think that Meta would keep prevailing, trying to open source their models, or would they also potentially start to lock down the frontier models in the future?

Hagay Lupesko:

I think so. It's a bit hard to say, especially right in the field of AI, where there are new innovations all the time and new directions, and it's really hard to say I would say that if you look at what happened so far, if you look at the last few years, if you look at what happened so far, if you look at the last few years, the gap between open and closed models has been closing very rapidly actually. So I am seeing there probably likely will be still closed frontier models for a while. I just think the economical incentives for companies to build and train these uh, these uh frontier models uh is going to decrease, uh, especially as there are more and more open models that are good. Now it could be that there'll be a small set of frontier models that are really good and then the open models are like three, four months behind, which is kind of, I would say, maybe where we are now, maybe we already even surpassed that point. And, by the way, that's a world I'm much more happy about, because I think it levels the playing field much more, it allows companies to innovate much more, and I think really the next, you know the next innovations in AI that I want to see are on the application level.

Hagay Lupesko:

I mean, models are nice and they're great and they're fun to play with especially, I'm an engineer. I love that stuff. But where is the massive economical value from AI to all of us, to mankind? We're starting to see that and we're starting to see that and we're starting to see that coming at the application level, where people are building fantastic applications on top of these models. The other cool thing I'll just say to your point of earlier of maybe there'll be a lot of models on the edge. I do think we'll see more models on the edge, but I think these models we learn to collaborate with big smart models in the cloud. I think these models we learn to collaborate with big smart models in the cloud. I think that's that's what we're going to see. There's already a lot of interesting work happening on that front and we'll see more of that I know we have like five minutes perhaps left, so I'm trying to really manage time here.

Anders Arpteg:

And, uh, I know that you're also interested in the whole geopolitical kind of situation that we're having with US China.

Goran Cvetanovski:

But can I ask a question? Because, thank you, hagai, it's been a pleasure. I'm looking forward to have you here live one day. Hagai is becoming our Silicon Valley correspondent for the AI AW podcast, so I'm super happy to have you here Talking about that. I just had like one question actually. So tell us what is the gossip in the town, or what is actually the current topics that are discussed in Silicon Valley or around there. What is the hottest trends?

Hagay Lupesko:

I think there's a bunch of them Like AI is kind of front and center everywhere, I would say, and there are a lot of, you know, the competition is definitely heating up in terms of just the models. Who is producing the best models out there? I think Elon with XAI really well. Maybe it's not that surprising for people that's seen Elon's track record, but the fact that XAI within such a short amount of time built a huge AI data center and was able to train a very, very good model, I'd say when I looked at the benchmarks played with it just a bit, it's definitely, you know, the front line of frontier models. I think that's very impressive and it also really heats up the competition because also there's this personal rivalry between Sam Altman of OpenAI and Elon of X.

Anders Arpteg:

And Olero Page and Google and whatnot, so yeah, yeah.

Hagay Lupesko:

So all of that is really really hitting up, and that's kind of one thing that is front and center. Other thing is there is a, I'd say, still a debate whether all these investments in AI, is this a bubble that is about to burst or are we going to see, you know, actual, you know economical success, business success out of it? I really think that again, for us as engineers, people in the field, of course we're excited about it. We're a little bit in an echo chamber If you try to take a step back and see okay, so many investments have been made, in particular in compute, ai compute tens of billions of dollars right across so many companies, and you can see that in NVIDIA's stock price. But the key question is where are the revenue from all these companies that are building?

Anders Arpteg:

And the 500 billion investment in Stargate and now 200 billion in Europe as well.

Goran Cvetanovski:

But for me, if we ask a question, if AI disappears today, do you think that nobody will care In terms of you see, if I look, it's basically the western society right now, I think even the kids have started using AI.

Anders Arpteg:

I think they assume it's everywhere.

Goran Cvetanovski:

Spotify, we have it. Google and everywhere Right, we have it everywhere. Right now, it's literally everywhere. We are just talking about ai in some big school will stop working.

Anders Arpteg:

Yeah, exactly so uh.

Goran Cvetanovski:

So that is the question will, uh, if it goes away, will we even notice? And I think we will.

Anders Arpteg:

And that is the case yeah.

Goran Cvetanovski:

So I don't think it's a bubble, it's just basically, and now we are moving a little bit more into this uh vertical ai, a little bit more um, uh specific ai, uh modules that can help doing different things, and um agentic ai.

Anders Arpteg:

Right now you have operators and stuff but you know, just since we have a few minutes, I'd just like to continue a bit here and say imagine that we have an agi future. You know, and you can think about this amazing, you know, 500 billion investments now in infrastructure and whatnot and having, you know, potentially a playground for an AGI. Some people believe, and that you know, the companies and the people that will have the access to AGI system will have extreme economical power and that is potentially what they're seeing and that's why, potentially, we're seeing these kinds of investments. What do you think about that? What's your thinking about AGI and what that will?

Hagay Lupesko:

you know, have an impact on society. So I think you know largely, you can classify this. There are people who are in the optimistic camp with AI and those on the pessimistic camp. It reminds me of saying there was a famous Israeli politician passed away by now. His name is Shimon Peres and he used to say both optimists and pessimists die the same way, but they live differently.

Hagay Lupesko:

And he's saying I choose to be an optimistic. Oh, that's a really good point. So I really agree with that. I mean, I think you can look at this thing from different. You know lenses, um, I really feel that, uh, the potential positive impact of ai on everyone's lives is just just immense, um, immensely positive, and I want to believe that, you know, the human race would know how to manage the risk with AI, and I'm more worried about other humans using it in the wrong way than AI right taking over. That's my main concern, because you know, that's it I agree.

Hagay Lupesko:

So I'm an optimist and I want to believe the future will be fantastic and I want to believe we'll be smart and capable enough to manage the risks well.

Anders Arpteg:

Well said. I couldn't have said it better myself. That's very, very glad to hear your thoughts about that, hage. It's been a true pleasure to have you on the podcast and I'm very happy to hear not only about you know, the success that cerebras have, but also you're thinking about. You know the silicon valley and how all the different companies work and the culture, and I would love to speak more about the speed that and tesla have and the elon companies and everything. But we got the taste of that as well, which I think is so interesting. So thank you so much again to for joining this and I hope we can meet up soon again and have another talk about you know what's hot in silicon valley awesome, it's been a pleasure.

Hagay Lupesko:

Thank you guys for having me, and next time maybe we do this in person yes, I would love that we'll love that thank you so much.

Anders Arpteg:

Have a great day. Bye-bye.

People on this episode