Nikola Borisov — Why the Real AI Battle Isn’t Training—It’s Deployment | Future Ventures Podcast Ep. 023 Artwork

Future Ventures: Scaling with Clarity

Future Ventures: Clarity at Scale is the podcast for founders, operators, and investors who are building companies worth owning for the long term — and who need to think clearly about capital, structure, strategy, and growth to get there.

Each episode cuts through the noise around scaling: how to structure a deal, how to position a business for institutional capital, how to build operational leverage without losing control, and how to make the high-stakes decisions that compound in value long after the moment has passed.

Hosted by Maxim Atanassov — a four-time founder and the Managing Partner of Future Ventures Corp. Since 2018, FVC has invested in, incubated, and scaled companies across sectors — with a focus on platform opportunities that compound in value. Maxim's background spans executive leadership inside Canada's largest energy companies and senior advisory at Deloitte and EY. He's a CPA-CA who has sat at the table where capital gets deployed, governance gets built, and hard decisions get made. Now he helps founders get there faster.

New episodes every week. Subscribe wherever you listen.

All Episodes

Future Ventures: Scaling with Clarity

Nikola Borisov — Why the Real AI Battle Isn’t Training—It’s Deployment | Future Ventures Podcast Ep. 023

May 06, 2026 • Maxim Atanassov

0:00 | 57:34

Send us Fan Mail

Nikola Borisov is CEO and co-founder of DeepInfra, offering open-source models like DeepSeek, Llama, Kimi, GLM, and GPT-OSS via APIs. He previously scaled IMO Messenger to over 200 million users, handling up to a million new users daily with cheaper self-built infrastructure. A Northwestern CS grad and Bulgarian programming veteran, he learned that distributed systems succeed at the margins. This conversation is crucial because AI inference—execution—is the silent battleground of the AI economy, yet few discuss the compute layer, which now limits access outside big labs. Nikola applies his hyperscale infrastructure knowledge to compute-intensive workloads. For those in AI, understanding this layer is vital.

Topics Covered

From Sofia to scale — How Nikola progressed from Bulgarian programming contests to scaling IMO Messenger's backend to 200M+ monthly users, teaching him about doing more with less.
Why the bet is on open source — The strategic and structural case for hosting open-source models, why a startup like DeepInfra had no realistic alternative, and why the US has an open-source gap worth worrying about
The supply crunch nobody is pricing in — Demand for AI compute has roughly 4x'd in the last four months, driven mostly by coding agents finally crossing the line from "useful sometimes" to "useful always." Nikola explains why this is a GPU shortage now, not a demand problem.
How Nikola actually thinks about optimization — Cache token pricing is the key cost lever most teams overlook, especially on agentic workloads. We discuss why inference hardware is diverging from training hardware and Nikola's habit of the build-then-measure loop since IMO days.

Key Insights

AI's bottleneck is supply, not demand. Coding agents alone — Claude Code, Cursor, the rest — have driven token usage up roughly 4x in the last four months. There aren't enough GPUs to go around, and the big labs are eating most of the available capacity. Smaller buyers are getting priced out of the market entirely.
Cache token pricing is the cost lever most teams overlook. Agentic workflows reuse massive amounts of context across tool-calling loops. Pricing — and architecting around — cached tokens separately is what separates teams that can afford to scale agents from those that can't.
Open source vs. closed source won't end 100-to-zero. Nikola thinks we land at roughly 80/20 — the question is just which side gets the 80. His reasoning: information leaks, engineers move between labs, and the Anthropic GitHub commit that exposed model details earlier this year is exactly the kind of thing that's going to keep happening. You can't really isolate a model from the outside world for long.

Links

DeepInfra: https://deepinfra.com/
Nikola Borisov on LinkedIn: https://www.linkedin.com/in/nikola-borisov
Future Ventures Corp: https://ca.linkedin.com/company/future-ventures-corp
Maxim Atanassov on LinkedIn: https://ca.linkedin.com/in/maxim-atanassov

About the Guest

Nikola Borisov is the CEO and co-founder of DeepInfra, an inference-only AI cloud that hosts open-source models like DeepSeek, Llama, and GPT-OSS for developers and enterprises. Before DeepInfra, he led backend infrastructure at IMO Messenger, scaling the platform past 200 million monthly users on infrastructure his team built and ran themselves — at a fraction of what comparable cloud-native competitors were spending. He's a computer science graduate of Northwestern University with over a decade of experience building distributed systems at scale.

SPEAKER_01 0:01

Welcome to the Future Ventures podcast on scaling with uh clarity. Today we are joined by Nicola Boriso. Nicola is the CEO and co-founder of DPI infra, an AI infrastructure company focused on making powerful open source models accessible through simple, cost-efficient APIs. With over a decade of experience building distributed systems, Nicola previously led back-end infrastructure at IMO.am, scaling it to over 200 million monthly users. Today he's applying that same system level thinking to one of the biggest bottlenecks in AI, inference at scale, helping developers and companies deploy models faster, cheaper, and with more control. Welcome to the pod, Nicola.

SPEAKER_00 0:55

Thank you. Thanks so much for having me, Maxime.

SPEAKER_01 0:58

It's my absolute pleasure. I'm assuming that you're a fellow countryman, uh, based on the name.

SPEAKER_00 1:06

I'm from Bulgaria. Okay, same here. Very good. It's it's great to meet you. I was I was gonna say something that you know based on our last names, that we probably are from Bulgaria.

SPEAKER_01 1:21

Um, whereabouts are you from?

SPEAKER_00 1:23

So I was born in Sofia and then I grew up there until uh you know I did high school in Bulgaria, and uh I I got involved in like programming competitions. Um that's partly how I managed to kind of get accepted into US university. Uh I I came to study at Northwestern.

SPEAKER_01 1:46

Uh oh wow, that's an amazing school.

SPEAKER_00 1:48

It it was a great experience for me. I really loved it. Uh it's like first time living abroad. Um excellent program. I had a lot of fun. Um I I continue to do some programming competitions at Northwestern. Yeah. Did some internships in Seattle over the summers. Um one at Microsoft and and two internships at a startup. Okay. And I really love working in a startup environment. I I like the scope. I like the I guess the impact. And then that kind of guided my decision later to join IMO Messenger right after school.

SPEAKER_01 2:34

Okay, okay. Um that's that's a fantastic journey. Um, any anything that's what was the hardest part of the entire process? And um, how old were you when you came to the US? 18, 19, 20?

SPEAKER_00 2:48

Yeah, I was like around 19. Uh it's not easy, like you just show up in a new country, and uh, you know, before that you've always lived at home with your with your parents. It's like new country, like new language, yeah, new culture. It's not an easy thing, but you know, I just uh you know, just just go to classes, uh meet friends, work hard. And you know, I I was fortunate that like one of my classmates from from my same high school was also in the same school, so you kind of get like a little bit you don't feel like completely foreign.

SPEAKER_01 3:36

That's awesome. Um, was he or she there before you and kind of showed you the showed you the way?

SPEAKER_00 3:42

Yeah, it's it's a great story. It's actually one of my best friends from from high school. We both like sat next to each other and did like programming together. And then he got accepted at Northwestern like a year before me. So I basically did one year in in Sofia University, yeah. And uh, you know, he said great things about the school, so I applied the next year and got in, and I was roommate with him like all throughout school, so it was actually a great experience. Like, I I got to like be roommates with like my best friend from high school, and and we both studied computer science at the school, so it that made the transition a lot easier.

SPEAKER_01 4:25

Is he the co-founder?

SPEAKER_00 4:28

Uh no.

SPEAKER_01 4:29

Uh okay, there's a story here.

SPEAKER_00 4:32

Well, I I would love to work with him. He we just took a little different path. Like, I joined a startup right after school and he went and worked at Facebook, which is probably uh you know a great choice. Like he was there when I think there were 200 people at Facebook at the time. Oh wow, wow. Um, so he did well. He did another, he founded the startup after he left Facebook with like some of his colleagues from Facebook at the time I was you know still working at IO Messenger. Um, so you know, timing was not like on our side to kind of this you kind of need to align on these things. Um but you know, he's a great person to to be a co-founder of, I think, you know. Excellent guy.

SPEAKER_01 5:21

Awesome. And it's it's so walk me through the path in terms of how do you decide to uh to launch the current company, so like deep infra, like how, when, why.

SPEAKER_00 5:33

Yeah, let me tell you a little more about my journey, like career-wise. So, I as I said, I did two internships at a startup company, um, and then one at Microsoft. The Microsoft internship was like great, but I I just felt like startups are more like the place that I want to try. And my mindset was always like, I'm gonna try this, and if it doesn't work, I can always go back and work for one of the big tech companies. Like there's no like long-term loss or risk of of trying a startup. And I was really fortunate. Like, the the IMU Messenger was one of the co-founders, was one of the first uh 10 engineers at Google. His name is George Harrick. Um he was in charge at Google, he he was in charge of the Gmail product. Uh he started the AdSense business for them. Oh wow. Uh, I think he was like employee number seven. And so I got to kind of join his kind of new venture after he like left Google. And he both had like seen all this kind of stuff happen and was really just really smart person, really had really clear, like long-term thinking and strategic thinking. And so I you know, I was brand new out of college. I just like came in and like worked hard. Um I got to be in charge of our backend and infrastructure at some point, and and then our service at the messenger started kind of gaining a lot of traction. Okay. So we we built um um a calling and video calling app that worked really well in the pretty low bandwidth networks of 2G and 3G at the time where like mobile was taking off, like you know, the billions of people around the world got their first computing device, and it was a phone. They had had probably like uh not a smartphone before, and so our app made it really easy for people in India, Pakistan, Bangladesh, and the Middle East, just all around the world to like do voice and video calls with each other. Yeah, I an interesting story I just want to mention here. Like, at the time I was like kind of skeptical about our product. I was like, man, you know, Skype is this beast, like they already have hundreds of millions of users using Skype. And you know, how are we in this small startup of like 20 people going to outcompete, like that already established product? Yeah, but I was really wrong. Like, all you need is like a couple smart people and like a brand new space that the other people are not kind of native to, and so like the Skype just couldn't translate to mobile phones, it couldn't work well on like the slow internet and the limited resources on the device, and we just had a small team of 20 people that managed to kind of build this really fast-growing product. Like we went in over in about three years from like zero users on our network to over 200 million.

SPEAKER_01 9:06

So, a couple of questions there, Nicola. Um, so I'm assuming that you guys build IMO as uh uh uh mobile first rather than Skype going from a web app to uh uh a mobile and kind of like what what what were the things that were breaking? Like from zero to 200 million, that that's like a explosive growth.

SPEAKER_00 9:25

So many things were breaking. It's just really even hard to the way to think about it is like we you know we wrote the first version rather naively, like we know, we just read a book about how to build a messaging network and like used our own kind of background and just wrote something, yeah, and shipped it. And you know, it started gaining users, and then we had to basically scale it up. And okay, it was basically this rescaling events where we would double or 4x the amount of computing power dedicated to each service. Okay, I think it helped that you know our co-founders had experience at companies like Google, so they had the right architecture, like we were using microservices all the way back in 2010, and so we just built a bunch of microservices, and each engineer was responsible for a couple of these, and yeah, and we had to make sure that roughly every six weeks we had doubled the user amount, so that's really hard to think about it, but you'd have to be prepared to like make sure your service can scale user code capacity in six weeks, and then do it again. And often it was like a round-robin exercise. We would like to look at a service, make sure we make it 10 times better, yeah. Yeah, and then look at the next service, make sure that one is 10 times better and capable of handling 10 times as much traffic, and then you go around, and like in about six months, you're back to the first service, and you have to like now think how to make it another 10 times better. I think the thing that we really learned from this exercise and and running this backend is you can do a lot with not too many people if you know the people are really good. And the other thing we learned kind of by force is how to build this using our own servers along our own infrastructure because we did a lot of voice and video calls, and so we couldn't have really built this application if we were using like uh AWS, like the price of bandwidth in AWS is just prohibitive to application license. Uh, and we were forced to basically run on our own infrastructure. I I when I kind of graduated school, I didn't think I'll be doing this, but I was in charge of signing the contracts with the data centers and the ISP providers and negotiating with them. Yeah, uh, and I was in charge of buying the servers and kind of making the servers into uh internal cloud that we can use for the rest of the services together with my team and co-founders. Um and so the story we learned there is how to I guess build an efficient service. I think the we had 200 million users, but like how much we spend on our servers and infrastructure was like you know, an order of magnitude less than like simular style company, let's say like Snapchat at the time had maybe twice as many users, but they spent like over 10 times more on you know just cloud bills.

SPEAKER_01 12:53

And and how do you guys get the network effect? I mean, obviously, uh was it that you were the best tools in the low bandwidth space, or how do you get 200 million? How do you get how do you get the first forget the 200? How do you get the first 10,000, 20,000, 50,000 uh users?

SPEAKER_00 13:12

It that's an uh interesting story. Like we had a different product before, and we kind of got it into this network. Like our first product when I joined the company was uh a multi-Messenger web application. Okay, so you you could log in into Facebook Messenger and Google, Google chat and MSN and AOL and ICQ, if you remember. Okay, yeah, yeah. And you can log in into all these protocols into one website, cool, and then you can have like all your friends that are at the different networks into a single interface, and it was a web interface. But as you you know, this was a reasonable product, it was kind of for power users, it was on the web, and it just did not have the same scale as a mobile application, especially in 2012 to like 15, as like billions of people got like their first computing device. Yeah, and so we kind of pivoted towards like we're having our own network. We we had like around a million users using this other product, um, accumulated like over like some period of time, and then we we basically co-marketed the new network to them, and we we got kind of our start like this, but at the end of the day, we made a simple app that like if you and I installed our phones and we had each other's phone numbers before, I could video call you, yeah. And Skype had like their system was based on like usernames and adding friends and approving the relationship, and so yeah, our app was just a lot simpler for these people that are just use the phone box, right? Like they had yeah, yeah, um uh they had an old Nokia phone, they got like some cheap Android phone, and they plug in the SIM card, and we really focused on our data. I think that's another thing that we got from the founders got from their Google Heritage. Like we collected a lot of telemetry and we based our decision on the telemetry, so we would see how much bandwidth the users had, and we kind of made sure that our calls work well in this very low bandwidth environment, yeah. You know, it it grew really well until WhatsApp essentially added voice and video calls into their app. That was always kind of the problem. Like WhatsApp was the thing that beat all the other existing protocols, but they did not have voice and video calls for a while. And and once they did add theirs, even though like I believe ours worked like better, I think the network effect that they had was kind of larger, and so that kind of flattened our growth rate.

SPEAKER_01 16:12

Got it. And and what I'm assuming this was before Facebook bought or Meta bought WhatsApp.

SPEAKER_00 16:18

It was kind of basically when when Meta bought them, they did not have voice in video calls, and it took even a couple years after that for them to one-site voice in video calls. I see. Um, and so it was actually what made it hard to compete is like Facebook. I don't like this. I think they used a bit their monopoly position, like they would use revenue from the Facebook to keep WhatsApp free. And we couldn't really keep our app completely free. We had to do like ads, and ads made the product not as good as the product without ads. Yeah. So that made it hard to kind of compete in that area.

SPEAKER_01 17:03

Yeah. Makes sense. Yeah, it's it's it's really hard to compete with deep pockets unless you have some kind of like proprietary mobile route technology or something that gives you an edge.

SPEAKER_00 17:14

Yeah. But uh, I you know, I treat this as like uh, and I think everybody should treat all of these things as kind of a learning exercise, like you you learn a lot doing something, you meet the right people too. Yeah, without which you can't really build the next thing. And you know, when when we decided to start deep infra, we got a lot of advice from the same uh you know person, George Herrick, and and the other co-founders of IMU, Ralph Herrick, they you know George is really deeply involved in like AI research, and so he was like telling us how the AI is coming, and we should try to build a product inside like that serves that purpose. And and given our expertise in the I guess the infrastructure plus scalable backend, for me and the team, um I felt like we can't help too much on the training side, it's too big to do kind of approach initially. Um we really felt that my mental model is like this, right? Like the models are kind of they have to behave in some ones like somewhat like humans. Like we study for like 10 years, and then we use 40 years to like apply whatever we want. It is continuous learning, but you know, you spend less of your time now learning and more of your time like doing. And so we always thought like inference is gonna be really, really big.

SPEAKER_01 18:57

Can you for our listeners and viewers, can you just explain what is inference in in plain terms?

SPEAKER_00 19:03

Yeah, yes, so it inference comes from like the word kind of to predict something, and so the way the way the models get trained is they they kind of study, they've been trained for a while, yeah. Actually ask them to to give you a prediction or give you an answer, and so yeah, in simple terms to think about what inference is, it's like it's like running the AI models, yeah. It's like hosting them. It's the AI models are kind of like a new class of applications, right? Like before that, we will write a web application and we will need a server on which to run it. In a very similar way, I think the AI models, when they finish training, they need somewhere where to run. At deep infra, we're building uh AI inference cloud, like a cloud that's really focused on just inference. So it's kind of like a new cloud, but in very many ways quite different than the other new clouds, because we don't our main business is not to rent servers with GPUs, it's for us to run AI models and give our customers a simple interface into the models.

SPEAKER_01 20:24

So okay, can we just double-click on this? And so, what kind of inference models are you running? Like, are you like, or is it all like decision tree and random forest and linear regression? Uh neural net, kind of like like help me help me understand a little better, kind of like uh, because if you're not writing space, if you're not writing capacity, kind of like how are you operating?

SPEAKER_00 20:46

Yeah, we are running mostly like deep learning models, like okay, large language models, but also diffusion models that generate images and videos, audio models that can either generate audio or or generate text out of audio. Okay. So, you know, we have gotten to an age where we have this deep learning model. They're not like written by hand, they just get get fed a lot of data and they start to understand things and are able to kind of predict, infer like information and results. And so we're focused on this category of models because they have something in common, they require an enormous amount of computing power to run. Yeah, that is. I want to for for your listeners, I want to try to explain something that is just hard to think about. You know, when you ask this large language model to generate something for you, it generates the the tokens, the the words one by one. In each token. And each word that it generates takes like hundreds of billions of multiplications, like operations that the computer needs to do. And that's like millions times more than like what we did before in terms of let's say if you run a database or a web server, like fetching something from a database or returning like a web page is like million times less compute intensive than this.

SPEAKER_02 22:31

Yeah, yeah.

SPEAKER_00 22:34

So so these kind of AI models basically require the appropriate hardware for you to even run them efficiently. And and we specialize in that part. Like we're always thinking about like what is if we only want to do inference, yeah, we want to host this open source AI models and let our customers basically use the intelligence out of these models in a simple API way. Yeah, what do we do? How do we build the data centers? How do we structure the clusters? How do we run the software and orchestrate it?

SPEAKER_01 23:12

So two or three questions come to mind. So with the models, is there reinforced learning on an is like a multi-tenant environment? Uh, if like if you're training the models, if if a client is renting them like the model, kind of like how do you like let's say it's like a real example, like uh working with a financial institution that they want to have their own model on-site, on-prem, so they can use like diff different AI models to run loan applications, determine what needs review and what doesn't. Kind of like if if it's in the cloud or kind of inference model, like how do you protect the information of a client that's renting the model from you?

SPEAKER_00 23:51

Yeah. So let me double-click on some of these things. Like, we are really focused on hosting this open source AI models, models like Deep Seek, models like GPT, OSS, Kimi, GLM. There's a a good selection of pretty nice, pretty intelligent open source AI models that are essentially you know being trained and their weights are published on the internet, and you know, anyone can take and run them. Yeah. Now, I do agree a lot of organizations are quite sensitive about their data. And so, what we are building is an inference-only cloud. We're not trying to train with other people's data. So, we have a very simple privacy policy and contracts where we would just do the inference for you, we would run the model for you on our clusters of compute, okay, we will give you back the results and we would not store your data on our platform and we will not resell it to anyone and so on. So, I feel like they are definitely a set of customers who would want the models to run inside their own like data centers and so on. Yep, but it would be really hard for us to help them just because like this is like a less repeatable business. Yeah, yeah, like you have to customize this for every particular customer. Yeah, I feel like some of these things could be solved with simple contractual obligations, right? Like even banks they end up buying computers from let's say Dell. And so they trust Dell that will sell them a computer that will keep stuff private, and so for the moment, we have really focused on building our own cloud that is optimized to do just inference and being pretty clear contractually that we are not doing anything else with the data other than running the inference. And I think that's enough for a good portion of the I guess demand out there. It's not able to cover everything for sure, but uh you have to pick something and kind of really focus on it.

SPEAKER_01 26:10

So uh an honest question. I mean, obviously, you're doubling down and betting on the inference layer, but um Carol, if you if you if you if you listen to the pundits, the pundits believe that the application layer is is where uh the the money's gonna come. What what makes you uh uh so bullish on the inference layer versus the application layer?

SPEAKER_00 26:36

Well, I don't actually disagree with people. Okay, I feel like let me put it this way. I feel like if you figure out how to make uh let's say an AI doctor, you can make this really cheap and charge a decent amount of money for it. But we're really happy to run the inference for you. And honestly, we're not it's an infrastructure problem. Like you you can't make like super large margins doing this, not like SaaS margins, but I do think some of these things are any going away anyway. Like in a world where it's really easy to write software, yeah, you will need less of this kind of software tools that you could charge a lot for. So I do think like people in the application layers are gonna make good amount of money. I think like on the infrastructure layers, as long as they make good amounts of money, we will also make a reasonable amount of return as well. You don't have to, you know, if you make an application, you kind of have to get it really, really, really right and win your market. Yeah, yeah. And we are building a platform where as long as there's many applications that are successful that need tokens, we will have demand for like our infrastructure and we'll it's kind of a cycle, like the more tokens we do, the better we learn how to do the tokens. Yeah, yeah, yeah. And in essence, like we're building an intelligence factory, so we generate intelligence out of like our data centers that goes up to these applications and powers like this all these use cases.

SPEAKER_01 28:20

That's amazing. Um, so so I mean, in essence, you're almost like a utility, you you provide a service. Now, I'm I'm curious, like a lot of people are talking about kind of like how to reduce token usage and stuff that I'm kind of like from an inference cost perspective, without kind of giving away the secrets. What are the best strategies to reduce the inference usage or inference cost?

SPEAKER_00 28:42

Yeah, I I'm really happy to talk about this. Like, I think the the most important thing that people need to like look closely into is the the price for cash tokens. And so we are one of the first kind of inference cloud where we had all of our models have kind of cash pricing, all of our new models have cash token pricing. Okay, and so why are the cash tokens really important? Like what ends up happening in the background is you essentially let's imagine the following use case. Like, you ask Cloud Code to create something for you. What ends up happening underneath with the inference is like Claude asks a model, Cloud Code asks the model what should it do? And the model returns reasons a little bit and returns some sort of a tool call. Says, like, I don't need more, I want to read this file, I want to read this web page. Then you would do whatever the model wanted to do, like fetch the file, fetch the web page, yeah, and then ask the model one more time, like what should I do now? And there's like this cycle, you keep asking the model like what to do now, yeah, given more and more context. And so you want to make sure that compact you don't do the work again for like the stuff you've already seen, and that's really it's not complicated, and I think like it's kind of going around the industry, but it's still not fully like understood everywhere. You have to really do well the caching of the tokens that way. You you save the cycles of your GPUs, they only have to work on like the new stuff.

SPEAKER_01 30:28

Yeah, just the incremental net new information while maintaining some of the memory that that's built or the cache that's that's that's built. Okay. Interesting.

SPEAKER_00 30:39

Um, yeah. And uh, I think one other thing I can mention is maybe we believe to do efficient inference, you kind of have to have the latest generations of chips as well. Basically, they have more memory, they're more efficient at doing this. Uh-huh. And and you have to structure them in a way where it's really, you know, your only goal is to do inference. I think the more and more time passes, the more and more these two things will diverge. Like you see NVIDIA bot grok, right? Uh, I think more and more the the GPUs that will be sold for inference would be potentially different than the one sold for training. So you have to pick one of these two things you want to do. You can't build one thing and kind of do well both of the things. They they would they they're they're different tasks. So you have to kind of to be really efficient, you have to focus.

SPEAKER_01 31:45

And it and it makes sense. Uh, I mean, and and and you kind of see some of these, and we talk grocket, uh, like we talk GPU, TPU is for Google. Um, and and so we would see like continued specialization in in the the use. Um, do you see um kind of like from a competition perspective, like that's the AWS of the world be kind of come into it and start encroaching on the on the model that you're building?

SPEAKER_00 32:12

I think for sure they've been kind of doing this since day one, to be honest. Like uh inference, I think they realize inference is important and and you know they have teams of people working on inference services that will be inside AWS. What I you know, I I gave you the story about like Skype and mobile. Yeah. I I am not as worried. I think our our biggest competitors would not really be, I think, AWS and Azure in the longer term. I think we'll compete with other startups that have also focused on on inference or new clouds. Um that being said, there's in my mind, like it's really hard to imagine the amount of demand that is now versus how much it will be in the future.

SPEAKER_01 33:10

Like uh what does the demand curve look like for you? I mean, people were saying like that AI has all this negative reputation and uh in the in in the states, and Sam Altman's house was attacked twice, and the government showed it his house. But I mean, I I I see it in our companies, in portfolio companies. We use AI incessantly. Um and and if if if I reflect on the kind of the last 48 hours, cloud code has been not very good. And I don't know what what's going on, but I mean we we I mean that's kind of like our go-to. So my explanation to this is the reason why it's not good in the last 48 hours is because there's likely a peak in demand or there's an increase in demand that's driving some of this.

SPEAKER_00 34:01

I I think we're definitely you know the development of AI models is really interesting, right? They they keep getting better and better. And it's the moment when they cross certain lines, like their demand for them kind of explodes, right? Like a year ago, the people that or the usage of uh coding kind of agents was like kind of almost nothing compared to now. Yeah, and so overall, like we see like great amount of demand, like it's really not a demand problem at all. Our business, it's kind of like a supply problem. Uh we procuring and setting up the the compute hardware and and scaling like the service to meet the demand is is really the problem rather than demand. Over something really, I think happened in the last four months, honestly. Okay, the demand really increased, especially as OpenClaw came out. Um, and just like the you know, the models got better and better, but I think the the orchestration layer the orchestration layer got to a point where it's asking the right questions and yeah, framing it well to the model, and just the whole iteration. Also, the models probably got fine-tuned for this kind of workload, yeah. So things really changed in the last three, four months. We got probably four times as much kind of traffic in the last like four months. Um we're really you know, generally, like we're really bullish on the overall demand for these things. Like we feel like the world is generally not have enough GPUs to kind of satisfy the demand, and basically something some some customers are getting priced out, like uh there's customers who come in and are willing to pay like a high price for the compute tokens, and there's a lot of use cases who we could still serve tokens at cheaper price, but just are not getting the priority right now because there's like high high higher higher, I guess, revenue use.

SPEAKER_01 36:28

Well, it used to pay. So uh if if we're to follow the trade, this means that the the people with deeper pockets or the companies with deeper pockets would get to benefit from this disproportionately, and we're gonna see a bit of like bifurcation. The people that have the capacity to consume would would largely benefit from it the most.

SPEAKER_00 36:49

Yeah, I I feel some of it is happening. Um I I feel like the big few companies and labs are sucking a lot of the compute capacity out, and it makes it really hard for startups and new labs and other people to secure like any amount of compute capacity. Yeah, yeah. Um but uh you know how different things are now compared to like six months ago where people were really kind of questioning the whole thing, and maybe we're in a bubble. And uh I'm pretty bullish on the overall demand for compute in the long run. Like, sure, would we you know maybe some companies would overcommit to something, and then I'm not worried that this compute will not get used, I think, given the trajectory of AI.

SPEAKER_01 37:48

Yeah. No, makes sense. Um a question, uh, because I mean, obviously, you you're making bets and you're making this bet with strategic mindset in mind. Um, why are you betting on open source? And and I mean, history, uh history is on your side, like uh, but there can there have been a few examples when closed source systems come on.

SPEAKER_00 38:12

Well, you know, I want to be completely honest here. It's partly we don't have really a choice, right? Like there's very little chance for a small startup like us to convince like OpenAI or cloud to like let us run their models on our like inference optimized cloud. So you kind of have to you you're in the other camp, like you know, they're so big they would always run their own inference, they would give it to like the big three clouds, yeah, like potentially, but not to a startup like like Deep Inference. So from the very beginning, this was actually the biggest risk to us. Like, I had to convince a lot of investors that it's gonna be okay, yeah, and there will be open source models. And it you're right, like the last three years like have helped a lot because when we started, I guess the gap between like the first llamas and the open AI models was pretty massive, like and most of our use case at the time was like people just trying to make something small work with like an open source model, yeah. But I've been I feel pretty fortunate that that the open models have really exploded in like variety, and you know, every week almost a new one comes out, like we can't really it's really hard for us. Like, I would love to host all of them, but we don't have the enough capacity to really dedicate cards for every model, given that there's a lot of demand for specific models that want to take all the compute capacity, so it's kind of a balancing act, but like we're really happy. One thing I want to comment on this here is you know, we do need to invest more in open source models in the US. It's a it's kind of a gap that it's not, I mean, good structurally for us for the US. If um you know the top five open source models are like all made in China, yeah. Because like these models are gonna get used in like the rest of the world, disproportionately of course, and they'll be used by in many cases, like for US uh kind of companies, and so we try to also be kind of like an independent place where these things run, like so. If we take the deep seek open source model and we run it, you know, the data never leaves our data center. It comes in, we return the result, and we don't store anything. So, yeah, in that sense, it's like there's some security in the data, but it would be really nice if we have more open source models. I want to shout out maybe like I think Nvidia has started putting out open source models, and you know, I think they're only gonna get better at this. They definitely have the compute capacity, and I think they have also the right researchers, yeah. Uh and I think they're strategically aligned in making sure that there's good open source models, at least like basic level of intelligence uh for everyone.

SPEAKER_01 41:36

But so what do you think the next five years or uh the 10 years is going to look like? I mean, um, obviously the the the entropics, the open AIs are spending exorbitant amount of money in building out the models, they're closed source models. Um but I if if it was to compare like like a very few examples, I mean, like lit VHS versus Betamax, Apple versus Android, like for the for the most part, it's very rare to see a closed closed source system win. Um now, what's your prediction in terms of how this is going to evolve in the future? Would we see the closed source models be very, very specific around specific use cases and the open source be more the generic ones?

SPEAKER_00 42:22

No, I I think all the models will be pretty generic, honestly. Because I feel like that notion of like specializing the model, I'm not sure I buy it as much because I feel like I feel like we have general models available that are contended a variety of fields. Okay. Now, on your question about like open source versus closed source five years from now, it's a great question. It's a bit up to the future to essentially it it can go either way. It's possible that someone at both and Tropic or OpenAI comes up with something really amazing in terms of I guess architecture for the model or something like this. And this might stay private for a while. But I feel like there's people, you know, if they discover something great, they go out and make you know a new commitment doing this, and so that information gets disseminated.

SPEAKER_01 43:23

Well, I mean, particularly in the case of a tropic, right? Like the the junior engineer that made the GitHub commit that they exposed some of the model already, or um the the new model that they developed was too powerful, and within a day, like somebody had hacked into it. So it's kind of like to me, it's like there's no such thing as you can protect or isolate a model from the outside world. Yeah.

SPEAKER_00 43:46

So I I agree with all of this, and my prediction is we can end up in one of two worlds either 80% of the tokens get generated by closed source models and 20 by open, or like the opposite. 20 80 in the other direction. Yeah. And I don't see a world where it's like 100 to 0 in either direction. So I feel like there will be scope for us to build for sure in any case. And I think down the line as we scale, I do feel like you know, a couple years from now, hopefully we would be also allowed to run some private models, uh, some closed source models. It's more a matter of trust. Like we need to get to a scale and convince like the big labs that we're not gonna mess up with their weights and leak them. Um I just like working on the problem. Like I think like inference is like very systems problem, very infrastructure problem. Um and it's like an optimization problem. You want to like create this ultimate efficiency, yeah, generate more, produce more per card per hour. It's a lot like I like Formula One, and they try to squeeze them the final power from the car and the final seconds.

SPEAKER_01 45:17

Yeah, who's your favorite team and driver?

SPEAKER_00 45:20

I've been I've been watching Formula One for like 28 years, and I've always been like a Ferrari team fan. Uh uh. The drivers change, but I always root for like the team. Right now it's Charles. Uh okay. Charlie. I I know Lewis is really good. Louis Hamilton is really good, but for me it's really hard as a Ferrari fan to be like, you know, yay, Lewis, because for a while he was like, you know, the arch enemy of uh Ballettone and Mercedes, yeah, yeah, yeah.

SPEAKER_01 45:55

No, it makes sense. Um where do you stand on Michael Schumacher? Big fan.

SPEAKER_00 46:00

Big fan. Basically, when I started watching, like he was uh it was two years before he started winning for Ferrari, yeah. And so I I grew up basically watching him start behind Mercedes, behind McLaren at the time, and chased them all season until the last race they're like you know a few points away. Yeah. So I I like that kind of like I basically watched him kind of try to chase them, and and they were like the underdog this these first two years um at Ferrari.

SPEAKER_01 46:35

I I hear you. Great fan of I love Formula One, and um I think it was in 2012, I think it was 2012. I was working for one of the um major oil and gas companies, and at that time I think we were rolling rolling in uh S4 HANA. Uh maybe it's a little later on. And one thing that we learned at that time was that um HANA is not big enough to handle the the demands of Formula One teams, so they had to build proprietary data cubes and platforms to be able to ingest because I mean the telemetry, like I think this what is it, about um 2,000 sensors on any F1 car, so just pulling in data from everywhere. So it's like wow, this is insane computation exercise.

SPEAKER_00 47:25

It's very it's very engineering sport, I think. Like uh the drivers are great, you know, they they add this human element to the cars, obviously. But I think what really wins the championships in a big portion is like the engineering effort behind the cars. The teams are massive and sometimes takes years to develop things. Massive respect to like the greatest engineers, like Adrian New is like a rock star.

SPEAKER_01 47:55

The god, yeah. I have his book here behind me.

SPEAKER_00 47:59

Yes, I've I've I've read that book, he's just a amazing person.

SPEAKER_01 48:03

Yeah, I mean, it'll be interesting to see kind of like what what happens with uh uh what was the team principle for Red Bull that got ousted and now he bought Aston Martin.

SPEAKER_00 48:14

Um yes, yes, yes. Um also bring back in Christian Horner, Christian Horner, Christian Horner, yes, yes, yes, yes.

SPEAKER_01 48:27

Because he's coming back. I mean, I'm obviously um from racks to races for Red Bull in terms of kind of like what's been what they have been able to accomplish. Um, and especially if uh Max Verstappen ends up leaving Red Bull, it'd be kind of interesting to see kind of what what comes out of it.

SPEAKER_00 48:43

Yes, that they're like their own world, like I think uh like Flavio Priatore came back. You know, he was uh he was with Schumer here at Bernetton, and yeah.

SPEAKER_01 48:56

I can I think he went with him to F1 to to he went with him to Ferraris to screw the year, yeah.

SPEAKER_00 49:04

Flavio can I complain about one thing? I think the biggest thing, if someone from Formula One is watching this, like we have to refuel the cars, like not refueling the cars is killing the sport. Like the difference between qualifying labs and the race labs when they're full of fuel is like six, seven seconds. Yeah, and the fastest to the slowest car is like a second, yeah. So imagine if we allow them to refuel the slowest car, if it just has less fuel, we'll be all able to overtake everyone else, and we will need like buttons for them to press. And I also think we I'm okay with some electrifications, but I feel like the main thing should still be the engines. Like, we're not gonna mess up the planet if we have 20 cars that run on petrol fuel, like we can use synthetic fuel. I feel like these are the key things for them to kind of fix a little bit the sport. Refuel.

SPEAKER_01 50:05

Yeah, I hear you. I I I hear you, but it I think it's uh the the the leak is taking uh the principles around safety first.

SPEAKER_00 50:14

That's that's the other thing I think is a mistake. Look, it's a dangerous sport, we cannot make it like 100% safe. Like we should use more and more technology to just make sure refueling is safer and safer, but the answer is not like let's not refuel. Uh to be honest, if you put so much fuel in the cars and they crash on the first few laps, that is like not great either.

SPEAKER_01 50:41

Yeah, so yeah, because the heavy the back end may be out of balance, yeah.

SPEAKER_00 50:47

I I think we overreact, right? Like there was like two races where the people refueling the cars got injured. Yeah, there was errors around that. Like, I I it's not great that people get injured, but in general, the sport is dangerous, like they're driving wheel to wheel at 300 kilometers an hour. Yeah, like there's inherent danger in the sport, we can't take all of it out.

SPEAKER_01 51:13

Uh yeah, uh, just to bring the conversation back, I mean, um, like you you talked about where we kind of veer it off is on optimization, and you see pitch stops went from 10 seconds to now sub two seconds all the time. And the other thing kind of reminds me is like at IMO, you said I if I remember what you said, I think every four months you guys were doubling the the user growth. And so I'm assuming that even what like as we're talking thinking thinking about deep infra, you're probably managedally obsessed with optimization and inference cost reduction. Kind of like walk me through kind of like what does your strategy look like from an optimization perspective?

SPEAKER_00 51:58

We we are really obsessed with optimizing things and and efficiency. You know, I throughout the years that I'm for necessity, we basically optimized a lot, and and partly we you cannot afford if the users are actually doubling every six weeks, like if you grow around 10% a week, in about six weeks you're like double. Yeah, you start from very small numbers, like you start from very small numbers, but then you have this growth, and the numbers quickly get big. You cannot afford every six weeks to basically buy twice as many servers. Yep. So you have to do you have to buy more servers, you can't like just software, like you cannot with just software like make it go away, but you also need to optimize the software as well. And so I actually is my advice to like engineers in general, like I like the loop, I like to build something and then look at the telemetry of the thing running, and then look at the graphs and the numbers and see what doesn't make sense or what looks bad, and just keep iterating, like you know, I add more metrics that I want to track, I let the thing run longer, and I see things that don't look as good that I think we can do better at, and and then we we try to notice like things that could have large impact, and so like early on, we really focused on the KV caching across requests, like we understood that that's critical for the agents. If if we don't have this caching mechanism and another competitor does, it we just can't compete with them on price given that the agent makes a big loop of requests. Okay, and so I think it's not actually like some super secret sauce. You just have to have that mindset of like, let's look at our metrics, look at what doesn't work. And I used to do this at the messenger too. I would basically look at our dashboards, look at services, and and look for places where I felt like we should be able to do a lot better, like you know, like if we um funny story, like the messenger once like our registration servers crashed because there was some TV show in Colombia that aired that featured Ivor. And it was kind of embarrassing because there weren't that many registrations, but they just kind of happened all at once, and our registrations was like not that efficient, and yeah, we sat down after this event and just like you know, we must be able to do this many. Yeah, like why is it taking this long per per user? Like we both increased the number of machines on which we would run the registration service, but we also really looked at all the pieces that go into the registration, like yeah, sending SMS and fetching the phone book for the customer, processing the contacts, and just worked on each part until we had like a much better system. And at peak we used to register like a million new users a day. Wow. Uh so it's just like uh iteration process.

SPEAKER_01 55:48

Um I I don't think my camera's off in just a um if I uh like just one last question. If I give you a hundred million dollars to invest in AI and and and not not invest it in DB infra, where would you put it? Where would you deploy that capital?

SPEAKER_00 56:16

I think like data centers are are good bet. I think like chips are good bet. Um I think if you like public market, I think there's companies that are making memory and chips are probably a good bet.

SPEAKER_01 56:36

Yeah, like the TMCs of the world.

SPEAKER_00 56:38

Yeah, I think that's kind of the mindset. Like, what would we need?

SPEAKER_01 56:44

Got it, got it. Well, this has been a super enjoyable conversation. Love connecting with you. Um any any parting words of wisdom?

SPEAKER_00 56:56

It was it was a great, it was great talking to you, Maxim. I really enjoyed our conversations. Um I don't know, I feel like uh we're really bullish in open source models and and open source agents. Like I like uh open code is like a great open source alternative. So I I like open source things. Uh so I want to encourage, like I guess, your listeners to just look it that direction for sure.

SPEAKER_01 57:24

Makes sense. Got it, love it. Uh, thank you so much.

SPEAKER_00 57:28

Thanks, Maxine. Pleasure talking to you.

SPEAKER_01 57:30

Same. Okay, bye. Bye.