The Neon Show

Why Your AI is Still a Demo: Lessons from Braintrust’s Field CTO

Siddhartha Ahluwalia Season 1 Episode 370

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 46:30

85% of AI teams will hit a serious production failure this year. The only thing separating them from the 15% who don't? Evals.

After nearly two decades of building AI systems at Microsoft, Facebook, and Dropbox, Ameya Bhatawdekar is now Field CTO at Braintrust, the AI observability platform used by Airtable, Notion, Stripe, Dropbox, Vercel, Cloudflare, Lovable, and Replit.

We discuss a shift that most teams underestimate. The winners in AI are not just shipping faster. They are building systems that behave predictably, improve continuously, and earn user trust over time. As traditional monitoring breaks down in a probabilistic world, observability now requires learning how an AI system reasons, not just how it performs. This leads to a new paradigm where agents are no longer just executing tasks, but also analyzing and debugging other agents.

The episode also traces the evolution of machine learning itself. From feature engineering to deep learning to transformers , each leap increased capability and reduced control. Evaluation is now where control sits.

Ameya is clear on one point. Moving fast with weak evaluations feels like velocity, but it compounds into technical debt, unpredictable failures, and ultimately a loss of user trust. The teams that win are the ones that invest early in rigor, especially in understanding context, which is quickly becoming the hardest and most critical layer in AI systems.

If you are a founder or engineer moving beyond the demo phase and trying to build durable, high-quality AI systems, this episode will change how you think about shipping.

0:00 — Trailer
00:55 — What’s Braintrust?
05:01 — What agents are shipping today
07:54 — What evals look like in practice for Notion & Zapier
09:44 — Evals vs Classic monitoring
11:33 — Who is the Field CTO?
16:35 — What goes wrong when agents fail
18:26 — Agents analyzing other agents
24:17 — Evals are existential in vibecoding
25:52 — Ship fast with weak evals or slow with strong evals?
25:41 — What makes enterprises trust an LLM?
29:25 — Do AI startups know how good their product is?
30:23 — 3 ML systems: Microsoft, Dropbox, Meta
36:30 — How the 2017 transformer paper changed everything
38:20 — All algorithms are predicting the next word
43:40 — What LLMs will do in 1 year

-------------

India’s talent has built the world’s tech—now it’s time to lead it.
This mission goes beyond startups. It’s about shifting the center of gravity in global tech to include the brilliance rising from India.

What is Neon Fund?
We invest in seed and early-stage founders from India and the diaspora building world-class Enterprise AI companies. We bring capital, conviction, and a community that’s done it before.

Subscribe for real founder stories, investor perspectives, economist breakdowns, and a behind-the-scenes look at how we’re doing it all at Neon.

-------------

Check us out on:
Website: https://neon.fund/
Instagram: https://www.instagram.com/theneonshoww/
LinkedIn: https://www.linkedin.com/company/beneon/
Twitter: https://x.com/TheNeonShoww

Connect with Siddhartha on:
LinkedIn: https://www.linkedin.com/in/siddharthaahluwalia/
Twitter: https://x.com/siddharthaa7

-------------

This video is for informational purposes only. The views expressed are those of the individuals quoted and do not constitute professional advice.

Send us Fan Mail

SPEAKER_00

My role at Brainterest is to help our customers who are building Gen AI systems to build their AI in a very predictable, repeatable way such that they can ship high quality AI into production. And so the way you do that is by building evaluate data centers. And is live coding going through eval? It's really important that you have control over the application that you have built by live coding. Evals almost become existential. There are other agents that are able to manage your subscriptions for you. There are agents that can help you set up your itinerary, knowing your preferences, knowing your constraints. They can pick the best combination of these flights, these hotels for you. So there's lots of interesting agents that are being built out there.

SPEAKER_01

Where do you think these models and capabilities are heading in 26 and 27? Hi, this is Sudha Talu Aliya, your host at Neon Show and managing partner at Neon Fund, a fund that invests in pre-seed and seed stages in the best of enterprise AI companies across US, India, Corridor, like Atomic Work, Spot Draft, CloudSec. Today I have with me Ameya. Ameya, welcome to the Neon Show. Thank you. You are at a very unique position, field CTO at Brain Trust. So Brain Trust for audience is one of the leading companies in Eval space for AI. I'll ask you to explain, you know, first what is a field CTO, and second is what is Evaluation.

SPEAKER_00

Sounds good. So my role at Brain Trust is to help our customers who are building Gen AI systems to build their AI in a very predictable, repeatable way such that they can ship high-quality AI into production. And so my role is to work with our customers to build out the technical strategy, the way we deploy and use the Evals platform and the observability systems, how we build out these feedback loops to help the company build the muscle to ship production quality AI. And so I lead the post-sales field engineering team as well. And so we work with our customers to not only onboard to Brain Trust, but use it in a way that can help them build these AI systems effectively.

SPEAKER_01

Can you describe now evaluance observability? What do they mean in this context?

SPEAKER_00

Absolutely. And so, you know, as we have seen this revolution of Gen AI over the last few years, what has happened is we moved from the world of training models to do very specific things, where we would take large training data sets, build these algorithms, and then train these models to do a certain task in a predictable fashion, you know, with a high degree of accuracy. After the Gen AI revolution, like the way now we build intelligent applications is we take models off the shelf. These are general purpose models. They can reason on a variety of tasks. And then we condition the models to work a specific way by doing prompt engineering, by doing context engineering. We are really making sure that the models work for a particular use case in a predictable fashion. And so when we are not training these models and we're conditioning them, we want to make sure that our instructions work well across the board, you know, for all the anticipated and not yet anticipated ways in which people are going to interact with these systems. And so the way you do that is by building evaluate datasets. These eval datasets, they describe the system in terms of the expected inputs and the expected outputs. And so if your AI is now able to work across those evaluatas, you have a high degree of confidence that it will work well in production. That's evals. You also want to make sure that the system works as expected in production. And so you want to know exactly how people are interacting with your systems, how they're working with your agents. In which cases does your system work well and in which cases it doesn't? Right. And then you want to look at those cases where it didn't quite work as expected and use that insight to continually improve your intelligence system. Can you fine-tune your prompts? Can you do better context engineering? Can you change something upstream in your system to address those shortcomings and not regress your existing capabilities? And so that's where observability plays a big part. So when you're building Gen AI systems, you really want that feedback loop of observability that helps you build better Evals that helps you ship better AI.

SPEAKER_01

Can you give an example using one of your customers, like what agents they are shipping, and how does Brainters play a role in them shipping production grade agents?

SPEAKER_00

Yeah, so we have a lot of customers, a lot of companies that are building some pretty phenomenal AI systems. They all are leveraging large language models. They are building, you know, um complex agentic systems that are reasoning, that are leveraging a lot of tools and interacting with LLMs to fulfill the user intent. And so we have systems where people can work with their agents to perform specific tasks. The agents are able to understand the user intent and perform actions to fulfill those user intents. Some of these are transactional, some of these are systems that provide answers to queries, some of these are systems that do work on the user's behalf, right? An example would be there are systems that are being used by special industries where they are trying to harvest meaningful information from their corpus of work documents to generate new content that is going to help them accelerate certain processes. For example, I've seen systems where construction companies are able to now put together effective proposals using complex engineering drawings, architectural plans, specifications to submit proposals for new RFPs. There's a very specific format in which these companies are building out that content. It needs to be accurate, it needs to be standardized, the certain formats that it needs to follow. And the AI systems are able to help these companies fulfill that task at uh with a high degree of automation. There are other agents that are able to go in and manage your subscriptions for you. They're able to uh make choices on your behalf in terms of uh you know which options uh they should um uh exercise on your behalf. So, for example, there are agents that can help you set up your itinerary, you know, come up with uh recommendations for certain flights or certain hotels and help you make that discovery, that discovery cycle far shorter so that like knowing your preferences, knowing your constraints, uh they can pick the best combination of these flights, these hotels for you. So there's lots of interesting agents that are being built out there.

SPEAKER_01

So let's let's take an example of your some of the earlier doctors, like Notion, Zapier, Stripe, Airtable, and Instacart. What does eval look like in practice for them?

SPEAKER_00

Yeah. So all of these companies are building agents, intelligent systems that are performing specialized tasks for whatever use cases they have. And so all of them build evals that reflect how their systems are expected to behave in production. They're all following similar patterns. They're building these eval datasets. These eval datasets are exercised every time any component in their systems are modified. Someone changes a prompt, someone changes some logic in some of the upstream components. The eval datasets ensure that like any regressions are caught early in the process. These eval datasets, they are not static. As the teams look at their logs, at how their systems are working in the real world, in the production use cases. Uh, they're able to leverage those insights to continually augment their evaluate datasets. And so these evaluate datasets are now highly reflective of how the system is going to perform in the real world and continue to help them improve their AI. And so the same pattern is fairly common across all of these companies. You can see how they have agents and systems that are specific to the offerings, the products they have in the market, and you can sort of see uh the kind of evals that they would have built to make sure that those systems work as expected.

SPEAKER_01

Got it. And why is observability absolutely necessary for AI when classic monitoring was enough for normal software?

SPEAKER_00

Yeah. I think monitoring gives you a great understanding of how your systems are working in terms of their performance from a latency perspective, from a cost perspective. And those are important dimensions to measure. I think in the case of uh these probabilistic systems, you also want to have a deep understanding of how your systems are behaving in response to these user intents that are typically being specified in natural language. There's a lot of variance in how people can express the same thing. And so you want to make sure that your AI system is robust enough to handle all of the various ways in people express their intent and they are able to fulfill the intent accurately. And so it's important for AI systems to have the observability that's beyond just these performance metrics. These AI systems need to log the entire trace of how the AI reasoned on the initial input, what were the tool calls it made? How did it interact with the LLMs? How did it sort of ultimately generate the response? And did that response actually meet the user intent? Was it accurate? Was it correct? Were there any hallucinations in the generated content or not? And so you really need to capture all of that information to be able to effectively evaluate the quality of that response and as a result your entire AI system. So that's why observability becomes a pretty key aspect of building these kind of intelligent systems.

SPEAKER_01

And your title is very interesting, it's Field CTO. What does it mean?

SPEAKER_00

Yeah, so all of my career I've been building AI systems. I've I've built internet scale, enterprise scale AI systems at Microsoft, at Facebook, at Dropbox. And uh over time, um, you know, I've observed how these AI systems are built, how they're deployed, how they're operationalized. And um I started leveraging Brain Trust about a year ago while I was at Dropbox. And it was um it was quite the step function change for us where we were able to move from a fairly messy way of building enterprise AI systems powered by Gen AI to a more systematic way of how we were building evals, how we were doing observability. And um we were able to go from this chaotic state to a much more systematic way of doing things. And um I see a lot of companies, a lot of folks who are building great AI products, they're kind of in the state that I was about a year ago. And so I was working closely with the Brain Trust team, and I felt like there would be a great opportunity for me to work with a lot of world-class AI teams to work with them on building production quality AI systems in a way where I could share some of the learnings that I had accrued over my career and help them sort of go from that messy primordial state of building these AI systems to doing something that was going to set them on the path of shipping production quality AI systems fairly quickly. And so that was sort of the drop for me. So my role at Brain Trust is to work with our AI partners, our AI customers to help them not only onboard the Brain Trust platform and use it effectively, but to systematize their the way in which they build and ship AI in a way that can guarantee high quality AI in a predictable, repeatable fashion.

SPEAKER_01

Got it. And and why did you choose to go from building AI systems to be on the field site where you're not building and selling now?

SPEAKER_00

Yeah, no, uh as I said, like I've been building AI systems for a long time. I started building my first machine learning models that were shipped in the operating system in the browsers in 2008. So it's been 18 years of building and shipping AI systems, and that has been a wonderful journey. Um I'm at the point where I think I can amplify the impact that I have and not just limit it to one company, but work with many, many, many world-class teams that are doing very exciting things in the space of AI. And so the opportunity to work with a lot of different teams across various domains, building a variety of different AI solutions was very appealing to me. I was also always on the product side. And so this was sort of my opportunity to uh build my muscles and grow a few more folds in my brain as I learned about the sales world. And so this was a really interesting opportunity for me. So now you work on pre-sales or post-sales? I work on post-sales. Um, well, I lead the post-sales field engineering team. So my team works with our customers to deploy brain trust and to build AI solutions. And so we have solutions architects as well as AI engineers who help our customers through the entire journey of onboarding to an Evals platform, operationalizing observability, and then really integrating BrainTrust into their AI SDLC so that they are doing evals in a fairly um systematic way, that they have operationalized how they do error analysis, uh, that they have built all the integrations that they need to automate a lot of this workflow. So my team helps our customers go through those phases. Uh, but my role is you know talking to interesting teams. Like I really enjoy talking to any team that is either considering brain trust or considering you know, onboarding to an Evelse platform, to teams that have built AI systems and are looking at how they can now have uh you know another step function improvement in their processes. So I work with everyone.

SPEAKER_01

When an AI product fails or an AI agent fails, what is what are the few things that could have gone wrong?

SPEAKER_00

Yeah, so you know, these are all probabilistic systems. So when an agent is reasoning on the task, the agent is trying to understand the user intent. The agent is trying to understand how it should sequence the set of tasks that it needs to perform. It needs to invoke tools. The tools are making calls to third-party systems, uh, returning certain results. And then these agents are trying to piece together all of that information and then again continue their reasoning to generate the final outcome. And so, as you can see, this is a probabilistic system with a long chain of steps that need to be orchestrated and executed correctly to yield the final outcome. And so a lot of things could go wrong. Maybe the tools don't work out as well as expected. Maybe the orchestration got certain things wrong. Maybe the sequence of tasks wasn't quite right. Maybe there weren't specific enough instructions for the AI to reason in a way that the system was expected to. So there's a lot of things that could go wrong. There are also a lot of operational things that could go wrong. You know, a call to an external system may fail. And so you really have a lot of um moving parts when you're building out complex agentic systems, and you could have a failure across any of those. And so you really want to have that full observability, that ability to trace exactly how an agent took that user intent, orchestrated the plan, executed it, and then generated the final answer. So you want to know exactly where it missed a step or failed to yield the final answer.

SPEAKER_01

So are you hinting towards that what you are building on the eval and observability side is capturing like this very interesting uh articles by Foundation Capital that got viral on context graphs and context engineering. Is that what you're trying to capture in evals and observability, the context of an agent and why agent did what it did?

SPEAKER_00

Yeah, I mean, capturing the entire context is really important. When you want to observe, like when you want to understand exactly what is it that the agent did over the course of its execution, you want to capture all the context that the agent had access to. That way, anyone who's debugging and troubleshooting can follow the details to understand exactly how the context was built. Was something missing in the context? Was it interpreted by the underlying system correctly? And so we are building a system where you can capture all of that context within the span, the trace of the conversation. So you can imagine a conversation where an agent gets an instruction from the user, then it performs multiple steps, yields an answer, and then that conversation can continue. And so you can have these long-running conversations where there's a lot of context that's being generated and built, and all of that is really important to understand how the agent behaved and for aiding with troubleshooting and debugging.

SPEAKER_01

So, so really rather than going through a log today, a developer or not, even a non-developer, can really talk to an agent why you did what you did through a chat-based mechanism.

SPEAKER_00

Well, yes, too. I mean, if you have all the context, you know, you can have other agents introspect, like look at that context and try to understand how the agent worked out what it did, right? And so we are getting into this very interesting phase where you have agents that can analyze and understand and give you insights into the performance of other agents. We have an agent built into the tool itself in Braintree, it's called Loop. And Loop has access to your entire project context. It has, it can access all your production logs, it can look at all your eval datasets, it can look at all the prompts that you have, it can look at all the ways in you are evaluating those uh agentic interactions, where you can have custom scorers that can evaluate the quality of your agentic interactions. And because it has all that context, it is able to reason on how your agentic system is behaving. It can identify areas for improvement, it can help you generate synthetic data that can be used to augment your eval datasets. And so, yeah, you can now have agents analyze other agents and give you recommendations.

SPEAKER_01

This has never happened before, right, in history of uh engineering like on how do you debug?

SPEAKER_00

Yeah, I mean debugging is pretty um interesting in these cases because you know you really want to understand how every component of a particular trace of a conversation performed. Uh all of that data is logged into your observability system. So in Brain Trust, you can see every step that was performed by the agent. You know, as a span, and so you can have um you know a full context of the span, but you can also have scorers that can evaluate the execution of a particular span. So you can have cus uh uh scorers, which are essentially functions that evaluate the quality of a particular action. Uh, you can define those as deterministic functions, you know, implemented in code, or you can use LLM as judges, but then you can evaluate like how did each span perform? And that can give you a fairly good way to zero in on problematic areas of your agents. So within a particular span, within a particular trace, you can quickly figure out where did the agent go wrong because your scorers can now point you to that particular place fairly effectively. But then you can also look at understanding all your production traces holistically. You can do clustering analysis to see what are the trends that you're observing in your agents as they work across multiple different interactions. You know, where do they get things right? Where do they get things wrong? How do they get things wrong? Why do they get things wrong? So you can now do that kind of holistic analysis on all your data to not only do you know debugging and troubleshooting on a on a particular instance, but get uh sort of uh overall understanding of the behavior of your agent across a wide variety of use cases and categories.

SPEAKER_01

So what do you classify brain trust also as a debugging tool then? Not yet.

SPEAKER_00

It is it is a very effective debugging tool. You know, it tells you if your agent made a mistake and it gives you all the context and gives you all the clues to understand exactly where that mistake happened, where did things go wrong? And so, yeah, it is a very, very useful debugging tool for folks who are working on building these AI systems or continuing to improve these AI systems.

SPEAKER_01

And is wipe coding going through evals? Like people who are using wipe coding, are they adopting tools like Beinterest?

SPEAKER_00

I think so. Like, I mean, it it almost becomes uh it's it's really important that you have control over the application that you have built by wipe coding. Vipe coding makes it really easy for you to take an idea and build an application. But how do you know that your application is going to work great? Like, especially if your application is encoding agent tick capabilities, right? And so when you're building these intelligent agent tick applications using wipe coding, um evals almost become existential. You know, that's the only way you have a high degree of confidence that what you've built is going to work well. We are starting to see a lot of platforms that provide no-code or low-code ways of building agents. They are looking at embedding brain trust in their platform to make it available to these no-code, low-code developers who are building agent systems to test and ensure that whatever they have built will work as expected. And so, you know, even if you're a no-code, low-code developer, it's really important that you have the rigor of building out evals, making sure what you've built works well for those evals, and continue updating those evals as you collect more data from the real-world usage of those systems.

SPEAKER_01

You would have seen uh teams ship fast with weak evals and team move slow with strong evals. Which one went in the long-term NY?

SPEAKER_00

Yeah, I mean you can ship very quickly without doing any evals or by doing a very cursory job at sort of um doing evals. Uh, but it's going to result in in some technical debt. You're going to see more frequent regressions, things that used to work stop working. Uh, any upstream changes can have a dramatic impact on the overall outcome. And so while you can start shipping things fast, uh shipping things predictably with high quality becomes really hard, right? And so if you want to build something that is going to uh scale, that is going to consistently deliver a high-quality experience, uh then you know you have to invest in high-quality valves. Like that, that's just uh non-negotiable. And you can see that you know, users are very sensitive to their systems uh and their quality of their output. You know, it's very easy for someone to lose faith and credibility into an AI system when they see certain mistakes or they're having these rough experiences due to quality issues in their in their interactions with their systems. And so users are not very forgiving. And if you make mistakes too often, you're going to see a churn and usage.

SPEAKER_01

So enterprises, you know, the failure is very expensive. So what makes an enterprise trust one AI system and reject another, even if they use the same model underneath?

SPEAKER_00

So these intelligent systems, these uh AI systems are not just a model, right? There's a lot of layering that happens on top of these models. Uh, these systems have to really um deliver specific capabilities or specific experiences uh that help people do certain specific tasks, right? And so you're looking at fairly complex systems that have a model at the heart of it, but there is a lot of engineering that happens on top of these systems. You want to make sure that you're able to collect the right context, that you're able to translate the user queries into rephrased queries that have a higher probability of being successfully executed. You want to make sure that the instructions that you're providing to your AI systems are accurate, that are comprehensive, that they ensure that these systems uh don't hallucinate or generate uh unexpected results. Uh, you want to have the right guardrails in place. And so it's not just the model. The model is a part of the system, but it's it's there's a lot of engineering that happens on top of these models, and you have to get it all right in order to have a high-quality system that is at heart a predictable, you know, model.

SPEAKER_01

Do you think AI startups that are building AI, for example, uh, you know, agentic companies that are providing customer support agents actually know how good or bad their product is or they're guessing?

SPEAKER_00

Well, if you're developing based on vibes, then you're guessing, right? I think uh uh it's going to be very hard to ship something in the real world uh and expect it to work well if you haven't stress tested it. You you if you haven't, if you've just tested it um, you know, with vibes, um chances are your system may not work well across the variety of ways in people are going to try and interact with your system. And so I do think that systems that are built with a high degree of rigor have a good chance of becoming useful systems that your customers will engage with, otherwise, they end up looking like demos.

SPEAKER_01

So, can you tell us some of like what if absolutely in layman terms, what are the different ML systems that you worked on in the last 20 years of your career before joining Brain Trust?

SPEAKER_00

And what was the impact those systems created? Absolutely. So I'll I'll sort of talk about the three phases of ML that I've experienced in my career. I think the first phase was classic ML. These are classic algorithms, probabilistic algorithms. Um they there are largely um, you know, they're like mostly algorithms that are popular in the 2000s. You know, these were uh logistic regression, um, you know decision trees, um uh clustering algorithms. Um and these were fairly effective. Like for their time, they did a pretty good job in terms of providing uh intelligent systems that could do things like risk management and risk detection, they could do spam detection, they could do um you know uh things like uh churn analysis, uh anomaly detection. So they they were like fairly effective. Um the way you build these algorithms was you took a lot of data and you worked with the data. You did uh you know what you call feature engineering on that data to transform the data into a shape and size that the algorithms could effectively work with to produce high-quality you know predictive systems. And um and so at that time, a lot of emphasis was on feature engineering. People used to come up with like uh a lot of different ways in which you would you would modify the shape and size of the data.

SPEAKER_01

Uh I believe search became really popular in like starting 20 years ago with Amazon personalizing search.

SPEAKER_00

Every company wanted to have a search. That search, like with these algorithms, you could implement search algorithms, you could implement personalization systems, recommendation systems.

SPEAKER_01

Yeah, recommendation systems were very popular and all were based on these old ML models.

SPEAKER_00

Old ML models, collaborative filtering. And so um it was a very interesting time to learn a lot of like how ML worked from first principles, because you could really understand under the covers what was happening, how you were taking these data, uh running them through these algorithms, the kind of uh systematic way in which they were processing the data to generate the final outputs. Uh, comparatively, the interpretability of the systems was was much higher. Yeah. I think in the 2010s, uh, we saw the wave of neural networks. Yeah. Uh with the abundance of computing and data, we are now able to train much more sophisticated, much more complex algorithms that could do things that the previous generation of algorithms struggled with. So they could do a really good job of understanding and working with um visual data. They could do a much better job of understanding and working with sequential data. So you had different training techniques that came into place where you were now doing representation learning, you were doing reinforcement learning. And so, what that meant was the algorithms were now doing a lot of that feature engineering that previously machine learning engineers would do. And so the algorithms were now processing data at a much higher volume. They were way more complex, and they could now do way more complex tasks. So they could do visual reasoning, they could do um machine translation, they could do image detection, you know, facial detection, all kinds of amazing things were now possible.

SPEAKER_01

Yeah, I've worked on facial recognition using neural networks in 2011, and it was a project for the government of India uh for facial recognition across all airports in India. That's phenomenal. Yeah.

SPEAKER_00

And it was a significant breakthrough. Like the ability to do facial recognition uh would have been technically uh not quite feasible, like five years before that.

SPEAKER_01

Yeah, I think the timing was right. And um back then gesture recognition became a big thing.

SPEAKER_00

Yeah.

SPEAKER_01

So you have Latina recognition became a big thing. If you've seen older movies like Angel and Demons, like all those certain labs had Latina recognition.

SPEAKER_00

Yeah, yeah. You could you you saw these uh post-detection algorithms uh become mainstream. You had products like Kinect.

SPEAKER_01

Um Flutter became popular, if you remember.

SPEAKER_00

Yeah. Yeah, yeah, yeah. That takes me back quite a few years. Yeah. But but the whole thing over there was like you now had these, you know, really powerful algorithms, but they were a lot less interpretable. These are very complex neural networks, and uh they took a lot of data to train. And by that time, a lot of companies had acquired, had amassed a lot of data. And these algorithms were more compute intensive, but GPUs were now becoming the way you trained and inferenced these models.

SPEAKER_01

All the the labs globally, specifically, like Stanford and like many, many even in India, they started sharing their open data sets.

SPEAKER_00

Yes. You started seeing the emergence of these fairly large data sets, right? And so now the emphasis was on how do you know do some degree of algorithmic exploration, but you know, a lot of these algorithms were used as is with a few tweaks.

SPEAKER_01

It was only reserved for specialists, right?

SPEAKER_00

No nobody the researchers, the machine learning researchers were now shaping and and and you know building these algorithms while the majority of the practitioners were using these algorithms, but they were focusing a lot on building out these data sets and training these models.

SPEAKER_01

Yeah, I think in the last three to four years, it's been really democratization of AI.

SPEAKER_00

Absolutely. And so what has happened is like in 2017, the famous uh transformer papers uh came out, and transformers that had a seminal and seminal impact on the world of machine level.

SPEAKER_01

What are transformers?

SPEAKER_00

Well, transformers was an architecture that allowed these models to understand very long sequences of words of data, right? Previously you had these um algorithms that could understand sequences of data, you know, take a long sentence and translate it. But as the sentence grew longer, you know, it could probably deal with like 50 words or 100 words well. But as you start getting into like, you know, a paragraph or a page of like thousands of words, these algorithms really struggled to understand the entire context and pay attention to all of the information that was presented to them, right? They would pay more attention to the most recent words and then forget about the more uh words that occurred in the past. And so there were very you know great limitations to the amount of data that these algorithms could process. Transformers allowed these models to really work well with large context that was being provided to them. So these models could now process a paragraph, a page, a whole chapter, or a whole book to produce the outputs in a meaningful, coherent way. And so Transformers was this big seminal moment where the world kind of changed.

SPEAKER_01

I'll give you an example from one of our portfolio companies, Buddy. So they have built AI for senior care living homes, and this is the cambient AI, which exists in all their systems, including CRM, uh, right, and it doesn't have a UI. But what had happened is uh, for example, they have built an AI teammate that can tell a salesperson that hey, uh the AI will create you your task on your email or your calendar and tell you that this is a senior that you need to reach out to because they have been considering home home uh there's a child that you need to need to reach out to because they have been exploring home for this grandparents, right? Right. And this is the right moment because you know, usually people start uh getting anxious after one year or one and a half year, and if they're not able to find anything. Right. And because AI can pass through their entire records, uh right, so they are able to create these tasks. So uh so but uh what what uh happens uh is uh let's say it adapts to the context also. For example, there's a book called Never Split the Difference, it's a very popular book in sales. So if a senior care living home, the CRO uses terminology from Never Split the Difference culturally, the team uses this. Uh the AI is able to absorb the entire and and uh you know the language of the AI is and it uses quotes from Never Split the Difference. Yeah. So people are amazed, like how did it adopt to our culture?

SPEAKER_00

Yeah, it's just pretty amazing because ultimately all these algorithms they are essentially trying to predict the next word. Yeah, right, given all the previous words that um uh it has been uh given, it's trying to predict the next word. And initially, this used to be a highly statistical probabilistic operation. You know, these models would look at uh, you know, uh they would be trained on a large corpus of information, you know, like a Wikipedia size data set. And the the idea was that these algorithms are learning certain statistical patterns to predict what the next word is likely to be. But as these algorithms have been trained on more and more data, with um these algorithms becoming more and more complex in terms of the number of parameters that make up these algorithms has grown exponentially. We have seen some very emergent properties in these algorithms, right? Uh, you know, at their heart, they are a probabilistic next word prediction system. Uh, but they are now, they now seem to encode an understanding of grammar. So the probability of the next word, uh, not just based on sort of the statistical occurrence in the training corpus, but they are now also able to apparently reason on whether this next word will make that entire sentence grammatically correct or not, right? They are they're starting to understand facts, for example, right? They are not only able to say what the next word is going to be, but they are able to now say, like, what is the probability, like, is it going to be factually accurate? Because the probability of a sentence of that word being factually accurate makes it uh a higher probability than another word, which which also occurs very frequently in the corpus given that particular sentence, but is factually inaccurate. So, for example, if the sentence is the boiling point of water is 100 degrees centigrade, right? So the model will likely uh you know predict 100 degrees centigrade instead of 102 degrees centigrade, right? And so we are starting to, it looks like the models are starting to learn facts. Uh, the model is starting to learn to reason, right? Like even if it's not a fact, it looks like the models are starting to understand whether a given proof is correct compared to an incorrect proof. And so if you have a statement that looks like a proof, uh, you know, the word that completes that sentence, if it is going to be factually correct compared to another proof, another sentence or variant of the sentence where the word was probably, you know, um highly um probable for completing that sentence, but the proof was wrong, uh, the model will pick the right proof, right? And so it seems like the models have built this reasoning capability. And uh, you know, there's an argument to be made that, you know, how can models learn these kind of reasoning capabilities? But there's also a counter-argument that the internet is big. Any question, practically any question that you may have asked has probably been answered on the internet. And so, like as these models have grown in size, as they've been trained on a lot more data, you're starting to see these very inter interesting emergent properties of intelligence in these models.

SPEAKER_01

And as the last, you know, you know, we are coming towards the conclusion of the podcast. Where do you think uh uh these models and capabilities are heading in 26 and 27?

SPEAKER_00

You know, I've stopped trying to predict where these are going to Be uh a year away from now because they seem to be rapidly uh developing impressive capabilities day over day. Um, you know, we've already seen models are now getting extremely good at reasoning, they're getting really good at uh orchestrations, they have gotten really good at solving analytical problems like coding, for example. These models have become really good at generating code. Uh, these models continue to get good at all of these capabilities at an accelerated pace. So it's hard for me to say exactly what these models will be able to do a year from now. But if you look at sort of the trajectory of these models, there was a lot of work being done maybe about a year ago to uh get these models to orchestrate complex tasks correctly. People built these very fancy frameworks and systems that were fairly complex and complicated to improve the orchestration capability of the model. And you know, there were some very impressive engineering feats that happened as a as a result of that. But now the models are able to do that natively, right? And so so it's it's pretty amazing that uh we have just you know built models that are now able to do these kind of planning and execution tasks uh that they would not have, like we would not have imagined about a year ago that they would be capable of doing. Yeah. You know, like if you were working on the early days of GPT uh 3.5, people spent a lot of time trying to make the models work with a very small context window of 4096 tokens with uh very weak instruction following. Uh, and a lot of what people built then is now completely unnecessary. The models just do that inherently. And so um, you know, who knows what the models are going to be doing a year from now, but I'm sure it'll be impressive.

SPEAKER_01

Thank you so much, I mean I love that conversation. Thank you so much. You know, and I love you know how candidness you brought in in the conversation and how simple you made it.

SPEAKER_00

Thank you so much for having me.