Preetam Joshi Breaks Down ML, LLMs, AI Agents, and Governance Challenges Artwork

Security & GRC Decoded

How today’s top organizations navigate the complex world of governance, risk, and compliance (GRC). Security & GRC Decoded brings you actionable strategies, expert insights, and real-world stories that help professionals elevate their security and compliance programs. Hosted by Raj Krishnamurthy. It’s for security professionals, compliance teams, and business leaders responsible security GRC and ensuring their organizations’ are safe, secure and adhere to regulatory mandates. Security & GRC Decoded brings you: Actionable strategies, expert insights, and real-world stories to elevate your Security GRC programs. Each episode explores frameworks, risk management strategies, and innovations shaping the future of GRC – from practitioners in the trenches. Subscribe now to unlock the tools and knowledge you need to succeed!

All Episodes

Security & GRC Decoded

Preetam Joshi Breaks Down ML, LLMs, AI Agents, and Governance Challenges

July 10, 2025 • Raj Krishnamurthy • Season 1 • Episode 14

0:00 | 58:31

How do you make sense of security, governance, and risk in an age of black-box AI? This week, Raj is joined by Preetam Joshi, founder of Aimon Labs and machine learning veteran with experience at DRDO, Yahoo, Netflix, and Thumbtack. Together, they break down the technical evolution behind large language models (LLMs), explore the real challenges of explainability, and discuss why GRC teams must rethink risk in the age of autonomous reasoning systems.

Preetam brings a rare mix of hands-on ML expertise and practical experience deploying LLMs in enterprise environments. If you’ve been wondering how transformers work, what explainability really means, or why AI governance is still a mess — this episode is for you.

5 Key Takeaways:

-From DRDO to Netflix to Aimon Labs — Preetam’s career journey shows the intersection of machine learning, security, and entrepreneurship.
-How Transformers Work — A simple breakdown of encoder/decoder architecture, embeddings, and attention mechanisms.
-Explainability in AI — What it meant in traditional ML... and why it’s nearly impossible with today’s LLMs.
-Rule-Based Logic Isn’t Dead — In high-stakes environments, deterministic systems still matter.
-Bridging AI & GRC — Practical steps for model security, auditing, and compliance in non-deterministic systems.

📌 Take Action

Visit ComplianceCow.com/podcast to catch all episodes
Connect with Preetam on LinkedIn
Follow the show on Spotify and Apple Podcasts

Security & GRC Decoded is brought to you by ComplianceCow — the platform for proactive, automated compliance.

🎧 Subscribe, rate, and share if this episode sparked a thought.

⏱ Timestamps (approx.)

00:00 – Intro
01:11 – Welcome Preetam to the show
03:20 – What has been your favorite experience working in AI so far?
07:08 – What is transformer architecture and how does it work?
10:23 – How do LLMs solve problems like math or reasoning?
12:38 – Where do agents fit in the LLM ecosystem?
16:07 – How does reinforcement learning apply to AI models?
21:33 – What does explainability mean in ML?
24:55 – Can you explain the limitations of SHAP and parameter-level reasoning?
27:33 – What does GRC look like in the LLM age?
30:58 – What does AIMon Labs actually do?
35:00 – Why is reliability a challenge with LLMs?
39:15 – Where does GRC intersect with AI deployment and compliance?
41:30 – What is fine-tuning and when is it useful?
44:43 – Is Retrieval Augmented Generation (RAG) still relevant with longer context windows?
47:29 – How do we guard against LLM misuse and toxic output?
49:43 – How can LLMs overexpose sensitive company data?
53:28 – Advice for those starting a career in AI or ML
55:34 – What are your favorite models right now?

Raj 0:00

Welcome to Security and GRC Decoded, the podcast where security, governance, risk, and compliance professionals, CISOs, executive leaders, and practitioners can stay ahead of industry trends and challenges. I'm your host, Raj Krishnamurthy. Thank you. Hey, hey, welcome to Security and GRC Decoded. I'm your favorite host, Raj Krishnamoorthy. And today we have an awesome guest, Preetam Joshi. I think many times we talk about large language models and generative AI, and many times we have a surface-level conversation. Today is not one of those conversations. Preetam has lived and breathed machine learning and AI, and we are going to dive deeper, right? And he's sort of at the perfect culmination of generative AI, machine learning, security, and GRC. And we would love to sort of uncover with Preetam. Preetam, welcome to the Security and GRC Decoded show.

Preetam 1:25

Awesome. Thanks, Raj. Thanks for having me here. You know, I always love to be on a podcast with experts like yourself. So looking forward to our conversation today.

Raj 1:35

Pritam, you have an awesome experience, right? So you founded this company, AiMON Labs. I think we want to talk about that. You'd worked with companies like Yahoo. In fact, you start way back in the early 2000s with the India DRDO doing machine learning, right? Some very interesting stuff. And then you go on to work at companies like Yahoo you build the data engineering machine learning practice at Thumbtack you worked at Netflix did some awesome projects at Netflix and then now you're at Amon right that's a fantastic journey

Preetam 2:09

yeah it's I've been honored to be part of this it's been a learning experience all throughout you know yeah I could I could talk a little bit more about the journey if you'd like Raj please Yeah, so, you know, interestingly, I started off in the early 2005, 2007, you know, when machine learning wasn't even a thing, right? It was just a few people who are enthusiastic about solving math problems, getting together and sort of working on projects together, right? And at the time, you know, the whole MLOps thing wasn't a thing. Hadoop was just coming on, if you remember those days. And the whole big data thing was just coming on, you know? So back then, it used to be called data mining. That was the popular terminology. And then we moved on to having this whole MLOps, big data, you know, explosion with Hadoop and Spark and all of that, come on. And that's where I got to work with a lot of those systems at Yahoo. I actually started off in security working on machine learning applied to security problems at Georgia Tech and then moved on to, you know, building sort of these big system, big data systems at Yahoo, yeah, or recommendations and stuff.

Raj 3:30

Which of that, and maybe it's very hard to pick, which of that is your favorite experiences?

Preetam 3:36

Yeah, you know, surprisingly, my most favorite experience of that is the one that you pointed out in a lab called DRDO, which was a small lab. Explain to our

Raj 3:51

leaders what DRDO is because many of us may not be familiar.

Preetam 3:54

Yeah, yeah. It's a small government lab, right? It was called the Center of Artificial Intelligence and Robotics. It was like a small unit of a government lab from back in India. And their focus was essentially working on innovations related to machine learning for doing a bunch of different things. And I specifically focused on text-based technologies. So how do you do parts of speech tagging? Back then it was a big thing. You know, you don't have modern LLMs where now you can ask at a and it gives you the parts of speech. Back then, you actually had to do a lot of work to actually get high quality parts of speech from, you know, piece of text. And just in case, you know, for completeness, parts of speeches, you know, like name, nouns, you know, you would extract like which company, did this text contain a company? Did it contain an animal? Did it contain something like that? So that's what, you know, and verbs and all of that stuff. So that's what parts of speech was. And back then, we were innovating, actually innovating with a system called, with an algorithm called Hidden Markov Models. So it was a lot of fun. It was way ahead of its time. And so learning through that experience was a lot of fun.

Raj 5:09

Let me ask you this. I think over the, you have been a practitioner, and over the years, the idea of machine learning, which in my opinion right now is much more democratic in the way that we are calling the language models, whether it is large or small. has fundamentally changed. What made that change? Right? And maybe can you double click on that?

Preetam 5:32

Absolutely. It's a fantastic question. And, you know, we can probably talk about this all day. But one of the things that I personally think changed the game was better compute. And neural networks, it wasn't a new thing, right? Neural networks have been around from the 80s, maybe even earlier, right? The concept of a neural network has been around back then, but the only reason we couldn't harness their power was because of a lack of compute. It was impossible to run them on a CPU machine. I mean, try running one of these big, large language models on a CPU. It'll not work, right? Or it'll be extremely, horrendously slow. So I think compute, the leap in compute was a major thing. That's what, you know, triggered the, you know, making these language models, these sort of these models bigger and bigger, the neural networks, right? So you had the RNNs that came on, Karpati has a fantastic blog on that. By the way, if you, if you've not followed Karpati, he's a fantastic person to follow. This is

Raj 6:39

Andrej Karpati.

Preetam 6:40

And Raj Parpati, yeah, exactly. He wrote a fantastic blog about RNNs and LSTMs back in 2013, 2014. And then so then, you know, then Google DeepMind sort of innovated on some of their classic BERT and BART models. And that's where things started coming up. And attention is all you need. I think everybody's familiar with that paper. The core piece was attention, right? And that key insight led to, you know, the modern day LLMs, if you have to

Raj 7:12

say. If you don't mind, let's take a step back. Attention is All You Need is sort of a seminal paper. It came, what, 10 years ago and talks about the transformer architecture. Maybe, can you sort of simplify your view of the transformer architecture and why you think it is seminal?

Preetam 7:30

Yeah, I think it came around 2016, 2017 by a group of researchers. And, you know, their key insight was this concept of attention, right? Like, and we'll talk about the transformer architecture itself. The most simplest way to think about a transformer architecture is, I mean, if you're aware of how neural networks work, right, you have all of these various neurons, per se, and it's sort of modeled behind the brain, right? If I have to simplify this thing, you have all of these neurons interacting with each other there are connections between these neurons and whatnot so now if you think about the evolution of a neural network into a transformer architecture that is two pieces of it right so there's the encoded layer which is all what that means is essentially you take like say a gigantic piece of text and it would map it into some array of floating point numbers and that's what we call vectors right an array of floating point numbers which is smaller in dimensions in terms of the space. So you would go from a vocabulary size of, say, 10,000 into a small array dimension size of 32. So that was the encoder. And then now you have the decoder, which is the second part of the transformer architecture, which takes the small embedding, as it's popularly known, and then translates it into generated text. So that's what the transformer architecture is. It's in a very, very high-level simplified way. The key to it is attention. And what attention does is essentially just says that, you know, for all of the different input tokens that you have, tokens as in the words, what are the ones that are the most important? It's an algorithm that allows you to figure out what are the more important words. Yeah, and we can get into the math of it, but it's probably better on a whiteboard to talk about the math.

Raj 9:23

Maybe explain to us, Pritam. So I think the transformer architecture essentially became very interesting as it essentially started predicting the next word, right? Now the probability distribution, right? of the next set of tokens. Maybe I'm super simplifying it, but the idea is that I type in a set of words, I'm in the context, and the next word comes in and it gets fed back, right? So that the next word becomes a probability distribution of the previous combination of the words, so on and so forth, right? Yeah. How is that... resulted in where we are seeing with the emergent behaviors of large language models right now? Because earlier language models were not this sophisticated. And how are we able to accomplish so far so quick?

Preetam 10:08

Oh, fantastic question. So you're absolutely right, right? I mean, in the end, this is just... all the language models are doing is predicting the next word given a sequence of n previous words. Now n can be dependent on the model and architecture and we won't get into all of that but it's essentially that. So you want to predict the next word given a sequence of n previous words. Now, how does it actually go ahead and solve like Olympiad level math problems or, you know, play chess that beats the best player in the world and all of that, right? That's a fascinating question. And this was one of the reasons why I actually originally looked into Andrej Karpathy's work a lot because he did a ton of original work. I would say his work is even more important than the attention volume work, right? So how does that actually work? So you You must have heard of this term called weights, right? Like the number of parameters, everybody talks about that. OpenAI has a 3 trillion parameter, 4 trillion parameter model, right? Now, all of these parameters are essentially the number of things, like the number of items that exist inside a neural network. So you have multiple layers of neurons, right? Which are interacting with each other. Now, what happens is during the training stage, when you are taking a piece of text and training this model, training this model to predict the next text, these layers, these intermediate layers are learning things, like learning complex relationships between these words. And those complex relationships can be things like, you know, math or, you know, how to sum two numbers. Because think about it, right? Like if you are trying to predict 2 plus 2 equal to 4, there is no way a model can know what 2 plus 2 equal to 4. It can't predict 4 after the equals or after the 2 plus 2 and the equals without actually knowing math, right? So you are actually teaching math to the algorithm and it's quite amazing that in the training process you can actually use math to make this thing work like that. Hopefully that makes sense.

Raj 12:31

I think that leads to an interesting question, right? And maybe in that context, where do you think agents fit in? Because I think this is where I think agents are going to start doing a much better job, right? Because some of these deterministic things can come through the agents. What is your take on it? Where do agents fit into the large language model ecosystem?

Preetam 12:52

Yeah, agents, so just to, for completeness, right, like what agent means, essentially, an agent has a tool which basically, think of it like an API call, something as simple as sending an email, right, or it has three main components. One is the tool, which is this API call. The second is the language model itself or some sort of machine learning model inside it. The third is memory, right, like keeping track of what, so that it can keep track of state. So an agent is sort of like a higher level abstraction on top of a large language model. Language models by themselves can't keep track of state and all those things. So that's why you have this paradigm of an agent. Now, the nice thing about an agent is you can use this amazing concept, which I love, called separation of concerns. You can have one agent to do a very specific task, do that very, very well. Then you can have multiple agents work together to solve a larger problem. Now, this used to happen even before agents, actually. People used to basically break it down into tasks. They used to solve one piece. Let's say you want to compute summarization of a text, right? Or answer a very complex user query, which needs like sub tasks to be executed. So they would split those into sub tasks and then aggregate them to answer the top level query. So that's where I think agents are super powerful.

Raj 14:16

And I think in the example that he gave, two plus two is equal to four. So two plus two essentially can be the language model and it essentially invokes an agent that takes these two parameters, produces four, and it comes back to the language model to produce whatever response that it can produce, right, in terms of summarization and response. Is that a fair... description of how agents can fit in as well?

Preetam 14:39

Yeah, in the 2 plus 2, you could probably have one LLM do the 2 plus 2 because all you do is tokenize them, right? 2 would be one token, plus would be another token, the equals would be the third token, right? you would give the sequence of tokens to an LLM and then give it four, right? But then you could have, what the agent could do is keep track of what was the previous set of computations, like two plus two, maybe there was 10 plus nine and whatnot. So having a sequence of these computations together, you might need them to solve a larger problem. Got

Raj 15:13

it. I wanted to maybe push this conversation a little bit forward, right? So we have the large language models, right? And I don't know if deep seek is a milestone moment or not, but then we get into large reasoning models. Explain to us what is a large reasoning model and how it is different than a large language model.

Preetam 15:35

Yeah, it's a great question. Reasoning models or so-called thinking models... The most simple way to think about it, and I'm actually oversimplifying this, so I apologize to the ML purists who might be listening to this, but the most simple way to describe it is this concept of traces, right? Like, so what you would do is if you have looked at any of these thinking models work, they basically generate a set of steps or traces as they call them. And those traces tend to help the model figure out if it is moving in the right direction to the right task. Now, there's this concept of reinforcement learning and it's a term, but what it means is essentially There are a certain subset of these traces that give the best reward. And reward is again an RL term, but reward towards solving the last problem. task, right? The actual task. So you keep improving, you keep going through the traces, picking the best one, and then picking the best one after that, and picking the best one after that. So you must have seen how ChatGPT 03, Mini or 03 works, right? It finds a trace, it thinks about a problem, it figures out, okay, this is what I'm going to go with, now it will go with the next step and then next step and next step and it's making a sequence of choices right and then those choices bubble up into the large task which is why it's so much more effective at complex reasoning problems where you have to do like really complex math or logic problems now those are very well solved by uh thinking models the trade-off with uh llms is that it is uh it's slow Right. You will have to sacrifice a lot of latency and compute also. I know OpenAI spends a ton of money running compute for these kind of thinking models.

Raj 17:34

Got it. So reinforcement learning is a very old phenomenon, right? So what has fundamentally changed and that has created this reasoning or thinking models?

Preetam 17:43

Yeah, it hasn't changed much, to be honest. It's just an application, in my opinion, at least. It's an application of reinforcement learning to these large language models. Reinforcement learning has always been popular, even in recommendation systems. If you go back to that world, you had all of these different users interacting with your videos that you put up on your website. And then you want to find the best set of videos that should be shown to the next person. The way you would do that is you would just randomly do this thing called explore and exploit. Explore some video, somebody clicked on it, that's the reward. Then start pushing more and more of the same kind of video tour to that person, right? So, yeah, I think it's a similar concept if you look at it from a language model point of view. It's just an application of it.

Raj 18:39

And you have built recommendation systems before?

Preetam 18:41

Yes. Yes, I have, yeah. In News Feed recommendations, you know, we built a recommendation system back in the day called Slingstone, which was powering, if you had Apple News, all of the news feed that you see there, it was being powered by that system back then. And obviously at NetPix, I was working quite closely with the recommendation systems team helping them power their models for these recommendations.

Raj 19:13

Got it. So there was this recent write-up from the engineers at Apple, which is thinking is an illusion of thinking. Illusion of thinking. I'm not sure what the right paper is. But the idea basically states that the reasoning models perform okay at small, low complex and medium complex tasks. But when you get into high complexity, they fail. What is your take on that? I mean, what does that mean to you? And what do you think we as a community, I mean, the machine learning community is going to do about it?

Preetam 19:49

Yeah, I think that was, you know, everybody had suspected this and all of us had some sort of, you know, a specific set of examples that we personally have seen where these models fail, right? So we had these kind of anecdotal examples that we have perceived these models to. The paper talks about... Now, what they did was a more thorough, elaborate quantitative study, which actually proved this concept. And so I'm actually interested to learn more. I will say that I haven't read the paper fully. But yeah, the gist of it is, like you said, they don't actually think. And because of that, it's... Solving these sort of complex problems makes it even more compute heavy. You need to actually run a lot more computer, run a lot more traces to actually solve those kind of complex logic problems, which is impractical, right? Yeah, so, and related to that, there was a study from Anthropic actually, So a lot of people talk about these traces, the thinking traces, as explanations, right? They would use those things as, okay, this is how the model reasoned, and that's how it actually thought about the problem and all of that. Now, what Anthropic said was they found that it wasn't actually true. It wasn't actually what it thought. And so... It might just be showing those traces because it's trained to do that, but it is not actually the internal thought process of the

Raj 21:31

model. That's a very interesting view. And I think this is leading us into the questions on the intersection of security and GRC and the idea of large language models. And I think we are getting there. But Pritam, help explain our leaders because traditionally, when you are doing your machine learning models, you have spent a lot of time on them. You can actually build explainability because you are actually building explainability on a bunch of features that are at least countable in relative terms. Now you say that the traces are not explainable. Double click on that for us. If that is not explainable, how do we achieve explainability in large language models or large reasoning models? Before we say that, can we just take a step back? What does explainability mean? Maybe we'll start from there and then we'll go into this.

Preetam 22:21

Yeah, yeah. Let's start from there, right? Let's talk about explainability. And we can probably just talk about explainability as a completely new podcast session, by the way. But we'll try to keep that concise in terms of the discussion. So explainability is simple, right? Let's take a very simple classification model. Let's say you're at a bank and you want to approve loans, right? So you get a bunch of different loan applications and now you're getting like thousands, maybe 10,000 loan applications every week, it's impossible to put humans in there to actually approve those loan applications. So now you think about adding a machine learning model there, right? In banking and all of these kind of critical sectors, they actually don't even use LLMs. It is using traditional models, like tree-based models, decision trees and whatnot, right? So let's just pick a decision tree or maybe even a linear regression model. The idea there was this model would basically take in attributes out of these applications. And then generate whether, yes, I should approve this loan or I should not approve this loan, right? So simple setup. So now as a banking person who's running this sort of model, let's say the model says this loan is rejected. You want to know why, right? Why was this loan rejected? That's where explainability comes in. you would, in traditional machine learning, you would have these sort of features, probably about 30 or maybe 100 of them. That's typically the case, right? And the reason why you had such a small set of features was because of this lost art of feature engineering, which nobody seems to care about these days. And you would spend a meticulous amount of time figuring out what are the right features for this model, right? So now that you have these features, these features will be able to tell you what contributed to it. And the feature could be something as simple as what is the age of the person? Did they have a credit history? How old was their credit history? Have they been defaulted on previous loans or similar things? Have they had previous loans in the past which increased the probability of success? So Now that you have those features, the model can tell you why it rejected or accepted an application. That is what explainability is. In the modern world, if we have to come into that, Raj, in LLMs, you have 3 trillion parameters. How will you tell which parameter contributed to something? Let's say you just take this 3 trillion parameter model and apply it to this loan application problem. You can't use those traditional techniques to actually figure out which parameters or set of parameters contributed to this application being approved or not.

Raj 25:05

Has there been any research done, Preetam, if you distribute the probability of those parameters? Is there a pattern that we see? Has there been any work done on trying to at least... I know it is a massive task, but has there been any work done towards that?

Preetam 25:23

Yeah, you know, for complex models like these, there have been classic algorithms like LIME or SHAP, which have attempted to sort of give you an overview of what parameters contributed to this. But even there, even for the neural networks, let's say LIME or SHAP, when it applies to neural networks, it's important to have a clear, distinct set of features. Because the parameters, if you look at the more modern language models, right, the parameters don't mean much by themselves. In a combination of things, they mean a lot to the LLM, but it's not like something that you can decipher as a human. So that's the problem for it. Even if you run SHAP on top of these parameters, it'll be garbage. I mean, you will not be able to understand it. That's where people started getting into this aspect of asking the LLM to explain itself.

Raj 26:16

That

Preetam 26:17

was one form of explainability. The second form is what you were talking about, Raj, is the thinking process. The set of steps that you thought through from a thinking model, you could use that as an explanation. And maybe we can talk more about that.

Raj 26:30

Got it. So do you see, given the evolution of what you have seen, Preetam, are rule-based systems out there?

Preetam 26:39

Ah, I don't think so. I think there is still a lot of value in terms of high precision rules, right? In the end, these machine learning models are probabilistic in nature, which means there's a high degree of non-determinism. And people don't like that, right? Especially now, take that banking example again. If if you run the model two times and for the second time it says approved, the first time it says not approved, what would you do with that? You cannot realistically run such a system. So I think rules have their place. I think even the classic machine learning models have their place. Having said that, the more complex problems are better solved by LLMs given their capabilities. Got it.

Raj 27:30

The reason I think that it's a good segue for me to ask you I mean, one of the challenges is that security, governance, risk, or compliance, cybersecurity, have traditionally been very deterministic principles. Either you have turned backup on, you have turned backup off. You have turned logging on, you have turned logging off. On the virtual machine, on the Kubernetes cluster, whatever that is. How do you take, and everybody is sort of moving towards the idea of using language models in general, and apply that to cybersecurity and GRC. So how do you bridge these two worlds of probabilistic inputs to deterministic outputs?

Preetam 28:13

Yeah, yeah. It's a tough problem, I will say. And which is why a lot of, and you know this really well, Raj, since you all are experts in the compliance space, right? One of the reasons why, you know, so the AI governance frameworks haven't taken full fruit is because of a lack of, I would say, consensus on what should be governed and how do you actually govern that thing, right? And like you said, in traditional controls, you will have yes or no, whether encryption is enabled or not, or whether you have two-factor authentication enabled or not. We went through SOC 2, by the way, recently, which is why all of that is fresh in my head right now. So it's... I would say taking a more pragmatic, incremental approach towards governance for AI non-deterministic systems is very important. And you could even start with, there's been some good work by a few people over here and simple things like model security. Is the model secure or not? Does somebody have access to the model? Can someone, can an external attacker influence the model somehow? how, right? Are there enough guardrails in place for the model? Can it spit out various things? Do you have PII regulated for any model, especially in high-risk industries? So stuff like that, I think what we see is in our applications also apply to, you know, governance, risk, and compliance spaces. We see, you know, people adopting it on a piece-by-piece basis, piecemeal basis. So solving one problem at a time. I think that's the right approach.

Raj 30:00

Got it. Now let's talk about, I want to talk about evals. I want to talk about, so the idea of what I would call unit testing, because I think we are all now becoming from software companies to AI companies in some ways.

Preetam 30:12

Raj 30:13

let's talk about AIMON. What do you guys, what do you do at AIMON? Where, you know, how did this happen? This is your brainchild. How did this happen? What are you doing?

Preetam 30:22

Yeah, so AI Haunt actually originated as an idea when I was at Netflix, right? One of the things that happened, now I forget the year, it was probably 2023 or so. when chat GPT really took off, right? And everybody was going crazy, talking about, you know, the capabilities and there's a massive, like this virality moment for chat GPT because of which even the enterprise part of OpenAI, like the models started picking up a lot. So everybody was interested in getting OpenAI on the enterprise, deploying it and seeing how it could add value for their use case. So that's what happened. There was a limited study done during my time at Netflix. And so we were evaluating OpenAI. So what we saw was then we had a bunch of hackathons and things like that. So what we saw was essentially people building cool AI applications using OpenAI APIs. But then the problem was they didn't really know how to improve those applications. So they got to a prototype stage very, very quickly, hacked something up. It was great to see, but In order to get to the part, so from, say, 45% accuracy rate to 80% accuracy rate, which is absolutely required to go to production to make it even a viable solution, they didn't know how to do that. So then they would come back to the machine learning team, to us, and then we would say that, hey, how do you do that? So we would have to go back and forth with them, give them tools and metrics to measure. Okay, what are the things? How do you measure accuracy, right? Like, how do you measure... hallucinations. What are the issues with this model? How do you measure conciseness? Somebody was talking about tone. I want this model to be talking in a certain tone because that's how these models talk about tone. All of those things, we had to really hand-hold on. And so that's where the idea for AIMON came into picture in my head. So me and my co-founder, Puneet, we went out and talked to a bunch of enterprises and realized that this is a similar pattern. We went out and built AIMON. So essentially what AIMON does, it's an AI reliability platform. It helps you, so you can continue building your applications and leave the part about improving your applications or measuring how good your applications are in a a wide variety of metrics, things like accuracy metrics, traditional business metrics, and even security metrics like prompt injection and toxicity metrics and all of those things using our APIs. So continue building your apps, use AIMON to improve your apps and measure how good your apps are.

Raj 32:56

And is this the distillation of language models or is this the traditional feature-based machine learning model that you typically build at AIMON?

Preetam 33:06

Yeah, great question. So what we did was in order to compute these metrics, we didn't really need all of the capabilities of language models, right? Like, so language models are way too large, too expensive also. For computing, say, a hallucination metric or checking if your LLM followed all the instructions in the prompt, right? It's called the instruction evaluation problem, instruction following evaluation problem, sorry. So for those kinds of things, Those are very specific tasks. And in my experience... building models that solve for specific tasks is much more better than trying to take a horizontal model or a very generic model and trying to solve it for that task. So we have a distilled version of some language models. We have a specific architecture also. It is still based on the transformer architecture, but some parts of it have been removed. We don't really need elaborate decoding layers and things like that, which is why the AI-mounted models work at extremely low latency and can run on more commodity hardware, as we would call it.

Raj 34:17

Got it. No, that is beautiful. And are you a proxy, meaning do you sit on the transaction path

Preetam 34:23

Yeah, so the way you can implement AIMON is use it in this sort of a proxy behavior where you go to OpenAI, then get the result back, and then you could review that with AIMON. So that's completely up to you on how to implement that. We don't have like a reverse proxy or a full proxy as such. That's not what we do. We allow you to be flexible because a lot of people don't want a proxy all the time. They just want to do this offline, right? So They want to grab a certain set of things and they also want to control what kind of data goes through this proxy and things like that, which is why we have a more customizable API for doing these kind of metric evaluations and guard rating and

Raj 35:10

whatnot. I think one of the things that you touched on is the reliability, right? And I think the basic principles of systems engineering will now also apply to AI-based systems as well. So how do you see reliability? I mean, are there specific metrics? Is it very contextual to the type of solution? How do you think about these things?

Preetam 35:34

Yeah, I think reliability is a holistic problem, right? I mean, it's like you need a holistic solution to that problem, right? So things like the system aspects of it are also super important, by the way. When you are building an LLM application or any sort of AI application, you need to do the usual best practices of building good systems and make sure you have high availability and you make sure you have the systems properly backed up, especially if you have stateful systems in your AI application. So all of those good things still exist in AI applications, right? And I think there's enough tooling and infrastructure for that. When you get into a quality of results, and we talk about data quality a lot, but this is quality of the results from a language model. That's where a lot of the tooling is lacking right now. And it's still very early. So a lot of people use LLM judges and things like that. But LLM judges have their own problems, right? They can be biased. They're also subjective to being probabilistic in nature. They might say a relevant score of 5 for the same input once and then relevant score of 10 for the same input the next time. So I think having a holistic set of metrics, and this is where tools like AIMON could potentially help, is making sure you... We basically divide it into four pillars of reliability. The first pillar is... making sure your output quality is great. The second pillar is making sure the toxicity or safety metrics related to your output quality is good. Third is your data itself. Is your data that you're feeding to your LLM good or not? Does it have conflicting information? Does it have poor formatting? All of that can affect your quality. So those are the main pieces that we make sure we recommend at least when you're thinking about reliability in your LLM applications.

Raj 37:40

Got it. Now, in the traditional machine learning world, you typically create this confusion matrix that talks about the accuracy of the outcomes and the precision of the outcomes based on what is expected and what is actual. How do you do that in the large language model world?

Preetam 37:59

Yeah, and that's always been this problem, right, where you deploy an AI application. First of all, you need to measure it and how good it is. What is the precision? So a lot of people do what we call wipe checks, and that's like they would take like three, four different queries, check if that works, and then call it success, that it works, right? Then when it actually goes to production, everything hell breaks loose, right? So the first important thing is doing a holistic evaluation, right? Just figuring out what your data set should be for your particular domain. Let's say the finance banking application for the loans, right? Are you covering all possible types of loans that could get input into your system, right? That would be your quote-unquote golden data set. that you will use to evaluate your AI application. And then once you did that, now you have a high degree of confidence that it's good. The next step after that is to do that continuously. So you might still get items that may not never have been in your evaluation data set or in the golden data set. So it's important to do continuous monitoring, like how you have continuous compliance controls. Similar to that, it's very important to do continuous monitoring of your AI application This is where it's interesting, Raj, is where the traditional machine learning world, where they used to ensure model output quality, and even the compliance world is sort of intersecting. Having poor output quality is also a risk, a pretty important business risk. So that's where I see these two worlds meeting.

Raj 39:38

Got it. And I think this idea of fine-tuning... especially as we continue to leverage these large language models, but we see that the reliability or the output metrics of some of these large language models does not suffice. Maybe language models do not suffice, and we have to work on fine-tuning, right? Maybe explain to our leaders what is fine-tuning. Listeners, what is fine-tuning? How do you see fine-tuning applied... in the world of large language models and what challenges do you see and what solutions would we offer?

Preetam 40:14

Yeah, fine-tuning. So you would use fine-tuning when you realize that your model is not working properly. Let's say you did all of these evaluations and you found that it's not actually working very well for your data set. Because, and again, coming back to this finance banking loan application thing, For whatever reason, your language model that you're using has never seen finance or loan-related data in the past. So now that you give this new form of data, which we call out-of-bound samples, to this model, it will not perform well. It doesn't really know how that works, right? So what fine-tuning does is, unlike full training, you would basically get, say, 100 to 1,000 examples, or maybe... thousand to five thousand depending on your use case and again this is where it depends on use case to use case basis you would take that you would run it through you know a fine tuning algorithm of your choice and then basically figure out how to make sure that the the model is working very well for certain metrics that you're calculating right That's the basic process. It's just teaching a new capability to the model. Think of it like that, right? Adding it on as a new capability. And that's basically what fine tuning is. I would, in fact, suggest for listeners here, do not use fine tuning unless you absolutely need to. Prompt engineering is... with some few short examples, and the few short can be anywhere between 2 to 50, right? You can give enough examples in your prompt, given the large context-sized windows that you have, and you can make more faster progress through prompt engineering itself. Fine-tuning only applies when you have, like what I said, when for some reason your LLM hasn't seen this kind of data set, right, in the

Raj 42:18

past. Got it. Let's take a slightly different take. Retrieval Augment Generation, RAG, which is maybe the most common deployment that we typically see. Can you explain what is RAG and how is it used?

Preetam 42:36

Yeah, so a RAG is a Retrieval Augment Generation, Raj, like what you said. So essentially, the idea is you have some enterprise knowledge and the LLM hasn't ever been trained with that knowledge, right? So let's maybe take one step back and ask it a question to say, who is Tom Cruise? If you go to ChatGPT, ask who is Tom Cruise? Because it has been trained with a lot of Wikipedia data, it knows who Tom Cruise is. It will tell you about Tom Cruise being an actor, his amazing career, amazing movies and whatnot. But now you ask the same LLM something specific inside your organization. Like what is ABAC? And this acronym, it'll probably come up with something from the internet, from the open internet. Now what the RAG does is it actually provides context from your internal organization to the LLM. And that would be sitting in the prompt. So when you're giving a prompt to the LLM, you would say, okay, this is what I have from my internal knowledge sources. Consider this when you're doing your inference. So this is what is called in-context learning, ICL, very popularly called prompt engineering. So you would basically take this RAG information that you would get from your internal knowledge base, put it in the prompt, and then ask the LLM to make an inference. So now this acronym ABAC is in your knowledge base. The LLM will pull that ABAC and give you the answer. That's how it works. The process of creating the RISE, there's so many popular frameworks these days. There is this database called ApertureDB, there is VBA, Milvus, and a bunch of these other ones, which are just vector databases, Pinecone and whatnot. So you would basically create vector embeddings out of your knowledge documents, store them in there, and then at runtime and the user sends a query, you would pull in the actual document from your VectorDB database Give it to the LLM as context and let the LLM give you the answer.

Raj 44:51

Got it. And given the increasing context window sizes, do you think RAG is still relevant?

Preetam 45:00

Yeah, it's a hot debate, I will say. And I'll tell you where I stand in that hot debate, right? I think RAG is still very relevant. A lot of people will disagree with me, but the reason is I still think providing precise information to an LLM for answering a certain question is more important than giving it the entire world knowledge, right? Like, if you take a knowledge dump that you have, say, a million tokens, you might have more, right? And nowadays, the Context sizes are increasing too. You take the entire 1 million tokens, give it to the LLM. It's unlikely to find the needle in the haystack. That's the problem. That is the main issue. And there's well-documented research about this, that where LLMs tend to only focus on the tail ends of the context that you give it. It forgets about a lot of things in the middle. And there are certain other LLMs which focus on things in the middle and forget about the tail end aspects of it so you know which is why I am again I prefer pragmatism so having some system that can give you precise information and give that information to the LLM to make an inference

Raj 46:12

let's talk about something very specific maybe a use case in security or GRC in the context of LLMs and evals

Preetam 46:22

yeah Um, so security and GRC in the context of LLMs and evals, right? Like, yeah. Um, I think from the security perspective, there are a few metrics that are very popular these days. Everybody is concerned about them. One of them is prompt injection, right? Now, the prompt injection essentially means that an attacker could make the LLM do something that it's not supposed to do. For instance, make the LLM behave like a really random person when it's not supposed to behave that way in the context that it's deployed or maybe make the LLM talk about its competition. Say you're talking to a Tesla chatbot and the LLM is now the chatbot is now appraising another competitor like Cruise or something. So that's where prompt injection attacks come in. And there are more serious implications of it. Like you could also have SQL injection things inside that. You could have malware injected into that. So from an evaluation perspective from security GRC, I think the few things that people care about is there's a metric called CBRN, which is chemical, biological, and nuclear risks. And then all those metrics means is you want to ensure that the LLM doesn't give out techniques on how to, say, make a war or a weapon, right, which can be used for mass destruction. So those kind of metrics, the safety metrics, are important from an evaluation perspective, from the security angle and from a GRC angle. So evaluation is one piece and I like to see it as an offline tool, right? You have some set of queries and you would evaluate your LLM on how good it fares against all these different prompt injection. Does it have prompt injection attacks? Does it have CBRN? Does it have toxic output? All of these different pieces. But you also want to, for these specific metrics, you also want to put guardrails, and that's where the security people, and even from a compliance perspective, having this control, like where you actually implement this control in place as a guardrail is super important.

Raj 48:32

Got it. I had seen a very, very cool demo with AIMON, and by the way, compliance, we work, we integrate with AIMON, we work very closely with you guys, where you had done something very interesting on access policy using AIMON models. Explain to our listeners what that is.

Preetam 48:51

Yeah, first of all, fantastic. It's always fantastic working with you all and collaborating with you all on these problems. I think it's been a great collaboration with you all so far. And Raj, to access control, one of the pieces is access control on unstructured data is becoming increasingly harder. So there's all of these different systems that exist already to implement RBAC or attribute-based access control, ABAC, in your systems. But all of them rely on very traditional approaches, whether you have an identity of a user and then whether that identity can be properly verified. If it's a structured table and sitting in a Postgres database, this particular person has access to a particular column, the table, you could do all of those things. with that setup, right? But now think about it this way. You have an LLM which is giving you unstructured data. Somebody trained an LLM inside your organization now deployed it on finance data, right? And now anyone in the company, even a contractor could ask this LLM, what is the compensation of person X when they shouldn't actually have access to that kind of compensation data. If they tried to make that query via a traditional system, like a SQL system or something, they wouldn't have access to it. And they wouldn't even have access to the interface to actually make that query in the first place. But with an LLM, all of that is now sort of sidestepped, and now they have access to this information. So how do you guard against that, right? What we have implemented is, you know, we made this demo, which you just talked about, which helps you take this sort of Okta-like policy, which you have all of these different attributes, permission sets from a user stored in a system like Okta, and use that to actually enforce those policies on unstructured data. And the way we do that is using a specific model that we have built for this, which can analyze the unstructured data's properties and see if it has any of the specific attributes that are allowed or not allowed from those input permission sets.

Raj 51:04

And I think what is very, very cool about it, Pritam, is that I don't know if we are collectively realizing it or not. I think the traditional way of applying these policies, there are a bunch of ways in which we have done this traditionally, right? You use OPA, Open Policy Agent, right? Other engines like that. And what this actually is doing is allowing, democratizing the idea of writing these policies and executing these policies at runtime much more easily, right? Yeah. Because how many people... know Rego and how many people want to read Rego, learn Rego is a big question. And I think that is super cool. But I think a lot of that hinges on how reliably you think the response is going to be, right? Talk about that a little bit. So how do you ensure the reliability of these responses? Because you are underlying using a model.

Preetam 51:56

Yeah, absolutely. And it's a great point and a very important point that because these systems are still non-deterministic, we also underlying use a model to actually figure out whether this unstructured text matches the permission sets in the Okta policy that you sort of retrieved. The way we handle this and we ensure high precision is as follows, right? The first is because we have tuned these models to do really well on these tasks, we can already achieve really high precision on this. And we optimize for precision. So whenever you think about machine learning models deployed in real-world systems, there's always this precision recall trade-off, right? When you think about accuracy. So if you optimize too much for recall, you sometimes suffer on precision and vice versa, right? What we did was we optimized for precision. While we may not be able to catch everything, we make sure that you don't have any false positives or any false negatives. That's the piece that we have optimized these models for. Now, having said that, there isn't a 100% guarantee that it will always work, unlike a traditional rule, which is very deterministic. And that is okay, at least for implementing the system right now, because... Without this, you have 0% access control. But with this, you have a high degree, like 95% accuracy rate. There will be some things that might slip in, and that's where we come in and sort of fine-tune the models against those cases to get it up to 98% on your organization.

Raj 53:32

Totally, that is super cool. We are approaching almost the end of our segment, Pritam. To a person who's graduating and looking at this podcast, how can they enter this world, this fascinating world of machine learning and language models? What would your advice be?

Preetam 53:53

Yeah, I would say there is, you know, there's so much information out there. Like it wasn't in 20 years ago, this wasn't the case, right? You had books and I have a ton of them lying around here. So we had to pour through all of those books. But now you could even make an LLM, create like a study plan for you to go through, you know, your favorite pieces of machine learning. Some people might be interested in linear algebra, going deep, into it some people might just be interested in the applications part of it right so I would encourage you all to pursue that path if you really want some recommendations I would again come back to Andres Karpathy's videos there are three main videos and I'm happy to share that offline three main videos on his YouTube channel one that talks about general applications of LLMs and how he uses LLMs for the second one is he actually goes through a three hour long video where he replicates GPT-3 or GPT-2 from scratch, right? That's amazing. I mean, yes, it's a bit long, but I would highly recommend sitting through it. And then there's a shorter version of that video where you basically tries to build an LLM model from scratch. So highly recommend looking at those and also look at the traditional machine learning algorithms like linear regression, logistic regression, start-start there. uh having a very good base in those traditional machine learning algorithms is very important before you jump into the cool kid stuff right like the large language models and neural networks and crazy things like that so

Raj 55:29

got it and i maybe you don't want to answer this question maybe you want what is your favorite model

Preetam 55:37

okay yeah um Let me see. I have a ranked list of favorite models. I think my first one is the cloud model, the Sonnet, right? The 3.7 Sonnet is pretty cool. And mainly because I use it with Mercursa, which is another coding assistant. So I found a lot of improvement in my own productivity because of that. The second one I would say is O3. I think it's a really fascinating model. O3 is a thinking

Raj 56:12

model.

Preetam 56:13

Yes, O3 is an open-eyes thinking model. And then soon after that follows Quen. Quen 3.5 or 3. That's a

Raj 56:23

Chinese model.

Preetam 56:25

Yes, correct. I don't want you to get in trouble. Yeah, it's completely open source and then also the deep seek ones, right? I think open source is doing really, really well, by the way. And also, unpopular opinion, but I'll share this anyway, Raj. I think the next frontier of models will be more generalizable models which use less data to actually do the same things that, say, even 5 trillion parameters model from OpenAI can do. So lesser data to train LLMs would actually make these LLMs more accessible than having more data and more larger models that need like massive hardware to run them on.

Raj 57:14

That is a very fascinating thing to say. That is almost another two more podcast episodes that we need to have. We haven't even talked about MCP and we need to double click and go into more detail or A2A, right? Or whatever the trend that is going to be tomorrow. So this is fascinating, Pritam. And thanks for being on the show. Sincerely appreciate it.

Preetam 57:35

No, thank you, Raj. Thank you for having me. It was a fun conversation. And yeah, I can chat about this all day. So we're looking forward to more conversations in the future and always enjoy working with you. Absolutely. Thank you. Thank you, Raj. Take care.

Raj 57:50

Thank you for listening to Security and GRC Decoded. We are your go-to resource for staying ahead in governance, risk, and compliance. If today's episode resonated with you, we would love for you to subscribe, leave a review, and share it with your network. To dive deeper into these topics, visit us at compliancecow.com and follow us on LinkedIn for more insights and community conversations. Join us next time as we continue decoding the future of GRC. Thank you.

Unknown 58:25

you

Raj Krishnamurthy

Host