What's Up with Tech?

How AI Chatbots Go Off The Rails And What To Do About It

Evan Kirstel

Interested in being a guest? Email us at admin@evankirstel.com

Your most powerful product might also be your biggest liability: an AI agent making decisions you can’t see and answers you can’t predict. We sat down with Andre Scott of Coralogix to unpack how to make black-box systems measurable, accountable, and—most importantly—improvable over time.

We trace the journey from monoliths to microservices to LLMs and explain why old-school “index everything, analyze later” monitoring breaks under today’s data explosion. Andre introduces an analytics-first approach that processes telemetry in-stream and then stores what matters in your own object storage. That shift delivers cost control and true data ownership, turning observability from an insurance policy into a growth engine. We dig into open tooling like an LLM trace kit built on OpenTelemetry that captures prompts, responses, and metadata, so you can evaluate correctness, flag prompt injection, and enforce guardrails at runtime.

Bias and hallucinations don’t announce themselves; they creep in through context loss, retrieval misses, and model updates. The fix is continuous evaluation with small, purpose-trained models that run outside your app to score tone, safety, factuality, and leakage risks. Think of agents like employees: give them performance reviews, train them with real data, and escalate when risk spikes. We also explore Olly, CoraLogix’ agentic SRE that reads your telemetry, answers business-grade questions, and recommends alerts and remediations—especially handy when cloud outages ripple through your stack.

Regulation is coming fast, and accountability rests with the teams who ship AI into production. If you deploy it, you own the risk. The practical playbook is clear: embrace analytics-first observability, capture LLM telemetry, make evaluators your crown jewels, and keep the data that teaches your models to improve. Subscribe, share this with your engineering and product teams, and leave a review with the one place you’d add guardrails first.

Support the show

More at https://linktr.ee/EvanKirstel

SPEAKER_00:

Completely, if not monitored properly, diving in today with Core Logics. Andre, how are you? I'm great. Evan, how are you? Good. Thanks so much for joining. Uh, before we dive in, maybe introduce yourself and uh also the mission at Core Logics.

SPEAKER_01:

Yes, no problem. Well, my name is Andre. I'm a developer advocate here at Core Logics. Uh previously worked at Microsoft and various companies around the globe as a SRE and DevOps expert. So I work for some of the largest hedge funds all the way down to the smallest animation studios. So it's safe to say I have quite a lot of stories when it comes to DevOps and observability practices.

SPEAKER_00:

Brilliant. And what's the um big idea of Core Logics for those who may not be familiar?

SPEAKER_01:

Well, we're an observability platform, so maybe I'll set some uh context around observability um and the evolution really with observability. So really it's technology's fastly changing as we know with AI that's here now. But I mean, going back at my as my time as an SRE, it was a lot of monolithic monolithic applications, which was very simple to monitor. Well, and it was a very reactive kind of state. And then now we've moved into the cloud and microservices architecture, where we're now sending you know Kubernetes clusters all around the globe, um multi-cloud. Um, so it's it's much more difficult to track. And of course, it's a it's a data explosion. So um observability now we call it 2.0, but really it's you know asking more questions of the business, being able to not just know what went wrong but why, and understanding the context for our users all around the world. So with core logics, we have seen with this trend of data explosion, a lot of these traditional, traditional legacy observability providers, um, they would index data first in storage and then provide the analytics second in an index. And that can lead it to a massive problem because you don't own that data, you're sending that data off. And as you're scaling and growing, especially with the AI explosion, of course, you know, you're you're paying a lot of money to store that data uh and observe it. CoreLogics flips that on its head where we do the analytics first with our streamer technology. So we process the data in stream, analyze it, and then we store it after the fact. So that provides a lot of cost benefits, yes, but also the big benefit, especially this day and age with AI, you own the data in your own object storage, which is very powerful.

SPEAKER_00:

Indeed, an incredibly important approach in the enterprise these days. So let's dive into this topic. What's happening behind the scenes when a chatbot starts to go off the rails, you know, wrong or weird answers? What's happening there?

SPEAKER_01:

Well, as I say, with the the shift between uh monolithic monolithic applications to this to the the uh uh microservice architecture and now AI, it's a completely different shift you once again. Um so these things are fundamentally different. Like you um they're you're sending your business logic to a foundational model. You don't understand what's happening in there, you can't you can't see inside of it. It's like a black box. You know, you cannot see. We've seen I've seen so many uh companies out there, DPD, one of them, where their chatbot actually went off and started um a disgruntled user got behind the system prompt and started asking to say some not nice things about the business. I mean, it was swearing a lot at the end of the day, it it made the company look bad. It was telling the user to go look elsewhere for different uh businesses and stuff. Like you don't want that happening in production. I would never allow that to happen as a my time as a DevOps engineer. We would never ship bad code, but we're yet we're shipping bad AI, which is meant to be the most clever thing, but it's it's actually it's actually you know could be going to be fundamentally wrong in production. So why is this happening? Because I see this all the time. Come that's just like the the shift to the cloud compute. Business are rushing to deploy cloud or rushing to deploy AI without thinking about how are we going to monitor this, how are we gonna observe it? It's I see it all the time. So with Core Logics, we developed uh the AI center. Uh, we invested 10 million pounds in a in a an AI center of excellence. So we have actual AI engineers working with observability data, but access to you know a lot of compute because this stuff's expensive to run. So they need you know supercomputers to run these types of things and build these models. So we've developed the AI Center, which an LLM trace kit, which is an open source technology. We're building open source, open telemetry for your users that don't know. Um, so that is a library that you can send your AI applications telemetry data and you can actually see what's going on. You can see the LLM uh calls, you can see the conversations. But most importantly, for a genetic AI, you can evaluate the correctness of your agent. So you can then put in guardrails. So you can stop, like the GPD store, you can stop. That's a prompt injection attack. You can stop that, you can catch that, evaluate it, evaluate the conversation, and then with scoring, you can then come in and set some guardrails and stop that from damaging your business, for example.

SPEAKER_00:

Brilliant. Wow, it's a really fascinating approach. Yeah. And you know, there's a lot of issues around bias. How do you how does bias sneak into chatbots even when they're trained on you know smaller curated data sets?

SPEAKER_01:

So I mean it's it's just the the the amount of data there is, like, and 90% of the time these applications will be right, but there is that 10 to 5% that could they can go wrong. So it just happens with hallucination when they when they lose context, when they go off, go off tangent, they they might even select because at the end of the day, they're they're selecting the the the closest match in terms of data with the vector dv, for example. And if if that goes offset and it starts going for a cache answer or something, and it's picking out the wrong answer, or it's just hallucinating, then you know, bias can slip in. And how are you going to stop that? That's the main thing. Is like a lot of companies have these AI applications out there, but they're not evaluating them, they're just letting them run blind. Um, so you really need some sort of evaluation, which would be from from our perspective, lots of small language models, like you just mentioned there. They're trained on the little data sets and they run outside the application so they can go in and evaluate you know what is the context here? Is are we being potentially biased or are we leaking sensitive data? There's there's a list that goes on, all these different kinds of evaluations, but that's critical when it comes to Suggentic AI.

SPEAKER_00:

Interesting. So this isn't a one-time snapshot. I mean, you want to uh over time keep tabs on chatbot accuracy and tone. I mean, how does that work? Is it a constant uh real-time process?

SPEAKER_01:

Yeah, it's constant. So it's it's always evolving. Um, I think that's the powerful differentiator that we have where you own your own data. Like as we say, we have the thing, the the observability maturity model. I actually just had a a uh uh a podcast there with uh one of our users who's gone from Michael from Pacview, who went through the four levels. Um, and it's uh it really is provides that structure. But what I'm trying to get to is we own your own data. That's the powerful thing with us. So telemetry data we find is very varied and rich. That's why we have the the four levels where we can we don't just treat observability as an insurance policy, but as like a driver for business growth. And that's the same with AI. If you own your telemetry data, you can see exactly how your application is running. You can then use that to train your models and become better in the future. Well, I think it's critical. I think companies need to be really, if they get AI observability right now, then they will be really set up well for the future. Fantastic.

SPEAKER_00:

Do you think we'll ever see a kind of self-healing chatbot? There's so many new LLMs coming into market, you know, one that knows when it goes wrong and fix itself, you know, in real time, even eventually.

SPEAKER_01:

I would like to eventually it'll get there. It'll get there eventually. I'm sure that I think the AI landscape is going to change fundamentally again in the next five to ten years for sure. But right now, where we're at is um we need to run evaluations outside of the fact. We need an AI to evaluate the AI that's trained on specific topics. That's where we're at right now, where we'll be in the future, who knows? But I think the most important thing right now, as I mentioned before, is the power of owning your own data, your own telemetry data, and actually owning your applications. And you're not sending it off to a third party where you have to pay a rent on that information, you know. If you own it, it's very, very powerful.

SPEAKER_00:

So we should almost think about uh these chatbots as kind of human-like, you know, employees, if you will, performance reviews, regular training.

SPEAKER_01:

Is that uh a good analogy? Really funny you mentioned that. We just uh we just uh announced Ollie, which is our agentic AI for observability. So you literally have your own SRE, like myself, in your pocket. So you it's an expert on your telemetry data. You can go in and ask it like, why is my application slow today? You know, questions that you would ask from like a business perspective or a non-techie, and you just get a full breakdown of you know the how the application is performing. You get some line graphs, pie, pie charts, whatever you need, an exact breakdown of how your application is running and what's going wrong, and it actually asks you follow-up questions, being like, Do you want an alert to potentially if there was something wrong like the AWS ortage last week? I asked Ollie, you know, why what why uh what what services have been affected? It gave me a complete service uh breakdown and actually give me some remediations to try and maybe prevent it in the future, or you know, some alerting that could help me detect it earlier and get the get the feedback out of my customers. So that's that's where we're going. Like, you know, and we want Oli to be not just an SRE, but uh why not a um you know a uh a product manager, uh security expert, you know, it's trained on your observability data. And if you have it all in the one context and then you own it, you can run this model on top of it. You know, the possibilities are endless. But yeah, so I really do think we're it's very human-like for sure. Um and uh it's quite exciting.

SPEAKER_00:

Indeed. You mentioned the AWS outage, and then there was a Microsoft outage. It's easy then to point fingers and you know at AWS and Microsoft as the root cause. But you know, who's accountable when a chatbot goes wrong or spreads misunder information? Do you blame the company? Do you blame the engineers? Do you blame Core Logics? You're the AI vendor? I mean, that is how's this going to work?

SPEAKER_01:

If you're deploying these, the buck, if you're deploying these AI applications, the buck stops with you. Like you're you're at risk, you know, and your customers are at risk. Um and this area and I have AI, there's it's it's just a lot more deeper than chatbots. Like there's companies, there's engineers I know plugging in these um coding assistants into production systems, and they have access to you know sensitive data, customer data. They're running code, they're deleting databases, they're leaking sensitive data. The code's not great, it's easy to be hacked. It opens up a whole new attack vector. So that could that could be a real problem going forward in the future as well, that I see. So we really we just really need to be evaluating these systems and uh and making sure they're correct before before we release them to production, really.

SPEAKER_00:

Interesting. You must have lots of stories and anecdotes where you're helping companies catch issues early and avoid much bigger problems, which you're probably not allowed to talk about. I'm not sure. Yeah, yeah. Um can you can you share any insights or examples there?

SPEAKER_01:

Um let me think. Uh the DPD story I've already mentioned. Yeah. Um I've seen. Yeah, well, the DPD story is this it's very simple. They had a they had uh OpenAI released any update to their foundational model, which is what they were running on. So you if you're not tracking that and you're not testing your application or evaluating it at all times, um, it's when it's in production and someone's managed to get behind the system prompt that that's behind the AI chatbot that's meant to protect it for this kind of thing. And they've managed to convince it to say some not nice things they've got behind the system, they found out a way to make it hallucinate, and then they've had a lot of fun with it. And you know, they've screenshot that and put it on Twitter across socials, and it went viral. Of course, that's true. But there's also other stories where um I don't want to name up the companies in case I'm wrong, but I know like a car dealership, for example, where you know they're using AI chatbots and they're they're in the the vector DBs, but they can't match, like say, the proper model of the car, but they just confidently give back the wrong answer, like, oh, here's the price, even though it's not correct. And that could be a legally banging offer, for example, if the user accepts it and suddenly they've got a car for a lot less or a lot more than what it's supposed to be. So things like that, that's that's where you can go wrong. So you really do need to have evaluators. Evaluators are the crown jewels currently when it comes to AI models. A lot of uh businesses are are sharing you know, prompt engineers or uh prompts that are useful, uh different contexts that are useful, but they're not sharing evaluations, evaluators because they're critical. I think they're the real crown jewels when it comes to AI observability.

SPEAKER_00:

Interesting. So, and there are of course a lot of regulatory environments, particularly in financial services and and uh healthcare, and even you know, countrywide. You're from Europe, you know that more than I do. Do you think we'll have regulations and you know to ensure AI monitoring? How will that look over time?

SPEAKER_01:

It's definitely definitely coming. Good question. I believe in the in the EU, there's the EU AI Act that's came out and where you businesses have to be responsible um for the risks. Uh, and I believe that will come over this way as well. It needs to happen, it needs to be some sort of regulation. So companies need to be prepared. You know, you can't just like as we as I said before, with the with the cloud compute um explosion where companies are just rushing to deploy applications and and not thinking about monitoring. With AI, you really need to be ahead of it before you release it to production because the the you will get a fine or you know you will fall into this risk of compliance.

SPEAKER_00:

So yeah, no, that's a great point. I mean, the the flip side is one could argue that you know constant monitoring undermines uh autonomous AI or maybe identific. Is is that a fair criticism, or is that just uh a wishful thing? Um I I wouldn't say that.

SPEAKER_01:

I mean, I I just think the really powerful thing is what does all this stuff stuff run on? I mean, it's it's all data. Um it's all it's all data. So really think monitoring it all is not a is not an issue. It can it can it can make you things better, it can help you project prevent new potential disasters in the future, it can train your model. Um but the only way you can do that in a cost-effective way is owning that telemetry data, which is very rich and can give you a lot of insights.

SPEAKER_00:

Um, you mentioned Ali. That sounds like a big release that uh you guys were working on. Any um any peek into the future uh without giving away secrets, obviously, uh roadmap or things that customers might be asking for.

SPEAKER_01:

We're hoping hoping to to go GA with it hopefully soon. Um hopefully a big announcement at reInvent. Hopefully, fingers crossed. If any of you are used there, come check us out at the reInvent and hopefully see a live demo of it. That'll be fantastic. Um, yeah, it's it's it's it's mind-bogging. It's it gave me that aha moment yet again. I keep getting I keep getting it at this company, it's amazing. Um, first with MCP server, that's like you know, as a developer, you're able to have uh a connection through like uh uh OpenAI or Claude to Core Logics and you can query your data. I thought that was really impressive. And then Ollie is just another level where any anybody from any any walk of like any expertise level can go in and be an expert in observability, which is incredible. And and then further down the line, if we think, okay, so you have an expert in your pocket, you know, can we can we automate our schedules tasks, you know, that were that at certain times at certain areas around the globe, you know, it's it's it's if you think about it that way, it's it's really, really interesting.

SPEAKER_00:

Amazing. Well, this has been eye-opening for me. Thanks so much for sharing a peek behind the curtain and uh have a great uh reIn. I decided not to go given the demographic uh chaos. Uh I decided to stay home, but hopefully you'll be posting and tweeting and uh sharing more from the show. And thanks for joining. Really amazing progress. Thanks. And thanks everyone for listening and yeah, watching and sharing the episode. Be sure to check out our sort of companion TV show, Tech Impact has TV on Bloomberg and Fox Business. Thanks, everyone. Thanks, Andre. Thank you. Bye bye. Bye bye.