Develop Yourself

#276 - Claude Code Failed: What Anthropic’s Postmortem Means for Developers

Brian Jenney

Ever had your AI pair programmer stop helping and start breaking everything? I did—and this time, the data proves it wasn’t just me.

Claude fell off. 

TypeScript that wouldn’t compile, migrations stuck in loops, refactors that went completely sideways. Turns out Anthropic’s own postmortem revealed three separate bugs causing degraded output—context routing issues, output corruption, and TLA-X blah blah blah error.

Let's dive in.

Send us a text

Shameless Plugs

Free 5 day email course to go from HTML to AI

Got a question you want answered on the pod? Drop it here

Apply for 1 of 12 spots at Parsity - Learn to build complex software, work with LLMs and launch your career.

AI Bootcamp (NEW) - for software developers who want to be the expert on their team when it comes to integrating AI into web applications.

SPEAKER_00:

Welcome to the Develop Yourself podcast, where we teach you everything you need to land your first job as a software developer by learning to develop yourself, your skills, your network, and more. I'm Brian, your host. What if I told you that everybody's favorite AI coding tool fell off recently and you noticed this, but had no way to prove it until right now? So it turns out it wasn't all in our heads. Claude did in fact get worse over the last month and a half or so. I've been noticing this and I've seen a lot of stories on Reddit and other forums about the degraded performance and quality of code that Claude was putting out. And I want to go through the post-mortem that Anthropic has released and also a lesson that I think we could all take away from this kind of incident, especially if you're a junior developer, especially if you're learning to code and you're beginning to rely on these AI tools. So I'm a senior software engineer at a small AI startup, and we've been using AI coding tools like Cursor, Claude, whatever else is out there that people want to try out and use. We're pretty much open to everything. I've been using these tools for the last around a year, like most of us. And when they're really good, they're really, really good. And when they're really bad, they're awful and we make fun of them and we dunk on them. But recently it felt like everybody had kind of consolidated on Clawed code. In fact, one of my coworkers said, Hey, you should really try out Claude code. And I was a little skeptical. I'm like, I like cursor already, it already integrates into VS Code. I'm using it there. I don't really want another tool and it's in the terminal. Like, is that really gonna work out well? And to my surprise, it worked out really, really well. In fact, at one point I was using Claude way more. And you can see this in my cursor usage history. I'm on the cursor dashboard where I can view my usage of cursor, and I can see around August 30th, I was basically not using cursor at all. I had basically moved directly into VS Code, and then my usage got incrementally higher and higher and higher until I was basically not using Claude at all, but was using Cursor solely for all my coding projects because Claude had gotten so bad. I just couldn't freaking prove it. And I even wrote a small little post on LinkedIn saying, hey, is anybody else basically noticing this issue? And other people said, Oh yeah. And then I'd go on Reddit and you'd see all these conflicting results and you'd see all these conflicting stories. Some people would say, Oh, it's a skills issue. I hate that phrase, honestly. It's been so overused. Everything's a skills issue. You don't know to do something, skills issue. Your AI tool sucks, skills issue. I'm like, no, there's something off here. And when I say bad results, I mean it was creating code that would not compile in TypeScript. It was doing lots of changes that were not really what I asked. So basically, the quality of the code would be worse and it wouldn't work. So it would make lots of changes. Those changes worked less well than they used to, and often the code would be a complete mess or would get confused as it was doing a large refactor, and then it would end up with something that I didn't really want in the first place. For example, I was having it work through some Python code to do a migration from BigQuery to using MySQL. Honestly, something that I considered fairly simple to do. I have no clue why it wasn't able to do this. It kept going in these awful vicious loops where it would change one thing and then rechange another query and then change that same query and then change the next query, and then it would just go in these loops and just at the end of it, nothing worked at all. And when I tried to correct it because I could see what it was doing wrong, it would acknowledge what it was supposed to do and then continue doing the exact same thing. At this point, I'll be faster than using the tool. And it turns out this wasn't all in my head, it wasn't all in our heads, but only some of us had this issue. So let's go through their postmortem. A post-mortem, by the way, is a typical software development practice on software teams where after a big incident, like a critical incident, like maybe somebody dropped a database, maybe somebody put out a feature that broke the entire app. These things happen. They happen at really big companies, they happen at really small companies. And I gotta give a shout out to Anthropic. They got a lot of respect for me for actually acknowledging this issue and then making it public because they didn't really have to, but I do think they lost a lot of trust. I'm beginning to go back to Claude, but I'm still like not completely feeling confident about it. I'm like, when is this gonna happen again? Will it happen again? What's gonna happen next time? Do I have to find this on my own? And how long are they gonna take to fix it? And by that time, am I gonna be already screwed or going so far deep down some rabbit hole that I've just wasted tons of time? Is it worth it anymore, basically? So, and they say between August and early September, they had three infrastructure bugs which degraded response quality. And if I go back to my usage report, you see right around August 30th, September 1st is when I started using cursor a little bit more and then it just shot up in September because I was noticing this issue, like a lot of people were. Now, there was a lot of rumors online, a lot of speculation that they were actually reducing the quality of the model being used to write code because of demand. And that makes a lot of sense, right? These models are not free. We pay for them. My organization pays for my use of Claude, like many organizations, especially startups that want people to move really fast or at least think they're moving fast. Anyway, that's a whole other topic right there. But many people thought maybe they're just like lowering the bar on Claude because we got so many users now, and they're like, we just can't afford to have this happen. So they go through how they actually serve Claude, which is a lot of like architectural stuff and things like that. So let's see when the actual errors began happening here. So it looks like around August 5th they had a context window routing error. Not exactly sure what that means. I can only imagine that you take some context, like they're trying to understand the amount of tokens that you're going to use and then route it accordingly. A lot of tokens would probably require more GPU, and a smaller amount of tokens would probably require less, I assume. And then around August 25th, 26th is when I started noticing this myself. I think a lot of people did output corruption errors. Not exactly sure what that means. And then at the last one, these approximate top K miscompilation. I have no clue what the heck that means. I'm just gonna be honest with you. On August 29th, they saw that load balancing impacted 16% of Sonnet 4 requests. Sonnet 4 is the model that I generally use. And between August 29th and September 4th, this is when this major error looks like it was present. And this is exactly the time that I switched to using cursor solely at this time. Really funny. So this like completely validates what I've seen in my own cursor usage here. And they said the overlapping nature of these bugs made diagnosis really challenging. So they had a bunch of bad stuff going on at the same time. Damn, that kind of sucks. And it's one of those intermittent things. So it only affected, it says it only affected 16% of requests. So that means like one out of every six or seven requests that you make would be bad. I I felt like all my requests were not doing super hot, but maybe that was just me. So the first issue, the context window routing issue. So this is kind of what I assumed. It said Sonnet for requests were misrouted to servers configured for an upcoming 1 million token context window. I don't really get why this would be a bad thing. It feels like if you had a smaller context and you send it to a server that could handle a really large context, why would that be bad? I don't know. But hey, it's not good, right? So that that is one issue that happened. So routing to the incorrect server to handle the type of context that it is meant to handle. Context basically meaning how much stuff are you sending over there? We're sending a lot of stuff, we're sending a small amount of stuff. How many tokens are we going to use to parse and understand this information? Issue number two is output corruption. And this is an interesting one. It assigned a high probability to tokens that should be rarely produced given the context. So, as we know, large language models are non-deterministic, meaning you could ask it the same question and it will give you a different answer. It uses probability to determine what words, what phrases, what code, what tokens should we use. So in this case, to enable this dynamism, this dynamic response, there was actually an issue where high probability was used on tokens that should be rarely produced. So, like pretend I said, what does a dog eat? A normal response, right, would probably be like, it eats dog food, or it eats leftovers, or it eats the cat. I don't know. These things are fairly probable responses, with the cat being probably the least likely response. An even less likely response would be something like it eats the car, right? And so imagine when you ask that question, it could have one of these four choices when it really has like millions of choices, millions of parameters, or billions of choices, right? But it kept giving you the least likely choice. It kept saying cats eat cars or cats eat dinosaurs or cats eat Chinese symbol. And that's exactly what happened. So a small subset of users that asked a question in English might have seen a, I don't know what that character is, might have seen a Thai or Chinese character in response to their English prompts, which is probably the least likely thing that you'd want to see when you ask a question in English. So this means you're not only getting bad responses, you're getting things that are like nonsensical. I didn't experience this, but that's pretty funny. And this last issue, this last issue is for big brain people out there. Approximate top KXLA TPU miscompilation. That's a mouthful. Uh, I'm not gonna get deep into this one because I really don't understand what this even means. But it says, hey, we deployed some code to improve out cloud selects tokens, and apparently it didn't work. Okay. Typical software bug issue, not exactly sure what that means. They go deeper into the XLA compiler bug, and they even post the code too. They post some of the Python code that we can actually check out. And this goes a little deeper into that non-deterministic issue that large language models have, right? I mean, it's not really an issue, it's a feature. You don't really want the same response every time you ask a question. You want a different one. That's how humans speak. We want something that feels more human, more dynamic. And so we want these different responses depending on what we're asking, or even for the same question, right? So it says when claw generates text, it calculates probabilities for each possible word, then randomly chooses a sample from this probability distribution. They use a top sampling to avoid nonsensical outputs. But sometimes, apparently, we discovered that this implementation, our TPU implementation, TPU stands for this top P sampling implementation, I guess, would occasionally drop the most probable token. So sometimes it would say, What's the most probable thing? And we'll just get rid of it. And who would who would want that, right? So I'm not gonna go more and more into this, but it looks like they go deep into why this was hard to detect and what they're doing to prevent this in the future. And now they even have a bug command in Claude Code. I noticed they started asking me, how is Claude doing so far on a scale of one to five? And so they're trying to get feedback. I really appreciate they're doing this, but I think there's a deeper issue here that really needs to be addressed, especially for those of you out there learning to code and those of you out there that are the beginnings of your career and really relying on these tools a lot. I want to give you a very clear warning here. I mean, Anthropic has given you all the red flags and warnings that you need at this point. So here's the thing that should make you as a developer a little bit uncomfortable. These tools are black boxes, right? We have no visibility to what's going on inside of these tools. These AI coding tools are black boxes. They are not open sourced. We have no clue about what's been deployed for the most part. We have no clue when things break. In fact, if they didn't want to release this information, we would have all just been kind of on the internet saying, are you seeing the same thing and trying to validate our own lived experience? And a lot of us probably would have been gaslit into thinking maybe it really is a skills issue. Maybe I'm just not using it correctly, right? Because that was the narrative from a lot of people online. And so if you're using those tools and you don't know what good versus bad output is, like pretend you were using this tool and you did actually have this problem, but you didn't know that the code it was outputting was bad. You thought, well, it's clawed, it has to be good, right? That I think is the sinister issue that's creeping into the industry. When we're starting to trust these tools, especially management and people that are outside of coding are wondering why aren't you using these tools? Why aren't they making you faster? Why isn't it better than you? It's a machine. How could it not be better than a human? And we really have to take the responsibility and take the ownership back from these tools, I think, because at a certain point, you have to know when you are going to know more than this little computer bot that you don't even know how it's working, right? For the most part, we have no clue how these things are working, obviously, right? These are multi-billion dollar companies. They don't want to give away the secret sauce. So if you are a software developer out there, especially at the beginning of your career, I highly recommend that you use these tools for work because that's just the new expectation. But if you're learning or if you're really at that stage where you're really optimizing for learning, turn them off sometimes, right? And don't just blindly accept their feedback. One thing that I'm doing now is I'm not letting Claude or Cursor just do massive refactors or changes to my code base. I'm going usually file by file at this point. Unless it's something that's so trivial that I don't really mind if it messes up a little bit. This is mostly UI stuff, like, hey, refactor this UI and you know, change this variable name across like 20, 30 files or something like that, or you know, make this page look prettier. Or here's an image of how this page should look. You go ahead and kind of figure that out and make me a bunch of components to do that. But if it's something like chasing down really tough logic or doing SQL queries or doing routes that depend on front-end input and then output things to the front end, or if logic in a back-end service is failing or is not quite giving me the right response that I want, I'm just not gonna have Claude or Cursor or really any AI tool just go and do that for me all the way anymore, unless I have a really, really clear understanding of what others I want to do, and I can write really detailed instructions about everything I need it to do. Because otherwise, you run into this issue where it does lots of changes, it breaks lots of things, even when these tools are working as expected. They often will write more code than you want, they do things that don't really make sense in the context of your code base, and they don't know about the business logic or all the other things outside of the pure code base that you're working in. They don't know why the team decided to do X, Y, and Z. They don't know why this column is named like that. They don't know everything about your SQL database or your BigQuery database or Firebase or Mongo or whatever. It can't get all this context and just output flawless code every single time. So I guess what I'm saying is learn to trust yourself as a developer. And if you can't trust yourself as a developer, then get to the point where you do trust yourself. And I think that means turning off these tools a little more than you'd probably like, slowing down a little bit and just taking time to understand what the hell it is you're actually doing and not having a career that is dependent on a service being up. Because if your whole career is dependent on anthropic or cursor to do anything at all, right, then what do you do when these services go down? What happens when bugs like this slip in aren't caught for almost a full month and you're relying on this tool to do your day-to-day job? It's a recipe for disaster. I hope this is a cautionary tale. I hope if nothing else, maybe you are now validated in your feelings that yeah, Claude was kind of sucking. I'm going back to Claude. I'm curious what you're gonna do. If you have a cool AI tool out there, I don't even know if I want to hear it, but if you do have like the best AI tool you've used out there, I actually am somewhat curious. Cursor and Claude are really the only tools I think I'm gonna use going forward. But if you have a super cool one that is groundbreaking, earth shatteringly good, I'd love to hear about it. You can email me at Brian at Parsity or just leave a comment under the video. I'll check it out and maybe I'll do a show on that because I would love to see if there's something out there, some unlock that I'm missing about using these AI tools. Anyway, hope that's helpful. See you around. That'll do it for today's episode of the Develop Yourself podcast. If you're serious about switching careers and becoming a software developer and building complex software and want to work directly with me and my team, go to parsebe.io. And if you want more information, feel free to schedule a chat by just clicking the link in the show notes. See you next week.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.