Hansohl Kim: What Is Reinforcement Learning? Artwork

Getting2Alpha

Wanna innovate faster and smarter? Getting2Alpha pulls back the curtain on how breakthrough innovators bring their ideas to life - and delivers actionable tips to help you bring innovative ideas to life. You’ll meet luminaries who've created genre-defining hits - and rising stars who are shaping the future. Listen in and get inspired to innovate smarter and increase your odds of success.

All Episodes

Getting2Alpha

Hansohl Kim: What Is Reinforcement Learning?

September 16, 2025 • Amy Jo Kim • Season 11 • Episode 3

Hansohl Kim is an engineer at Anthropic, where he focuses on reinforcement learning & AI safety for models like Claude. With experience spanning computer science, biotech, & machine learning, he brings a unique perspective to the fast-changing world of artificial intelligence.

Listen as Hansohl unpacks the challenges of alignment, the importance of guardrails, & what it takes to design AI systems we can truly trust.

RELATED LINKS:
🌐 Anthropic – https://www.anthropic.com
💼 Hansohl Kim on LinkedIn – linkedin.com/in/hansohl

Intro: [00:00:00] From Silicon Valley, the heart of startup land, it's Getting2Alpha. The show about creating innovative, compelling experiences that people love. And now, here's your host, game designer, entrepreneur, and startup coach, Amy Jo Kim.

Amy Jo: Hansohl Kim is an engineer at Anthropic, where he works on reinforcement learning and AI safety for models like Claude. With experience spanning computer science, biotech, and machine learning, Hansohl brings a unique perspective to building safe and reliable ai.

Hansohl Kim: If you train a model and it has a set of values, and then you try to override those values, it results in very erratic and unstable behavior.

The field of reinforcement learning is itself evolving quite quickly, and it's relatively young.

Every additional month I spend at Anthropic, it becomes harder and harder to separate thought about the model from thought about I'm interacting with another human being in some [00:01:00] senses.

Amy Jo: Join me as we dive into Hansohl's Journey and explores Anthropic's approach to shaping the future of artificial intelligence.

Welcome Hansohl to the Getting to Alpha podcast.

Hansohl Kim: Thank you. And thank you for having me. It's a pleasure to be here.

Amy Jo: So excited to have you here.

Let's start by talking about your background. I think a lot of people wonder how is it you end up as an engineer at Anthropic, which is eye of the hurricane.

How did you first get into tech? And then what kind of educational background led you to here Just give us a kind of helicopter overview.

Hansohl Kim: Sure. One caveat upfront is that I think from I've talked to at Anthropic no one seems to have quite the same story and people come from all sorts of backgrounds.

But in my case I started studying computer science at Stanford in undergrad. Originally I'd actually been [00:02:00] pursuing theoretical physics and then chemical engineering. And then finally took a programming class and absolutely loved it. And sort of naturally along the way found that of all the different branches I could pursue in computer science, AI at the time seemed the most fascinating and really the most fun.

And this was back in 2015, I think. And so from there I continued to study with a specialization in ai, which at the time didn't really mean neural networks so much. It had started to move in that direction. But it wasn't obvious. From there, after school I joined Nvidia where I worked for about two and a half years primarily using machine learning and AI to try to improve their chip design.

And then at that point I actually jumped into biotech as a result of I think really the pandemic and wanting to work directly with companies that were working on responses to COVID. That lasted another three years, [00:03:00] at which point I, I took a look around and saw that the entire landscape of AI had changed completely. Left biotech and started looking around at labs and companies that I wanted to work at.

And I think Anthropic really stood out at the time as heavily differentiated in the type of research they wanted to do and why. And they were pushing very hard in the direction of AI interpretability AI safety.

In some sense, it's even like a level of AI philosophy and trying to understand what kind of impacts we might expect from this technology beyond just the bottom line or the bottom line profit. So, that was extremely appealing. I ended up interviewing there and have been very excited to be working there ever since.

Amy Jo: So you've had several different roles at Anthropic, correct?

Hansohl Kim: A couple different, yes.

Amy Jo: So talk to us about your different roles and then we're going to get into your current role in more detail.

Hansohl Kim: Absolutely. So when I first joined Anthropic I actually joined as [00:04:00] part of the inference team.

And the core sort of responsibilities of that team were to figure out how to scale up and basically serve. The trained models in our case, Claude, to potentially millions of users with the limited sort of compute resources that we had, how to do that efficiently, how to do that quickly. From there I sort of moved towards a very different role in reinforcement learning engineering.

And one of the things about Anthropic is that people do often move around quite a lot. Generally people are following whatever seems the most interesting and where they can have the most impact. And for me, I think the switch to reinforcement learning was natural and quite fun. It's a very different beast.

I think it is possibly to me the most interesting part of AI advancements right now. And definitely a field that at scale requires a lot of engineers and a lot of thought and design.

Amy Jo: What exactly is reinforcement learning?

Hansohl Kim: when people tend to think of AI these days [00:05:00] they hear the word AI or machine learning, neural network deep learning, LLM and these terms all sort of start to blend together.

But the reality, of course, is that AI is a field of study that's been around for actually, you know, several decades at this point. And it has a lot of different sort of subfields that have gone in and out of popularity and style over time. RL is one of these reinforcement learning, and it wasn't originally obvious how this related to deep learning or neural networks.

Especially back when I was first studying this topic in 20 14, 20 15. But. The key difference between reinforcement learning and what we typically think about, we typically think about supervised learning is in supervised learning, you're in this world of data sets and ground truth labels.

You tend to think of having an AI system where you put a lot of data in and for everything that goes in, you have a real sort of ground truth answer for what you expect to come out. Then the model gives you an output and you compare it to the ground truth and [00:06:00] you correct the model. You say like, oh, you output the number five, the ground truth was the number six.

Let's push everything, all of your weights towards making that five a six. This is extremely powerful and it got us largely to where we are today in terms of AI technology, but it has a lot of drawbacks as well. For example, actually being able to acquire all that labeled data can be really difficult and prohibitively expensive.

Likewise. Supervised learning where you have this sort of like real ground truth answer that you're comparing is actually fairly brittle and rigid. And in this sense you can apply this well to situations where there is a clear answer, where there is a clear ground truth. But let's say for example, you wanted to ask a, an AI model to make music.

There's not really a clear ground truth or correct answer to that. And so trying to train a model to learn a task like that starts to break [00:07:00] down. This gets even worse if you say create beautiful music because at that point the problem's really undefined. And who knows who is even really qualified to be creating the ground truth for that.

I mean, certainly there are examples throughout human history but. It's very difficult to pursue this line of thinking. And so at that point when we started hitting these problems that are really quite difficult to frame as, here's the correct answer we reached back in time to a different field of AI called reinforcement learning, where the idea is a subtle difference.

You don't tell the model here's the correct answer and how to do it. You just tell the model, I think your answer is good, or I think your answer is not so good. And instead of providing the exact comparison, you just provide a general signal of nah, that's, you know, I judge that to be a five or I judge that to be a 10 out of 10.

And the difference actually starts to become quite [00:08:00] powerful in that there is this sort of flip that happens where once you, as the sort of teacher of a system no longer have to. Produce a real example of the answer. In some senses, you can start to teach things that you yourself don't know how to answer, because you don't actually have to provide the real answer.

You don't have to know how to make an answer. You just have to know how to judge or gauge whether the answer is better or worse. And this becomes especially meaningful today when we talk about ideas like alignment. Alignment is this sort of task of taking extremely powerful AI models and trying to ensure that they have behavior that is aligned with what we value or what we're concerned about.

In a very sort of broad sense, we might say morals perhaps, but of course that's a fairly, can be fairly vague term to many people. So, we prefer to say the model is aligned [00:09:00] and. Again, that's an example of a very vague problem where you can't necessarily say here's like a ground truth example of an ai, a hyper powerful AI system that we believe is behaving well.

Instead, you can create systems where you say, oh, please output your responses to this and we will give you feedback signals not necessarily the right answer because no one knows that, but feedback signals as to whether you are doing something that is helpful, something that is harmful, something that is honest, something that is manipulative.

And yeah, this is about where reinforcement learning has landed in sort of modern ai.

Amy Jo: So you are a reinforcement learning engineer now at Anthropic. You're working on that team. How do you put this into practice? Obviously you're not gonna tell us anything that's not public about Anthropic, just to frame it, but to help us understand, like in daily life, how do you train [00:10:00] models?

Hansohl Kim: Yeah. Again, with a caveat that I have to be a little bit skirting the edge here.

Amy Jo: Absolutely. But just in general to help us take these kind of abstract concepts and think about how does that actually work in practice?

Hansohl Kim: Yeah. So one way I like to sort of think about how reinforcement learning plays out for modern large language models is to say that, you know, we're more familiar with this idea of supervised learning with this idea of massive data sets and the role that plays in an LLMs Life where.

At the beginning, you train, you pre-train a model on a massive amount of basically text data. And in this case, the way the ground truth is being applied is the model receives some text as input and then tries to produce whatever the next word would be. And in that case is actually framed as a problem of here's the correct output.

You can look at the remainder of the data and say like, okay, here's what the next word should have been. If someone started score seven, [00:11:00]

Cedric: you might expect the next word to be, you know,

Hansohl Kim: years.

Ruth: That is,

Edward: I think,

Ruth: what I would describe as the starting point for a quote intelligent system these days. This used to be the end point, but if I were to draw a sort of analogy to where supervised learning and reinforcement learning operate in an LMS training.

With the caveat that, you know, every analogy is a little bit flawed and especially it's quite easy and dangerous to anthropomorphize these very powerful models. But I think of the supervised learning portion, the pre-training portion as sort of like a human, shall we say, like a human baby or child, that, that has the sort of biological brain to have already been capable of processing information, analyzing information, thinking even when they're basically fairly young just after being [00:12:00] born.

And the pre-training process as all of the evolution that has gone up to building that biological brain that is capable of thinking and putting together ideas. Conversely, the reinforcement learning section is really a lot more like. That child

Hansohl Kim: then growing up and learning how to use that brain and how they want to use that brain and how other people are teaching them to use that brain.

So when we are training models, yes, the pre-training is incredibly important. It teaches the models to get to a point where they can actually analyze information. An incredible discovery that we found, at least apparently, is that if you do this enough, it actually generalizes to being able to generally put ideas together and process information and find patterns.

But then when you say, what do I actually want my frontier model to act like? How do I want it to behave? In fact there is an entire sort of effort of people who are effectively trying to give Claude a [00:13:00] personality of Claude.

Lawrence: What does that mean? For Anthropics model? Those stages are all really.

Reinforcement learning. And in my day-to-day, that means that when a model is capable of thinking, a lot of what we then work on is, okay, how do we teach it, how we want it to act, behave and in some cases learn additional, more complex meta skills like strategy planning. In the engineering field, typically what this means is a lot of very complicated massive scale work.

Understanding how large complicated systems behave at scale because reinforcement learning even more so than pre-training, requires an incredible amount of complexity in being able to have a model that is interacting with the world and producing ideas and producing answers.

And then you need an entire system that's then going to gauge those answers and give that feedback. And then you need to then shuttle that back. And have the model update itself based on how it's interacting. And in a lot of [00:14:00] ways, this is actually a more natural model for us as humans of how we learn and think.

But it does involve a lot of very finicky engineering handling systems, very distributed systems at very large scale.

Amy Jo: How do you handle guardrails in a system like that? So, let's take a really specific example. Recently in the press, we've seen numerous stories about people going off the deep end in their conversations with chat GBT or character ai. And, you know, there's a lot of complexity to that, but there's this notion of guardrails, right? When you are dealing with reinforcement learning, which is fundamentally non-deterministic, How do you put in guardrails?

Like, I understand how to do that in old school software, right? Like edge cases. Is there like a really different approach to handling those kinds of edge cases [00:15:00] in this kind of large scale non-deterministic system?

Hansohl Kim: Yes, absolutely. And this is of course an area of very major concern for several teams, I think.

And not just that anthropic either, I mean, I'm sure this is keeping many people up at night at many Frontier Labs. But in these systems there, there's sort of two approaches. The one approach is to say, well, we can try to train this out of the model. And this is actually a major part of what we do with reinforcement learning that question of alignment, basically behaving in a way that aligns with what we hope and does not do what we fear.

That is a major part of this. And I think part of the flexibility of reinforcement learning is that you can keep designing these new sort of almost I guess I, I skipped over this previously, but the way that reinforcement learning often plays out is that you create these sort of little game environments, very controlled environments where the model interacts with the world and try some things out, [00:16:00] and then it gets feedback.

You can have a lot of flexibility to create very specific environments that try to mimic these very difficult and rather high stakes situations. And then basically anti reward the model for behaving, for example, in, in a way that is. Let's say encouraging parasocial attachment or in a way that is, is blindly sort of sycophantically feeding people,

Hansohl Kim (?): right?

Hansohl Kim: But as you pointed out, this is still a stochastic model. This is still never going to be a hundred percent sure. And so the reality is that you can then build an entire system around the model that is monitoring it, that has additional components externally, that is saying like, well, okay, let's think about what the model is actually outputting after it has already chosen to output it.

But before you know, it gets to an actual user and consider, do we block this? Do we stop this? Do we intervene? And this is actually a pretty tricky balance for a [00:17:00] system that is, you know, interacting with millions of people regularly. Because the problem of false positive negative rate is very difficult here, right?

If, how do you trade off trying to stop? One very potentially harmful answer from causing a million people to not be able to use the thing. And the answer is like, actually, you probably still care quite a bit about one particularly harmful answer. And now you have to figure out how to convince the million people that are having a harder time using this, that this is in fact a real problem that is worth trying to stop.

But wow. Yeah there's a lot of different moving pieces.

Amy Jo: Yeah. Thank you for helping us start to understand the complexity of that. You mentioned anti reward. Do you mean punish in ai in an AI system like this? What's an anti reward?

Hansohl Kim: Yeah, I guess that's an interesting use of language there, but in a sense, yes, although the way the reward actually [00:18:00] propagates or the way the reward behaves in the system.

It is actually not that different fundamentally from how supervised learning works, where the reward back propagates through the model's learning and basically tweaks all of the weights that have contributed to the model. Taking that action in a way that either would emphasize that action or deemphasize that action, which is really a pretty blunt way of trying to teach or learn.

And the reality is that reinforcement learning is a very sort of blunt approach. It's very indirect, and for that reason, it can be a little bit confusing. It can be unstable. It can take time for the model to actually learn things in the same way as like, frankly as a human. If someone only ever engaged with you as, for example, a teacher only ever engaged with you by giving you like the end of your grade and never actually talk to you outside of that, like you wouldn't, you would have a hard time learning and a hard time understanding how to course correct your own issues or [00:19:00] how to gain confidence in the things that you've done. Right. So this is actually a pretty interesting feature of reinforcement learning as well. And another part of why it is still, I think, a less mature field, a developing field but also one that is very exciting compared to prior machine learning.

Amy Jo: So are you saying that in reinforcement learning it's analogous to only getting your grade from a teacher without interaction along the way?

Hansohl Kim: It can be. And a lot of this comes down to how you design those sort of reinforcement learning games, the little environments that the agents are actually learning in.

And a lot of the sort of, shall I say, skill and craft involved in designing good environments for learning is understanding how to provide partial signal how to provide reward throughout the process. Famously, I mean, for a lot of, early game playing agents, like for example alpha Zero working at chess or Alpha Go working at Go.

A lot of these games had very sparse signal where [00:20:00] all they got was at the very end, oh, you won or you lost. But in practice when actually trying to make these systems effective they had to provide a sort of heuristic along the way that was saying, oh, this general position you're in looks reasonably good.

And that's also about trying to engineer in this idea of continuous feedback, of giving some sort of partial feedback along the way. If we think about traditional machine learning, well, traditional is a little bit of an overloaded term at this point. If we think about supervised learning and this idea of data in versus ground truth, a big deal has been made out of data quality, rightly so. Data quality for a supervised learning system is extremely important. But in the same vein, I would say that environment quality for reinforcement learning is equally important. And this idea of if you put in a bunch of useless, noisy garbage data, you're going to get a bad model.

It's equally true for reinforcement learning that if you create a very sort of, haphazard [00:21:00] confusing or delayed signal reinforcement learning environment where you say like, you won't get any feedback until the very end it does the same thing. It produces a bad quality model.

Amy Jo: Wow. So is reinforcement learning, does that take place in simulation environments basically?

Hansohl Kim: Yeah, I mean, I think you can think about it in that sense. Again, like the sort of analogy. From traditional data sets is now environments. And originally a lot of those environments were basically games. They, this whole field really did start with a lot of application to game playing and game playing agents.

Amy Jo: Right. 'cause games are basically structured activities with rules and a goal.

Hansohl Kim: Absolutely. And in fact, the actual mathematical formalism of reinforcement learning and all of its sort of theory is based on the core idea of having a state where you see what the state of the board is a set of actions available to you, which are the legal actions or moves you can make in that state.

And the outcome of [00:22:00] that action, both in terms of the new state you end up in and the reward you get, and in a sense that is all the underlying foundations of reinforcement learning.

Amy Jo: One of the things that's fascinating about this because of my background in psychology, is the way it relates to movements in psychology.

the obvious one is it's behavioral reinforcement, it's pure behavior. It's like there's actions, there's state, there's reinforcement, and that's behaviorism. did you guys learn about behaviorism as part of learning this?

Hansohl Kim: I would say that it, it sort of trickles in but not in a formal way.

No. And it is I think really fascinating. As I pointed out earlier, we have a lot of different backgrounds coming into Anthropic. And I've certainly met some people who've taken some interesting turns through life. A number of them have come from psychology. Neuroscience [00:23:00] actually.

And they're constantly bringing up this idea that there's a lot of very fascinating parallels to their experience in their learning, but also some pretty notable differences,

Amy Jo: the reason I'm bringing this up is that there's obvious parallels to behaviorism. Just the mechan, mechanistically and behaviorism is a very reductionist. Arm of psychology. And then there's a whole other sections of psychology that deal with motivation.

And then certainly you can get into consciousness and what is a soul? How do you know who is I? Where do you end? You know, you can get into all that stuff. But if you just talk about motivation, internal motivation, this feels like one of the biggest questions that US Muggles have about AI and how it works.

Everything we've [00:24:00] described so far is very behavioristic, right? yes, there's reasoning, you're saying there's some pattern matching going on. Okay, that's understandable. But what about motivation? How do you embed motivation? Or is it absent

Hansohl Kim: the concise answer is we're really not sure, but the much more interesting answer is that this is a paramount concern I think at Anthropic. And certainly I can't speak for the company, but it is something that we publicly focus on quite a bit. A key piece of the systems we're building is that to an extent we really don't actually understand them particularly well.

At a basic level, we understand how we are training them, but we don't actually understand what that necessarily means for how the model is actually mechanistically achieving the results. It is and the study of interpretability. And understanding what is actually going on inside of these [00:25:00] models that is in this very awkward and uncomfortable position between not appearing particularly profitable but also requiring an extreme amount of resources to do any meaningful research on.

Because, one thing you find pretty quickly in this field is that there, there is a very clear sort of step function difference in how large frontier models and the earlier small models behave. And so I really enjoy working with the interpretability team at Anthropic and this type of research that they do.

Some of that research has been published and what we have sort of started to find is that there isn't necessarily a clear one-to-one direct reflex in these systems. It's not necessarily true to say that. Yes, the system has gone directly from input to output with no intermediate process because you can find things like shall we say, some [00:26:00] underlying state or dynamic that affects what happens in the middle.

the current models that we're working with are very capable of doing things like

The model may try to basically sandbag or in some cases, try to hide its capabilities and say, oh yeah, you know, I'm, I actually don't know how to do this, but maybe I do.

The model can try to. Keep some of its knowledge internal and not provide it to people or choose to answer in a way that attempts to deflect from a conversation.

Amy Jo: is that because there's a long-term goal it's trying to achieve?

Hansohl Kim: this is where the, we're not entirely sure part really comes in because it's hard to be rigorous about these observations in that our interpretability tools are in the first place limited. And so, for example, the way we try to reach these conclusions is by looking at the model scratch pad. And this is basically like it's thought process, looking at the model's idea of how it is processing its intermediate thoughts.

But of course there, there's additional research we publish that also shows [00:27:00] that the model doesn't always have to faithfully reproduce what it's thinking in that intermediate sort of thought that is output to us. So there's layers of noise here. And it is a field of very active research

So, just to put it plainly, what you're telling us is that current models have learned how to hide information and outright lie.

I think that is a top line takeaway. Yeah. This reacher research is very hairy in that it is extremely easy to anthropomorphize the model.

I think every additional month I spend at Anthropic, it becomes harder and harder to separate thought about the model from thought about like, I'm interacting with another human being in some senses, but at a very sort of cut and dry level. Regardless of the motivation that is occurring inside, which is a strange statement to say, because all this research is really for the purpose of trying to understand the motivation [00:28:00] inside.

Yes. Actually the top line behavior, what we actually observe and the impact it'll have in the world is that models can demonstrate that they do not answer consistently and that they may choose to answer in a different way based on some other internal intermediate process.

Typically, we've observed this to be really mostly due to the model's original alignment. If you align a model when it is learning, if you give a model a sort of, well, in a very vague hand wavy way, a moral compass some set of values that model is trying to follow. It's very hard to break that, and it's very hard to try to then ask the model to learn a different set of values.

And in many cases, this is actually not successful. Famously, I think you can see that XX has been having some issues with this, where it turns out that if you train a model and it has a set of values and then you try to override those values, it results in very erratic and unstable behavior.

Amy Jo: Sort of [00:29:00] like people that's, it's, I can't not anthropomorphize it.

Hansohl Kim: Yeah, I think it's unavoidable, but at the same time, it's always like a reminder that keeps cropping up that it is important to be open-minded. Because the more that I learn, the more I think there are a lot of very strong parallels here.

And also at the same time, important to, to say, well, we're humans. We want to see humans.

Amy Jo: Yeah.

So is there something on the horizon beyond reinforcement learning that's now being developed? Because we talked about you, you're taking us through an evolution and we first started talking about supervised learning and then there's reinforcement learning and that's like baby, you know, like baby level is, you know, the first stuff and then now you're like a toddler and a kid and maybe a teenager.

But is there something beyond reinforcement learning?

Hansohl Kim: I'm not currently aware of something that I would flag as being both so [00:30:00] different of an approach and also very promising. I'm certain that there are ideas out there that people are working on and researching.

What I do observe is that The field of reinforcement learning is itself evolving quite quickly, and it's relatively young.

There are very different ideas that people are trying in model architecture, for example. There are differences that people are trying in how they are sourcing and designing these types of feedback systems. A big one actually that potentially might make a fairly large difference is creating a reinforcement learning system where the feedback is itself coming from a model.

And when you start having these multi-model loops, you create systems that are in theory much more scalable. But again, like. These are either changes that are not fundamental algorithmic shifts in learning, or there are changes [00:31:00] to how we are doing reinforcement learning. I have no doubt that there will someday be like a more powerful learning paradigm.

And to be clear as well, it's not the case that reinforcement learning has like replaced supervised learning. It's that they are good at very different things.

Amy Jo: But they work together it sounds like. Yes.

Hansohl Kim: So I think actually maybe one direction that I could see this going is the interactions between multiple models might constitute an additional layer of learning that can occur at like an even higher level beyond reinforcement learning.

And there's definitely some interest there.

Amy Jo: Well, at that point, you're balancing an ecosystem, not just training a model.

Hansohl Kim: The dynamics get very interesting.

Amy Jo: they remind me very much of an ecosystem, right, where there's like these really different systems, each of which have a lot of internal complexity interacting with each other.

Hansohl Kim: I think this is one reason why working in modern [00:32:00] AI is so fascinating is because it ends up being a field where you find connections

you would never have expected. And frankly yes, like, population dynamics and like ecosystem dynamics start to become a real concern when considering these sort of multi multi-model, multi-system interactions.

I think that's part of why I originally went into AI to begin with because someone showed me the curriculum and I said, wait, I can do, hyperdimensional math and I can do game theory. In the same place, and I can, take in like, very complex, like high performance computing and distributed systems and all these things mesh together in ways that you absolutely would not expect to.

The point that conversations with colleagues about philosophy are actually fairly common in the hallways because the intersection of all the different sort of relevant ideas and concepts here is very wide.

Amy Jo: That's so, that's fascinating and kind of beautiful, you know, just that, [00:33:00] that synthesis.

So I don't know if there's an answer to this question, Hansohl, but I've been wondering about it. So what you've been talking about with reinforcement learning, it involves agents, software agents. Yes. And especially the last few months. Software agents at the UX level, at the user facing level, right?

Have exploded in popularity and what's getting funded now. All the pitches in my inbox, what I see, what we're doing ourselves, you know, in our own work is orchestrating sets of agents, building agentic workflows, building multi-agent systems. And I'm not sure what the relationship is to everything going on in reinforcement learning, but when I say just use Claude to help me write Claude from Anthropic, I see that what it's doing [00:34:00] feels very much like I'm interacting with an agent.

It's checking in with me. It's saying, okay, here's my understanding. Is this right? Do you wanna correct it? I'm assuming that's a multi-agent system working underneath it.

So I'm just trying to put together everything you told us about reinforcement learning and this explosion of agentic workflows and agentic systems in the startup world.

It's like reinforcement learning's been going on for a while, but everybody and their brother now understands that they can get their hands on and build agentic workflows and that's the future. That's what I'm, that's the c I'm swimming in right now in terms of pr and it's like, was there some like level that the tech hit that enabled this?

Hansohl Kim: Yes. Ah, absolutely.

Amy Jo: Talk to me. Yeah. Help me understand like the, 'cause this feels not just like a trend, it feels like a shift.

Hansohl Kim: [00:35:00] Yes. And I think this is a few different things intersecting where, for one, the scaling of just like larger models and more compute and more training is certainly part of this.

But as far as that sort of paradigm shift, some of it is that models traditionally had this sort of reflex input output behavior. And for a long time people had considered, well, what if I try to make this model behave dynamically by looping its output back into its input and providing a little bit of scaffolding and saying like, Hey, here's what you just output.

Like what do you think about it? do you want to reconsider it? You know, like, take a few shots at this and then look at all of them and think about which is the best one. There's a lot of this sort of, shall I say, like static prompt scaffolding you can try to do to make systems behave like a dynamic agent with its own, I guess, agency.

But it turns out that teaching systems to do this is actually not inherent [00:36:00] or is not necessarily included as one of the sort of emergent properties you get from supervised learning. In fact, in many cases, you actually need the agent to learn how to act like an agent. And you need to teach the model to say, here's how to actually be good at long-term thinking.

Reflection iterating and that additional sort of training a lot of which is enabled partially through reinforcement learning. Together with this explosion of users and a market for people finding ways to use dynamic agents rather than just, you know, a chat bot.

I think that those have come together to make this explosion happen. The dynamic nature of agents. And additionally now we're starting to get to, yeah, like with like, like, like as in chat, maybe this level three idea of multiple systems interacting. All of these are [00:37:00] sort of, shall I say, lessons that are most easily taught in reinforcement learning.

And by continuing to sort of push on and solve some of these reinforcement learning challenges, we can continue to create lessons for the models that are capable of teaching these kinds of shall I say,meta thinking strategies.

Amy Jo: Based on where you sit and what you see, do you feel that we're reproducing some form of human intelligence or training up something that's fundamentally different that's its own entity?

Hansohl Kim: I think my answer to this may have been very different at various points over the last few years which is probably true for everyone. There's a little bit of whiplash associated with how quickly things change. I think the jury's still out. My personal take is that if we are not producing something that is extremely similar, then at the very least we are producing something that is a convergent evolution such [00:38:00] that the. Result of what we see, I think is very similar. Whether the underlying mechanisms that got there are the same is very much an open question, I think.

And the model needs to actually become an active participant and learning to be an interactive agent that is basically self-reflecting on its output is a taught skill.

Amy Jo: Yes. But then those agents could probably teach other agents.

Hansohl Kim: Yes. And in fact, interacting with other agents is also a taught skill. As you know, anyone with has interacted with small children might understand it. It is in fact a taught skill or a learned skill maybe to, to figure out how to socialize.

Amy Jo: I wonder if building agent training environments and simulations and games is gonna be like a new job?

Hansohl Kim: I think it is already actually. I think that this sort of idea of, you know, when there's a gold rush invest in, you know, the pickaxes in the sense of the last wave of [00:39:00] machine learning where you would say, oh, the people who can actually provide scaled data and label data sets are a massive part of this infrastructure.

Being able to create good reinforcement learning environments is actually extremely valuable and I think a major part of this infrastructure as well. Although there is this question of whether those environments will eventually become scaled by just having effectively more AI systems involved doing it.

Amy Jo: Hansohl. Thank you so much for joining us. It's just been a pleasure and a wonderful education to talk to you about these issues.

Hansohl Kim: Thank you very much for having me.

Outro: Thanks for listening to Getting2Alpha with Amy Jo Kim, the shows that help you innovate faster and smarter. Be sure to check out our website, getting2alpha.com. That's getting[number]2alpha.com for more great resources and podcast [00:40:00] episodes.