
AI Proving Ground Podcast
AI deployment and adoption is complex — this podcast makes it actionable. Join top experts, IT leaders and innovators as we explore AI’s toughest challenges, uncover real-world case studies, and reveal practical insights that drive AI ROI. From strategy to execution, we break down what works (and what doesn’t) in enterprise AI. New episodes every week.
AI Proving Ground Podcast
Beyond Chatbots: How Digital Humans Are Transforming Enterprise AI Experiences
As enterprise AI strategies mature, the user interface is evolving beyond chatbots. Enter the digital human — real-time, emotionally aware, multilingual AI avatars designed to mimic human interaction with uncanny realism. In this episode of AI Proving Ground Podcast, WWT Chief Technology Advisor Ruben Ambrose and Area Director of Strategy an Innovation Eric Jones unpack their journey building WWT’s own avatar, “Ellie,” and offer a revealing look at the infrastructure, latency tradeoffs and feedback loops driving this frontier.
What happens when cutting-edge AI meets human expression? You get more than a chatbot you get a digital human. Whereas many AI breakthroughs can seem invisible, this one looks you right in the eye and it talks, and it feels surprisingly real. Today, on the AI Proving Ground podcast, we talk with Eric Jones, an Area Director of Strategy and Innovation for WWT, who's been focused on Gen AI and its applications in the enterprise workflows for the last few years, and Ruben Ambrose, a Chief Technology Advisor focused on AI and digital humans. This isn't just another conversation about chatbots. It's a look at how AI is starting to walk, talk and maybe even feel just like us.
Speaker 1:Eric and Ruben will break down the full tech stack that powers a digital human. They'll talk about deployment models and use cases and, more than anything, they'll emphasize the need to consider user experience and human factors when embracing this innovative technology. So, without further ado, let's dive in. Ruben and Eric, thanks for joining on this episode of the AI Proving Ground podcast. Before we dive into the meat of digital humans, I am interested. The two of you have gone on the road with our digital human, Ellie. I'm curious as you power it up, power it down, any guilt or anything in terms of packing Ellie, who we describe as a her, into a suitcase, or how do you deal with that ability to connect with a digital human who has human-like qualities?
Speaker 2:Well, the good news is she's a very resilient lady. She's a rough and tumble and she doesn't mind, you know, traveling conditions and stuff like that, Just so long as she gets her Perry sparkling water. When she gets where she's going, she's content.
Speaker 1:Yeah, absolutely. So let's level set on here, because I'm not quite sure that all of our listeners or viewers out there might know exactly what a digital human is, and there's a lot of terms out there, whether it be digital assistants, ai chatbots, things of that sort. So let's define digital human, and how does it differ from the traditional AI chatbots that we're interacting with seemingly on an everyday basis now?
Speaker 3:And the way that I think about the digital human Brian is just that it's a human-like representation of an avatar. It's talking to you in your native language, ideally, and as natural of speak as you can be talking through, and I think that's the the simplest way of thinking about it. Basically, there's lots of uh, specific versions of digital humans, as you mentioned a couple of examples, but that's that's sort of the foundation of how I think about it.
Speaker 2:Yeah, I agree, and I think you know a lot of human interaction is visual and you look at people's expressions when you're talking to them. So it just takes the interaction of what we're used to with a typical chatbot, which is usually text-based.
Speaker 1:And it's on another level in terms of the experience, and we'll dive a bit into the technical components here. But how do you get that natural human-like quality? How do you get the speech synthesis down, pat? What types of technologies are underlying there to make sure that when Ellie is turned on or any digital human for that matter it actually represents what we perceive as a human?
Speaker 2:actually represents what we perceive as a human. So at the end of the day, under the covers, there are a lot of parts working together to make kind of functioning in thing. You see that you're actually interacting with In terms of the AI aspects specifically of it. There are several models kind of working in tandem and it's put into what's basically called an AI pipeline and the processing guys kind of goes from of goes from end to end through the pipeline and it literally starts with models that do things like speech recognition and can convert what a person's saying into text. There are models that do language translation.
Speaker 2:If you determine a person speaking Spanish but what you're actually talking to knows English, you have to do some language translation. There is what most people are used to with chatbots today a large language model and retrieval, augmented generation attached to a large language model, which kind of gives the large language models specific, pertinent information to the conversation you want it to be able to house, and when those responses are generated, you kind of have to reverse the process and take the text, turn it back into speech, turn it back into the original language that you heard to begin with and then finally pass it to yet another model which can deal with actually like rendering in real time the face and doing the lip syncing and moving the teeth and the tongue and the eyes and the blinking and stuff like that, so just at a very high level kind of end to end. There's a lot of different AI models involved in generating the output. Eric, anything you want to add to that?
Speaker 3:I was actually going to ask you, ruben, do you want to go through some of the feedback that we've gotten around our various iterations of the digital human? Because I think there's some nuance there that we've been building on over the last four or five months and it almost seems like at every layer we get better, and then there's new pieces of feedback that we get models to do automatic speech recognition, text-to-speech, and to do language translation.
Speaker 2:That's kind of where we started our journey. We used a bunch of software packages that NVIDIA provides. They have a suite called Riva and it has all these various models that do these functions and support many different language peers and that kind of thing. And our first attempt back then was just to have a simple web page that you could push on a button and talk to a large language model without typing, just with your voice in any language, and it would figure out what language you were speaking, do the translation for you, turn it into text and send it to the large language model, get your response back and then kind of reverse the process. So the experience for you as a user is you could basically talk without typing to this thing and then it was multilingual right and it would figure out your language automatically. That's kind of where we started. We had pretty good success with that because there are quite a lot of model supports available to do these kinds of functions and you know we kind of thought, hey, that's kind of neat. You know we can build translation services and voice and audio interaction now into chatbots. It's a step forward of what we've seen before.
Speaker 2:But then we were, shortly after that, asked to kind of take it to the next level, because there was a big conference that was happening at the end of last year that Disney has and we were going to have a big booth there. We were co-sponsoring with it in video. And the ask was you know, hey, we'd like to actually have like an actual full-fledged digital human at this thing. Do we think we could build something like that and show it off there? It would be really nice demonstration of any technology and like our own skill set as wwt in terms of our engineering prowess. So basically we started with the kernel of what we had with the translation pipeline and then we added to it basically the face and the rendering and the human aspect of it, the visual aspect to it. So we just basically started kind of chaining things on to the end to get the full feature set.
Speaker 2:Every step along the way had its challenges. Some of these models are new and they're being iterated on very quickly by the people who build them, right. So you know, week to week you might get a new version of the model that perform better or address certain issues and stuff like that, but at the end of the day, you got a nice working product. We had some actual challenges in terms of deployments too, though. Right, all these things have to run somewhere right, and, just like customers have to figure out when they're going to be deploying this technology, we had to make the decisions on how are we going to deploy this.
Speaker 2:In this case, we had very specific parameters, where we wanted to run this thing at a conference and, you know, internet connectivity at some of these venues usually isn't very good and so the idea was okay, well, we kind of need this thing to be self-contained, but we have so many pieces that need to run. Where are we going to run this? You know, we ended up speccing a very, very, very large workstation with four RTX 6000 GPUs in them, and then we basically that kind of helped guide the architecture, because we had to get everything to fit on as much GPU RAM as we had available to us in those four cards. So it was kind of a special order workstation, special purpose intent and a special architecture. We chose it with skinny downs so that everything would just fit on the cards we had, and then all the assets that all those AI models needed loaded locally on that box, so that no internet but server was required. When the thing actually got to the location, it was kind of all self-contained, kind of deployed at the edge. So the good news is that works and it works reliably. The bad news is when you want the digital human somewhere else, you have to physically ship stuff around and deal, know, deal with shipping, get things getting lost, potentially broken, stuff like that. So, post that iteration of a digital human, we've since been building a completely data center based version of it.
Speaker 2:Um, that has its own challenges. In this case everything's deployed in the data center, you know, in containers, and all the models that need to talk to each other happen there. And then the idea is you stream the video out of the data center and just off to somebody and it'll run in a browser, right, so you don't need to ship anything anywhere. No big, complicated workstation. And then we also have a third version of it, which is kind of a hybrid of the two, where we have a very small workstation with one graphics card in it, one GPU in it, and all it does is render the face and the animation of the face, but all the backend models run in the data center. So it's kind of a hybrid.
Speaker 2:You don't have much to ship. It's not a very specialized workstation, it's kind of more run on email, very small kind of thing. But then you're also taking advantage of data center power. It's a power like the heavy lifting in the model, and then you don't have to send much data back and forth. You're just basically sending a sound file to what's basically the thing the person interacts with and then that's the thing that kind of generates. You know the face for the person that's talking to it. So there are different flavors, different variants, just like I would say to any customer how you want to deploy, if you're a customer trying to deploy this technology, you're going to have to look at your specific scenarios and use cases and pick one of these three flavors. Eric, you were there for a lot of this. You were hands-on on a lot of these conferences. You went with the equipment, you ran the demos, anything you want to throw in in terms of your experience managing these different variants, because you've used all of them.
Speaker 3:Yeah, I think to your point, ruben, it's. It's all a matter of give and take, so understanding you know what are our different requirements and what is that going to mean.
Speaker 3:So, when you've got the fully offline version, shipping things around, slightly slower latency, some requirements on the rendering side of things that are maybe not as optimal as you want, but then when you go all the way to the data center side of things, you're essentially streaming video.
Speaker 3:So if you're thinking about how your Netflix starts streaming at home when your kids hop on and at the same time you're trying to watch something on a slow connection, that can lead to some pretty poor outcomes. And, and something that we have found in this, in this space, that is not surprising but has a lot of different flavors is, you know, the latency plays a big role in how people interact with what we have been building, and some customers are expecting a latency of sub one and a half seconds nearly impossible in the space as it exists right now. And then you know other customers, when you start getting above three seconds, it becomes very awkward, right. So you have to find ways to either move the avatar around or put some other information on the screen that shows, hey, I'm processing, have that latency of the internet, but if you're also in a place that has a really good internet connection, that might not be an issue. And taking each one of these custom builds and taking those customer requirements in is really important.
Speaker 1:And, eric, when you mentioned, you know, move the avatar around, is that referring to the blinking and kind of just like the human motion, whereas you know, our first iteration was little to no blinking and it took some people, uh, by surprise, and so just adding in those very fine, maybe overlooked elements makes a world of difference.
Speaker 3:It does. It makes absolutely the world of difference, and I think even when you start looking across the different avatar types. So the first digital human that we deployed was hyper realistic, if you will. You could see the pores on her face and that level of detail and even that sort of scared people a little bit, versus if you can zoom the camera out so you don't see those pores, uh, or you go a full cartoon version. That has its own issues, but you can probably animate the cartoon version a little bit more. So that's exactly it and what we, what we see, and I know, ruben, you've gotten a lot of that feedback as well over time yeah, yeah, you're, you're right in that question.
Speaker 2:Like the very first version of the rendering of the face, when it wasn't actively responding, it would just kind of be frozen right and if you, if you, if you finish the last response with her eyes closed because she was blinking. That's how she stayed until she got. So she got the next question. Uh, since then we've added uh, newer versions of the rendering allow you to add something called gestures and so it. Basically, it's like what normal people do when they're waiting around they move around slightly, they blink, their head nods right, while they're waiting to you to say something. It's just more presentable. So that's how normal person trying to interact with it, uh, and it, it, it gives you. I'll say, I'll go as far as to say more star trek kind of.
Speaker 1:You know, oh, this thing is actually really polished kind of feel when you're like looking at it even before you start to interact with it yeah, that gets in the the concept of uncanny valley, where you know humans might interact with something that's human-esque, but it almost works the opposite way. This is too human. So I guess there is that balance that you know our clients, organizations that are looking to implement digital human, have to take into consideration. It's a very fine line to toe right.
Speaker 2:Yeah, yeah, that's true. We found, because we've taken this to many locations around the country, various conferences, various AID events that we hold at that we hold in different cities, and Eric and I have both kind of ran it as demos at some of these events. And it's interesting, people have very different reactions to it. Some people are fascinated and just come right up to it and want to start talking to it and are upset that they won't speak their specific language because we don't have that particular language model loaded. Other people are more. I'll watch somebody else use it, uh, you know.
Speaker 2:But uh, some people are scared and they don't want to stand there and talk to it because it's a very you know, it's not just a screen, there's a mic to involve, because a lot of these venues are like a lot of public spaces, right. So we have a very specific mic. Uh, eric, I think it's called a cardioid condenser mic, is that right? Yeah, which is designed to just pick up what's immediately in front of it and kind of ignore, you know, the background noise, so you don't have to kind of step up to something and talk into a mic. And some people find that a little like oh, you know, it's not just as natural as just walking up to a person, right? So the different reactions people have are interesting and definitely varies by personality.
Speaker 1:Yeah, eric. What do you think that signals for enterprise organizations who perhaps want to look into deploying a digital human, the fact that there's some pause, whether it's speaking into a microphone, seeing too many or too little blinks. What does that mean for enterprise deployment or how a digital human might be used in any given industry?
Speaker 3:Yeah, I think when you're considering this for enterprise deployment, the first thing you need to consider is your end user, and who are you deploying this for? Some enterprises are looking to deploy this internally. You have a little bit more give and take there because you can as an enterprise. You have a little bit more give and take there because you can as an enterprise. You can more or less say I need you to use the digital human, and people have some more leeway there Versus if you're deploying this to, let's say, your customer base, and does that customer base get forced into using this? How is that going to impact their satisfaction? Do they have the option of using it, which there's some really interesting feedback loops there of when people choose to use it, then how often do they end up going over to a human? It's almost like the kiosks that you have today and in quick service, retail restaurants and things like that, where if there's a kiosk but the kiosk is frustrating enough, then they just go to the human and trying to avoid those as much as possible.
Speaker 3:But those are the things that enterprises first need to think about. Who is your customer base? How do you plan on deploying this? What's your strategy to deploying it and, as quickly as possible, getting feedback on that strategy to make pivots. This is not a build a big thing over the course of several years and then deploy it. This is one of those getting it into select users hands, ideally users who would be interested in in providing feedback and and possibly even struggling through it a little bit just so that they could provide that feedback. Because that that's how you make honestly. Most of our ai systems are based on that and that's how how you can make a good digital human as well prototyping your keyword.
Speaker 2:There is prototype. Build a prototype, try it out, learn from your prototype, make adjustments, repeat, repeat, repeat until you're, you know, happy with what you have and willing to put it in front of your general pocket customer, you know.
Speaker 1:So we're talking about use cases and how we can go about prototyping. Are we seeing any of this in the market? Yet? Pick your industry. Are we seeing anybody actively taking a digital human and putting it into practice in the real world?
Speaker 2:Yep. So we are seeing a lot of customers prototyping and they're in kind of various stages of the journey of their prototyping. One customer in particular, large Telecom, is interested in potentially having this being some of their retail stores. They've actually been prototyping this internally for quite some time now and I would say that they're kind of a similar place where we are in terms of how the maturity of our digital human and how it performs and stuff like that, and they've been working on it for quite some time. We have another customer who's a theater chain and they're considering it as well for their lobbies. You know not very much a retail setting. They're interested in like potentially prototyping with us. You know how that might look and what that might look like. How effective is it or how willing are customers to interact with it versus actually going up to the counter and talking to somebody in person.
Speaker 2:I think the other thing I'd throw in here is you know there are other. There's still improvements we can make to what we've built so far, that we just haven't gone on the path on right, and it can be basic things like how well does it pronounce or understand acronyms, because that makes a big difference to the human listener. That would be a big one. And then the other one is you know we, instead of having like the mic be so obvious and you having to like physically interact with this mic, you could potentially hook a camera up and mount it to the side of it and then use models to do some image recognition so that it can tell when somebody's walked up to it and then start listening at that point, and so the experience of interacting with it is a lot more seamless. So there's some definite once you have good success on the base, there's definite paths you can go to get and refine and improve the interaction more and more and more.
Speaker 1:Yeah, I had that for a later question, but since you brought it up, I'll just ask what types of advancements in AI are we seeing that are really going to help propel the digital human forward? It could be picking up on emotional resonance or just understanding, maybe data about a user before they even start asking questions. Are there any advancements that have taken place in the recent past or upcoming in the future that we expect that are going to play a big part here?
Speaker 2:I haven't seen any models that can tell mood from how you're speaking. I don't know, eric, if you have, I've not seen anything that advanced from how you're speaking. I don't know, eric, if you have, I've not seen anything that advanced. But I think things from our experience and from watching people interact with it and feedback from people who have interacted with it speed of response is like the number one thing that we hear. First of all, how quickly is this thing responding? Because when you talk to a person, you get something back in a fraction of a second right, and so that's kind of people's kind of baseline for what they expect. Speed of response is one. And then, secondly, I would say pronunciation of things, especially acronyms, can throw people off if it says you know something off that a more human wouldn't say like AI versus AI, things like that, which can be tuned, you know in so many models today.
Speaker 2:I would say some of the uncanny valley stuff of how good is the lip?
Speaker 2:Syncing is another thing that people like really, because people really pay attention to a face when they're talking to a face, right, it's just kind of normal.
Speaker 2:What humans do is you read a lot of the person's body language, not just what they're saying to you. So I think that's the third thing. And then I think, um, I'll use the word sensitivity or awareness, and what I mean by that is, if I'm talking to you and you suddenly start interrupting me, I'll stop and I'll try and listen to what you're saying, and then I'll adjust my reaction accordingly, right? Well, what we have built now is very much a say what you have to say and then you get a response, and we haven't yet built in like the interruptibility that a normal person would be able to do. And I know there are models that can handle that and that exist. Like that technology actually exists. I would say that's one thing that's improved in the last few months. That's a potential for us to add in there. That would make it way more human-like in terms of how you can interact with it.
Speaker 3:Yeah, so you're talking about just making it more human-like, so it's more of like a human-to-human conversation, eric, are you seeing anything on the horizon in terms of advancements of AI, or any technology for that matter, that would make a digital human more human? Yeah, I think to improve on the parts that Ruben was saying around, this is the feedback that we're getting in the important pieces there. There's a couple of improvements in the models that I'm seeing that are in I'm thinking are going to make a big impact, and those are improvements to the speech models, and so we're already seeing certain companies take that, especially in a cloud hosted solution. Making speech sound less robotic that's a big deal. Making speech sound less robotic, that's a big deal. And right on the horizon we're looking at models where you'll be able to do that training more locally a little bit easier, especially from the NVIDIA stack, and having that customized model that sounds less robotic, handles those acronyms, like Ruben was saying.
Speaker 3:But also from a cultural perspective, thinking about you know, we were working with a customer recently because, you know, one thing we didn't talk about is we've also demonstrated Ellie on three different continents, right, and at least four different cultures, if you don't even count the subsets of cultures within the United States.
Speaker 3:And as we continue to demonstrate this digital human.
Speaker 3:You realize that certain cultures are more acceptable to the experience and don't get hung up on certain things, and other cultures may get hung up on that.
Speaker 3:And going back to the audio model right, if you think about deploying this in the UK but using a United States-based audio model that doesn't have the right accent, something like that it could be anywhere from funny to culturally unacceptable, depending on where those accents are leaning towards, and so making sure that we're sensitive to that. And then the other improvement that I see happening and let's just use OpenAI's models as an example that they're one area that's doing the same thing giving you the option between a 4.0 model versus a 4.0 mini model that might cost less and return faster. Those types of things where, even when we're looking at these hybrid solutions, maybe we take a small language model to get your initial response, to get to that really quick response, like Urim was mentioning, and then use a larger AI system on the back end to get the more robust answer. Those are some of the things I'm really excited to start looking into and I think that we will see large improvements on this year.
Speaker 2:And one more thing to throw in there, eric I think customization of the appearance of this avatar and the face is something we get asked about a lot and with the current versions of what we use, that can be a laborious, expensive process, especially with the high quality, high res, you know, version of the head model we use today.
Speaker 2:Generating a new one of those is not a simple or trivial process. So there's been a lot of movement and there's some companies, startups out there that they're trying to specialize in like custom avatar generation, right, maybe not the full quality, they call it kind of 2.5D, where it's not like a fully 3D rendered object but it is animated and it does move when it talks and stuff like that and the ability to generate those custom off of pictures or even just sending it a prompt and saying build me an avatar that looks like this and has these characteristics a little generated for you, and you can just stick it in, you know, to the rest of your pipeline and have a custom avatar that's relevant to either your customer base or your company culture, right, if you have a company mastermind or something like that.
Speaker 1:Well that you know. That customization makes me think of the movie Interstellar, where Matthew McConaughey's character is talking to the robot TARS and says, hey, scale back your humor setting or scale up your honesty setting. So not only from a looks and a visibility standpoint, but just about how they can, how a digital human might convey information. That type of customization would be important too.
Speaker 3:Computer vision and being able to know who I was Once. I've been using the quick service restaurant or retail as an example, coming in and saying, you know, I kind of know what you ordered. Which are these systems exist today, being able to tie into them and maintain that session? You're trying to have a delicate balance of being as convenient as you can with a customer without trying to, um, you know, be too information seeking and kind of being off-putting in that way like, oh, I, I know what direction you walked past from away from me last time, that that might be a weird thing to keep track of, but knowing that, hey, when we talked last, you went to this location, how did you like it? Uh, do you want to try out a different thing? There's, there's subtle differences that you can make in how you're keeping track of that interaction with a person and and even having customized avatars for an individual person as well, as something that I think will take this to the next level.
Speaker 1:Well, ruben, how do these digital humans learn and adapt? Are they doing that right now, or is that on the horizon? Or if they are doing it right now, how are they able to, you know, keep up to date or keep up to speed with the user?
Speaker 2:Yep. So the quality of the responses are first address, first and foremost address on which underlying large language model you choose right to power it, because that's kind of the heart of everything and the quality response you're going to get is going to vary depending on which one you pick and how many parameters and how big training set of the data was and all that jazz, just like any chatbot. Second component to that is retrieval, augmented generation, and that's where you start to teach the large language model stuff that's specific to you, obviously our digital human to know all about what technology, kind of work we do, who our partners are, case studies on work we've done in the past and business outcomes for different customers. So we went to the team that basically built our internal chatbot and they have basically gone out and gathered this data set to train and, attached to the RAG on their LLM and we quote, unquote, stole their data set and that's what we loaded in to our digital human. So you can't ask for any of these things.
Speaker 2:We were going to get relevant responses that make sense, that you, you know, matches the information you'd see if you were just browsing our site, right, because that's where kind of everything is published. That's how you make it relevant to you, to your customer base. If it's a retail setting, that's where you would load in, like your menu and what the options are for your salads and what things cost and those kinds of things right. Same thing if you're a hospital and you want a digital human to be there in your lobby and you need them, people are going to come in and ask this thing how do I get to radiology? I don't know where to go in the hospital? Right. That's where you load all that stuff in is through retrieval, augmented generation, and you make it specific to your use case in your context.
Speaker 1:Well, eric Rubin's mentioning a couple different physical settings and where we might interact with a digital human. How would organizations think about where to run these workloads, knowing that they could take place in a lot of different areas? They could take place in a lot of different areas. I'm assuming it's a hybrid of cloud on-prem edge, but how should we be walking through organizations on how they assess and determine where they're running those?
Speaker 3:workloads yeah, I think, going back to my original comment for the enterprise and understanding who your end users are, and then expand that where you plan on running it so you know if you have the infrastructure to be able to connect this to a common data center. So let's continue on the theme that Ruben just said. If you're a hospital, for instance, as a hospital you already have a data center on prem that you should be able to expand and WWT can help with that if you need to, to make access for the digital human. But that's where we would do a lot more of a run your workload in your data center and have, let's just call it, a national park or something like that. Well, you don't have the infrastructure or a data center sitting inside your national park.
Speaker 3:So that's where we would be looking at the more of the one-off solutions where there's a central computing node there, maybe it has access to internet every now and then so you can update the data set, but it largely needs to operate in and of itself during normal operating hours and I think for most enterprise customers you'd be thinking about, you know, how does that scale across? Do I want to run one node? Do I want to run 50 nodes. That's what really will dictate. This is, you know, what's your availability to data centers, what's your infrastructure existing today, what is the internet connection going to look like, with where you want to deploy this in the long term? And that's what's going to guide which solution you start building from there.
Speaker 1:Yeah, ruben, where's the industry at? Those are a lot of considerations for any organization to consider. From a technical perspective, ruben, as you're out demoing LE, our WWT's digital human, what other questions are you getting from IT leaders or business leaders in terms of how they're starting to think about implementing it within their own teams or organizations within?
Speaker 2:their own teams or organizations. Yeah, so first thing is just upscaling people on how to install these models, fan them up, how to chain them together, how to customize what you get out of the box with the models in terms of how much pre-trained and what you can adjust post the pre-training. That's probably the first question, and that's like an internal development capability kind of conversation. The second one, then, is almost always okay, like we were just discussing what's the best, most sensible way to deploy this. Where do these things actually run? Do I put my models at the edge hybrid? Do I put it completely in the data center, whether it's my data center or is it a cloud-based data center? Second conversation I feel like the third thing that people then think about is okay once it's deployed. I guess the second and third thing are a little related. What's day two support look like? Because if I put everything out at the edge now when I want to update things and things are physically all you know, like National Park example there's a cost to that right To keeping things patched updated. National Park example there's a cost to that right To keeping things patched updated. Doing maintenance. Power supply dies on this thing. That powers this thing to Grand Canyon. Somebody's got to go out there in a truck, right? Very different when it's a data center, if you need to maintain it.
Speaker 2:But then you have single essential point of failure and you got to deal with some of those kinds of things. Right, can I take all my digital humans down and patch them, because I need to do that, right? And how do I scale up the data center deployment so that I can run multiple of these simultaneously? What components in my AI pipeline can I share and can a single digital human at any point in time use simultaneously? And what things need to exist one-to-one for every digital human that's spun up as demand goes up, because you can't actually put those into two separate buckets. Just to cut down on, how much GPU do I need in my data center to support 30 simultaneous people talking to this thing? Right, there are strategies you can use to kind of skinny that down as much as possible and make the most of the money you're spending on GPUs, because obviously you know they're not cheap and you need a fair amount of them to power these things properly.
Speaker 1:Yeah, Eric, you've been demoing as well. Is there anything that you know, any common themes in terms of what you're seeing from questions, or even maybe surprised that people aren't asking about this technology?
Speaker 3:Yeah, and to follow up on Ruben's answer, because I think it answers this as well, Brian is the digital human actually fits really well into an enterprise's AI strategy, right?
Speaker 3:However, you're planning on scaling your AI today, be it for chatbots or for other uses of AI. The digital human sits nicely on top of that and is generally going to align. So if your enterprise is moving in a direction of a GCP or an Azure hosted models, then we can put a digital human on top of that. If you're working in the data center and you're wanting to expand in that way, we can put a digital human on top of that data center. And I think those are the questions. To answer your question directly, then, Brian, that I'm not hearing from people is how would I actually deploy this, what are the constraints of deploying it and how does it fit into my general AI strategy over the next couple of years? And that's the thing that I would really like to start tying in all together, because I don't view the digital human as a one-off component to add to your enterprise. I think it's a component or a Lego block that you can stitch on top of your existing AI strategy.
Speaker 1:Yeah Well, answer your own question there a little bit, Eric. How would organizations start to think about integrating that with their broader AI strategy or start to work towards that deployment?
Speaker 3:Yeah, I think the first thing is just to ask those inquisitive questions that Ruben kind of alluded to. Hey, how many GPUs does it take to run three streams of this at once, or six streams of this at once? How does it scale over time? How can we, as WWT, help our customers to scale that out?
Speaker 3:That's something that, when we moved our build into the AI proving ground that our own engineers really pushed us on was to say we need to come up with a better enterprise solution for this that scales, and we need to understand the constraints of what it would take to end up scaling this out. And if you're a customer, so that's for data centers specifically. But then the same problem applies as you look at your cloud providers and how many instances of your EC2 instance that you're going to have to set up in order to support six streams, 12 streams or something like that, and what's your scaling plan or strategy around that? How long does it take to spin up one of these in case you do hit your workload limit? Those are the things that are on our minds that we can help our customers as they're starting to think through how they would actually deploy this.
Speaker 2:In terms of the actual demo and people coming up and you know kind of the questions that we get asked. First thing is what's this running on, right? So usually we can either point to the workstation sitting right there and said, hey, everything's fitting on four GPUs. Here's kind of the models we're using. Each one uses roughly this many resources and we fit it all on here. Similar conversation if we're showing a web-based version, hey, this is running a bunch of containers, et cetera, et cetera, et cetera. Here are the models we're using. So that's usually like your first and foremost question.
Speaker 2:The second thing usually surprise when we talk about the multilingual capability. That is not something people are used to. They're not used to something just figuring out what language you're speaking and responding in kind. They're used to talking to things. We talk to our phones and say things all the time right, like how many people have Alexa at home? So people aren't so taken aback by that. The multilingual capability is something that usually catches people by surprise. How does that work right? How do you guys actually get that to do that? How's the language detection? A lot of questions around that as well. And then I think, the third thing I would say people are usually taken with the face because we use such a high-quality one. As Eric mentioned earlier, you can literally see hairs and pores on that version that we have that's running on the big workstation. So usually people are quite taken with that and then always ask well, is it easy to change that? How difficult is it to get a face? If you want a different face, right, what does it take to generate one of those?
Speaker 1:you know, it's interesting that you talk about just the um, the realistic nature. I'm assuming that it's going to get continued to be more and more realistic to the point where we'll just be walking around town or walking around our um, our office buildings or wherever we may be, and be interacting with a variety of these digital humans on a consistent basis, similar to how we interact with humans. I mean, are we working towards like a coexistence model here?
Speaker 2:You're talking about Skynet. Is that what you're asking? Yeah, well, I guess I think it depends, right? There are some use cases where you actually want to look at a face and that's just the best way to get what you need out of everything and to get your question answered or to get directions. There are other cases where it's a digital human but you need everything but the face.
Speaker 2:So, example, in a drive-thru, right, you want something that you could talk to, that understands the questions. You're asking about an order that you can interrupt and say hey, actually, take that milkshake off of that order. I changed my mind and you're getting completely fluid, natural conversation like you're getting from any other human. And there is speech synthesis and you know speech detection involved. There's language involved, because if somebody's speaking Spanish falls up and drives through, you want them to be able to order just the same.
Speaker 2:But that's the case where you don't necessarily want the face, right, you want everything about the digital human except the face in terms of interaction. So that's what I mean. Like it depends on the modality, right, it depends on the situation. Sometimes the face is like almost unnecessary. In a hospital setting, I would say it's way more pertinent, right? And when you drive through it you know it's just not so every time it comes back down to well, what's your business use case right, who are your customers or who are your employees and where are they going to use this for and in what situations? And you have to adjust accordingly.
Speaker 3:Yeah, yeah. And to add to that, you know, I think that whether or not you're going to be interacting with digital humans all over the place and trying to make them look more realistic I think was a portion of that question, brian To me I view it as the video game industry where for a long time, we pushed and pushed and pushed for better graphics, to the point where you wanted it to look super realistic, and there are some games that still push in that direction and I think we're always going to see that drive. And then there's a large portion of the industry that has moved away from. You know, hey, graphics sometimes take you away from the experience and you take something like a Fortnite where they're not trying to get hyper realistic with their graphics or the entire Nintendo Switch platform. They're trying to build much better and rich user experiences without focusing on the graphics.
Speaker 3:And I do see that as being a component, to give, two sides of that where in the future, where you may want to call your doctor and have your doctor go through an actual exam and I think we saw that a lot through COVID, where people were doing that virtually you're going to want that person to look hyper-realistic, almost trying to make it mimic, as though they're talking with a real doctor. Now that's a slippery slope example that we can talk about that in different ways. But then when you are talking about a retail example or other areas where you're not really trying to mimic the real human experience, but you're trying to just mimic a human like experience, graphics may not matter as much and that's once again a compute constraint. That's understanding your outcomes, what you're going for and how you would make those trade-offs.
Speaker 1:Well, Eric, as you mentioned, I'm a little bit surprised. Actually, this is turning way more into a user experience, user-centric solution here. What do organizations need to consider from that regard in terms of how they? Is it just that continuous feedback and understanding how users interact with it, what they expect from it, what makes them comfortable versus uncomfortable? Is that the process, or how do they start to think about user interaction and user design?
Speaker 3:Yeah, I think about this a lot, brian, because I think the digital human is a new version of a user experience that we're not uh, we don't really have a clear foundation for how we're going to be doing that, and I think you can think about the iterations of the user experience. Be it, uh, at one point in time it was just audio, and then we had the uh other user experience, evolutions of mouse point and click to touchscreen interfaces, and then there's just lots of different ways that you. I would really consider that user experience from the ground up of. What does it look like to start a session with the digital human? Do you want somebody to have to start that manually or, like Ruben mentioned, do you want that to start automatically? What are the constraints around that?
Speaker 3:Do you want to show the text log that's going on so the user can see what the digital human was hearing and responding with, or do you not want to show that, and what are the trade offs there? So, at like a lot of user interfaces, there's not really a silver bullet, it's more of what do we want to accomplish with this user interface? How do we want to make it as natural as we can? But you do have to start from the ground up in a couple of different ways, in my opinion, as you're talking about the digital human experience, and so that's where I would start is with an understanding that this is going to be different, and let's think through what are the possibilities with those differences that maybe weren't possible in the past, and then how can we even combine those with other user interfaces, such as a touchscreen or things like that.
Speaker 2:I'll add a couple of things there. I'll bring the P word up again prototyping. Prototyping is your friend. Stand something up, have real people, use it, get feedback. And the second thing I'll add is Eric was given some examples of interfaces that have developed over time right, mouse and touchscreen. The difference with a digital human that you're interacting with is, unlike a mouse and a touchscreen. There was no precedent for how that interaction should go. But with a digital human, people kind of have a baseline they expect which is talking to another person. How does that work? Right?
Speaker 1:baseline they expect, which is talking to another person. How does that work, right? So it's a little more challenging. Let's say they deploy a digital human in a drive-through or in a dressing room, or whatever the setting might be. What do organizations have to consider in terms of IT lifecycle and how they're going to keep up with these rapid advancements?
Speaker 3:to me, that dives straight back into your ai strategy. That is not a question that is unique to the digital human, and I'll use a very quick example of you know, if you have not developed into your ai strategy how you're going to upgrade from one model to the next and do that seamlessly across your enterprise, you are already set up for failure. And digital human will be another example of where you're adding difficulty to yourself versus and I'll use our AI proving ground as a great example right, we host many, many different models in the proving ground, but making it seamless to switch from Lama 3.1 to Lama 3.2 whenever it comes out to Lama 3.x which is actually how I describe our architecture to different customers, because that's going to change let's call it five times this year. And how do we make that change? As easy as a single commit to a code repository and then you're just swapped over.
Speaker 3:That, to me, is the key to making sure that this is maintainable and that you can actually keep up with the industry. And just one more thing there is having frameworks for evaluating your success of the deployment. Right, so, using automated evaluation frameworks around when we move from this model to this model, how do we know we didn't actually regress, or how do we know that now we do need to actually change some of our prompts? And having that score and that transition being as automated as possible is also key to successfully upgrading along the industry 100%.
Speaker 1:Well, great, that seems like as good of a spot to end this conversation. Eric and Ruben, fantastic conversation. Thank you so much for joining us today what I'm sure is a busy schedule, or at least carting Ellie all over the world, as you had mentioned. So good luck on the future demos and thanks again for joining. Thanks.
Speaker 2:Appreciate it.
Speaker 1:Thank you. Okay, so what did we just hear? We went beyond the buzzwords to explore how digital humans are becoming the next interface of human-AI interaction, not just through Ellie's lifelike presence, but through the complex systems and thoughtful design behind her. A few key lessons from today's episode that stood out to me First, innovation doesn't happen in isolation. Building a digital human requires orchestration across AI models, ux, design and infrastructure. Second, deployment matters Edge, cloud or hybrid strategies need to align with real-world use cases and user expectations. And third, the human factor is crucial. From blinking to latency, user trust and comfort are key to this adoption. As digital humans evolve, they're not just answering our questions, they're reshaping how we connect with technology. If you liked this episode of the AI Proving Ground podcast, please consider leaving us a review or rating us, and sharing with friends and colleagues is always appreciated. This episode of the AI Proving Ground podcast was co-produced by Naz Baker, cara Kuhn, mallory Schaffran, stephanie Hammond and Brian Flavin. Our audio and video engineer is John Knobloch and my name is Brian Felt. See you next time.