UXchange

UX and AI Digest Episode 2

β€’ Jeremy β€’ Season 2 β€’ Episode 2

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 26:25

Send us Fan Mail

Evaluating AI Agents, Claude's Computer Access & Prompt-Only Enterprise Software

πŸ”¬ EVA β€” A Framework for Evaluating Voice Agents

  • I hadn't realised we lacked a proper evaluation framework for voice agents β€” this one from Hugging Face caught my attention
  • What I like: it combines two dimensions I've always thought should go together β€” accuracy (task completion, faithfulness, speech fidelity) and experience (conciseness, conversational flow, turn-taking)
  • My question: is the "experience" side actually measured with real end users, or just by the designers?
  • This connects to a three-step evaluation model I keep coming back to: define your ingredients, evaluate internally, then validate with users β€” and compare the gap
  • I'll dedicate a full episode to this, but the short version is: if you want to elicit trust or satisfaction, you need to know which product attributes actually produce those outcomes

πŸ€– Claude + Cowork β€” AI With Access to Your Computer

  • Cowork now lets you authorise Claude to access your files and folders so it can act on your behalf even when you're away
  • I'm genuinely torn β€” amazed by the technology, but uncomfortable with the direction
  • My concern isn't the capability itself, it's the pattern: LLMs arrive, and suddenly we open the gates to everything β€” recording, transcription, computer access β€” as if these things naturally belong together
  • My rule of thumb: always assume your data is being used to improve the product β€” if you have doubts, assume yes
  • I'd love to see more push for private, self-hosted LLMs β€” but the honest tension is that commercial ones will keep winning on convenience because they have more data to train on
  • It's not even apples to apples β€” and that's what makes this hard

πŸ–₯️ Aragon β€” What If Enterprise Software Was Just a Prompt?

  • Startup Aragon raised $12M at a $100M valuation to replace enterprise tools like Salesforce, Jira, and Tableau with a single LLM interface
  • Their thesis: buttons and menus are dead, future business is done by prompt
  • My honest reaction: I get why this is being explored β€” we're mapping the edges of a new territory and seeing what sticks
  • But one modality for everything? I'm not convinced β€” when I was building my own website, I actually wanted both: LLM for generation, drag-and-drop for fine-tuning β€” and that product barely exists yet
  • Users have 10+ years of muscle memory with their tools β€” strip that away and you're not simplifying, you're adding friction
  • Nielsen's heuristics exist for a reason: people need control, exit doors, and multiple ways to accomplish a task

Support the show

SPEAKER_01

In today's episode, what if we could use our enterprise software only through prompts? What if we had a framework to evaluate the quality of the output from LLMs and AI? And what if we gave total control to AI of our computer? I'm gonna cover three short stories. The first one will be about a framework to evaluate the output from AI. This article that I found is focused on evaluating voice agents, but I think this has implications for AI overall. Then we'll cover what if what could happen and what could be the value of leaving your computer open for AI to use. And finally, what if we could use our enterprise software look more like a prompt? So let's start. So I've I haven't read the whole articles, of course, but I'm just using that as let's say inspirational um elements and to discuss the implications on the user experience level. Um that is really helpful to me because sometimes I I suffer from um analysis paralysis, so sometimes it's even more helpful to me to only read some of it and discuss what if what would be the implications on a user experience level. Okay, so first and foremost, we have this new framework for evaluating voice agents, so it's called Eva. And this is an article that was found that I found on Hugging Face. Um, and so this is a framework to evaluate end-to-end um converse conversational conversations with voice agents, um, and so that evaluates multi-turn spoken conversations using realistic bot-to-bot architecture. And so, what is interesting is that there are several scores here. We have accuracy, so we have let's say the physic the physical aspects of your interaction, in this case, accuracy, and we also have experience, and so how was it perceived by the end user? And so I find this really interesting in um in the fact that we are combining finally first and foremost. I didn't know that we didn't have any let's say framework for evaluating voice agents. I honestly thought we had. I worked on that for a year, so I reviewed all the literature that is linked to um how can we make conversations with artificial agents more natural? And so, for instance, I discovered a whole bunch of things that we humans do that make our conversations natural and acceptable, such as back channels. So, this is a thing that I discovered. So when you speak to someone, you are awaiting for cues from this person that manifest that they are understanding you, that they're following you, that they that they they degree or not with what you said, but at least that they provide feedback to you. And in absence of that, the conversation looks um really let's say artificial or um or maybe uh eerie. And so, yeah, there is a need to let's say take some inspiration from this human-to-human interaction and applying it to human to artificial to some extent, because we can go to the to the uh uncanny valley for those who are not familiar. It it describes this sensation, this feeling we would have the more or interaction with um artificial agent, let's say mimic the ones from humans, but at the same time they do not have the same capabilities as humans. So once we understand that these are artificial, it creates an eerie sensation, which is cold, which is described by the Incanny Valley. Anyways, so I found this to be really interesting because I oftentimes find in our frameworks a really strong emphasis on measuring either the experience or so the experience that a user um has with their AI, their agent or their product, or the other side, like measuring the physical aspect. I like to call it physical aspect, it's not necessarily physical, it's like the ingredients, the criteria, the attributes that you put in your product. That's what I'm referring to. And so what I find interesting is that this framework combines both. So for instance, it has accuracy, and by accuracy we understand task completion, faithfulness, measures whether the agents' responses were grounded in its instructions, policies, user inputs, and tool call results, speech, fidelity, measures whether the speech system faithfully reproduced the intent text in spoken audio. And then we have experience, which ultimately is really interesting because okay, so it's experience, but I don't know to be honest how do they measure that. It looks like okay, they measure conciseness, measures whether the agent's responses were appropriately brief and focused for spoken delivery, conversational progression, measures whether the agent moved the conversation forward effectively, and turntaking measures whether the agent spoke at the right time. Neither interrupting the user nor introducing excessive silence. Okay, so that's really interesting because I don't know exactly how do they measure those the experience ones. Is it really measured with end users or is it measured with the designer? Um, so at first glance it looks like and it it almost isn't doesn't matter to make for me to make my point here, and I'm gonna have a full episode dedicated to that, to the importance of having three steps when we evaluate our experiences of whatever we conceive. It looks like what they call experience, the measure of experience, is not maybe I'm wrong, and I will correct if needed in a f in a next episode, but this is not um measured with end users, or maybe they use it themselves. Um yeah, I'm not finding that information right now, anyways. So it's interesting because whatever the result, I would say it sparks conversation. So when you create something, you need to evaluate it on several fronts. So I'd say if you want to elicit an experience, let's say increase in trust or increase in satisfaction, you should know what you should put in your product that will elicit these aspects. So you should be really clear in okay, if I put this, let's say this turn taking or this um voice, it will be perceived as more trustworthy, and so my user will trust it more. It's like this three-step evaluation framework that I know is not new, but I'm really advocating for let's say focusing on that even more with AI because we are developing products that we don't know a the intended, like let's say the expected outcome it will have on the user experience. So it's it's a pretty much like reverse engineering a really good recipe. Let's say you really like a soda and you or your users really like your soda, but you don't know why they like it. So if you want to reproduce that, or if you or if you know they they like it to some extent, but not to the extent you were expecting, you would like to know what is the delta that what should what should you be working on? It's like anything in life. If you are producing some outcome and output at your company and you want to improve, you must know where you stand in relationship to the extremes, and then how is your outcome tied to your output so that you can know what you should change in your output? So it's the same, it's like we need to define what to put as ingredients inside of the recipe, then we need a way to evaluate the recipe internally before our consumer eats the recipe, and then we need to evaluate with them again and compare that. So that's that's the that's why I found this framework interesting, but still it really looks like maybe I'm wrong, but it really looks like they still evaluate that with only the designers or the the the ones who conceive the model, and it would be great to separate to some extent. So there are several implications to the framework that I'm proposing, and it's not new, by the way. We are doing heuristic analysis since the beginning of time, we do uh usability inspections uh internally with with teams uh before releasing products since the beginning of time. I'm just saying we should emphasize that more. But I will have a dedicated episode on that. Okay, then it's just a reflection to spark conversation. Uh, what I saw yesterday, if I'm not mistaken, we can now let Claude use our computer in co-work. Uh so it looks like you can connect, you can authorize Cloud to access some of your of your files and folders so that even through the app, once you're away, um you can chat with it and ask it to message you a presentation because you are not ready for your presentation. So I'm really I'm really torn between being amazed and at the same time being kind of doubtful. Amazed because of the technology, of course, and at the same time, it's a bit like it's a bit the same as what I said yesterday. I think right now there's kind of a conflation between between what the technology can do and what we should authorize it to do. I don't know. I feel like there is a mix at this stage between Yeah, LLMs are here, so because they are here, it means we are opening the gates to a lot of other things, such as controlling my computer, such as recording people like in yesterday's in yesterday's issue. If you if you listen to it this episode, we should open the gates to whatever we um we could just because so I don't know, maybe it's really helpful to some people, but I I I I am struggling to understand why because we have the technology which is LLMs, why does that mean we open the gates to everything else? Like to me, these are not things that go hand in hand necessarily, it can, but not necessarily, and I see a push towards the end of privacy. I don't know, every time a little bit more, the end of privacy. So I give the LLM permission to record my voice, whatever I do, I prompt it with my voice, I record my tra my meetings, and I give and I analyze the transcripts automatically with LLM, and it goes into a server and it analyzes everything and it uses that to improve their model. Maybe not at all times, but I prefer to assume that it does always. That's my rule of thumb. Once you have doubts, always assume that your data is being used to improve whatever product they are releasing. And I'm not anti, I work in user experience, so I think that this is necessary to some extent, but I think it's good to know where you're leaving your data. I think this is important. Like users should know where they're leaving their data, and so it's the same with co-work. I really I'm not affiliated or whatever. I really like cloud, I love what they're doing, I love the products, and I love their their commitment to to improving, but I just don't understand why there is this direction of um, of course, you can be in control, you can use the folders that you give cloud access to. Um yeah, I I I don't know. Um I'm just leaving that open. It's not so much a commentary, maybe it is, but it's more of a reflection, like philosophically speaking, almost, and maybe even more user experience speaking, like on the level of trust. Do we really want to leave open or computer it for instance? I'm I'm just wondering why is there not maybe there are, I think there are, why is there not more push for private for private LLMs to which we can speak with a self-hosted server, with a self-hosted LLM on your server, which does exactly the same thing. And I see a trend over and over and over again. This was the case with Google, um, where we tend to value, we tend to value way more convenience than privacy. So it's like even if I know that Google is analyzing my data left and right, I know that I know that the product is superior, and I know that the product is superior just because of that. And so I'm willing to make the trade. And I I I think this is a pattern that is being repeated with LLMs and way, way more. It will have way more way more impact because we talk to these LLMs and we give them access to files, and so ultimately I think this is way more impactful in terms of how quickly they can learn. So that's why probably so sorry, I'm thinking out loud and I'm and I'm really sharing my thoughts as they come, but that I'm thinking that's probably why there is so much push for this mix between using or LLMs and giving them access to everything. Because ultimately they will be superior to private ones, because private ones don't have as much data to be trained on, and so ultimately these commercial ones will be more convenient because they do the job better. So it's like it's it's it's not even comparing apples to apples, I feel like comparing a private LLM to a to a commercial one. Maybe there is an in-between, which is you are a pro and you can train that so much that um even if it's private, it does the job perfectly. But I really would like to see more emergence of private LLMs that do this kind of task. Okay, and finally we have a news sharing a startup who wants to make enterprise software more look like a prompt. So that's the story of Josh Siroda, who founded the startup Aragon back in August and has just raised 12 million at a hundred million post-money valuation to build an agentic AI operating system for enterprise customers. They say the simple thesis is software is dead. So this is an article from TechCrunch. Sierota says buttons and dialogue boxes and pull-down menus are a thing of the past, and future business will be done by prompt. Aragon is attempting to offer the whole suite of business software, Salesforce, Snowflakes, Tableau, and Jira's through LLM interface. Sierra, who worked on go-to-market teams at Oracle and Salesforce, admits to suffering a bit of quarter life crisis in the lead up to moving to San Francisco and launching Aragon with a small team from a live workloft across the street from the Giants baseball park. So, okay. Well, here my thoughts like I'm not speaking even about the product, like Aragon. I I haven't I have no idea about the product itself. I'm I'm I'm looking at it as I speak. So if we go on our website, it's described as a proprietary AI powering the world of bits. It's an OS, they say it's enterprise AI OS. Um at the foundation of every company are bits, ones and zeros created every second, stored across every system. These bits grow exponentially every second. Together they make up our entire business. No one has ever been able to see all of it, connect all of it, act on all of it until now. So that's the vision, apparently. Okay. So my unbiased this is this is of course um ironical. Um my opinion as a user experience researcher, interested in the human, maybe a little bit more than the technology, but also in the technology. That's probably why I'm a user experience researcher. I would posit that this is normal, what we are seeing, because it's like when you enter a new territory, you need to map this territory, you need to map the extremes, you need to go to all of the edges of this territory so that you can readjust and settle where where this is maybe less shaky or less dangerous or less less uncertain. And so I feel we are entering this era of yeah, we have some new toys, new capabilities, which is AI, and we need to experiment with all of it and see what sticks. So that's what I'm seeing with these kind of things. Um, I'm not saying that this is wrong or right or whatever, I'm just uh describing what I'm seeing. So, in my opinion, when we interact with objects, there are so many senses that are that are solicited. We have vision, touch, we have um and and and in senses here, so what is integrated in our perception I will also include other things. So like memory, feelings, goals, like in this model that I'm describing, even that is an input, I would say. And so then you need to make a decision and you need to act on it. And once you act, it's the same, you have a thousand ways to do something, and these thousand ways they are competing with each other, with also including your experience. So, for instance, if I'm an enterprise software user, I might have a goal. Let's say I might input the the the uh how can I say the payroll data of my employees? Okay, so I might have to do that, and I have several ways to do it, and this is what we are seeing right now. I might do it with voice, I might do it typing, I might do it by clinking, I I might do it by clicking on buttons. Ultimately, it will all depend on my experience, also. Am I a new payroll specialist or am I an experienced one with 15 years of experience? So that also adds to the complexity how regularly do I need to do it, and so on and so forth. How to what extent do I trust technology? And I would say. That it's not so clear to me that we should use only one modality. Like to to make it short, my conclusion is it's not so clear. Do we really want to only interact with technology with prompting? That's my question that I want to leave here. Because if it's the case, how restricting would that be? Like let me tell you, I developed I'm developing a website recently and I'm using only AI prompts. I was I was really torn between using AI prompts only, well sorry, prompts to LLMs, and using a drag and drop builder. And I know to some extent how to code, how to code, sorry, how to yeah, how to come up with a website, but I would say it's really basic HTML and CSS, and I'm not a web designer. So to get the job done, I was hesitating between drag and drop builder and and um the use of LLMs. Knowing that right now on the market, I looked at it, and it looks like if you want to build your website with an LLM and then and then edit it by drag and drop, which would be the most efficient to me, we don't have such product. At least I couldn't find any. If you happen to know of any, please let me know. Um let me know. Uh I think on Spotify you can comment on the show. So let me know because I couldn't find any, and it's really really frustrating. It looks like companies again and again and again, like I think it's not companies, it's it's maybe the the mindset. We tend to always put products first before needs, and so that's what I'm seeing probably with this kind of um with this kind of um let's say take that we could have the enterprise software looking more like a prompt. So it's it's really it's really a it's really a mindset saying that you can do everything by just prompting. Well, if I take the example of website design, I don't know if I need to change one thing, I need to prompt it and wait for its answer, whereas it would be way easier to do it by hand and drag and dropping. So I don't know, I'm not sure that I'm not sure that this would work, to be honest. I might be wrong, let's see. But at least it would be a learning experiment for this company, even if it doesn't work. Um but I'm just saying that having only one modality to interact with technology can kind of feels restrictive, to be honest. And sometimes, for having spoken with a lot of users throughout the 10-ish years of experience that I have in user experience, I can tell you people want to do things a certain way. And if you take that away from them, they are not happy. And sometimes we do want to have control over things. So if you don't leave your users an exit door or a retry or and that is part of Nielsen heuristics, well, they are not happy, and understandably so, because you're placing barriers between them and the job they have to accomplish. And so restricting things to one modality could be one of those barriers. I don't know, I'm just saying. So that's it for today's episode. These three news. I hope you liked it, and I hope you learned at least one thing, or at least that it's part conversations. I'm super happy to um have people disagree with me. Let me know in the comments. Um, I don't have the full, full, full knowledge of um what's behind these articles and these news. Uh the there may be way more, let's say, smart people around that could um um let's say compliment what I'm saying. So if it's the case, please comment and I would learn from that. So until then, take care. See you tomorrow. Cheers.