
Living With AI Podcast: Challenges of Living with Artificial Intelligence
Living With AI Podcast: Challenges of Living with Artificial Intelligence
Machine Listening - Discussing Virtual Assistants and How they Hear
00:30 Elvira Perez
00:40 Paurav Shukla
01:00 Sean Riley
01:20 Glastonbury Cancelled
03:30 Your smart devices listening to you, explained (Vox)
03:48 Why Do We Use Virtual Assistants Like Siri, Alexa and the Like? (LWAI Podcast)
04:20 Amazon Sidewalk will create entire smart neighborhoods. Here's what you should know (Cnet)
05:10 Wi-fi sharing plan launched in UK (BBC)
06:10 Christine Evers
44:45 The Social Dilemma (Netflix)
49:52 Jurassic Park Quote (quotes.net)
Podcast production by boardie.com
Podcast Host: Sean Riley
Producer: Louise Male
If you want to get in touch with us here at the Living with AI Podcast, you can visit the TAS Hub website at www.tas.ac.uk where you can also find out more about the Trustworthy Autonomous Systems Hub Living With AI Podcast.
Podcast Host: Sean Riley
The UKRI Trustworthy Autonomous Systems (TAS) Hub Website
Living With AI Podcast: Challenges of Living with Artificial Intelligence
This podcast digs into key issues that arise when building, operating, and using machines and apps that are powered by artificial intelligence. We look at industry, homes and cities. AI is increasingly being used to help optimise our lives, making software and machines faster, more precise, and generally easier to use. However, they also raise concerns when they fail, misuse our data, or are too complex for the users to understand their implications. Set up by the UKRI Trustworthy Autonomous Systems Hub this podcast brings in experts in the field from Industry & Academia to discuss Robots in Space, Driverless Cars, Autonomous Ships, Drones, Covid-19 Track & Trace and much more.
Season: 1, Episode: 9
Machine Listening - Discussing Virtual Assistants and How they Hear
00:30 Elvira Perez
00:40 Paurav Shukla
01:00 Sean Riley
01:20 Glastonbury Cancelled
03:30 Your smart devices listening to you, explained (Vox)
03:48 Why Do We Use Virtual Assistants Like Siri, Alexa and the Like? (LWAI Podcast)
04:20 Amazon Sidewalk will create entire smart neighbourhoods. Here's what you should know (Cnet)
05:10 Wi-fi sharing plan launched in UK (BBC)
06:10 Christine Evers
44:45 The Social Dilemma (Netflix)
49:52 Jurassic Park Quote (quotes.net)
Podcast production by boardie.com
Podcast Host: Sean Riley
Producer: Louise Male
If you want to get in touch with us here at the Living with AI Podcast, you can visit the TAS Hub website at www.tas.ac.uk where you can also find out more about the Trustworthy Autonomous Systems Hub Living With AI Podcast.
Episode Transcript:
Sean: This is Living With AI, the podcast where artificial intelligence is centre stage. Can we trust these AI-powered devices? Our feature topic today is machine listening. We're talking to someone who's no stranger to the podcast, appearing on our panel quite often since the start, expert in machine listening, Christine Evers. But before that, let's meet the panel. Mulling over the topics of the day this week are Elvira Perez and Paurav Shukla. Elvira is Associate Professor of Digital Technology and Mental Health at the University of Nottingham, her work has focused on data ethics and responsible research and innovation. Welcome Elvira.
Elvira: Thank you.
Sean: And Paurav is Professor of Marketing at the Southampton Business School. One of the areas he specialises in is customer expectations, consumption experiences, luxury marketing and branding. I hope you get to do some good product testing there, Paurav.
Paurav: I'm pretty sure after your introduction, I shall.
Sean: And I'm Sean Riley, and I'll be listening without the benefit of intelligence, artificial or otherwise. And we're recording this on the 21st of January 2021. Well, not much happening in the world of AI at the moment that I saw, but there's some significant news. Joe Biden has just had his inauguration and more importantly, Paurav, Glastonbury has been cancelled again.
Paurav: I know. Second year going, isn't it? And it's, what a pity that is. But it has a bright side in some ways, because many a times nowadays, most families don't go to Glastonbury, so they would be at home and they would be listening to more music, I guess, at home through many of those devices. Right, Sean?
Sean: They will. They will. Actually, there's a funny sort of side story to this, because I have several of these smart speakers littered around my home. Children have them in their rooms and this sort of thing. And I had reason yesterday to change the Wi-Fi password for our Wi-Fi router and never has a smart home become less smart so quickly. I don't think I've got them all running again yet. How unsmart are some of these devices?
Paurav: Yeah, if you lose the keys to the home, then that is it, isn't it? And so, they lose the password and that's it. I think in some sense, I guess we will have to talk about it, because when we think about home and privacy and smart speakers and everything, it's a very, very intriguing issue in its own way, isn't it?
Sean: Do you have a smart speaker, Elvira?
Elvira: No, I don't. But I can see, I know why I don't and it's a bit for the reason you mentioned. We're not extremely tech savvy and what we've experienced is that when there are multi-users and not all users are equal, there is always some type of conflict. And until recently, we didn't have a television and so I'm coming from a family that we are not 100% convinced about smart speakers.
Sean: I think smart speakers is purely marketing. They're not smart in the slightest.
Paurav: That's very, very true. I couldn't agree more. Anyway, they are just really, I still remember that old video by Vox, I think, which was then republished, which said, smart speakers, anyway some expert was saying, smart speakers are nothing but microphones in your house. And that's about it. All the other smart things are happening somewhere else in the world.
Sean: Obviously, we've talked to Stuart Reeves about these virtual assistants, but being as today's feature is going to be on machine listening anyway, it feels like it is another chance to have a discussion about that. I mean, there's been some interesting features on Amazon's range of Echo devices, as I understand it?
Paurav: You're absolutely right. There are some features that are very problematic in many ways in these devices that should not be named, as we know. For the reason being, an initial feature that was very recently brought on by one of these companies was called Sidewalk. And the thing is that Sidewalk connects your smart assistant or smart speaker, as they call it, with other smart devices within your neighbourhood. So now it is not just a home phenomenon, it's a neighbourhood phenomenon.
And the thing is that it was an opt out feature, so you had to go into your settings on your mobile phone and make sure that that was not activated, neither it would automatically be activated. And that's very worrying from my perspective.
Sean: I think anything that's an opt out, that's, yeah, that's that, I don't know what the word is influential. I know BT, I don't know if it still has it, but BT's broadband offering would, unless you opted out, share your router with anybody passing in the street who happened to pay BT as well. These sorts of opt in, opt out, they're very carefully designed, aren't they, Elvira?
Elvira: Yes, there obviously, there is an intention behind, but what is really worrying is that ethically, you don't give sometimes a choice to users that may not go through the manual instruction in detail. And there is no awareness and there is the perceptions of privacy are completely mis-constructed and expectations of privacy is something that has to be extremely clear and transparent.
Sean: Not that I'm saying that I wouldn't read the manual on something, but yeah, there are definitely a breed of people who just plug something in and turn it on and think, great, this is going to be fab without actually looking at the implications. This week's feature is machine listening. And our guest is no stranger to repeat listeners, regular panellist Christine Evers. Christine is a lecturer in computer science at the University of Southampton. She specialises in Bayesian learning for machine listening. Her research area is the intersection of robotics, machine learning, statistical signal processing and acoustics. I don't know if I've even said Bayesian, right? Is that right, Christine, Bayesian?
Christine: Yeah, I think that's, it's Bayesian. Like Bayes theorem.
Sean: Ah, from Bayes. Oh, okay. Well, I suppose to get onto this, we've had discussions before about things like smart speakers, but that's very specifically, you know, taps into your area, really, doesn't it? I mean, before the advent of smart speakers, machine listening isn't something we perhaps would have thought about so much. Is it a new field?
Christine: It's a relatively new field. I mean, machines have been listening for a long time in environments and we've deployed microphones for a long time in our home environments. But it's certainly, the voice assistants that we now can actually install in our kitchen and our living room and actually interact with, I think that's certainly changed the way that we perceive machines almost coexisting with us in our homes. But it's also opening up new areas like equipping, for example, robots that can completely autonomously behave in an environment and actually move around and support whatever we're doing or help humans perhaps take physically strenuous tasks off us as well.
Sean: There are a lot of kind of ways we could approach the trust issue with machine listening. One of the things that occurred to me when I was doing a bit of research was that one of the big problems has to be guessing the intent or analysing the intent of the speaker when you're listening to them. What is it that they want, I suppose? Is that a fair point?
Christine: Yeah, I think there's multiple levels of actually guessing intent or importance, if you want to call it that, or salience, as we would call it. So traditionally, from an automatic speech recognition point of view, so this goes more into the speech processing side of things, natural language processing. From that point of view, you would assume that you only get a microphone signal or something that's very close talking, there's not much noise to it, perhaps no reverberation. This is a long time ago that people actually dealt with this in that particular way.
Then the question is, from a natural language processing point of view, which is also active research nowadays still, how can you now translate the spoken words to text and then allow the machine to extract intent from that text? In practice, what we find though is that those machines are usually not close talking. What I mean by close talking is your microphone is close to your mouth. In practice, we would have a voice assistant standing on your kitchen counter whilst you're trying to cook, there’s sizzling from your pans going on, perhaps sort of some children screaming in the background, telephone going off, TV running in the background.
And that makes it an extremely difficult scenario to deal with, because not only do you need to now translate the spoken language to text in order to process it, you also somehow need to make sense of this acoustic scene as a whole, which part of it is actually what is important in the first place and what the machine should listen to.
Sean: And that translation to text sounds like an obvious thing to do to simplify the problem. Is anyone working on doing it before it gets to text? Is there any kind of way of doing that or does it have to go to text and then we work out what it is from the text?
[00:09:47]
Christine: So I'm not an expert in natural language processing as such, but as far as I know the best way to do it is to translate to text, because of course, once you get to the written words, that tells you a lot about the linguistic content. However, people also do take into account information like the pitch of your voice. So what does your voice actually sound like? That's important also when you try to extract emotions and that's a little bit of a controversial topic, but could you extract from the spoken as well as the textual content how a person is feeling? Which of course is very important also for perhaps applications like call centres. How do you actually engage with a human without seeing them through the tone of their voice? As I said, it's a slightly controversial topic, this emotion recognition, but you can fold in information that goes beyond just the text.
Sean: If we think about that scene there of the kitchen and the voice assistant, how do you ascertain what's important and what you don't need? I mean, I know that our virtual assistant in our kitchen sits next to the radio and half the time we have to run over, turn the radio down to ask the virtual assistant to do something, then we can turn the radio back up again when we've done it, because it can't cope with that speaker right next to it. But there's all sorts of things going on in a lot of these situations. How do you know when you've got it right or the right voice?
Christine: That's a really interesting question that and this very much falls into the area of acoustic scene analysis, which is basically what I'm working on in terms of machine listening. So what you try to do in the first place is to work out what sound sources are actually around the machines and which of these sound sources usually contain salient information or not. So this would be, for example, a human speaker versus a washing machine. We don't usually want to listen to a washing machine. If you only have a single human speaker in the room, of course, you could then estimate where that sound source is based on the sort of properties of the acoustic cues.
So our voice has a very distinct pattern to it. Even if the language changes, voices still have very distinct patterns across time, but also in frequency, depending on how we speak. So our voices go up and down and very rapidly change, actually, whereas for perhaps the siren of an ambulance would vary over something like two seconds time, a single phoneme in a language lasts about 50 milliseconds. So there's very rapid variations in that.
And based on those sort of signatures of the sounds, you can detect sounds, but also classify sounds. Once you've done this and once you've kind of made sense of all the different sound events that are in your environment, you can then usually take advantage of the fact that many devices contain multiple microphones. And what that does is, as the sound wave comes into or impinges on this array, so multiple microphones in a particular geometry, on this array of microphones, it hits the first microphone with a small time delay, the second, and so on and so forth, due to the propagation of the wave.
And that time delay between the microphones does give you information about where the sound source is located. There are other cues that we can use in addition, but that is one of the most straightforward ways of trying to localise where a sound source is. Once you've done that, and you know roughly in which direction your sound source of interest is, so say it tells you at 30 degrees, so to your right-hand side, you've got a washing machine, and to your left-hand side, you've got a speaker, then what you can do is, or what the machine can do, is to form what is called an acoustic beam.
And you basically use this microphone array in order to focus its energy almost onto a certain area or a certain cone of interest, so you're almost zooming into a sound source. And of course, what you do with that is that you suppress everything that's outside of this cone, and you only listen to what is inside of the cone. So you've now removed all the sort of noise that is coming from outside, you listen to the speech. And once you've got that, there are some complications to it, because there's an acoustic channel involved.
So if you listen to your own microphone signal that you recorded, for example, on your iPhone, you'll find that your voice doesn't sound like something you record on your mouth, it sounds almost metallic. That's what we call reverberation. It's basically the reflections of the sound in the room as they suddenly impinge on this microphone array. And you can remove that as well using enhancement techniques to improve the signal that you're actually listening to.
Once you removed all these adverse effects from your signal, you can then apply it to automatic speech recognition algorithms and natural language processing in order to make sense of the actual speech signal itself.
Sean: That's all pretty impressive stuff. So my takeaway from that is don't walk along while talking to the smart assistant, stand still?
Christine: Yeah, and that's actually one of my particular areas of interest. How do you make sense of dynamic scenes? Because, of course, if a machine thinks it's in a particular direction, it only takes an estimate once. As you move, you'll move outside of this cone of interest, almost, if you want to call it, or this beam. So what I'm working on, particularly with respect to robotics, is normally our sound scenes are highly dynamic. We as humans, even small head and body rotations that we subconsciously perform to hear better, those actually make a distinct difference on the channel between the microphone and the speaker itself or themselves.
So how can we cope with the fact that human talkers are highly dynamic, but also our microphone arrays might be highly dynamic. If you have, say, your phone, and your phone is probably carried on your person, I keep waving it in my hand. So the phone itself varies, but at the same time, I move as well. So this is a really highly dynamic scenario, and we somehow need to make sense of that.
Sean: And slightly separate to this, but connected in some way, is I remember when setting up the smart assistant for the first time, it asks you to pretty much identify yourself by saying various phrases or the same thing over and over again to try and get a sense of what your voice sounds like. Is that really able to determine my voice from somebody else's voice?
Christine: Well, in principle, yes, it would, depending on how they set up these algorithms, we've got very, well, very characteristic features of our voice. So there's the pitch of our voice, at what frequency you speak. So your voice is lower than mine. So your frequency is lower than mine. And that fundamental frequency is determined by your vocal tract, effectively, your physiology as a human. So that's a very characteristic feature. Also, things like intonation or rhythm of speech are very unique to humans. And depending on how many of these features those voice identification algorithms use, the more they can learn about a person and then the way that they actually speak.
Sean: My voice assistant will allow me to access my calendar and who knows what else in terms of my own personal data. But I'm pretty sure if one of my family members asks it, it will tell them my calendar details as well. I mean, is that something that, you know, how reliable is that from a trust point of view?
Christine: Yeah, I absolutely agree with that. I mean, it depends, again, on the device. Certainly for robotics, what you want to do is to make a robot aware of who the primary user is. And that's mainly to avoid issues where perhaps third parties could opportunistically attack the system or exploit the system in some ways. And there are very good examples also in news over the last two years where people have actually exploited especially voice assistants. So the voice identification systems would, of course, allow you to have a primary user, which would give you the confidence that it wouldn't be able to perhaps order toys for your children or that an advert might accidentally, shall we say, trigger your voice assistant.
On the other hand, that becomes slightly complicated if you live in a multi-person household. So how does your voice assistant then distinguish between your calendar and your partner's calendar? And how can it prevent that your partner can basically access your information if you share the same device? So they're all very complicated questions, really.
Sean: What sort of thing worries people about this machine listening area? You know, should they be worried? Is there any, you know, are there any problems here?
Christine: Yeah, absolutely and I think the biggest concern that we possibly have in machine listening when it comes to user adoption is actually the trust in the system. And I think the fundamental underpinning issue here is that users perhaps need to build up the trust in the company or the provider of the system in the first place, in order to trust that that system actually does what is on the tin effectively. Primarily, how do we avoid to think as a user, we have the concern that, of course, you've got a listening device in your home environment, in a very private sphere. How do you ensure that you're not being constantly surveilled?
And that really comes down to how much you trust the manufacturer of that device and that what they tell you is happening in that device is actually what's going on. What is typically happening at the moment is that, of course, we have wake words. And what happens is that the machine listens, but it processes this information only on device. So, this is not sent usually to the cloud. And only when it hears that wake word is when the machine actually starts to transmit data to the cloud where it has better algorithms, more data available to actually then perform the processing chain that it needs to perform.
[00:20:01]
But, of course, the question for users is then how can I trust that the company is actually implementing an algorithm that only listens on device unless I say the trigger word? And how do I avoid scenarios where because of these very complicated scenarios that I'm in, where there's noise, there's other people, there might be words that sound very, very much alike to the wake word. How can I make sure that the machine doesn't accidentally switch on and then listen to very private conversations that you might have within a private environment?
Sean: Well, I'm thinking of those wake words and the processing that's required to do the transcription. Is that the best word for it? That's not wake words. I mean, my guess is that these are small, inexpensive devices that only really have the power to discern those wake words. How much power would you need to do the rest of it? I suppose what I'm trying to get at here is, can we set people's minds at ease that this cheap £20, £30 device that they've bought to make shopping lists or set timers with, actually can't possibly do the sort of heavy lifting that needs to go to the cloud?
Christine: Well, a lot of research at the moment is actually going into reducing the complexity of the algorithms that are processing all this information. And the idea is that you bring algorithms away from the data centres back onto edge devices, like your phone or your watch. And that is basically done on two levels, A, of course, computational power goes up with time. The longer we wait, the more powerful devices we have. On the other side, can we actually help that along by reducing the computational complexity of these algorithms and the resources that those devices require in order to perform the tasks at hand?
And especially nowadays, where phones are basically at a standard that probably supersedes whatever we had 10 years ago on a high performance machine, we're very much getting to a point where a lot of the processing can be offloaded onto edge devices, as we call them, so at the edge of the network, so to say. So that we avoid that ambiguity between the information that's being sent to a data centre and that we as users have more control over the information that's actually processed on our own devices.
So a lot of, for example, smart devices like watches process a lot of the physiological signals that we record only on the device itself. And the hope is that over time, we can do similar things with speech and audio as well in order to ensure that users can retain as much data as possible.
Sean: So allaying my fears that they can't do the processing, actually they can do the processing, but that's in our benefit because they'll only process hopefully what and send what's kind of the end product eventually, right? So they'll send out the bit of text you hope, or the bit of text that's been translated rather than the audio. Do these devices get better the more they're used?
Christine: Yes, of course. So the more data you have available, the more you learn, of course, about the use of language, about sound scenes, how they work in general, and you're starting to detect more and more patterns in the data that the machine can then make sense of. This becomes quite important also for things like text-to-speech, where you start texting and it predicts the next word, but also for language prediction and synthesis, for example, when the machine is actually trying to speak back at you. And of course, there's other issues where you want to predict where perhaps a person might want to move to, as we talked about dynamic scenes. If you move in an environment, we of course don't tele-transport, we move in relatively smooth patterns. So can a machine then predict where you're likely to go next in order to almost anticipate what it needs to do as the next action in a few seconds' time?
Sean: And I'm just jumping back a little bit here, but when you mentioned kind of like an array of microphones, even these tiny devices have several microphones in them, do they?
Christine: Most of them do, or a lot of them do. So you can find the specifications normally online. I mean, voice assistants usually have at least four microphones. I think some of them have even eight microphones available to them. Most phones by now, I would expect to have about at least two microphones, if not more in them. So there's a lot of information that we can gather from these microphones, both in terms of time content, so how things evolve over time, but also spatial content, as in what happens around the machine, and how can we provide that machine with situational awareness like we as humans do?
Sean: We’re used to the big tech companies having control of this, so your Amazons, your Googles, your Apples, so for Siri or for whatever. Is there any progress on kind of independent systems for this kind of voice recognition and kind of machine listening? I suppose what I'm thinking is, might people at some point be able to roll their own and have very good results without worrying about these privacy issues?
Christine: Yes, and there's a lot of research going into exactly that field at the moment. What is happening in some devices, or in many devices by now, which is very promising, is that you would process, you basically pre-train a model. You have a data set that you start from that, as a designer, you believe is reliable. What we then do is to take this data set and build a model around it, and that model basically captures how information evolves over time. This might be either in terms of situational awareness, what is usually happening in a living room, for example, or it might be in terms of speech, how speech signals evolve and how language evolves.
So, you build this model up, and then you deploy this model onto these individual devices. The model is built on a server, and then you deploy it onto the individual devices. Now, of course, your individual device is individual, it's personalised, so you as a nowadays can select that the data that you record on that device, or particular parts of data, are only to be retained on your phone and you can use that data on the phone in order to basically let that model evolve based on your personal preferences, based on your habits.
You have the choice as a user for a lot of these algorithms to select that information may or may not be shared. The important thing about it is that at that point, you're not actually sharing the data itself. It would basically build this individualised model and it can then, depending on your preferences, either send an encrypted version of that model back to the server in order to improve the overall model, or if you don't want to do this, and you'd rather just learn or rather just use your personalised device, keep it on your device itself.
But the crucial part about that is there's a lot of research going into how can we actually keep your data safe? How can we avoid that this very, very personal data gets shared with a server, is stored on a server where we as users don't have control over it? And how can we encrypt information that might even be in a tangible way related to our data? How can we keep that information in the form of models, for example, safe?
Sean: I did sort of slightly side completely not trust related reason for asking that, which is because it is related to trust, having that idea. I live very close to a couple of sports grounds and when I'm here working away, obviously not during the Covid era, over the last year or so, but you often hear big cheers go up. And the first thing you think is, oh, what's happened? Has there been a goal? Or has there been a wicket? Because there's cricket ground and there's a football ground, there's a rugby grounds, all sorts.
And I thought that the way to find that out is obviously to go and check online what the score update is. And you can wait several minutes for the score to update and you want to know what happened, right? So I thought if I get a Raspberry Pi listening to the radio, putting it all on text, I could just scroll back up the little text field and go, oh, yeah, such and such scored. Boom.
Christine: So we've actually done that for a previous project. But basically, I mean, for projects nowadays, you need demonstrators and particularly for audio, that's really handy, especially if you work in robotics like I do. You can basically just incorporate various APIs for, say, speech recognition. So I focus on acoustic scene analysis, but I can access speech recognition platforms through APIs that are by very big providers who have access to massive amounts of data. And basically, you can hook into them. So once you've kind of extracted a speech signal, you can then just send it through the API, get your answer back. And it's a very powerful demonstration to actually show, especially to the public, oh, this is what the machine heard if I actually have my algorithms on. And this is what the machine heard if I have my algorithms on. It actually makes a significant difference.
Sean: Thanks to Regular Living with AI panellist Christine. And talking of our panel, did you manage to hear the interview clearly or were there homeschooling children screaming over the noise of the washing machine? Or is that just my house where that happens? But homeschooling must be affecting work in some way.
Elvira: Definitely. And someone said, “Sorry to interrupt.” And I said, I reply, you know, “Interruptions nowadays while they're working at home is extremely difficult.” 12 o'clock meetings are challenging because the children are hungry and they will come, they will feel the right to let you know, “Mum, when are we having lunch?” Yeah, it's extremely, and it's exhausting. At the end of the day, you are tired because you're trying to keep concentration to a level that sometimes is not feasible.
[00:30:07]
Sean: And humans find this hard. Machines also, we've discovered, find it hard. Just as a side note, I think it adds something to the day when somebody gets interrupted because, you know, we all have to, we'll have a bit of humour and it just changes things. Paurav, what were your thoughts off the back of listening to Christine's feature and the idea of these machines and, you know, listening capabilities?
Paurav: I learned a lot of new things from Christine's interview, actually, Sean. It was very interesting to know how technology within this small little microphones, as we call it, really works, because that idea of the microphone after listening, your first word creates a zone. And then from that zone, it pipes through all that information. So we should not move much. Was really fascinating, I realised that how it then operates, because in our house, when we are, you know, in the kitchen and in the evening and there's lots going on with my sons and my wife and I, and all of us are there and we want to play something. And I would say, “Alexa, play Bob Dylan.” And obviously, suddenly there would be somebody from saying, “Alexa.” In a very little voice, “Play Gorillaz.” And Alexa listens to that too.
And that is such a fascinating thing that I assumed that it was just, you know, it had an ability, it was just a very sensitive speaker. But it's that zoning was a very interesting idea. And I can now understand how it hears beyond those shimmering’s and sizzling Indian cooking that is going on in my house and how it is able to manage all those little conversations. But at the same time, it also makes me a tad worried from a parental perspective, or it's just as an individual, that it's listening all the time. It's just not activating possibly, but it's listening all the time of all those kinds of conversations that are going on.
And in my house, the conversations are happening in three different languages at the same time, which is so much fun. But I'm not sure, we do this with Alexa also, oh, I did mention the name, didn't I? But the thing is, we sometimes use the smart speakers and I can say this now, sometimes I ask it the time in my own mother tongue in Gujarati. And it comes up with such funny, funny ideas. We always try those very Hindi words and Gujarati words and all those kinds of words and we throw it at it. And right now, it is still not intelligent or smart enough to understand it.
And it makes it more fascinating for us. It comes up with interesting jokes for us for the day.
Sean: It becomes the butt of a joke doesn’t it, sometimes? Elvira, you were going to say something?
Elvira: Yeah, I was going to add that in terms of background noise cancellation and the technology has advanced quite a lot in the last 10 years, in terms of the signal processing. What's the signal? What is the noise? And to separate that. But the technology hasn't been advanced so well, as you said, with accents. And I believe that there's a training element there that is biased. So I'm sure the three of us, Sean, you will be the person that Alexa will understand you the best. Maybe Alexa will have trouble understanding me speaking in English, or at least it will not be as accurate.
So how do you provide a machine that will, so anybody has the same level of, the machine will understand at the same level. That's still really, it's a real technological challenge. Also, what I found fascinating is the way users will communicate with these smart microphones or smart speakers. And I've noticed children tend to be extremely disrespectful. They will give orders and be rude. And I keep thinking, why, why do they do that? And I'm sure there is really interesting psychological phenomenon where you feel that you have to give orders. And if you shout, maybe the machine is going to understand better and do it more efficiently. I don't know.
Sean: I think also I've heard people being polite to them and thanking them and that is actually quite an interesting phenomenon in its own right. And it's interesting about, as you say, the different dialects, different accents, they are definitely biased towards certain ones. Similarly to the way that algorithms like the YouTube algorithm come up with automatic subtitling and captioning are clearly trained on American English. And I expect a lot of these smart speakers are too. So maybe my British English is fairly close, but even so I can look at, say, YouTube automatic captioning when I put a computer file video up and it is quite amusing sometimes what it thinks I've said. Paurav, you were going to say something?
Paurav: Yes, it was this idea, Sean, which Elvira just mentioned about kids and smart speakers. And one of the things I think it's now a new power player has arrived in the house. And the hierarchy generally would be that when your kids are small, they see you as almost demigods because dad knows or mum knows. About eight or nine years old, they realise that they don't. Santa Claus is out of the window now. Santa Claus is no more an entity.
So what then happens at eight, nine years old, you start seeing them challenging some of those things. Then at about 15 years old, they almost say, “What do you know?” And so we start seeing this whole pattern within 15 years. And now there is a new little device which they can overpower, because now they are powerful and the device almost follows everything. What the parents have been doing to the kids, I think the kids are doing that to the smart speaker. So I think, Elvira, you're right, this is a new power balance that has come where they can order and be kind of direct and the device would always nicely answer, almost would say good evening and good night and all those kind of things. Which somebody who is their contemporary or somebody who is similar in power balance wouldn't do. So I think they feel a little more, I have the power type of thing.
Sean: One thing we find in our house is if the youngest decides to try and talk to the device, the next child will start making silly noises to try and make sure that the command goes wrong or whatever. These sorts of things happen. I was interested as well by the idea of the processing happening on the device that Christine started talking about towards the end of the feature and the idea that previously I've thought, oh, these things are fairly safe. They're only sending off after the watchword. They don't know what we're saying the rest of the time. And actually, maybe that's sort of called into question now?
Elvira: In relation to Paurav says, of these children believing that what the machine says is the correct answer always and that worries me because the kids believe that Google has all the answers. And sometimes they may not question or look for counter facts and you can find anything you want and the machine will agree with you. And that's something, this critical thinking is something that I found that within digital literacy, data literacy is something schools have to basically do something about it because it's worrying, specifically when they feel that they have power to ask and demand and believe what maybe the machine is saying back.
Sean: And often these machines will just literally parrot something off a website, right? So it might not be the right website. The context may be wrong. The date could be wrong. I know we've had instances, I won't use this as the exact example, but something like, who is the president of the United States of America and the device goes to the wrong date or an old article because it's been viewed more times or whatever, and it tells them the wrong answer. These things do happen. And there is an argument and Paurav, I think you've said this before, that are these smart speakers just the latest version of the microwave oven, right?
Paurav: Yes, I very much feel so in some ways, because, Sean, I have both types of those systems right now at home, the Google based system as well as the Amazon based system. And one of the things that I see myself when I was reflecting on it, is that I generally ask it to play a particular music. I generally ask it to check the weather. And I generally ask it to turn off or turn on something. And that's about it. However, every week in my inbox on a Friday, I get about 30 new commands that this thing can do.
However, at the same time on Saturday morning, I have forgotten all the 30 and it comes back to the same three commands that I keep on asking. And that makes me wonder that is it a new microwave oven, which was once upon a time touted as the complete cooker alternative and it has now become a royal heating devices in most homes?? And the same thing is happening with the smart devices that they look very smart. They are becoming even more and more elaborate in a way the latest version of that Amazon device is so very elaborate and it has lights and this, that and everything.
So they are becoming more conspicuous in terms of them becoming, but are our interactions with them becoming any smarter? I don't know about others, but I certainly haven't asked any more than those three questions ever.
Sean: I think they become, they sort of fulfil a function in your house, don't they? You're absolutely right. We have a microwave oven. We had to replace ours a couple of years ago and the replacement came with all these automatic cooking modes, make, I don't know, meringue, cook raw fish, automatically programme the weight of the fish and it will, de, de, de. It still just gets used up for heating up frozen peas and whatever else, milk, I think occasionally. But yeah, things find a niche in your house, don't they? And then they fulfil that niche no matter what they're capable of.
But it makes me wonder if it's, if the big tech firms are desperate to get us to use these things for other things. I mean, they're sending these emails out saying, “Hey, did you know your device can offer you a quiz? Hey, did you know your device can plan your day for you?” But again, as we mentioned, I chatted to Christine about it, there are privacy and trust issues over all of these things. You know, we have a Google devices, several of them in the house, and I know that I can go and ask for, I mentioned this in the feature, but I can ask for my calendar and it will tell me what my day is looking like, which, you know, is quite convenient.
But anyone could ask it that, you know, a trades person who's, I don't know, fixing the washing machine could ask what I'm doing today and find out when I'm in the house, out of the house, whatever.
Paurav: What an interesting point you made, Sean, there. And Google has just literally about two weeks ago, actually launched a new mode called guest mode on its devices, on all of its smart devices, Nest and these various other devices. And one of the things that they have brought about is this guest mode, is that that limits the kind of interactions. But at the same time, what it also does is that by doing that, it says that it does not save those recordings on its servers.
So now I'm feeling like should I be operating on guest mode all the time? But then it limits a lot of other things. So if I want the perks of those other things, like the calendar and this, that and everything, then I need to keep it on. The thing is, how do I remember that I have kept it on, on one device and off on another device and so on and so forth? So there are still a number of implications, as you rightly say. In one of the earlier podcasts, we also talked about this.
There was this issue of a young lady or young girl, eight years old, asked Alexa, and did I name it again, but asked the one that should not be named, that, “What is coming for Christmas?” And the device actually told her all the surprise Christmas gifts that were supposed to be coming from her mum, and that was it. You know, all that was lost. So I think these are really serious challenges.
Sean: This is a standard thing across any technology, though. You can have it as secure and as trustworthy as you like, but it won't be as convenient as you like. And that balance is really hard to kind of, well, it's hard to get that balance sometimes, isn't it?
Elvira: What you mentioned, Paurav, is it requires such a cognitive demand to ensure that you remember this, that you're looking on Saturday to see what's new, and maybe making notes, trying. So at the end, you become a slave of the technology, because there's so much to remember, so much to try, to test, to invite others to do, because otherwise it doesn't make any sense if it's a joint, whatever. I believe that there is a point where you have to decide, otherwise, it's just so consuming. It's a trade-off. And I see that with the kids, the technology is so persuasive, and so it's like a magnet. I believe there is no autonomy to disengage.
What you are describing is a context where, how much time can you dedicate to make your devices more optimal and try? So it's a very interesting point where, specifically for more vulnerable groups, they do not have the capacity to decide, okay, I tried today for 10 minutes or an hour, and that's enough. It’s difficult.
Sean: We've talked about this before, the social network documentary that's on Netflix discusses the idea of smart devices being a bit like the fruit machine, where you pull the handle for another roll of the dice and, you know, what's it going to give me today? What am I going to look at this time? Or what am I going to, how many likes will my photo have had? Or what new bit of clickbait can I read all about? Yeah, thinking about the education side of things, I mean, at schools, obviously, things are different during lockdown, but even with homeschooling, children at school learn about different industries, and they learn how things are, say, built or manufactured. Is the same thing happening in tech?
Paurav: This is a big issue, Sean and I think in some sense, tech is kind of tackling it. But at the same time, there are still a number of black holes, as I see. So when Amazon was criticised quite a bit in the wider media and through academic articles also in terms of its practices, its labour practices, its company practices and how it was operating, what it did was to, in a way, yes, it was a PR campaign, but at the same time, it opened its warehouses to people to see how it actually operated.
A number of videos emerged from top bloggers and top vloggers, who showed how the company operated behind the scene to get us what we wanted, how everything looked seamless, but what was going on behind it. And we saw some very interesting insight, people understood how this company was operating, what it was doing. It also made the company realise that it needed to change its human machine interaction. And so the new kind of robots were brought into Amazon warehouses and a new kind of technology and new algorithms were brought in. So the company improved its operations because of the scrutiny.
What is happening here right now in the smart speaker industry, it is just before that warehousing type of issues that had happened. We have just touched upon the early side of the privacy issue, the transparency issue, what part of my data is being shared, what is being done with it? We've heard some vague answers, you know, Google saying that we are using it to show you some advertisement, relevant advertisements on your pages and so on and so forth.
Amazon is saying that we are showing you this, we are going to use this to create a recommendation engine on our website because they don't show ads in particular, but all those kind of things. What is still unclear is that how is that data which I sent to that server, what happens, how much it hangs in there? In a way, they may have given it in a big legal framework, but how many of us have actually read those legal frameworks? And so there is no clarity within the industry, the transparency is lacking in this industry. And I think that is the next big step the industry needs to take for people to trust it more.
Sean: There's a possibly though an issue with education level in understanding what's going on in some of these industries?
Elvira: Well, definitely there is an education and awareness problem among the citizens, but I believe the problem comes from a lack of responsible and responsible innovation frameworks and sensitivity. This type of tech, they get really excited in coming with a new product that they know many people may use or not, but they do not stop and reflect about the consequences for society, for families, for children. And they react to a problem. So, oh yeah, we're going to fix this because there is a problem.
And if there's a lobby against transparency, there is no clear understanding of what's happening in that black box, then they react and they may simplify, I don't know, the terms and conditions, or they provide an accessible way for professionals using a specific tech to understand the steps that the algorithm is taking and which type of data is being fed. But what I feel is that there is just a rush to deploy products without thinking of the impact they may have in society. And it's costly, and you need the resources to engage with your stakeholders, to anticipate scenarios that nobody can anticipate and evaluate risk. And that exercise is missing, it's completely missing right now.
Sean: There's a famous film quote that comes to mind here. Jurassic Park, “Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.” Just time now to say thank you to our panellists for joining me again today. Thank you very much, Paurav.
Paurav: Thank you.
Sean: And thank you very much, Elvira.
Elvira: Fantastic, thanks so much.
Sean: Hopefully joining us again on another Living With AI podcast very soon. If you want to get in touch with us here at the Living With AI podcast, you can visit the TAS website at www.tas.ac.uk, where you can also find out more about the Trustworthy Autonomous Systems Hub. The Living With AI podcast is a production of the Trustworthy Autonomous Systems Hub. Audio engineering was by Boardie Limited, and it was presented by me, Sean Riley. Subscribe to us wherever you get your podcasts from, and we hope to see you again soon.
[00:50:47]