The Screen Lawyer Podcast

Sarah Silverman v. AI Lawsuit Pt. 2 #109

August 02, 2023 Pete Salsich III Season 1 Episode 9
The Screen Lawyer Podcast
Sarah Silverman v. AI Lawsuit Pt. 2 #109
Show Notes Transcript

In this episode, Pete Salsich III further explores the controversy surrounding the class action against Open AI. Ms. Silverman alleges copyright infringement of her book by highly detailed summaries generated by the Large Language Model (LLM). So, what constitutes "fair use"? Is it the ChatGPT-produced summary of an original copyrighted book? Or is it the copy that exists inside the Large Language Model training dataset?

Original Theme Song composed by Brent Johnson of Coolfire Studios.
Podcast sponsored by Capes Sokol.

Learn more about THE SCREEN LAWYER™ TheScreenLawyer.com.

Follow THE SCREEN LAWYER™ on social media:

Facebook: https://www.facebook.com/TheScreenLawyer
YouTube: https://www.youtube.com/@TheScreenLawyer
Twitter: https://twitter.com/TheScreenLawyer
Instagram: https://instagram.com/TheScreenLawyer

The Screen Lawyer’s hair by Shelby Rippy, Idle Hands Grooming Company.

On this week's episode of the Screen Lawyer podcast, we're going to dig in a little bit more deeply into the lawsuit filed by Sarah Silverman and some other authors against OpenAI, based on ChatGPT’s copying of their original works for its training purposes and the output that comes out of that process. There's a lot more to cover here, and we're going to get right into it. Join us. Hey there. Welcome to the Screen Lawyer podcast. I'm Pete Salsich, The Screen Lawyer. And today I want to spend some time digging in a little bit more deeply into the recent lawsuit filed by Sarah Silverman and other authors against OpenAI and all of its affiliates alleging copyright infringement arising out of the large language model’s process of accessing copyrighted works to learn to train, to teach the model how to replicate certain styles of language, text, etc.. We talked about in the last episode that the cases the lawsuits were filed actually two different lawsuits by two different sets of plaintiffs, same lawyers, same jurisdiction, same defendants, same claims. My guess is that they're going to have to all be merged into one lawsuit. Each claims to be class action. So I think that's going to have to get sorted. But ultimately, I think there's going to be one lawsuit alleging these claims. And as we talked about in the last episode, that raises a whole series of questions, including a fair use analysis. And part of the issue is that in the complaints filed by Silverman and others, part of the allegation is that they were able to put a prompt in to ChatGPT to essentially summarize their original copyrighted works. And when the summaries came out, it contained very, very high levels of accuracy. In fact, the type of accuracy that can really only come from having the whole book copied and accessed. And part of we talked about the fair use analysis. So when you do a fair use analysis, what is the new work? What is the alleged infringing work and spent some time talking about these summaries, these outputs. Because the argument is this summary that comes out has so much of the original in it, it really is a derivative work, and under copyright law only the copyright owner has the right to make a derivative work of their original copyrighted work. And if you so if you don't have a license or permission from the copyright owner, you can't make a derivative of the original. That's basic copyright law. And the argument in those cases is that the summaries are derivatives. And, you know, we talked about, well, if that's the issue, then you can't really make a strong, fair use argument to protect that summary because it doesn't have the it has essentially the same purposes of the original. It uses more than is necessary to make the point it could impact the market for the original whole set of reasons why the derivatives themselves probably are not fair use. But the more you start to think about it and then this kind of came up in a conversation that we had here at The Screen Lawyer after we finished recording the last episode and we were kicking around, Well, what does this mean? What about that? And it prompted some further discussion. I want to dig into that a little bit because you're probably asking yourself similar questions. So one of the scenarios that was posed was, well, because I had said, you know, in the first episode, you can't really, can you? Or can you not sue the tool open? I is the tool that is used by others to apparently create derivative works such as these summaries, and those are infringing but open. I didn't put the prompt in their opening. I didn't. Isn't offering for sale summaries of famous works. So you know what is the infringing act. And the first question it kind of seemed obvious after a while was, well, wait a second, let's say I'm Sarah Silverman and that's my book while I am the copyright owner of my book. If I tell ChatGPT, produce a summary of my book, well, aren't I the person creating the derivative? And if I'm the copyright owner or I, you know, I direct my lawyer or my assistant or somebody else to do this at my own insistence, someone is creating this new derivative work. Well, but that's me. I can't sue myself. All I did there is just use a tool to create a new derivative work. And I'm really the person that did it. They. The lawsuit doesn't allege that other people are creating these derivatives, that other people are selling them or marketing them or using them. It doesn't say that open eye is creating these things. So if the derivative of that summary is the infringing work. Well, right now one of the arguments might be you can't sue yourself. You just created something that you had the right to do and you haven't alleged that anybody's done anything wrong. And I think that's compelling until you start to think about, well, wait a second, What the complaint is really saying is that the only way ChatGPT could actually create these summaries that are as detailed and accurate as they are is if it had a copy of the whole book inside of its library. And so the complaint uses the term, uses the term ingest - it literally swallowed the original book. And that's the copy, that's the infringement, making that copy going out into the marketplace without permission and obtaining a complete copy of the original work and putting it in its data set. And according to the complaint, in the information that they've found, hundreds of thousands of works are included in these data sets. And yes, many may be public domain older works, but many or not many are copyright protected. And so the argument is if ChatGPT’s able to produce a summary, it has a copy of my whole book that it did not have a right to do. And so if that's the complaint, if that's the infringement and the summaries are simply proof that the original infringing copy exists inside the data set, now we're back to a fair use analysis that is a little bit less certain because again, is just putting the copy into the data set. Is that in and of itself infringement or is that fair use? This is where I think the fair use argument at least becomes possible. More convincing, perhaps, and that is that the purpose of making the copy inside the data set. And so that's what we're now talking about. We're talking about the copy inside the data set, not whatever is produced in the output, because that's what OpenAI did. OpenAI did go and get the original make a copy and put it in this dataset, at least according to the complaint. So if that's the copy that's at issue now, we ask, is that copy possibly fair use? And then you go through the factors again. Well, in that copy by itself, we don't know if that would impact the market for the original just list sitting inside a data set, arguably not because the rest of us who might want to go buy that book don't know that it's available inside a data set necessarily. How much of the original was used for the purpose? Well, if the purpose was to train the A.I. on methods of communicating and sentence structure and language uses things like that, which is what they claim, well then you probably need the whole book to really make that purpose valuable, because part of what you're teaching the AI is beginning, middle and end story arc language uses structure, themes. All of those things that exist in the book in its entirety. It's kind of hard to say that you only needed some of the book to teach the AI about writing books. Now we're into an almost impossible proof point, but it certainly seems like it's more logical to say, Well, we use the whole thing because the whole thing is the teaching component. And now we look back at the purpose, and the purpose is to train to teach something new how to do its job. It isn't to put something out in the world, it isn't to make money off this particular work, it's to train. So all of these works are accessed and used without permission, but for the purpose of training something new, that new thing is going to do new things with that information but isn't supposed to be able to simply produce copies. That's where this is going to be very interesting to see how it all turns out, because I think that purpose and when I first heard about this, I thought that purpose lined up fairly well with the perfect ten lawsuit about the thumbnails of the porn videos that this is in. This is sort of a new thing, and we're going to go access this existing copyrighted works, but use them for a different purpose. In this case, train the A.I. and that is perhaps fair use. Now, the recent Andy Warhol case that the Supreme Court has given us some new language to use and new ways to think about different purpose and meaning of the new work versus the original. So it's not a foregone conclusion, but it's certainly a colorable argument. But that that that purpose is totally different than the purpose of the original, which was simply to entertain This isn't to entertain the the A.I. or to educate the A.I., it's to teach the A.I. things about writing. And so you carry that argument along. You're like, Well, that's not any different than a human being going to the library or going to the Internet wherever they're getting and reading lots and lots and lots and lots of books that are copyrighted by other people to train themselves how to be a better writer. Nothing wrong with that at all. In fact, it happens all the time. That's kind of the argument that OpenAI might make against the allegation that the written the copy in the dataset is the problem. But you go back to this, you know, the analogy of a vending machine. If I you know, is AI just like a vending machine and I'm thinking, oh, I want I want a book like Sarah Silverman's book, I press a button or I issue the right prompts and out comes essentially a derivative work. It's close enough copy. And that's the output. Well, again, the vending machine is just a vending machine. But how did it get the original in the first place in order to spit out the derivative copy? That's the issue. Because now you're, you know, can you sue the vending machine? Well, I'm not really suing the vending machine for the output, but I'm suing the vending machine because it stole the what's in its data set. That's the infringement. And I think that at the end of the day, has to be the strongest argument on behalf of the authors. And that's where this is really going to get tested. And, you know, in a way, this is a good case to do it. I think that the summaries that were put out and talked about in the case are interesting, but it's not really a complaint about the summaries. I think it's more that the summaries are proof that this particular book was copied. So you can imagine the discovery process in this lawsuit. You know, how many other books are they going to go out and ask ChatGPT“Write me a summary of this book?” And every time a summary is written of every copyrighted work out there, is that going to be further proof that all of these books are actually in the data set? And it's that building of the data set that is going to be tested in this case. If the courts decide that building the dataset is fair use, going out and accessing these copyrighted works and building them into this giant dataset is in fact a different purpose and therefore non-infringing then then I think the lawsuit goes away because it's now subject to these other problems, which is you're the person that, hey, you plaintiff, you created the derivative work of your own work by asking the prompt to do this. And so therefore you can't sue about the summary because you did it yourself. You have to sit around and wait. And if you see a new book comes on the market that is virtually the same as yours and whoever that author is, if you can prove that that author used ChatGPT or something else to copy your book, now it's a straightforward copyright infringement case against that author, but not against the tool it used. And so if you're OpenAI, you're trying to say, look, accessing this, all of these works is necessary to train this new technology to do all these wonderful things. And it is therefore an entirely different purpose that doesn't impact the market for the originals and doesn't use more than necessary. Follow the fair use factors and you get a holding that it's fair use. I have a feeling that's going to work its way back up to the Supreme Court because it's not really something you can contract around. It's going to get teed up. And frankly, the purpose of this lawsuit may be to do just that, to try to get it elevated up as quickly as possible through the court system. On that basic issue, I could see the defendants trying to even before class certification testing the basic theory, perhaps at a motion to dismiss phase or the parties agreeing this to segregate the lawsuit or bifurcate the lawsuit into a liability issue first. And then if liability is possible, then you can look at class certification damages, other things like that. But if it turns out that the basic premise is fair use, then you shouldn't go through all these other aspects of the case before you decide that main issue. So it wouldn't surprise me, actually, if that becomes the key point fairly early on in the process. We'll keep a close eye on that. But it's interesting, it changes my original thinking about the what the summaries are as output and puts them more in the evidence category rather than the infringing work category. It's probably where they belong, I think. You know, another thing that's kind of interesting, and this that we were talking about is that, you know, there's going to perhaps come a time and I've seen lawyers write about this, others write about this where at a certain point AI’s might end up being the judge and jury about this situation. So I realize that in and of itself isn't going to happen any time soon because we still need human judges, still no human lawyers. There still needs to be human decision making in our legal system. I don't think that's going to change any time real soon. But in a way, you know, think about YouTube and its algorithms and its, you know, its ability to immediately identify copyrighted music. When people post new videos, that's AI-driven algorithms that can spot that information and immediately act on it. There are other examples where essentially this AI tool’s already being used or similar AI tools are already being used to cut through the clutter and hone in on what the real issues are. So in this case, are we going to, you know, turn on the deputy, you know, open up a deposition and take the deposition of ChatGPT and turn on the video cameras and under oath, ask ChatGPT to produce things? What if you asked ChatGPT to identify all the works in its dataset? Can it do that? Probably. I don't know. I mean, these have become so now as lawyers, the discovery process in a case like this, the fact investigation process in a case like this is fascinating and it's going to cost a lot of money in legal fees too, because litigation does. And this is one where there's lots of uncertainty and lots of digging that has to be done. Now, you can certainly take, you know, discovery of OpenAI. Have it provide all of its information about what's in its dataset. And one of the things that's actually interesting in that area, in the complaints themselves, if you read them, they kind of tell a story about how in its earlier days OpenAI was pretty forthcoming and public about the process it was using and what it was going to get and where its datasets were being built from. And over time, as new versions of OpenAI I can new versions of ChatGPT came out in over time as new versions of ChatGPT came out, they got a little less forthcoming about what was used to build the datasets for this next generation. And the complaint points out where they used to provide all this information. Now they're being much more secretive about where they get it. And if you measure the total numbers of works in the data set, the argument is there's no possible way they could have that many works unless they were getting them from these shadow libraries and torrent sites where all of this work is there illegally and an infringing manner. And so if they're getting things from places like that, then they're not even getting the original through proper means. I mean, I don't know how much of a difference it's going to make to the copyright analysis. If they bought copies of the book, at least the author sold one and then they fed it in. You still have the issue of the training, but at least you're not stealing from, you know, the ability of the author to make money even from the single sale. So that could be an issue. But ultimately, I think it's going to come down to whether the idea of training by itself will be considered a separate purpose, a different enough purpose to withstand a fair use or to be considered fair use and withstand an infringement claim. That's where this is going to come down. And I think it's going to be interesting to see. And I and I I'm not sure that we can predict that. Well, I was pretty sure earlier this year when this started had when all these issues started happening, that the training portion set itself up for a pretty good fair use defense. And arguably it still does. But these lawsuits, these complaints have drafted in. And I think an interesting and perhaps compelling way by talking about the summaries, because now it's not just the training we now know. And apparently they'll have proof that these the ChatGPT doesn't just write in the style of Sarah Silverman or in the style of some other author it can reproduce. The original or awfully close to it. It can make derivative works easily and if that's true and people know it, what's to prevent unauthorized copies essentially being created by ChatGPT of original works and being sold and distributed simply because it's very easy now to do. You don't even have to go through the somewhat laborious process of, you know, copying and writing a new book, just spit it out and try to make the argument that it's not a you know, it's not a copy. It came out of ChatGPT. It's legal to learn. I think that's an interesting argument because you're starting then to argue that even if just training in and of itself had a different purpose, you can't say that training is the only purpose because as the output tells us, that infringing is a purpose. And now you're kind of back into the arguments that go back and forth under the safe harbor language of the DMCA about is is the essential purpose of this platform to facilitate and encourage infringement, in which case it may not get to safe harbor. Or is the possibility of infringement just one of the things that's possible, among many other things that are not infringement, and therefore you can't really blame the toolmaker if somebody uses the tool improperly once in a while, as opposed to blaming the toolmaker whose the tool really only has one purpose and that's to infringe and I think in that argument, OpenAI has a pretty good case that there's lots more to do than just infringe. In fact, and that's completely unscientific, but my guess is that the great majority of people using ChatGPT right now are not trying to infringe somebody else's work. They're trying to save themselves time, you know, come up with... I’ve got to write something or whatever. I'm just *noise* and have it have it come out right away. I didn't have to go through that effort, make it sound funny, make it sound, you know, sophisticated, make it sound like, you know, whoever did it. And so, you know, people are being creative and stuff like that, but they're not trying to create illegal derivative works of somebody else's particular work. They're asking prompts. And then the AI draws from all of the works that it's ingested and spit some things out. If I give a prompt that says create a derivative work, basically give me a summary. Well, it can do that. But now I've induced the infringement in a way. And so now it's not the AI that did it. It's not OpenAI's fault, it's the actor. And if I'm the plaintiff in this case and I'm saying, give me a summary of my own work, well, I've induced the creation of a derivative. But guess what? I can do that. I'm the owner of the original derivatives or something that I actually can do. I'm the only person that can do so. Me or anybody acting on my I have the does this. That's not infringement. I need to find other people doing it without my permission and then putting it out in the world and impacting my marketplace and so forth. And then we're now back to talking about individual actors and not OpenAI. So if you’ve followed me along this far, you can tell that I'm sorting through this stuff still, but I am still probably more on the side of that. The training itself is likely to have a decent shot at a fair use defense again, depending on how it's presented and depending on what else we see when it comes to evidence. Right. I mean, because we don't know, for example, what other instances of summaries or derivative works are come up in discovery. So we'll have to see. But we're going to continue to watch this case pretty closely, because this, I think, is a bellwether. And it's not so much about the class certification or other things. It's about this basic test. When the AI's go out into the world and scrape, ingest whatever the term is copyrighted works and put them into their dataset to train from, is that act infringement or fair use? That's the key. And I think the arguments can go both ways, but I'm leaning towards that's fair use. It's a technical argument, but I think it's likely to be the case and forces back to kind of what the Supreme Court said in the in the Prince case and the Andy Warhol case, which is you've got to look at the use itself. And if the use it's alleged to be wrongful was copying and training, that's a very different use than writing the first book. And don't be distracted by the summaries. The use isn't creating summaries because as we've already said, those summaries themselves are being created by the original owner. It's the use originally of just accessing it and reading it, for lack of a better term, to train that uses a different purpose, I think. And under the Supreme Court's recent case, that may be sufficient, but we'll see. Stick around. More to come on The Screen Lawyer Podcast. We will spend time with this case. And as there are developments, we will come back and talk to you about them for sure and look forward to having some guests on who spent a lot more time in this area to help us unpack some of these legal issues. But check us out wherever you get your podcast. The Screen Lawyer podcast can be found. Hope you'll follow us and spread the word on social if you like it. And if you're watching this on YouTube, hit that like and subscribe button down below. We'd love to add you to our subscriber list and so you can stay current with all the things we do here at TheScreenLawyer.com Take care.