From Startup to Exit
Welcome to the Startup to Exit podcast where we bring you world-class entrepreneurs and VCs to share their hard-earned success stories and secrets. This podcast has been brought to you by TiE Seattle. TiE is a global non-profit that focuses on fostering entrepreneurship. TiE Seattle offers a range of programs including the GoVertical Startup Creation Weekend, TiE Entrepreneur Institute, and the TiE Seattle Angel Network. We encourage you to become a TiE member so you can gain access to these great programs. To become a member, please visit www.Seattle.tie.org.
From Startup to Exit
Gen AI Series: Legalities of scraping data to build LLMs, a conversation w/ Adam Shevell, partner at Wilson Sonsini
Many of the large language model providers have built their LLM models by scraping data from websites and open source sites. Some have licensed the content like Open AI has done with Reddit. Nonetheless, many of these LLM providers have been sued for illegally copying data for which they have no permission. In this podcast, Adam shares his thoughts on the Fair Use doctrine and various legal opinions on the legality of scraping data to build LLM models. Here's also an interesting article that Adam has written for developing, extending and using Generative AI models.
Adam Shevell is a partner in Wilson Sonsini’s San Francisco office, where he co-leads the firm’s technology transactions practice in the city. Adam advises technology companies and their investors at all stages of company development, from pioneering start-ups to leading global enterprises, angel investors, venture capital firms, and other institutions in the start-up ecosystem.
Adam represents leading Silicon Valley companies on complex and strategic transactions involving cutting-edge innovations and the launch of new products. By understanding his clients’ products, markets, and business priorities, and by building deep and lasting relationships, Adam provides creative and pragmatic advice focused on delivering effective results.
Adam also works closely with Canadian start-ups on their U.S. expansion, fundraising, strategic partnerships, and exit transactions, and with Canadian venture funds investing in U.S.-based companies.
Brought to you by TiE Seattle
Hosts: Shirish Nadkarni and Gowri Shankar
Producers: Minee Verma and Eesha Jain
YouTube Channel: https://www.youtube.com/@fromstartuptoexitpodcast
Hello, everybody. Welcome to another episode of uh From Startup to Exit. My name is Gabri Shankar. I'm on the board of Thai Seattle, and this podcast is uh brought to you by Thai Seattle. Uh it's been great fun hosting this along with my uh fellow Thai Seattle board member, Sherish Natkarni. And we've been doing this for years, so thank you all for your support and spreading the word, and I'm hoping that you all enjoy the content that we are bringing you from uh all the guests we've had. Uh Sherish uh is a serial entrepreneur based in uh Seattle, also, but now he's a serial author. His latest book, uh Winner Takes All, is uh available wherever books are sold. His first book, from Startup Exit, from which we borrow the title of this podcast, has been out for a few years. It's a fantastic book and a must-read for all entrepreneurs. Uh, this is uh another great episode. Uh a fantastic guest we have, but I'll have uh Shirish introduce the guest and take it away.
SPEAKER_01:Thank you, Gauri. Uh today's uh podcast is about uh Gen AI and the legal implications of building a model. Uh very pleased to welcome Adam Chabelle, who's a partner in Wilson Sonsini's San Francisco office. He had worked earlier in Seattle as well. Uh he now co-leads the firm's technology transactions practice in the city. Uh Adam advises technology companies and their investors at all stages of company development, from early stage startups to leading global enterprises, as well as angel investors, VCs, and other institutions in the startup ecosystem. So welcome, Adam.
SPEAKER_04:Thank you. I appreciate uh the opportunity to talk with you both and I look forward to it. Excellent.
SPEAKER_01:So let's start uh with um your background. Uh can you tell us a little bit more about your background and how you got to develop AI ML expertise from a legal perspective? And what kind of AI ML work are you currently doing?
SPEAKER_04:Sure, I'd be happy to. I've been an attorney for 17 years now. And for most of that time, I have focused on technology and intellectual property rights, primarily as they pertain to technology companies, startups, and larger enterprises uh in the global economy. And as an IP attorney, my job is to understand the rules of the road for intellectual property rights uh in the United States, in other countries as well, and to understand how those apply to the technologies that our clients are building, commercializing, exploiting, and protecting, right? And that applies to any type of startup. It could be a mobile app, it could be an enterprise SaaS product, uh, or it could be uh a large language model, right? And so it what's interesting is uh I one thing I like about the job is that uh I constantly have to learn about new technologies, right? Um as technologies develop, it it's incumbent on me to do my job well to understand the new things that are coming out, right? And uh in many ways, artificial intelligence and machine learning is no different, right? Um I've I've there have been lots of different waves of new technology that come out during my career, and each time myself and my colleagues have to uh understand and learn and get to get our sort of our our our hands dirty with understanding what it is, and then figuring out what intellectual property right issues apply both generically as they might be developing the technology, or as they get into the market and start doing business deals with other companies. How do the specific IP rights pertaining to that technology impact the ways you should or should not do deals in the market? Now, what was a little different with AI and machine learning and specifically generative AI is uh two things I would point out. One is the the questions of law were so new in a way that we haven't seen in a while, right? Really was a um you know groundbreaking type of questions that are so new that no one really had an idea of how the courts will answer those questions as they pertain to Gen AI. And we're still waiting for the courts to answer a lot of these questions, right? Um The other factor I'd point out is just the speed with which it was adopted and deployed. You know, uh ChatGPT was released in November of 2022, and uh, you know, by by the spring of 2023, it was hard keeping up with the news as far as types of uh news stories, new companies, new applications, news about Chat GPT and other challengers, it was almost dizzying the velocity of news that was coming out. And so the the liquidity with which generative AI was adopted, um not just by the tech industry, but then by the by by uh all people of all walks of life using Chat GPT and Cloud and other generative AI products was really uh amazing. And and so it was both the the speed and the novelty of the issues that gave us uh an even greater challenge. But as far as uh professional uh sort of observations, it really was the perfect storm for myself and my colleagues who do this kind of work to be right in the place where intellectual property rights were so relevant to this new technology in a way that made our jobs uh even more relevant and on our advice even more relevant to companies who are building and using tools in this place.
SPEAKER_01:Great. Thanks, thanks for the introduction. Um so um, as we rightly mentioned, there's been a um a dramatic uh interest in um LLMs, uh starting with OpenAI, but there are many other companies like Anthropic and Facebook and Google and others who have large language models. And it's rumored that uh the way that they trained these models was by scraping content from the web. I mean, um certainly Google had its own content and Facebook has its own content, uh, but OpenAI clearly does not have its own content. Um, so uh can you talk about uh the various legal perspectives on how legal it is that uh what they have done?
SPEAKER_04:Sure, sure. And and I'll say the the legal questions around scraping content from the internet are not new because the internet's been around for almost 30 years now. And it we you know, we have we have precedence and we have uh a track record that we can point to as far as the types of claims plaintiffs have made in the past, alleging wrongdoing or alleging foul play with companies or individuals who have uh copied content from their website, otherwise called. Um so the the first and obvious one is copyright infringement, right? And we can get to that in a second. Um, the second is people would argue, or content owners would argue, that um there's a breach of contract. If someone comes onto their website, they've agreed to the terms of service on that website, and most terms of service advise or require that the visitors do not copy content without authorization, right? And so there's an argument that scraping content on a website is a breach of contract of those terms of service that apply to that website. Um another claim that's been been made is is trespass to channel. And that's an old way of saying um you've interfered in my property, right? You you've come onto my property and done damage. And typically this is used when let's say your access and scraping of a site diminishes its ability to uh operate well, right? Maybe there's it's it's loading slowly, it's having a lot of errors because you're overloading the website because of the amount of scraping you're doing of the website that you're accessing. And in those cases, you might see a trespass to channel claim as well. Another claim that's been thrown around is unfair competition. And so this is as straightforward as it sounds. You know, I've I've spent uh and invested money, I've invested uh resources in compiling and generating and creating the content that I own on my website. You you can't just come along, grab it, and piggyback uh and free ride on my efforts. That's that's unfair.
unknown:Okay.
SPEAKER_04:The last would be the uh Computer Fraud and Abuse Act. This is uh a statute uh uh that it basically it prohibits actors from accessing other people's computer systems without authority and essentially doing bad things, for lack of better words, right? And so this would be a classic thing. If you're if you're someone's hacking another person's computer, that could be uh a violation of the Computer Fraud and Abuse Act, the CFAA. Um and there are ways in which plaintiffs will argue that if you're scraping without authorization, you're illegally accessing my computer server, and and therefore you're uh you're you're in violation of the CFAA. So those are the types of um claims that get thrown at uh companies who are scraping content from the internet. And I'll I want to go back to copyright infringement because that's gonna be the biggest one here, and probably the one we discussed the most today. So the Copyright Act gives owners of works of authorship, right? That could be um a photo, that can be um written language, that can be all sorts of you know, videos. The the copyright owner gets under the act the exclusive right to control certain things with regard to that work of authorship, right? They exclusively control the right to reproduce that work of authorship, the distribution, uh they have the exclusive right to prepare derivative works out of that work of authorship, uh they have the exclusive right to publicly display or perform the work. So, in other words, uh the copyright owner controls who gets to do these uh these things with their work. Uh obviously, if someone is coming in and copy pasting uh the works from the internet, you know, there's a uh reproduction happening, right? There could be a distribution happening there. If they take that later and make uh derivative works, that's another exclusive right. And so on on the one side, people argue that scraping violates the copyright owners' exclusive rights unless they have authorization from them, right? Um and that's sort of the basis of how our um our statute around copyright works is that we give the author exclusive control over these things, and unless you have those rights, uh and off and sorry, unless you have permission from the copyright owner, you're uh prohibited under the act to do those things.
SPEAKER_01:So um uh uh let's talk a little bit about the copyright um act. Um, so search engines uh today uh go um across the web, you know, they check the robots.txt file to receive their permission, they download the entire content of the page, and they can even produce a snippet of that page uh in the search results. So why are they not um in copyright infringement in that case? Even though they've technically been given permission through the robots.txt file.
SPEAKER_04:Sure, sure. And so um I'm glad you raised that because the the counterpart to the exclusive rights of the copyright owner is the fair use rights of others to use copyright. And and and I think what you just uh touched on comes down to fair use. So uh what what fair use is, is it's an exception to the uh the the exclusive rights of the copyright owner. So the section 107 of the copyright act says, you know, even though a copyright owner has exclusive rights over their work of authorship, individuals can do all the exclusive exercise those exclusive rights if they meet a certain criteria that constitutes fair use. And there are four primary criteria listed as the um considerations the court will look at whether or not to determine whether or not a certain use is fair use. Um the first is the purpose and character of use. This is often the one that's most heavily weighted in the calculation, right? And here we look at whether or not are they using the the copyrighted work in the same manner for the same thing, or are they doing it using it for something completely different? And and the word you'll hear used in this regard is that is it a transformative use, right? Are they taking something and making it and using it in an entirely different and novel way, such that the the work's use is is is transformed, right? Um the second factor is the nature of the copyrighted work. Uh uh the more creativity uh in the work, the more uh uh protection it gets under this doctrine, uh and the less creative the work. So think of let's say a an instruction manual for your washing machine is gonna uh it's it's easier to use that as fair use than uh uh you know a work of art in the museum, right? Uh so that's that speaks to the nature of the copyrighted work. Um the the third question is how much of the work did you use, right? If you use a small snippet of the totality of the work of authorship, um, it's more likely to be deemed fair use than if you use, let's say, the entire thing. Because again, we're trying to get at what what what is the dividing line between respecting the rights of the uh copyright owner who's invested their time and energy to produce this versus others who may want to repurpose the work for other purposes. And then lastly, the fourth factor is the effect on the copyrighted work's market, right? So uh where the uh the the sort of the use of the work of authorship competes directly with the original copyright owner's use of the uh work, uh it'll be less likely to be fair use, right? Where you're using uh copyrighted work in a manner that does not impact the market for the original work, uh that is more likely to be fair use. Um and the the challenge with fair use and and and the uh the trickiness of advising companies on uh whether or not they can take a fair use approach, is you can't pre-clear something as fair use, uh right? Each specific use case is gonna be looked at uniquely by the court. So it there's no there's no way we can say ahead of time, this this is fair use and this is not fair use. We can we can advise this is more likely and this is less likely, but until you go to court and have a uh judge look at these factors and weigh them and come to a conclusion as a matter of law, you just don't know if what you're doing is fair use. But you can you can get a sense once once you have these factors looked at as to whether there are ways you can do it that are more likely to be fair use or less likely to be fair use, right? And and that's a big challenge for companies because you know uh uncertainty is is challenging, and you know, companies are investing obviously a lot of money into developing large language models, into developing other forms of AI, whether it's generative AI or not, and not having answers to whether or not their conduct with respect to data and content is permitted or not, is a is a big question mark hanging over the entire endeavor, right? Um just for for instance, um traditionally I mentioned the question of whether or not use is transformative was all was always typically the uh the most important factor in this recipe for fair use. But then recently, uh the Supreme Court ruled, uh I believe it was in 2023, that um it was a case regarding an Andy Warhol painting. Right? And the Andy Warhol painting was um admittedly uh a derivative work of a photographer's uh photo of Prince, the artist, the the recording artist. And so um the Warhol print was uh sold to a magazine who was that was writing a story about Prince and ran in the in the magazine. And the court said putting aside the question of his creation of the work of art, um the the the fact of licensing the Andy Warhol uh print of this photo to the magazine was in direct competition with the original photographer whose business was to take photographs and license them to magazines. And so it elevated this last factor, which is the effect on the copyright works market, right? They said that was not fair use because the way you're using this work of art by Andy Warhol directly interfered in the market for the original photograph, which could have been used instead by the magazine, but instead they they purchased a license to use the print from Andy Warhol instead of the photograph from the original uh uh from the original photographer. Um the other thing I'd point out uh to companies out there thinking about this is that unfortunately the court systems aren't uh uh, you know, they're they're they they don't move very quickly, right? Uh there's no um move move fast and break things. It's the opposite. Move slow and make sure nothing breaks, right? And that's part of the um the strength of the the the the judiciary and and and and the law is that it's conservative and moves slowly so that people have uh predictability and and continuity with the past. Um but these questions are very important, they're very pertinent to companies who are developing uh you know LLMs and other kinds of AI technologies now. And it's unlikely we're gonna get good answers or clear answers for many years. And so it's it's gonna be interesting once these cases start um landing with judgments, because it we're gonna have many years of development at very high pace to uh retroactively look at and apply and figure out whether or not uh this type of technology might be in peril because of the legal rulings, while that type of technology is not, right? And just to give an example, um, the last sort of very large fair use case that went to the Supreme Court was Oracle versus Google, Oracle beat Google. Um and that lawsuit took almost 11 years from uh first filing the Supreme Court ruling to get a clear ruling, right? And so that's an eternity in this uh in this environment. And and it's hard to see how the the sort of the legal process is gonna keep up with the demands of of the innovation cycle and and how fast this technology is coming to market.
SPEAKER_01:Great. So um as we know, um you know, large language models today have scraped all kinds of content, you know, including literary works, poems, etc. But when they produce content, um they're not regurgitating the same content. It is um so there have been some instances where they have people. Tricked into doing that, but in vast majority of the cases they create new content that's based on what they have learned. So can you not argue that it's not under the fair use doctrine that it is transformative content and so should be okay with the fair use doctrine?
SPEAKER_04:That's the billion-dollar question here, right? Is the technology's use of the content transformative or not? Right. And depending on which side of the issue you sit on, I think you'll find people with with uh opposite views, right? The developers of LLMs claim uh that, hey, we we might be copying without authorization, but we're putting it to a use that is completely transformative, right? We're training the weights and biases of these models, and then those models are going to go out and create something entirely new that is is a uh unpredictable creation based on the totality of all the content that it has been trained on, right? And I have uh I have sympathy for that that argument. On the other side, you have content owners saying, hey, we yes, that's the theoretically that's true, but also um with the right prompting, you can produce word for word the songs that were um caught you know uh ingested by your uh by your LLM, or produce photographs with our watermark on them. And both those cases are actually going through courts right now. And I have some sympathy for those too. It's very, you know, the it it really is a um there's there's it's not a home run here, right? And it's gonna be very interesting, and it may come down to specific judges, specific courts, specific fact patterns. Uh, you know, there's a saying in law, you know, bad facts create bad judgments and bad rulings, bad case law. So like it really is gonna depend on on who who's getting uh judgment from from where first and and and who's deciding on them. Okay.
SPEAKER_01:All right, over to you, Gary, to continue the conversation.
SPEAKER_02:Yeah, this is uh quite fascinating, right? So the um players today, right, the LLM players, although they're all in some fashion startups, but they all have deep pocketed investors or partners, right? Um so they have the ability to maybe uh sustain the lengthy uh methodical uh judiciary uh you know uh approach. But if you're a startup, right, and um they're taking it down, you're looking at the the stack and trying to figure out is there value that can be created in a fashion that solves a very specific problem uh across the uh across the you know world. How do then startups sort of think about this? Because as it is, they're giving up a lot of money to starting with Nvidia to LLM players, right? So suddenly now you got to set aside a large uh legal escrow for lack of a better word, to figure out if you are within uh the domain, and they may not even be in it at you know, it's all sort of fleeting. So how as a lawyer would you advise startups to approach this uh uh approach this problem? Because uh that's the most funded category at the moment from a VC perspective.
SPEAKER_04:Yeah, absolutely. Uh I think the most important thing for startups is to become educated about all the different issues and understand what we know and don't know, right? Where where are the uncertainties and where do we have better certainty, right? And and you need to be pragmatic and understand are there ways we can de-risk certain facets of our product or not, right? Um for instance, and and this is this is maybe uh not completely right on target, but uh you know, you can use ChatGPT on a free basis, and ChatGPT and OpenAI will then take your inputs and retrain the model with your inputs. Or you can use ChatGPT on an enterprise uh basis, you know, you pay a monthly subscription, and uh OpenAI will not take your inputs and train the model with them. And and and that's one very easy way to de-risk your use of generative AI, right? For companies who are who are building on the generative AI stack, um that you just you just want to get your arms around these legal questions, right? The question of scraping and and copyright infringement, if if you're building on top of the LLMs, mostly that's an issue for your providers, right? That's that's like anthropic, open AI, meta. The companies who are doing that ingestion are really facing the risk. Um, you as the downstream user preparing a product that is powered by those LLMs, you you have different considerations, right? What you don't want to do is produce content that uh infringes the content owners originally. You're not so concerned about the scraping, copying, and training process, right? And so downstream, you might think about okay, what kind of guardrails can we put in place, right? Do we do we scan and screen output before it's made available or not? How do we do that, right? So, you know, you want to understand where where you're situated, where's your product situated within the stack, and then what are the issues that are most exposed exposing us most to risk and and where are the things that are not of concern, right? Um and then you know, some some of this can be uh mitigated through contracts, right? Like uh I'll I'll take uh an example, you know, you you can let's say if you're if you have a direct-to-consumer product and you're you're your your consumers have to agree to terms of service, there's a lot of risk you can uh mitigate there through smart contracting with your consumers, right? Um one issue that companies need to understand around generative AI, if they don't already, is that the output from a computer like the model creates generates the the output on the content, um there's no copyright protection on that content. And also uh companies aren't able to patent the output of generative AI. And so um because of that, you know, your your your rights and and and ability to protect that content is is diminished. So knowing that, you know, how do you mitigate certain uh you know issues in your business? For instance, if you're hiring a developer to build part of your your software stack, um you you might consider requiring them not to use generative AI products in the software they're developing because you don't want them to deliver uh uh you know a bunch of software that has no copyright on it, if you're worried about protecting it. On the other hand, maybe you don't care if there's copyright on the software because you're just gonna use it on the server side. No one's ever gonna look at it or have access to it. And so, hey, if they can make the developer more efficient by using Copilot to write, great. We save money, they save money, perfect. So it's just being aware of the issues, where the risks and pitfalls lie, and then understanding how they line up with your business strategy and making sure you're not accidentally walking into a big gaping risk.
SPEAKER_02:Right. You know, uh startups in general um usually wake up and think, hey, how do I solve problems? How do I uh uh acquire more customers, increase revenues, uh increase valuation, right? They they don't um wake up thinking the legal ramifications, if any, there are, to even put guardrails around their fast-moving train, right? Because it's a it's uh usually an arms race and they are trying to do that, right? But this is a unique situation because um there's a lot of court cases going through the courts uh with generative AI, and especially if you're uh focusing on enterprises, the enterprises may share data that has never been exposed at all because it's closed, nobody could have got to it, and therefore now you're using it to model it. And the minute you mingle and create this model to solve a problem in any fashion, it exposes uh or it could expose you to uh to uh to the legal uh you know uh challenges, if any, there are. So the tricky part for a startup is how do I balance these two things? Because it's not about just the money, it's about building a deliberate culture or DNA to pause. And you've been around startups for a long time. Pausing is not a feature of any startup, it's they're just pounded. So, how would you, if you were advising us, say Sharish and I as as founders, what would you say we got to be cognizant cognizant of on a regular basis as we just barrel down the road?
SPEAKER_04:Yeah, um, great. First of all, I'd I'd love to advise you guys if you're if you're working together. That would be uh and what and what I'd say is you know it there like there are there are certain points of uh strategy, uh MVP build uh iteration where new you're sort of you're you're laying out a new roadmap. Yeah, and it it's it's worthwhile just sit down, even if it's a 10-minute chat with your IP lawyer. You know, you don't need a whole team, you don't need uh uh you know a full week of meetings, but just update them and say, hey, here's what we're doing. Any concerns? What would you flag? What are your biggest concerns? And just start to be in the process of just bringing them in under the you know, into the tent, as they say, for a brief just just let them see what's happening. Um, because they'll be able to raise the flag if there is anything. Or they might point in a certain direction and say, hey, you know what? Look, you're you're you're leaving you're leaving these chips on the table, or you're exposing yourself unnecessarily to risk by doing these things. Just just close that door and and the wind will stop blowing through, right? Like just those those little things that sometimes, because this is what we do day in, day out, we're we're really adept at seeing those uh pitfalls and risks that a founder who has uh much more important issues to think about most of the time isn't focused on. And they shouldn't be focused on them most of the time. Sometimes it's important to really, you know, uh, I say roll up your sleeves and work and do a deep dive on this if it's like required and if it's gonna really bring business value to the company, right? Like all this has to be dialed in with what's valuable for the company. And sometimes what's valuable for the company in the moment is taking on risk and worrying about it later, as long as that's uh you know, eyes wide open decision by the company to do that and the founder to do that. There's nothing wrong with that. But running into risk without knowing it is is is the thing we're trying to avoid here. And and having check-ins very informally, just to let your your IP attorney know what's happening, can help you keep your eyes wide open and not miss those pitfalls.
SPEAKER_02:Got it. Yeah. So that that I think that's a great takeaway, Adam, is uh the you know, startup founders have never, you know, they've consulted lawyers only for absolute regulatory or compliance. Now you are uh they got to sort of be cognizant of what am I doing here? Is this building a guardrail around your product? That's a new reality that probably a lot of them will learn. And hopefully the investors class will also give them a lot of support doing that. And part of it is say, for example, uh New York Times was uh V OpenAI, right? Or GitHub being sued by uh open source developers, right? Do these cases point to anything, or are they more about uh establishing rights around their own content? Who is it going to benefit and how do we all uh read from these cases?
SPEAKER_04:Yeah, sure. That's a great question. Um what you see is the battle between content owners and LLM developers, right? And and it's it's it's expected. Um uh it's hard to tell what the outcome is going to be in any one case, uh other than we're hoping to get some rulings so that we can start telling our clients pragmatically here's what such and such court held, and here's what you can take from that. Currently we don't really have that yet, right? So we can't even give direction. We're waiting for that. Um I think what we're already seeing, and it's no surprise, you know, companies uh being more proactive, content owners being proactive about licensing deals. And um interestingly, seeing the larger LLM developers um agreeing to deals, right? Paying for content, uh, paying for large license deals, and you know, having companies, the content owners, um, realize that they're sitting on some some gold here and then that they can do something with it, right? And in some ways, I suspect a lot of the lawsuits are positioning for a settlement type license deal, right, out of it.
SPEAKER_02:Right, right.
SPEAKER_04:Obviously, the open source software developers who are suing GitHub aren't looking for a big payment, right? They're they literally made their software available for free. So it's not about a license. There it seems more about the principle.
SPEAKER_02:Yeah, right?
SPEAKER_04:And as a reminder, you know, the in open source software, um, you know, it it's it's free as in free speech, not free as in free peer, they say. Right. You're allowed to use it, but um, there are requirements, even if it's no cost to the user, right? So typically an open source software license will require at least that the user gives attribution to the copyright owner. Yeah. It says, hey, um, you know, copyright so-and-so 2024, yeah. Um, used on license, yeah, and we discussed warranties. And sometimes there are even stricter requirements like under GPL, uh, you know, and others. But the argument in this case is when GitHub um copied and trained its model for copilot, it it essentially it's it didn't use the note, it didn't fit fill the notification requirements of the open source software license. So it's a breach of contract, right? You need to tell your users downstream that the code that's coming out of Copilot is copyright open source developer, right? And to the extent they the copilot reproduces the original open source software work um very closely, you could argue that they've really failed to uh give copyright notice and and notice of the original license, right? So that's that's the the basis of the dispute. In that case, it doesn't seem like a shakedown for a licensing fee. It really does feel like something under a principled approach to this real question of law that hasn't been answered yet.
SPEAKER_02:Yeah, yeah, yeah. And I mean, open source community in general is okay with you using the I mean the the software. I think uh all I think the undue benefit that is perceived that GitHub might have gotten by using their software with without the back of without the reciprocal uh reciprocal exchange of benefit back to the open source, seems like uh the principle of it. Hey, I'm contributing with good faith that you will also contribute so that the community gets better. In this case, unclear for the open source community what is it they're getting back, even if it was a train to use a model. I think that seems like the crux of the issue. Uh, the other thing is as these uh cases go down, the court system and rulings, settlements, all of these happen. Uh the truth is that uh all the content that's available has been used to train. I mean, there's no more, I mean, the creation of content isn't keeping pace with the training model. Just compute power is much more powerful, right? So now the next bastion is not to publish content. It's sort of the inside the wall of companies or uh your cameras or your uh old you know, VCR tapes, etc. Right. So now the question comes down to uh as a startup, I'm getting something new, I'm making something with it, uh, giving it back to those that gave it to me. The exchange there could be fair, but suddenly I could also be exposing uh the data to others uh unwittingly, just the way the workflow works. So you you may have to kind of get there. That becomes a tricky one because it's it's not something that uh uh startups are adept at uh handling because every time they use an LLM or a model, they have to figure out, hey, what am I what am I git give and get? This is sort of where the open source GitHub, I'm sort of jumping off of that thing. If you are a startup or open source developer, the problems could be similar. How do startups uh manage this, especially with enterprise data?
SPEAKER_04:Yeah, I I think so the the situation is startups should um do their best to understand, again, the risks and pitfalls, right? And um, like for for instance, there they're early after ChatGPT was released, uh a team of software engineers at Samsung were using ChatGPT to debug their software, and they uh they uploaded to the input prompt box uh a bunch of confidential software, right? Uh didn't not realizing that it was gonna be trained, you know, the software's gonna be used to train ChatGPT. And a lot of them lost their jobs after the fact that after it was discovered, right? And they didn't realize that. Had they known that, they shouldn't have they wouldn't have done it, right? And and again, that points to the uh knowledge is is is king here, right? So for startups, really understanding the ecosystem and how the um different exposure points are, you know, if you if you're if you're in possession of enterprise data, what what what does protecting that enterprise data in 2024 mean different than it would have meant in you know before 2022, right? Right. And and how like are there additional protections and sort of you know industry practices needed to keep your uh promise with respect to how you're gonna uh protect enterprise data uh from from your you know potentially expose it to scraping and other other issues.
SPEAKER_02:Got it, got it. Excellent. Adam, you know, we could talk for a long, long time. You know what's interesting about this conversation is um startups don't think about lawyers or even uh you know connecting with them other than for uh some regular reasons. But now the way you presented today helps startups to think about how to collaborate because they are like you have a finance person, um you know, you have an accounting person, in as startups become uh focused on the Gen AI, legal should be sort of come in further upstream as opposed to usually the afterthoughts. Hey, I need a contract or I need this one, the core of what you're building has some legal implications, and the way you uh helped clarify that I think is uh is would be very useful for our for our audience. Thanks for uh thanks for doing that. I really, really appreciate it. Uh Sharish, back to you.
SPEAKER_01:Great. Yeah, again, a fascinating uh discussion. Uh uh Adam, um on that note, uh how can uh startups get a hold of you?
SPEAKER_04:Yeah, uh best would be either by email. I'm at ashavelle at wsgr.com. So as h e v as in victorel at wsgr.com or on LinkedIn. Uh you can find me there, Adam Chavelle. Um happy to chat with with anyone who has these questions. And and again, uh, you know, Gallery, I your your suggestion is is uh one that speaks to my my my heart's content of uh of uh bringing your lawyer in it doesn't have to be a big job it doesn't have to be an expensive uh amount of time just uh short little check-ins uh to to really be sure that you have alignment if you're building LLMs if you're building in this space um there are so many legal pitfalls it would be great to have that sort of product advice uh that product roadmap vetted by your lawyer um as you go along so that you don't have to uh redo a lot of your work later when you get to let's say uh a big investment round and your your investors have concerns about the legality of what you've done.
SPEAKER_02:Excellent yep appreciate it all right thank you Adam uh wonderful talking to you and uh uh we will also share I think your the paper that you shared with me is also very instructive we'll put that in the uh notes for the podcast that's great thanks for having me thank you Adam thank you thank you for listening to our podcast from Startup Exit brought to you by DI Seattle assisting in production today are Isha Jain and Mini Verba. Please subscribe to our podcast and rate our podcast wherever you listen to them. Hope you enjoyed it