
Heliox: Where Evidence Meets Empathy π¨π¦β¬
Join our hosts as they break down complex data into understandable insights, providing you with the knowledge to navigate our rapidly changing world. Tune in for a thoughtful, evidence-based discussion that bridges expert analysis with real-world implications, an SCZoomers Podcast
Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter. Breathe Easy, we go deep and lightly surface the big ideas.
Curated, independent, moderated, timely, deep, gentle, evidenced-based, clinical & community information regarding COVID-19. Since 2017, it has focused on Covid since Feb 2020, with Multiple Stores per day, hence a sizeable searchable base of stories to date. More than 4000 stories on COVID-19 alone. Hundreds of stories on Climate Change.
Zoomers of the Sunshine Coast is a news organization with the advantages of deeply rooted connections within our local community, combined with a provincial, national and global following and exposure. In written form, audio, and video, we provide evidence-based and referenced stories interspersed with curated commentary, satire and humour. We reference where our stories come from and who wrote, published, and even inspired them. Using a social media platform means we have a much higher degree of interaction with our readers than conventional media and provides a significant amplification effect, positively. We expect the same courtesy of other media referencing our stories.
Heliox: Where Evidence Meets Empathy π¨π¦β¬
The Liquid Economy: How AI Could Finally Pay Us Back
See the related episode for haiku, details and comic
Beyond copyright battles lies a revolutionary economic model that could transform how we value human creativity in the age of AI
Copyright and Artificial Intelligence
Part 3: Generative AI Training pre-publication version
A REPORT of the Register of copyrights May 2025
US Copywrite Office
Recognition, Acknowledgment, Payment On Use, Non-Dilution
( spoken word 2024 ( spoken word 2024 )
This is an essential alternative approach that would address several ethical, informatics, and technical issues as we advance.
π§© The Silent Revolution: When AI Learns to Teach Itself
May 15, 2025 β’ Season 4 β’ Episode 26
AI art and copyright
October 28, 2024 β’ Season 1 β’ Episode 46
Final Report β Governing AI for Humanity
September 30, 2024 β’ Season 1 β’ Episode 17
This is Heliox: Where Evidence Meets Empathy
Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter. Breathe Easy, we go deep and lightly surface the big ideas.
Thanks for listening today!
Four recurring narratives underlie every episode: boundary dissolution, adaptive complexity, embodied knowledge, and quantum-like uncertainty. These arenβt just philosophical musings but frameworks for understanding our modern world.
We hope you continue exploring our other podcasts, responding to the content, and checking out our related articles on the Heliox Podcast on Substack.
About SCZoomers:
https://www.facebook.com/groups/1632045180447285
https://x.com/SCZoomers
https://mstdn.ca/@SCZoomers
https://bsky.app/profile/safety.bsky.app
Spoken word, short and sweet, with rhythm and a catchy beat.
http://tinyurl.com/stonefolksongs
Curated, independent, moderated, timely, deep, gentle, evidenced-based, clinical & community information regarding COVID-19. Since 2017, it has focused on Covid since Feb 2020, with Multiple Stores per day, hence a large searchable base of stories to date. More than 4000 stories on COVID-19 alone. Hundreds of stories on Climate Change.
Zoomers of the Sunshine Coast is a news organization with the advantages of deeply rooted connections within our local community, combined with a provincial, national and global following and exposure. In written form, audio, and video, we provide evidence-based and referenced stories interspersed with curated commentary, satire and humour. We reference where our stories come from and who wrote, published, and even inspired them. Using a social media platform means we have a much higher degree of interaction with our readers than conventional media and provides a significant amplification effect, positively. We expect the same courtesy of other media referencing our stories.
So you type a few words into a prompt, maybe ask for an image or some text, and boom, something amazing comes out. Yeah. It feels like magic sometimes, a poem, a picture that looks real. But, you know, how does it do that? Where does all that creativity or knowledge come from? Well, it really all boils down to data, just massive, massive amounts of it. These AI systems, they learn by training on these enormous data sets, text, images, code, audio. whatever they're supposed to create. That training is everything. And right there, that training process, that's where we hit this huge legal and ethical wall, right? Yeah. Copyright. Exactly. If the AI is learning from potentially copyrighted stuff, which it almost certainly is, what are the rules... It gets messy fast. Super messy. So to help us kind of unpack this, we're looking at a key source today. Yeah, a really important one, a report from the U.S. Copyright Office. They've done a deep dive specifically into generative AI training and how it intersects with copyright law. OK, so our mission for this deep dive is to walk through what the Copyright Office found. We need to understand the tech side, just enough to see the legal issues. Right. Where does copyright actually, you know, touch the process? Then we'll really dig into fair use that seems absolutely central. Oh, it is. Big time. Explore licensing, which is popping up everywhere now. And also, you know, see what other countries are doing. The goal is to give you a framework to make sense of it all. Sounds good. So where do we start? How these things learn? Yeah, the basics. Machine learning. Okay. So machine learning, fundamentally, it's about systems learning from data. They find patterns, make predictions, without someone programming every single rule explicitly. Like the example in the report, right. Predicting sales based on ad spend by looking at past data, not a fixed equation. Exactly. Exactly. The system adjusts itself. Generative AI takes that further. It doesn't just predict, it creates new outputs. And that uses these deep learning models, neural networks. Right. Think of them as incredibly complex networks with billions, maybe trillions of connections. parameters or weights. Like tiny knobs and dials. Kind of, yeah. And training is basically showing the model tons of data and tweaking all those knobs and dials so it gets better at recognizing patterns and generating stuff that makes sense. And for text, it breaks words down into... tokens. Yeah. Numbers. Yep. Converts text into numbers the model can crunch. And the big thing seems to be scale. More data, more parameters, more computing power equals better AI. That's been the trend. Yeah. Massive scaling leads to surprisingly capable models. But it also means this, like insatiable hunger for data. To the point where they might run out of text on the internet, the report mentioned that concern. It's a real concern, yeah. But it's not just about quantity. Quality is crucial too. You know, garbage in, garbage out. So developers choose data based on what the AI is for. Legal AI needs legal text. Medical AI needs medical papers. Exactly. The report mentions meta looking for diverse language data or video AI developers needing clips of sophisticated actions like doing boxing. They curate it. And this data, a lot of it is just scraped from the web. A huge amount. Yes. And when the report talks about publicly available data, it often just means it's on the Internet somewhere. Not necessarily that the owner gave permission for it to be used in AI training. That's a key distinction. Got it. So they grab all this raw data and then what? Clean it up. Right. A big curation process. Filtering out unwanted stuff, explicit content, watermarks, low quality things. D-duplication too. That Getty Images lawsuit against stability AI, that was about watermarks in the training data, wasn't it? LAI on data set. That's a prime example, yeah. Yeah. It highlights the legal risks right there in the curation stage. And the cleaning, removing website menus, ads. Uh-huh. And the report notes something important. Sometimes this cleaning might remove copyright info, like author names or copyright notices. Which could be a problem under a different part of the law, Section 1202. Potentially, yes. Removing that copyright management information is generally prohibited. Okay, so filter, clean. Then compile these data sets, like the pile mentioned in the report. Yeah, combining different sources, articles, books to create these broad training sets. So you've got the data, then the training happens. Pre-training, fine-tuning. What's the difference? So pre-training is usually the massive phase, using that huge diverse dataset to build the model's general understanding of the world, language, images, whatever. Building the foundation. Exactly. Then fine-tuning is often done later on smaller, more specific datasets. to adapt the model for a particular task, like writing code, or to align its responses with human values. But the report says most of the actual knowledge comes from that big pre-training phase. That's a key point they make, yes. The fine-tuning refines or directs that knowledge, but the core learning happens during pre-training with the massive dataset. Okay, so data collection, curation, training. Now, where does the actual copyright law potentially get, well, infringed? The report talks about prima facie infringement. Meaning on its face, or potentially infringing, yeah. The Copyright Office lays out several points. First, just collecting and copying the data. Downloading those huge data sets onto servers, that's making copies. Exactly, making reproductions, which is an exclusive right of the copyright owner. Okay, what else? During the training process itself. When the model is processing the data, temporary copies are made in the computer's RAM, its memory. Even temporary copies can count. They can, yes. It depends on the specifics, but it's another potential point of entrenchment. And then there's the really tricky one, the model itself, the weights. Ah, yes. This is super complex and heavily debated. The argument, which the report details, is that the trained model's weights, those adjusted knobs and dials, might themselves be considered an infringing reproduction or derivative work of the training data. Wait, how can a bunch of numbers be a copy of, say, a painting or a novel? That feels weird. It does. But the argument goes like this. If the model can output something that's substantially similar or even identical to something it was trained on. Like it memorized it. Exactly. If it can reproduce that expression, then that expressive content must be encoded somehow in the final structure of the weights. The report quotes others saying the model isn't magic. It learned from the data and that learning is in the weights. So the pattern learned from the copyrighted work represented by those weights is argued to be a kind of copy or a new version. That's the core of that particular legal argument. It's applying old law to brand new tech. Wow. Okay. Any other points of potential infringement? Yes, during deployment, specifically with systems that use something called retrieval augmented generation or RAG. ID, what's that? It's when the AI doesn't just rely on its training. It actively goes out and retrieves current information like search results or specific documents to help answer your prompt. It looks things up in real time. Ah, I see. So it pulls in fresh info. Right. And in doing that is making copies, temporary or sometimes even displaying parts of that external content. which might be copyrighted. The report mentioned news organizations are worried about this, like their articles showing up in AI Answers. Big concern there, yeah. Especially if substantial chunks or the whole article get reproduced, which some lawsuits allege has happened. Okay, so potential infringement at collection, during training, including the model weights, and during deployment with RA. That's a lot. It is. But there's a major defense in copyright law. Fair use. Exactly. Section 107. This is the heart of the legal battle the report explores. It's a defense, a balancing test, not a simple rule, super fact-specific, case-by-case. And the Supreme Court recently stressed the importance of the use itself, right? Is it transformative? Yes. That's critical. The Copyright Office agrees, too. You have to look at fair use separately for pre-training, for fine-tuning, for RA. They might have different outcomes. So let's break down the four fair use factors. Factor one, purpose and character of the use. Transformativeness is the big one here. Huge point of contention. AI proponents argue training is highly transformative. They say the AI isn't copying the expression. It's extracting uncopyrightable things, statistical patterns, ideas, associations. A non-expressive use, they call it, or non-consumptive. Right. The AI consumes the data to learn, but the goal is a new service generating novel outputs, not just repackaging the original. But opponents strongly disagree. They do. They argue training absolutely uses the expressive value, especially if you train on art to make art, or train on stories to write stories. They question if this non-expressive use ID even exists legally, pointing to cases where copying for convenience wasn't seen as transformative. And the Copyright Office itself seems to draw distinctions. Maybe pre-training is more transformative than later stages. The report suggests that yes, Fine-tuning or deployment that directly competes with the training data might be viewed less favorably. How well the AI's guardrails work to prevent copying is also relevant here. Okay, still under factor one commerciality, most AI is commercial, right? A lot of it is, yes. And the report says the factor looks at whether the use of the copyrighted work for training serves a commercial purpose. Doesn't matter if the company itself is nonprofit. If the training furthers a commercial goal, that weighs against fair use. Concerns about data laundering, too. Research use turning commercial later. That's mentioned, yeah. Commercial use generally cuts against fair use. And one more bit under factor one. Using illegally obtained data, like pirated works. The report is quite clear on that. Knowingly, using pirated or unlawfully accessed data for training, that weighs heavily against fair use. Access control is right, and bypassing it undermines the claim. Okay, factor two. Nature of the copyrighted work, creative versus factual. Right. Using highly creative stuff... Novels, music, art generally weighs more against fair use than using factual works like databases or manuals. Published versus unpublished also matters. But AI trains on everything, so this factor is mixed. Often, yes. The report suggests it's usually less decisive than factors one and four in these big AI training scenarios because the data sets are so incredibly varied. Got it. Factor three, amount and substantiality used. Copying a lot is usually bad for fair use, but... But sometimes copying the whole thing can be fair if it's necessary for a truly transformative purpose. The classic example is Google Books scanning entire books for search indexing. So AI training copies entire works, often many times. That sounds substantial. It does. Proponents argue the model only learns a tiny fraction from each work statistically. Opponents say taking the whole thing is inherently substantial and question if it's always necessary. But the report highlights a specific angle for factor three with AI, the amount made available to the public. Tell me about that. This connects directly to what you might see coming out of an AI if the trained model has memorized parts of its training data. Which the report confirms happens. Yes, its state's memorization occurs and verbatim extraction is possible for some works. If the model can reproduce copyrighted content, especially near verbatim, that way strongly against fair use under factor three, it shows a substantial part wasn't just used internally for learning, but is effectively being made available through the output. So memorization isn't just a bug. It's evidence against fair use. Precisely. It shows that an identifiable chunk of the original work is potentially accessible via the model. And developers use guardrails to try and stop this. Filters, prompts. They do. technical and policy measures to try and prevent the AI from spitting out memorized, copyrighted stuff. Does the report say if these guardrails actually work? It says if they are effective. Factor three might weigh less heavily against fair use, but it also stresses that their effectiveness is disputed. It's a subject of current lawsuits and they definitely aren't perfect. Okay. Factor four, effect on the potential market. Does the AI use hurt the market for the original work? Yeah, does its substitute for it dilute its value? The debate here gets interesting. Is the harm just to the specific works used in training? Or is it broader, harm to the market for authorship and general. Exactly. The argument noted in the report is that if AI output competes directly with human created work, it could reduce the value of that human work, thus harming the market for the kind of stuff used as training data. And factor four also looks at licensing, right? If there's a way to pay for this use. Yes. If a licensing market exists or is developing for this kind of use using works for AI training, then choosing not to license ways against fair use. Which leads us perfectly into licensing. Yeah. Is it even possible to license all the data needed for these giant models? That's the million dollar question or maybe trillion dollar question. The report dives into the feasibility debate. What are the arguments for it being feasible? Well, the report points out deals are actually happening already. Adobe Firefly licenses data specifically for training its AI. Yeah. Getty Images, Shutterstock, they're licensing their libraries. Newsorgs too. AP licensed to open AI. Yep. Music companies are exploring it right to Phyhydra, others using in-house music. There are platforms emerging, like Created by Humans, the Dataset Providers Alliance, trying to build these markets. The report even mentions revenue sharing as a possible model. Okay, so it is happening. But what are the challenges? Why might it not be feasible? Scale is the big one. Licensing everything needed for a massive foundational model. The transaction costs seem astronomical. Finding all the copyright owners negotiating, especially for stuff scraped from the web. Power imbalance too. Big AI labs versus individual artists or writers. That's a concern, yeah. And some worry about antitrust issues if creators try to band together for collective licensing, though others point out that music CMOs already do this. So direct licensing versus collective licensing, like music rights organizations. Those are the main voluntary models discussed. Collective licensing could lower transaction costs, but some creators might prefer direct deals. The report also mentions statutory licensing, governments stepping in. Briefly. Things like compulsory licensing, where the government sets the terms generally not liked by rights holders, or extended collective licensing, ECL. What's ECL again? That's where a collective organization can license all, works in a certain category, unless the owner specifically opts out. Ah, the opt-out part sounds complicated. controversial. Very. Copyright owners strongly prefer opt-in systems where they have to give explicit permission first. So given all this debate, what's the copyright office's take on whether the government should create a new licensing system? Their conclusion right now is that it's premature. They see robust growth in voluntary licensing deals and note a lack of stakeholder support for a government mandate. They think the market should be allowed to develop further. And they suggest that where these voluntary licenses are available, not using them makes a fair use argument weaker under factor four. Exactly. The existence of a market affects the fair use balance. OK. Beyond fair use and licensing competition. Yeah. Does requiring licenses hurt smaller AI companies? That's a significant policy concern raised. Could licensing costs create a barrier to entry, leading to dominance by big tech? The report acknowledges this, but says it doesn't actually change the fair use analysis itself. It might be a separate policy problem to solve. solve. Though some argue licensing is possible for smaller players, too. Right. The debate continues on that front. And what about other countries? Is everyone wrestling with this? Oh, absolutely. It's global. The report touches on different approaches. Many places are looking at text and data mining TDM exceptions. Like the EU? They have one. Yes. With an opt-out for rights holders. Japan has one, but it might not apply if licenses are readily available. Singapore. The UK has one mainly for research for now. What about countries with fair use laws like the US? Places like Israel, Korea. They're figuring out how their existing flexible doctrines apply similar to the US. Then you see other ideas like Brazil looking at ECL or court cases in China dealing with AI training data. Is there any push to get everyone on the same page internationally? There's definitely talk about harmonization, but it's tricky. Some argue widespread unlicensed training might violate international treaties, like the three-step test for copyright exceptions. That test says exceptions have to be special cases, not hurt the normal market, and not unduly harm the author. Basically, yes. And the argument is that large-scale AI training might fail that test if done without permission or compensation. Wow. Okay. So to recap, we've got the tech process of training, multiple points where copyright might be infringed, this incredibly complex four-factor fair use test with huge debates around transformativeness and memorization. The rise of voluntary licensing markets facing feasibility challenges and this patchwork of international approaches. It's a lot. The Copyright Office Report really maps it out but doesn't give final answers yet. No, it provides a framework. It acknowledges the interplay court decisions, influence tech and markets, and vice versa. It's watching how things develop. It really feels like we're in the early stages of figuring all this out. The tech, the business models, the legal interpretations are all moving targets. Definitely. The Copyright Office is essentially saying, let's see how the courts handle these fair use cases and how the voluntary licensing market evolves before considering major legislative changes. So a final thought for you, the listener to mull over drawing from the report. We're seeing these licensing markets for AI training starting to grow. What if licensing data for AI training becomes part of the normal exploitation of copyrighted works? Yeah, if that becomes the norm. How does that shift the fair use calculation, especially factor for the market effect? Does unlicensed use become much harder to justify if there's a clear market path to get permission? Or maybe looking bigger picture, how do we ultimately balance this incredible power of AI, which relies on digesting vast amounts of human creativity with the fundamental rights of creators to control and benefit from their work? That tension isn't going away anytime soon. It's probably the defining creative and legal challenge of this technological wave. Something to keep watching in the courts and the headlines.