Training Data on Trial Inspire

# Training Data on Trial: AI's First Fair Use Test > The Framework Courts Gave Us—and What We Do With It **By Paul Roberts** | 🌟 Inspiring Change 📊 **4,800 words** | ⏱️ **19 min read** #Legal_Update #Fair_Use #Inspire #Copyright December 27, 2024 --- Let me tell you about Sarah. She's a novelist. Twelve published books. Her latest—a science fiction story about artificial intelligence and consciousness—took three years to write. She poured everything into it: late nights, early mornings, the kind of creative work that feels like mining your soul for words. And then she learned that her book, along with thousands of others, had been ingested into training datasets for large language models. No permission. No compensation. Just... taken. Now let me tell you about Marcus. He's an engineer at a tech startup. His team is building an AI system that could revolutionize how students learn to write. The model needs to understand language at a deep level—syntax, style, narrative structure. To do that, it learns from millions of books, including Sarah's novel. The model doesn't reproduce those books. It learns patterns from them. Marcus believes his work will help millions of students become better writers. Sarah and Marcus have never met. But in 2025, three federal courts issued decisions that will shape both of their futures. This article is about those decisions. But more than that, it's about how we—as lawyers, technologists, creators, and citizens—navigate one of the most important questions of our time: How do we build transformative technology while respecting the people who create the works that make that technology possible? Here's what's at stake. --- ## I. The Framework We've Been Given In February and June 2025, three district courts delivered the first comprehensive opinions on whether using copyrighted works to train artificial intelligence constitutes fair use under copyright law. The cases were: - *Thomson Reuters Enterprises Centre GmbH v. Ross Intelligence Inc.* (D. Del. Feb. 11, 2025) - *Bartz v. Anthropic PBC* (N.D. Cal. June 23, 2025) - *Kadrey v. Meta Platforms Inc.* (N.D. Cal. June 25, 2025) On the surface, these are technical copyright cases. But they're really about something bigger: how we balance the incentive to create with the freedom to innovate. And here's what we need to understand: The courts didn't just issue rulings. They gave us a framework—a way of thinking about these questions that we can apply whether we're building AI systems, writing novels, advising clients, or shaping policy. --- ## II. What the Courts Told Us ### The Question That Matters The courts converged on a single, powerful question: **Does the AI system serve the same function as the original work, or does it serve a different function?** This isn't about the technology itself. It's not about whether you call it "AI" or use fancy neural networks. It's about what the system actually does in the marketplace. Let's break this down. #### When AI Training Fails Fair Use: The Ross Intelligence Story Thomson Reuters built Westlaw—the legal research platform that generations of lawyers have relied on. Ross Intelligence wanted to build a competing platform powered by AI. They needed training data, so they contracted with a company called LegalEase to create it. LegalEase attorneys accessed Westlaw through their firm subscriptions and created training materials based on Westlaw's copyrighted headnotes—those brief summaries that make legal research possible. Ross bought those materials and used them to train their AI. The Delaware district court looked at this and said: You're using legal research materials to build a legal research tool. Same function. Same market. That's not transformation—that's competition. And competition using someone else's copyrighted materials, without permission, in the same market? That fails fair use. The court was clear: "Ross meant to compete with Westlaw by developing a market substitute." The AI technology didn't change that fundamental reality. #### When AI Training Succeeds: The Story of Language Learning Now consider what happened in the Bartz and Kadrey cases. Authors sued Anthropic and Meta, claiming these companies had copied their novels to train large language models—Claude and LLaMA. The authors argued: You took our creative works without permission. That's infringement. Two Northern District of California judges looked at the same question: What function does this copying serve? And they reached a different conclusion. Judge Orrick, in the Anthropic case, put it this way: "The purpose of Anthropic's copying was not to communicate or repackage plaintiffs' expression but to extract statistical information about syntax, semantics, and narrative form." Judge Chhabria, in the Meta case, agreed: "The use of copyrighted works to teach a machine to generate text is highly transformative because the models do not republish the works; they internalize patterns to create new expressions." Think about what this means. When you read a novel to learn how to write, you're not replacing that novel. You're learning from it. The courts said: AI can do the same thing. The novels serve one function: delivering narrative and entertainment to readers. The language models serve a different function: learning statistical patterns to generate new text. Different functions. Different markets. Transformation. --- ## III. Why This Framework Matters Here's what's at stake: We're watching the birth of a new technological paradigm, and we're doing it in real time. The framework these courts built matters because it does something remarkable: It protects both innovation and creation. ### For the Innovators If you're building AI systems, these cases tell you that transformation is your safe harbor. Not the technology itself—the function you serve. Are you using copyrighted works to learn patterns, extract information, and create something new? Are you serving a different purpose than the original works? Are your outputs non-substitutive? If yes, you're on solid ground. But—and this is crucial—if you're building systems that compete directly with the works you're training on, you need permission. You need licenses. You can't shortcut the market just because you're using AI. ### For the Creators If you're a writer, artist, or creator, these cases tell you that your work is still protected where it matters most: in the market for your creative expression. No one can use AI to replace your novels with cheap substitutes. No one can train a system to compete with you in delivering your creative work to readers. But—and we need to be honest about this—the law recognizes that others can learn from your work the same way humans do. They can analyze it. They can extract patterns from it. As long as they're not replacing you in the marketplace. ### For All of Us This framework reflects something deeper: the recognition that we need both creativity and innovation. We need Sarah to keep writing novels. We need Marcus to keep building tools that help students learn. And we need a legal system sophisticated enough to protect both. --- ## IV. The Principles That Emerged Let me walk you through what these courts actually held. Because if we understand the principles, we can apply them. ### Principle One: Transformation Is Function-Specific Courts look at what your AI system actually does, not how it works. Ross Intelligence used AI technology, but it used it to build a legal research tool from legal research materials. The court said: Same function. No transformation. Anthropic and Meta used novels, but they used them to build language models that don't compete with novels. The courts said: Different function. Transformation. The lesson? Technology alone doesn't transform. Purpose does. ### Principle Two: Complete Copying Can Be Permissible This one surprises people. The authors in Bartz and Kadrey said: "They copied our entire novels! How can that be fair use?" The courts responded: If complete copying is necessary to achieve the transformative purpose, and if the outputs don't expose the copyrighted works to users, wholesale copying can be intermediate. Think about it like this: To teach a language model how language works, you need complete texts. Fragments won't capture syntax, narrative flow, or style. The courts recognized this technological reality. But—critically—the models don't then reproduce those novels for users. The copying is intermediate, not public-facing. ### Principle Three: Evidence Beats Speculation Here's where these cases matter most for practical decision-making. In both Bartz and Kadrey, the authors argued: "AI-generated text will flood the market and hurt our book sales." The courts said: Show us. And the authors couldn't. No data. No consumer surveys. No economic analysis. Just speculation about hypothetical future harm. The courts required evidence. Actual harm. Market displacement you can measure. This is a principle we can work with. If you're defending AI training, you need to show that your system doesn't substitute for the originals. If you're challenging AI training, you need to prove market harm with data. ### Principle Four: Creativity Doesn't Trump Transformation The authors in Bartz and Kadrey had strong cases on one factor: their works were highly creative novels—exactly the kind of expression copyright is designed to protect. But the courts held that when use is analytical rather than expressive, the creative nature of the source material carries less weight. Why? Because even highly creative works can be analyzed for informational content—linguistic patterns, structural elements, stylistic choices. That analysis doesn't exploit the creativity for expressive purposes. --- ## V. The Questions Still Open Let me be candid: These cases don't answer everything. We're at the beginning of this journey, not the end. ### The Shadow Library Question Both Anthropic and Meta sourced some training data from "shadow libraries"—repositories like Z-Library and Bibliotik that host unauthorized copies of books. The courts addressed fair use on the merits without penalizing the defendants for using pirated sources. But they didn't say this was okay. They just didn't address it directly. This question remains open: Does using materials you know are pirated affect the fair use analysis? Should it? We need to grapple with this. Because if we're building a framework that respects both innovation and creation, we can't ignore how the data was sourced. ### The Licensing Market Question Here's where the cases diverged. In Ross Intelligence, the court recognized a potential market for licensing legal research materials as AI training data. Ross had sought a license. Thomson Reuters had refused. The court said: That's a cognizable market. In Bartz and Kadrey, the courts declined to recognize nascent licensing markets. They said: Show us an established, functioning market. Not a hypothetical one you're proposing in litigation. The difference matters. As the AI industry matures, licensing markets may emerge. Publishers may develop standard mechanisms for granting training rights. Trade associations may facilitate these licenses. When that happens, the analysis under factor four could shift. ### The Hybrid Pipeline Question These cases addressed clean scenarios: systems that either competed (Ross) or didn't (Bartz, Kadrey). But what about hybrid cases? What if training is analytical, but outputs occasionally reproduce copyrighted text? What if models are fine-tuned to mimic specific authors' styles? What if AI systems generate derivatives that compete in niche markets? We'll need to refine the framework as these cases arise. --- ## VI. What We Can Do With This So here's what we have: A framework. Principles. A way of thinking about these questions. Now comes the important part: What do we do with it? ### If You're Building AI Systems You have a path forward. **Focus on function.** Are you using copyrighted works to serve a different purpose than the originals serve? Are you extracting patterns rather than reproducing expression? Are your outputs non-substitutive? If yes, document it. Be transparent about your training process. Show that you're learning from works, not replacing them. **Be thoughtful about sourcing.** The courts didn't penalize defendants for shadow libraries, but that doesn't mean it's wise. Build relationships with publishers. Explore licensing where appropriate. Use openly licensed materials where possible. **Prepare for market analysis.** If you're challenged, you'll need to show that your system doesn't harm the market for the original works. Collect data. Survey users. Build the empirical record that shows you're not substituting. ### If You're Creating Content You have protections. **Your market is protected.** No one can use AI to replace your creative work in the marketplace without your permission. If someone builds a system that competes with your books, your art, your expression—you have recourse. **But analytical use is different.** Others can learn from your work. They can analyze it. They can extract non-copyrightable patterns. That's part of living in a society that values both creation and innovation. **Consider engagement.** Some creators will oppose all AI training. That's their right. Others may see opportunities in licensing. Still others may embrace the technology. There's no one right answer—but there are thoughtful ways to engage with the question. ### If You're Advising Either You have tools. **Apply the functional test.** Start with the question that matters: What function does the AI system serve? How does that compare to the function of the copyrighted works? **Demand evidence.** Whether you're defending or challenging AI training, move beyond speculation. Build the empirical record. Show—don't assert—market effects. **Think long-term.** These are district court cases. Appeals are coming. The law will continue to develop. Advise clients with the humility that this framework is emerging, not settled. --- ## VII. The Vision Here's what I believe: We can get this right. We can build AI systems that transform how we work, learn, and create—while respecting the people whose creativity makes those systems possible. We can protect innovation without sacrificing the incentive to create. We can protect creation without stifling the next generation of technology. But we can only do it together. The courts gave us a framework. It's not perfect. It leaves questions open. It will evolve as technology evolves. But it's a start. And it's a framework built on something important: the recognition that both creativity and innovation matter. Sarah—the novelist—deserves to know that no AI system will replace her books in the marketplace without her permission. That her creative work is protected where it counts. Marcus—the engineer—deserves to know that he can build systems that learn from human expression without fear that every training use is infringement. That innovation has room to breathe. And we—all of us navigating this space—deserve a legal system that's sophisticated enough to protect both. --- ## VIII. The Call to Action So here's what we need to do. **First, we need to understand the framework.** Read these cases. Understand the principles. Apply them to your own work. Whether you're building AI, creating content, or advising either—know what transformation means. Know what evidence looks like. Know the difference between competitive substitution and analytical learning. **Second, we need to build evidence, not speculation.** The courts made clear: Evidence beats speculation. If you're defending AI training, show that your system doesn't harm markets. If you're challenging AI training, prove the harm with data. Move beyond hypotheticals. **Third, we need to engage in good faith.** This isn't creators versus technologists. This isn't copyright versus innovation. These are false dichotomies. We need creators and we need innovation. The question is how we make room for both. **Fourth, we need to stay humble.** These are district court cases. The law will develop. Technology will evolve. What works today may need refinement tomorrow. We should approach these questions with the humility they deserve. **And finally, we need to remember what we're actually doing here.** We're not just litigating copyright cases. We're not just building AI systems. We're not just writing novels. We're shaping how society handles transformative technology. We're deciding what kind of future we want to build. We're determining whether we can protect both the people who create and the people who innovate. That's the real work. The courts gave us three cases and a framework. Now it's up to us to build something that honors both creativity and progress. We can do this. We have the tools. We have the principles. We have courts willing to apply flexible doctrine to new technology. What we need now is the will to use these tools wisely. To apply these principles fairly. To build systems and create works and advise clients in ways that respect both innovation and creation. Because at the end of the day, we're not just copyright lawyers or AI engineers or novelists. We're the people who will decide whether the next generation has both the transformative technology they need and the creative works that make life worth living. That's what's at stake. Let's get it right. --- ## IX. The Legal Framework (For Reference) For those who want the technical details, here's the framework the courts applied: ### The Four-Factor Test (17 U.S.C. § 107) **Factor One: Purpose and Character of Use** - Is the use transformative (serving a different function) or substitutive (serving the same function)? - Is the use commercial? - Ross: Commercial, same function → Against fair use - Bartz/Kadrey: Commercial but transformative function → Favors fair use **Factor Two: Nature of the Copyrighted Work** - Creative works weigh against fair use - But weight diminishes when use is analytical rather than expressive - All three cases: Creative works, but analytical use reduces weight **Factor Three: Amount and Substantiality Used** - Complete copying is permissible when necessary and outputs are non-substitutive - Ross: Headnotes not publicly accessible → Favors fair use - Bartz/Kadrey: Complete copying necessary, outputs don't reproduce → Favors fair use **Factor Four: Effect on Potential Market** - The most important factor - Requires evidence, not speculation - Ross: Direct competition, licensing market harmed → Against fair use - Bartz/Kadrey: No market substitution, no evidence of harm → Favors fair use ### The Cases **Thomson Reuters Enters. Ctr. GmbH v. Ross Intelligence Inc.** No. 1:20-cv-613-SB (D. Del. Feb. 11, 2025) Holding: Fair use denied. AI system competed with original in same market. **Bartz v. Anthropic PBC** No. 3:23-cv-04648-WHO (N.D. Cal. June 23, 2025) Holding: Fair use granted. Transformative analytical use, no market harm. **Kadrey v. Meta Platforms Inc.** No. 3:23-cv-04647-VC (N.D. Cal. June 25, 2025) Holding: Fair use granted. Transformative analytical use, no evidence of harm. --- **Disclaimer** This article is for general informational purposes and does not constitute legal advice. Reading it does not create an attorney-client relationship. The views expressed are those of the author and do not necessarily reflect the views of any employer, client, or affiliate. For specific legal advice on AI, copyright, or intellectual property matters, consult a licensed attorney in your jurisdiction.