This guide navigates the complex intersection of AI development and copyright law, addressing legal and ethical challenges for developers training AI models with copyrighted content. This article explores how copying protected materials (e.g., books, music, databases) for AI training can infringe copyrights unless exceptions like fair use or permissions apply. This article also outlines international frameworks and national laws, including the UK’s proposed text and data mining exception, the US’s fair use doctrine, and India’s narrow fair dealing under the Copyright Act, 1957.
Table of Contents
Introduction
I remember the first time I tried explaining how AI learns to my aunt. She is a retired schoolteacher and now an avid poet, but has always been into writing since she was a child. I said, “Imagine you are training a robot to write poetry. What do you look for? You want to build a robot that writes the best poems and is the best in the world.”
I continue, “So what is your first step?”.
“I guess I have to show it the existing poems in the world,” my aunt replied.
“Okay,” I said. “So you feed it thousands of poems by different authors. It reads them, learns the patterns, and then writes something new.”
She paused, raised an eyebrow, and asked, “But… should I not ask the poets for permission first, before using their work to train the robot? Unless the poems are in the public domain.”
There it was. The moral and legal dilemma of our times.
Welcome to the wild west of AI training and copyright law.
If you are building or using AI tools that create content, be it legal research summaries, AI-generated art, or a chatbot that writes bedtime stories, you are probably wondering: Can I legally use books, websites, databases, or songs to train my model? And if you are not wondering that… well, you should be. Because governments, courts, and copyright owners sure are.
Recently, two major events shook the AI and legal world:
i. The UK government proposed letting AI companies use copyrighted material unless creators opt out. That is like saying, “You can eat from anyone’s fridge unless they put a lock on it.” Naturally, many artists and authors are not thrilled.
ii. Meanwhile, in the US, legal research company Ross Intelligence was sued by Thomson Reuters for using Westlaw content to train their AI. The court ruled against Ross, saying their use was not “fair.” That decision could ripple through every company building generative AI today.
There are so many AI models being built today. All for different purposes, some want to erase front desk personnel, and some focus on bettering legal research.
So, if you are a startup founder, developer, product manager, or even a curious lawyer dabbling in AI, you are probably wondering what the actual copyright rules are and what counts as “fair use” or “fair dealing.”
And how about scraping websites and plugging them all into your model? And most of all, how do you stay out of court or, worse, the headline?
In this article, I will walk you through all of that. I will share what the international treaties and local laws say, what courts are now deciding and what best practices should businesses and AI devs follow to ensure compliance with copyright laws and foster ethical AI innovation.
Whether you are training a chatbot to write legal memos or building an AI that remixes classic Bollywood songs into jazz, it is time to get smart about copyright.
Let us dive in, and yes, we will circle back to my aunt’s poetry robot soon.
What copyright rules apply to AI training?
Let us get one thing straight: just because your AI is learning does not mean it is exempt from the law.
When you feed copyrighted content, such as books, blogs, research papers, databases, films, and music into your model, the model is not merely ‘reading’ the content as a way a human would. The machine is copying, storing, processing, and sometimes transforming that data. And under most copyright laws, copying is a big deal.
So, what is the legal position?
The moment a work is created, say, a novel, a photograph, or a music track, it is protected by copyright. You do not need to register it, put a © symbol on it, or send it to yourself in the mail. It is automatic, thanks to the Berne Convention, which India, the UK, the US, and over 170 countries have signed. But it is always better to register, but either way, it is protected by copyright.
This matters because it means: copyright holders do not have to “opt in” for protection. You can not use their work for AI training unless your use falls under an exception or you have permission.
And here is where things get complicated.
International frameworks
Let us zoom out before we zoom in.
- Berne Convention (1886): Establishes automatic copyright protection across member countries. No need for formalities. If it is protected in one country, it is protected in others.
- TRIPS Agreement (WTO): Requires member countries (including India) to offer a minimum level of copyright protection.
- WIPO Internet Treaties: Adds extra protections for digital environments, including database and network rights.
So these treaties shape national laws, but do not provide specific rules for AI. That is left to individual countries.
The UK: Proposed opt-out chaos?
The UK recently proposed that AI companies can use copyrighted content for training unless the owner opts out. This is being called a “text and data mining” (TDM) exception. Critics argue this undermines the Berne Convention, which emphasises automatic protection of authors’ rights without the need for formalities. By shifting the burden onto creators to actively opt out, the proposal is seen by many as weakening copyright protections.
The backlash has been fierce, especially from artists, authors, and media companies. The proposal is not law yet, but it tells us that governments are struggling to balance innovation with creator rights.
The US: No free pass for AI
In the US, copyright law offers fair use as a possible defence for using copyrighted materials. But courts have been cautious. In the Thomson Reuters v. Ross Intelligence case, the court ruled that using Westlaw’s copyrighted headnotes to train a legal AI tool was not fair use.
That is a major signal: using structured, high-value data, even if you are building a new tool can still be infringement.
Let us look a little more into the case so that you have a better understanding going forward.
Ross Intelligence was a startup aiming to develop an AI-powered legal research tool, basically a smarter, faster alternative to traditional legal databases like Westlaw. To do that, Ross needed large volumes of legal text to train its machine learning models. So what was the problem? Ross did not license that data.
Instead, Ross allegedly hired contractors to access Westlaw (a subscription-based legal database owned by Thomson Reuters) and scraped or copied large amounts of content, including editorial headnotes and summaries, to feed its AI training pipeline.
In 2020, Thomson Reuters filed a lawsuit in the U.S. District Court for the District of Delaware, claiming that Ross had infringed its copyrights and violated its terms of service and contractual rights
At the heart of the case were two big questions:
- Is using copyrighted material to train AI a form of fair use?
- Do AI training purposes justify reproducing and analysing large volumes of structured data?
Ross argued that its use of the content was transformative, i.e., the AI was creating something new for legal research, not competing with Westlaw directly.
But Thomson Reuters disagreed, arguing the headnotes and summaries were original, copyright-protected editorial content, not just raw legal texts. And also that Ross was building a competing product using Westlaw’s proprietary work without permission.
What did the court say?
In February 2025, Judge Stephanos Bibas in the U.S. District Court for Delaware sided with Thomson Reuters, holding that:
- Ross had copied copy-protected content, including headnotes, and
- The copying did not qualify as fair use under U.S. law.
The ruling emphasised that Ross’s actions harmed Westlaw’s market, especially since they were offering a rival product. Even if Ross’s AI created new outputs, the input method, unauthorised scraping and reuse of copyrighted, curated summaries, was still a copyright violation.
Though it is a U.S. case, this decision could influence courts in India, the UK, Canada, and the EU, especially as more lawsuits emerge around AI and copyrighted content.
India: No specific AI exceptions just yet
India does not have a dedicated AI law (yet), but its Copyright Act, 1957, applies just the same. India recognises a doctrine called fair dealing, which serves a purpose similar to fair use in the United States. However, Indian fair dealing is narrower in scope. It applies only to specific purposes enumerated in the statute, such as private or personal use (including research), criticism or review, and reporting of current events. Unlike the open-ended and flexible fair use doctrine in the U.S., Indian courts do not generally extend fair dealing beyond these listed categories.
As a result, uses like data scraping, training AI on copyrighted datasets, or reproducing excerpts of text for algorithmic analysis may not be protected under Indian fair dealing, especially if they fall outside the statutory purposes.
Key takeaways:
- Section 14 gives authors exclusive rights to reproduce and communicate their work.
- There is a fair-dealing exception, but it is narrower than US fair use.
- Current exceptions mostly cover private use, reporting, review, education, and not commercial AI training.
- No specific TDM exception is in place.
So, if you are training an AI model in India using third-party books, articles, or music without permission, you are likely infringing copyright unless you fall squarely within a legal exception (and chances are, you do not).
But keep in mind that the absence of a specific TDM exception does not automatically make all commercial AI training infringing, but the narrow scope of fair dealing makes it risky without explicit permission or a clear non-commercial research purpose.
What is fair use vs. infringement?
So remember the poetry robot I told my aunt about? Let us call it Poetron. Say we train Poetron on thousands of poems written by living and dead authors without asking anyone. It starts churning out verses in the style of Sylvia Plath and Gulzar. Everyone’s impressed. My aunt even tears up reading one.
But then a question hits us: Was training Poetron legal?
Well, that depends on one big, slippery concept: fair use.
What is fair use?
Fair use is a legal doctrine, mostly in the US, that allows limited use of copyrighted material without permission under certain conditions. It is a balancing test, not a free pass.
Here is what courts consider:
- Purpose and character of the use
Is it transformative? Are you adding something new, with a different purpose, or just copying it? Training an AI to generate new content might be transformative, but not always.
- Nature of the copyrighted work
Factual works, such as legal databases, are treated more leniently than highly creative ones, such as music and poems. So, Poetron is using court judgments? Better odds. Poetron using Rupi Kaur’s work? Riskier.
- Amount and substantiality
Okay, so did you use just a snippet, or did you feed the entire corpus in? Feeding a whole novel into your model? That is heavy.
- Effect on the market
If your use hurts the original creator’s ability to make money, fair use is less likely to apply. If Poetron starts publishing books that look eerily like living poets’ work, that is a problem.
In the Ross Intelligence case, Ross claimed fair use for training its AI on Westlaw summaries. The court disagreed, saying the use was not transformative and directly threatened Westlaw’s market.
UK & EU: Not quite fair use
The UK does not have a general fair use doctrine, it relies on fair dealing, which is stricter. There are narrow exceptions like:
- Non-commercial research
- Criticism and review
- Education
But here is the thing: the UK added a TDM (text and data mining) exception in 2014 for non-commercial research. The new proposed reform would expand that to commercial AI training, unless the copyright owner opts out. This has stirred up quite the copyright hornet’s nest. In the EU, similar TDM exceptions exist but also come with opt-out clauses.
India: Narrow fair dealing
In India, we use the term fair dealing (not fair use), and it is pretty specific:
- Private or personal use
- Criticism or review
- Reporting current events
- Education and research
None of this clearly covers commercial AI training. If you are a startup scraping blogs, articles, and books to train your model, fair dealing probably will not save you.
Also, Indian courts have not yet dealt with a major AI copyright case, but you do not want to be the test case, do you?
Best practices for AI companies and developers
Let’s face it, none of us wants our next big AI breakthrough to end up as Exhibit A in a courtroom. Whether you are building the next ChatGPT, a design assistant, or our old friend Poetron, the poetry robot, the smartest move is staying ahead of legal trouble.
Here is how you can do just that.
Use licensed or public domain data
Start with content you actually have the right to use. Seems obvious, right? But you would be surprised how many models are trained on scraped web content without checking the license.
Stick to open-source datasets with clear licenses (e.g. Creative Commons). Use works in the public domain (e.g. Shakespeare, Tagore, Beethoven). Use licensed databases, where you have paid for access or have written permission.
For example, if you are training an AI to analyse legal writing, use open-access court opinions, not Westlaw headnotes (unless you want a visit from Thomson Reuters’ lawyers).
Always check what exact rights the license gives you. Some CC licenses prohibit commercial use or derivative works.
Get explicit permissions or licensing deals
This is the gold standard. If you want to use copyrighted music, books, or data sets for training, you can ask.
Large AI companies are already doing this. However, OpenAI and Google primarily argue that training on copyrighted material is fair use, not that they obtain permissions for all data. For instance, OpenAI has claimed training on publicly available internet materials is fair use, and Google has pushed for text and data mining exceptions without significant licensing. While they may license some data, the general practice relies on fair use.
But it is best if you get explicit permission.
Smaller startups are reaching out to niche content creators, offering licensing fees or revenue share.
It might cost more, but it is cheaper than a lawsuit and also way better for your PR. And also, you will be ethically correct.
Be transparent about your training data
If you are using AI in your product or service, be open about what went into it. Investors, users, and regulators are starting to ask.
So what can you do? You can publish a data sourcing policy, maintain documentation of datasets used, including licenses and permissions. Also, make sure to clearly state whether you used any third-party copyrighted works.
It goes without saying, transparency builds trust, and if a question ever comes up, you will be glad you kept receipts.
Avoid using “valuable structured data” without permission
Let us go back to the Ross Intelligence case for a second. One reason Ross lost was that it used Westlaw’s editorially created headnotes, which are not just case summaries, but proprietary content that Thomson Reuters sells.
If you are eyeing:
- Premium legal summaries
- Music with curated metadata
- Academic papers behind paywalls
- Paid-for news archives
…you better have a license.
Courts are especially protective of structured content that took effort, time, and money to produce.
Audit your model outputs
Sometimes it is not what you put in, but what comes out.
Let us say Poetron, trained on copyrighted poems, spits out a verse that closely resembles an original line from a well-known poet. That is called memorisation, and yes, it can still be infringement.
So what can you do? You can:
- Run tests on the generated outputs.
- Use plagiarism detection tools.
- Set up filters to flag overly derivative content.
This is not foolproof, but it is a solid safety net.
Keep an eye on changing laws
AI law is evolving faster than Poetron can rhyme “melancholy” with “technology.”
- India may soon release its Digital India Act, which could regulate AI and data usage.
- The EU AI Act will likely set global norms for risk-based AI regulation.
- The UK may or may not move forward with its opt-out copyright exception.
- In the US, courts are drawing the line between innovation and infringement, one lawsuit at a time.
So set up alerts, follow legal updates, and talk to an IP lawyer before scaling your AI product globally.
Common questions about AI and copyright
If I had a rupee for every time someone said, “But it is on the internet, so it must be free,” I would have enough to buy Poetron a copyright lawyer. The truth is, there is a lot of confusion around what AI developers can legally use, especially when it comes to training data.
Let us clear the air.
- If it is publicly available online, is it free to use?
This is the most common and dangerous assumption.
Just because a song is on YouTube or a research article is on someone’s blog does not mean you can download it, copy it, and use it to train your AI model. Availability ≠ permission.
Publicly accessible does not mean public domain. Copyright still applies unless the owner has clearly waived it or granted you a license.
- If you credit the creator, is it not infringement?
Attribution is a great ethical step, but it does not get you out of legal trouble.
Copyright infringement happens the moment you use someone’s work without permission, even if you say who made it. Attribution does not replace a license.
- If it is AI-generated, does it become original?
Developers often assume that whatever their model produces is new and unique. But AI models are known to memorise and reproduce parts of the training data, especially if the dataset is small or the material is distinctive.
If the output closely resembles the input, it can still infringe copyright, even if generated by a machine. AI-generated Studio Ghibli is the best example of this.
- Is scraping data from websites always legal?
Some folks rely on automated bots to gather massive datasets from blogs, articles, forums, and more. They assume scraping is just a form of reading.
Web scraping can breach copyright law, terms of service, and data protection laws. In India, scraping protected content without permission may violate the Information Technology Act and trigger civil liability.
Also, many sites explicitly prohibit automated scraping in their terms of use.
- Does fair use cover all transformative uses?
Developers love the word “transformative.” If your AI is building something new from old data, that must be fair use, right?
“Transformative” is one part of the fair use test, not a golden ticket. As we saw in the Ross Intelligence case, even using legal content to build an AI tool can fail the fair use test if it harms the original market or uses too much of the work.
- No one will notice, I’m too small to be sued.
Small startups often think they can fly under the radar. But copyright enforcement is becoming more automated and aggressive, especially in sectors like media, publishing, and law.
You do not actually have to be famous to be noticed. All it takes is one complaint or takedown notice to turn your weekend side project into a legal headache.
- Is Indian copyright law not flexible for innovation?
There is a belief that Indian copyright laws are more lenient because we are still developing AI policies.
The Indian Copyright Act, 1957, does not provide blanket exceptions for AI training. Fair dealing is narrow, and most uses outside education, research, and private study will not qualify. If anything, India is under pressure to align with global standards, not relax them.
The AI copyright compliance checklist
Alright, so you have built something brilliant, an AI that writes legal emails, paints wildlife art, or narrates bedtime stories in the voice of a 1970s Bollywood star. Now the real question: Are you in the clear, legally?
Here is your go-to checklist to make sure your AI project respects global copyright standards. Use this like a litmus test whether you are a solo developer, a startup founder, or part of a corporate R&D team.
Step 1: Know what is in your training data
Make sure you have documented the sources used. And are they public domain, open licensed, or explicitly permitted? Did you scrape them? If yes, check the site’s terms of service.
Keep a spreadsheet or internal log of datasets, such names, source, license type, and notes on usage rights.
Step 2: Check licenses
So do not assume anything. If it is Creative Commons, make sure you check for commercial use restrictions. And if it is paid content (e.g., academic papers, books, music), make sure you have a license. If it is user-generated content, then is there a consent mechanism?
Reach out to rightsholders, especially if your model will be used commercially. Draft simple license agreements if needed.
Step 3: Use protective technical practices
You should test outputs for memorisation or reproduction. Make sure you have filters to avoid verbatim copying. Also have a human-in-the-loop system for sensitive domains (e.g., law, medicine, journalism).
Run a “content similarity check” for generated samples, especially in creative or legal AI tools.
Step 4: Set up internal policies and documentation
Make sure you have written a data sourcing and content usage policy. Also, do not forget to have your engineers briefed on copyright risks. And have a proper fallback plan if takedown requests come in.
Include copyright awareness in onboarding materials or product documentation. Train your team like you would train your AI, ethically.
Step 5: Monitor laws in key jurisdictions
You will want to watch:
- India – for updates under the upcoming Digital India Act
- US – case law developments like Thomson Reuters v. Ross
- UK – if the proposed opt-out model progresses or gets dropped
- EU – enforcement under the TDM opt-out model and AI Act
Subscribe to newsletters by law firms or IP think tanks. Even a quick monthly check-in is better than nothing.
Conclusion
We are in uncharted territory. We are in the ‘mess around and find out’ phase. Which is a scary place to be in.
There are no common answers when it comes to copyright and AI. But being proactive, transparent, and respectful of creators’ rights? That is always a good call, no matter what the robot poets of the future say.
So go ahead, build your AI, change the world. Just do not forget to carry the license key along with your API key.
FAQs
1. How can developers navigate conflicts between local AI regulations and international standards?
Developers may face situations where domestic laws contradict global standards (like GDPR or OECD principles). In such cases, it is important to seek legal guidance to prioritise compliance based on jurisdictional reach, extraterritorial application, and risk exposure.
2. Are there specific strategies for integrating ethical principles into agile AI development workflows?
Yes, one strategy is to embed ethical checkpoints into sprint reviews or product backlog grooming. Another is using checklists like the IEEE Ethically Aligned Design framework during each iteration cycle.
3. What should startups with limited resources prioritise when trying to align with global AI standards?
Startups should begin with high-impact, low-cost actions such as clear documentation of data provenance, ensuring consent for personal data, and bias audits for models. They can also use open-source toolkits for fairness and transparency assessments.
4. How do international AI standards apply to pre-trained models sourced from third parties?
So, even if you did not train the model, you are still accountable for how it is used. You should assess the source’s documentation, licensing, and ethical claims, and perform independent audits before integration.
5. What role do AI lifecycle management tools play in global compliance?
Lifecycle management tools help maintain version control, track changes, and document decisions, all of which are essential for audit readiness and regulatory reporting under frameworks like the EU AI Act or NIST AI RMF.
6. How can developers ensure transparency when using black-box models?
Techniques like LIME or SHAP can be used to generate post-hoc explanations. Additionally, developers should offer clear disclaimers about the model’s limitations and ensure users understand the rationale behind outputs where possible.
7. Are there domain-specific standards that developers should be aware of beyond general AI principles?
Yes, for example, medical AI may be subject to FDA regulations in the U.S., or ISO 13485 for medical device software globally. Financial AI may need to comply with Basel III or local central bank guidelines.