Democracy

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

08/08/2025

Murtaza Hussain, Ryan Grim, and Waqas Ahmed
Aug 06, 2025

The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.

Meta has scraped data from the most-trafficked domains on the internet —including news organizations, education platforms, niche forums, personal blogs, and even revenge porn sites—to train its artificial intelligence models, according to a leaked list obtained by Drop Site News. By scraping data from roughly 6 million unique websites, including 100,000 of the top-ranked domains, Meta has generated millions of pages of content to use for Meta’s AI-training pipeline.

The sites that Meta scrapes consist of copyrighted content, pirated content, and adult videos, some of whose content is potentially illegally obtained or recorded, as well as news and original content from prominent outlets and content publishers. They include mainstream businesses like Getty Images, Shopify, Shutterstock, but also extreme pornographic content, including websites advertising explicit sexual content and humiliation porn that exploits teenagers.

While high-profile sites like The New York Times, which has engaged in litigation to prevent their content from being used to train AI models, are absent from the list, the leak shows that Meta often found ways to stop sites from defending themselves from being scraped. The scrapers ignored common web protocols that site owners use to block automated scraping, including “robots.txt” which is a text file placed on websites aimed at preventing the indexing of context. The data were shared with Drop Site by whistleblowers frustrated over Meta’s support for Israel in conducting its genocide in the Gaza Strip. According to the whistleblowers, the data is indicative of Meta’s unethical and potentially illegal business practices more broadly. Andy Stone, a spokesperson for Meta, rejected the characterization. “This list is bogus,” he said.

Meta Leaked List 2.58MB ∙ PDF file Download

A list of the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models. The list is from a query run directly on the Meta database. The software used to do this internally is called Spidermate. The list has been reformatted for source protection purposes.

The scraping of data to train AI models has become a major controversy in recent years, with publishers filing lawsuits against budding AI companies accusing them of effectively stealing their content to build their AI platforms. Meta itself has been targeted by lawsuits from authors who accused the company of copyright infringement for using their work in its models. AI models require a tremendous amount of data for their training data to work effectively. In one notorious instance that raised the alarms of privacy experts, a startup known as Clearview AI, founded in 2017, scraped the internet for over 3 billion images taken from social media to develop a facial recognition tool used by intelligence and law enforcement agencies. The company was later hit with a wave of lawsuits for invasion of privacy.

A lack of transparency about the inputs that companies use to develop their AI programs, including fears that extreme or illegal content could be shaping these models, has been added to the already existing ethical issues over potentially stealing content from writers, publishers, and ordinary people simply sharing content online. An investigation by Stanford University from 2023 found that the popular Stable Diffusion text-to-image AI platform had been trained on hundreds of images of child exploitation, raising major ethical questions about data use and output from its models.

“With pretty much any generative AI model across many domains, the lack of transparency about the training data has been a recurring problem. That’s led to a number of lawsuits over copyright concerns. The only reason we were able to conduct our study was because the training set was transparent enough about the imagery it used that we could programmatically analyze it,” said David Thiel, a data scientist who worked on the Stanford University study and formerly the chief technologist at the Stanford Internet Observatory. “In this case, with no other information available, you’d just have to hope that Meta followed decent safety procedures when training the model. I have no idea whether that’s true or not.”

Rather than scraping from sites directly, many of the addresses on Meta’s leaked list belong to Content Delivery Networks (CDNs) that are used by websites to cache and store information to improve site performance. According to company employees, Meta’s scraping bots visit the same sites repeatedly to search for updated information to scrape, and the list shows all addresses that have been scraped at least once and then used to train AI models. The data are captured using an internal tool called Web Crawler. Regardless of whether it is removed by the site, scraped data continues to live on Meta’s internal servers and databases, employees say.

Meta recently went on a massive hiring spree and poached top talent from its biggest competitor, OpenAI, offering hundreds of millions of dollars in individual bonuses to the top AI researchers to join Meta. Interestingly, Meta also scrapes data from OpenAI to train its model. While Meta has gone on a billion dollar spending spree to target OpenAI researchers, it has drawn the ire of content creators and publishers who say that their work continues to be used without their knowledge or consent to build its AI models.

A lawsuit was filed earlier this year against Meta by 13 prominent writers, including Sarah Silverman and Ta-Nehisi Coates, who alleged that copyrighted versions of their books had been used to train Meta’s LLaMa model, but it was dismissed this June on “fair use” grounds by a judge in California. The legal victory for Meta, however, was limited: In their ruling, judge Vince Chhabria clarified that, while the case against Meta regarding these 13 authors was dismissed on the grounds that the use of copyrighted information was sufficiently “transformative,” the potential harms from companies using scraping tools to train their AI models may just be beginning—opening up the door to further legal challenges.

“This order does not stand for the proposition that Meta’s use of copyrighted materials to train its LLaMA models is lawful. Rather, the court concludes only that the allegations in the current complaint are insufficient to state a claim for relief,” Chhabria ruled, adding that, “by training generative AI models with copyrighted works, companies are creating something that often will dramatically undermine the market for those works, and thus dramatically undermine the incentive for human beings to create things the old-fashioned way.”

Meta Problems

Since the start of the Israeli genocide in the Gaza Strip, Meta has been rocked by internal discontent, including employee protests and disclosures of internal information related to corporate cooperation with the Israeli government. In addition to the most recent disclosure of the websites used to train the company’s AI models, whistleblowers have provided data showing other confidential company data, including data revealing a massive campaign to censor and deplatform pro-Palestinian content on Meta platforms, as well as advertising data showing decreases in the effectiveness of Israeli ad expenditures. Some senior policy officials at the company who have been involved in censoring content are also previously veterans of the Israeli government, including Jordana Cutler, Meta’s policy chief for Israel and the Jewish Diaspora, who previously served five years as an adviser to Prime Minister Benjamin Netanyahu.