Apple and Salesforce AI training datasets co-opt MrBeast, Marques Brownlee videos


0

Marques Brownlee in glasses and a gray shirt.

A new investigation claims that tech companies used subtitles from more than 48,000 YouTube channels — including from top creators like MrBeast and Marques Brownlee and higher learning institutions like MIT and Harvard — to train their AI models, even though YouTube prohibits the harvesting of platform content without permission.

The investigation, conducted by Proof News and published in conjunction with Wired, found that companies like Anthropic, Nvidia, Apple, and Salesforce used a dataset of 173,536 YouTube videos including those from Khan Academy, MIT, Harvard, The Wall Street Journal, NPR, the BBC and late night shows like The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live.

Marques Brownlee posted an Instagram Reel noting that, in his opinion, “the real story is Apple and a whole bunch of other tech companies are training their AI models using data that they buy from third party data scraping companies some of which get their data in slightly illegal ways… Apple can technically say they’re not at fault for this.”

Wired says that representatives for the non-profit AI research lab that scraped and disseminated the YouTube dataset, EleutherAI, did not respond to the publication’s requests for comment. The dataset is part of a compilation the nonprofit calls The Pile, which also includes material from the European Parliament, English Wikipedia, and emails from the employees of the Enron Corporation released during the federal investigation into the company in the early 2000s.

Wired reports that most of the collections that make up The Pile are accessible to “anyone on the internet with enough space and computing power to access them.” These include Apple, Nvidia, Salesforce, Bloomberg and Databricks, all of which have publicly acknowledged their use of The Pile to train AI models.

Jennifer Martinez, a spokesperson for AI startup Anthropic, said in a statement that while the company had used The Pile to train its generative AI assistant, “YouTube’s terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to the Pile authors.”

In his Instagram Reel, Brownlee added, “The double whammy is that I actually pay for more accurate manual transcriptions on every video that we put out… so that means the stolen transcriptions specifically are paid content that’s being stolen more than once.”

His concerns echo those of creators across the world who are concerned that their work will be consumed or exploited by AI without compensation or permission. Many are currently suing tech companies for unapproved use of their work.

Wired reports that The Pile is still available on file-sharing services but has been removed from its official download site. Proof News has created a tool to search for creators in the YouTube AI training dataset.


Like it? Share with your friends!

0

What's Your Reaction?

hate hate
0
hate
confused confused
0
confused
fail fail
0
fail
fun fun
0
fun
geeky geeky
0
geeky
love love
0
love
lol lol
0
lol
omg omg
0
omg
win win
0
win

0 Comments

Your email address will not be published. Required fields are marked *

Choose A Format
Personality quiz
Series of questions that intends to reveal something about the personality
Trivia quiz
Series of questions with right and wrong answers that intends to check knowledge
Poll
Voting to make decisions or determine opinions
Story
Formatted Text with Embeds and Visuals
List
The Classic Internet Listicles
Countdown
The Classic Internet Countdowns
Open List
Submit your own item and vote up for the best submission
Ranked List
Upvote or downvote to decide the best list item
Meme
Upload your own images to make custom memes
Video
Youtube and Vimeo Embeds
Audio
Soundcloud or Mixcloud Embeds
Image
Photo or GIF
Gif
GIF format