Nvidia is allegedly scraping YouTube, Netflix, and more to train AI

Can't spell fair use without AI.

Nvidia Omniverse ACE model, with its face replaced with the company logo, holding a logo for YouTube in its hand

As generative artificial intelligence grows in prevalence, it’s important that companies remain transparent about the data used to train models. Generated content doesn’t materialise from nothing, meaning that informed consent from copyright holders is paramount. This is especially true in light of the scale of scraping by Nvidia to train its ‘Cosmos’ AI.

Cosmos, in this instance, does not refer to Nvidia’s existing product of the same name. Instead, it’s the internal codename for an AI model, as reported by 404 Media. According to emails obtained by the outlet, the goal of the project is to build a video foundation model “that encapsulates simulation of light transport, physics, and intelligence in one place to unlock various downstream applications critical to Nvidia.” In order to train it, the company has been scraping up to a staggering “80 years worth of videos per day.” Unfortunately, this apparently includes platforms with copyrighted material, such as YouTube and Netflix.

Nvidia naturally denies any assertion that its practices are in breach of copyright law, also describing model training as a form of fair use due to its transformative nature. However, many platforms do not allow scraping as part of their terms of service, including Netflix and YouTube. More damningly, though, screenshots allegedly show Nvidia employees taking measures to circumvent scraping protections for the latter in service of Cosmos.

YouTube actively blocks the IP addresses of users running scrapers or mass-downloading tools on the platform. In response to this, Nvidia employees apparently used Amazon Web Services (AWS) to run and restart virtual machines to circumvent these protections. Curiously, this solution came from a member of the company’s Omniverse team.

To be clear, there’s no indication of Comos’ deployment in public or commercial products. However, 404 Media has obtained an alleged chart shared in emails that shows how the model would benefit various products, including GeForce, Omniverse, and others.

I strongly suggest reading the full report on 404 Media for a deeper look into Cosmos and its development. Nvidia has yet to comment publicly on the outlet’s findings outside of comments already provided.

In lieu of any concrete conclusions, instances like this further fuel my general distaste for generative AI. For transparency, I wholly entertain uses akin to Google’s Magic Eraser, a feature I use frequently on my Pixel smartphone. I also see no problem with DLSS Frame Generation. In terms of generating content wholesale, though, I’m far less comfortable, and I don’t see that position changing for the foreseeable future.