Where does AI training data come from?
A report from The New York Times revealed on Friday that OpenAI may have trained AI models on YouTube video transcriptions and Google may have been doing the same thing.
The report found that in the hunt for fresh digital data to train its newer, smarter AI system, OpenAI researchers created a workaround called Whisper, which could take YouTube videos and transcribe them into text that could then be fed as new AI training data — for a more conversational, next-generation AI.
The process of developing GPT-4, the powerful AI model behind OpenAI's latest ChatGPT chatbot, took over a million hours of YouTube videos transcribed by Whisper, according to the NYTimes' sources.
The Times reports that OpenAI employees had conversations about how YouTube transcription training data could potentially violate YouTube's rules, but OpenAI decided to move forward anyway with the belief that training AI with the videos was fair use.
Knowledge of where the training data was coming from extended up to senior leadership, according to The Times, with OpenAI's president Greg Brockman even allegedly helping collect videos.
The Wall Street Journal's Joanna Stern interviewed ...