Apple Uses YouTube Subtitles for Research AI Model, Clarifies It’s Separate from Consumer Products

Apple’s OpenELM Model and YouTube Subtitles: Clarifying the Connection

Article Text:

In July 2024, reports emerged that Apple, along with other tech giants, had utilized YouTube subtitles to train their artificial intelligence (AI) models. This dataset encompassed over 170,000 videos from prominent creators such as MKBHD and Mr. Beast. The data was employed to train Apple’s open-source OpenELM models, which were released in April of the same year.

OpenELM, described by Apple researchers as a state-of-the-art open language model, was developed to contribute to the research community and advance open-source large language model development. Apple has clarified that OpenELM was created solely for research purposes and does not power any of its AI or machine learning features, including Apple Intelligence.

This distinction is significant because it means that the YouTube Subtitles dataset is not used to power Apple Intelligence. Apple has stated that Apple Intelligence models are trained on licensed data, including data selected to enhance specific features, as well as publicly available data collected by their web-crawler.

Furthermore, Apple has indicated that it has no plans to build new versions of the OpenELM model. This decision underscores the company’s commitment to transparency and responsible data usage in AI development.

The use of publicly available data, such as YouTube subtitles, for training AI models has raised questions about data privacy and consent. Apple’s clarification provides insight into how the company approaches data usage in AI training, emphasizing the separation between research initiatives and consumer-facing products.

In summary, while Apple utilized YouTube subtitles to train its OpenELM model for research purposes, this data is not employed in powering Apple Intelligence. Apple’s commitment to using licensed and publicly available data for its consumer AI features highlights its dedication to ethical AI development and user privacy.