OpenAI wants to work with organizations to build new AI training data sets

It’s an open secret that the data sets used to train AI models are deeply flawed.

Image corpora tends to be U.S.- and Western-centric, partly because Western images dominated the internet when the data sets were compiled. And as most recently highlighted by a study out of the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.

Models amplify these flaws in harmful ways. Now, OpenAI says that it wants to combat them by partnering with outside institutions to create new, hopefully improved data sets.

OpenAI today announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI model training. In a blog post, OpenAI says Data Partnerships is intended to “enable more organizations to help steer the future of AI” and “benefit from models that are more useful.”

“To ultimately make [AI] that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training data set as possible,” OpenAI writes. “Including your content can make AI models more helpful to you by increasing their understanding of your domain.”

As a part of the Data Partnerships program, OpenAI says that it’ll collect “large-scale” data sets that “reflect human society” and that aren’t easily accessible online today. While the company plans to work across a wide range of modalities, including images, audio and video, it’s particularly seeking data that “expresses human intention” (e.g. long-form writing or conversations) across different languages, topics and formats.

OpenAI says it’ll work with organizations to digitize training data if necessary, using a combination of optical character recognition and automatic speech recognition tools and removing sensitive or personal information if necessary.

At the start, OpenAI’s looking to create two types of data sets: an open source data set that’d be public for anyone to use in AI model training and a set of private data sets for training proprietary AI models. The private sets are intended for organizations that wish to keep their data private but want OpenAI’s models to have a better understanding of their domain, OpenAI says; so far, OpenAI’s worked with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and with the Free Law Project to improve its models’ understanding of legal documents.

“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” OpenAI writes.

So, can OpenAI do better than the many data-set-building efforts that’ve come before it? I’m not so sure — minimizing data set bias is a problem that’s stumped many of the world’s experts. At the very least, I’d hope that the company’s transparent about the process — and about the challenges it inevitably encounters in creating these data sets.

Despite the blog post’s grandiose language, there also seems to be a clear commercial motivation, here, to improve the performance of OpenAI’s models at the expense of others — and without compensation to the data owners to speak of. I suppose that’s well within OpenAI’s right. But it seems a little tone deaf in light of open letters and lawsuits from creatives alleging that OpenAI’s trained many of its models on their work without their permission or payment.