Unveiling the Shadows: The Transparency Crisis in Large Language Model Training Datasets

The Data Provenance Explorer aims to assist machine-learning professionals in making better decisions regarding the data utilized for training their models, potentially enhancing the accuracy of real-world applications.
To develop more sophisticated large language models, researchers gather extensive datasets that combine varied information from thousands of online sources.

However, during the merging and rearranging of these datasets into new collections, critical details about their origins and usage restrictions are frequently lost or obscured.

This not only raises legal and ethical issues but could also negatively affect model performance. For example, if a dataset is incorrectly labeled, someone training a machine-learning model for a specific application may unwittingly use inappropriate data.

Moreover, data sourced from unknown origins can carry biases that lead to unfair outcomes when the model is put into action.

To enhance data transparency, a group of multidisciplinary researchers from MIT and other institutions initiated a thorough audit of over 1,800 text datasets from popular hosting platforms. They discovered that more than 70 percent of these datasets lacked crucial licensing details, while around 50 percent contained incorrect information.

Building on these findings, they created a user-friendly tool named the Data Provenance Explorer, which generates clear summaries of a dataset’s creators, sources, licenses, and permitted uses.

“These tools can aid regulators and practitioners in making informed decisions when deploying AI, thus contributing to the ethical evolution of AI,” states Alex “Sandy” Pentland, an MIT professor and co-author of a new open-access paper about this initiative.

The Data Provenance Explorer could empower AI developers to construct more effective models by allowing them to choose training datasets that are aligned with their model’s intended use, potentially improving AI accuracy in real-world applications like loan assessments or customer service interactions.

“Understanding the data on which an AI model is trained is crucial for grasping its strengths and limitations. Mislabeling and confusion over data origins lead to significant transparency issues,” mentions Robert Mahari, a graduate student in the MIT Human Dynamics Group, JD candidate at Harvard Law School, and co-lead author of the paper.

Joining Mahari and Pentland in the research paper are co-lead author Shayne Longpre, a graduate student in the Media Lab, and Sara Hooker, who leads the Cohere for AI research lab, along with collaborators at MIT, University of California at Irvine, University of Lille (France), University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. Their research is published today in Nature Machine Intelligence.

Focus on fine-tuning

Researchers typically employ a technique known as fine-tuning to enhance the capabilities of large language models designated for specific tasks, such as question-answering. For fine-tuning, they curate specialized datasets aimed at boosting a model’s performance for that particular task.

The research team from MIT concentrated on these fine-tuning datasets, which are often created by researchers, academic bodies, or companies and are licensed for designated uses.

When crowdsourced platforms compile such datasets into larger collections for practitioners’ use in fine-tuning, the original licensing information is frequently overlooked.

“These licenses are important and should be legally enforceable,” declares Mahari.

For instance, if a dataset’s licensing terms are incorrect or absent, someone could invest considerable time and resources in developing a model only to potentially have it retracted later because it contained private information.

“Individuals might end up training models without a complete understanding of their capabilities, concerns, or risks, which ultimately derive from the data,” adds Longpre.

To initiate this study, the researchers precisely defined data provenance as encompassing a dataset’s sourcing, creation, licensing history, and characteristics. They then established a structured auditing method to trace the data provenance of more than 1,800 text datasets from well-known online repositories.

After analyzing the datasets, they found that over 70 percent had “unspecified” licenses that were missing significant information. They then worked backward to complete this information, reducing the percentage of datasets with “unspecified” licenses to about 30 percent.

Their investigation revealed that accurate licenses are often more restrictive than those assigned by repositories.

Furthermore, they noticed that nearly all dataset creators come from the global north, which may affect a model’s performance if it’s intended for deployment in other regions. For instance, a dataset in Turkish primarily produced by individuals in the U.S. and China may lack culturally relevant information, Mahari points out.

“We often deceive ourselves into believing the datasets are more diverse than they really are,” he says.

Interestingly, researchers noted a significant increase in restrictions on datasets created in 2023 and 2024, possibly reflecting concerns among academics about their datasets being used for unintended commercial purposes.

A user-friendly tool

To assist others in acquiring this information without necessitating a manual audit, the team developed the Data Provenance Explorer. This tool not only enables sorting and filtering of datasets according to specific criteria but also allows users to download a data provenance card that offers a concise, organized overview of dataset traits.

“We hope this serves not just to enhance understanding of the environment, but also to empower people to make better-informed decisions about the data they train on,” Mahari states.

Looking ahead, the researchers aim to broaden their study to explore data provenance for multimodal data, incorporating video and speech. They also plan to analyze how terms of service from websites that act as data sources reflect in the datasets.

As they extend their research, they are engaging with regulators to discuss their findings and the specific copyright implications related to fine-tuning data.

“Achieving data provenance and transparency from the very beginning, when datasets are created and released, will facilitate better access to insights for others,” Longpre concludes.

McDonald’s Festive Pie Makes a Sweet Return: Discover Where to Indulge!

Experience the Colosseum Like Never Before: Airbnb and Paramount Pictures Unveil a Free “Gladiator II” Adventure

The Life Expectancy of Hamsters: Essential Tips for Ensuring Your Furry Friend Thrives

Donald Trump: Mastering the Art of Global Leader Friendships

McDonald’s Festive Pie Makes a Sweet Return: Discover Where to Indulge!

Experience the Colosseum Like Never Before: Airbnb and Paramount Pictures Unveil a Free “Gladiator II” Adventure

The Life Expectancy of Hamsters: Essential Tips for Ensuring Your Furry Friend Thrives

Donald Trump: Mastering the Art of Global Leader Friendships

Unveiling the Shadows: The Transparency Crisis in Large Language Model Training Datasets

Unveiling the Mysteries of Fossilized Teeth: How Prolonged Childhood Shaped the Rise of Bigger Brains

Unlocking Insights: How Mathematical Modeling Enhances Our Understanding of Prostate Cancer

The Neuroscience Behind Teen Decision-Making: Understanding Youthful Impulsivity