Shielding Yourself from the Rising Tide of Bank Fraud: Essential Strategies for Protection

Bank fraud is rampant. Your data could be anywhere. Here's how to protect yourself. If you feel that no bank account is entirely safe from scams and fraud these days, you aren’t being paranoid.  Three in 10 bank customers experienced fraudulent activity on their accounts in the past year, according to a first-ever bank fraud
HomeHealthUnleashing Chaos: How Generative AI Might Disrupt the Internet as We Know...

Unleashing Chaos: How Generative AI Might Disrupt the Internet as We Know It

Researchers have discovered that training newer generations of generative artificial intelligence models using synthetic data can result in self-destructive feedback cycles.

Generative AI models, such as OpenAI’s GPT-4o and Stability AI’s Stable Diffusion, show remarkable proficiency in generating text, code, images, and videos. The challenge, however, lies in the enormous volume of data required for their training. Developers are currently facing limitations in available data and may soon run out of resources for training altogether.

Given the scarcity of data, the idea of utilizing synthetic data for training future AI models appears attractive to major technology firms. This is due to several advantages: synthetic data is less expensive than real data, is virtually unlimited in supply, carries fewer privacy concerns (especially relevant for sensitive medical information), and can sometimes enhance AI effectiveness.

Despite these potential benefits, recent research from the Digital Signal Processing team at Rice University indicates that reliance on synthetic data can have significant detrimental effects on subsequent generations of generative AI models.

“Problems arise when the training with this synthetic data is inevitably repeated, leading to a feedback loop we refer to as an autophagous or ‘self-consuming’ loop,” explained Richard Baraniuk, Rice’s C. Sidney Burrus Professor of Electrical and Computer Engineering. “Our team has extensively studied these feedback loops, and the concerning news is that new models can become irreparably damaged after just a few generations of this type of training. This phenomenon is often called ‘model collapse,’ particularly in discussions around large language models (LLMs). However, we believe the term ‘Model Autophagy Disorder’ (MAD) is a more fitting analogy to mad cow disease.”

Mad cow disease is a deadly neurodegenerative condition affecting cows and has a human counterpart caused by consuming infected meat. A significant outbreak in the late 20th century highlighted how mad cow disease spread due to the practice of feeding cows processed remains of their slaughtered peers, hence the term “autophagy,” which originates from Greek and means “self-eating.”

“We shared our findings on the MAD phenomenon in a paper presented in May at the International Conference on Learning Representations (ICLR),” Baraniuk noted.

The research, titled “Self-Consuming Generative Models Go MAD,” represents the first peer-reviewed study on AI autophagy and specifically examines generative image models like the well-known DALL·E 3, Midjourney, and Stable Diffusion.

“We opted to focus on visual AI models to emphasize the potential pitfalls of autophagous training, but similar issues affecting corruption arise with LLMs, as recognized by others in the field,” Baraniuk stated.

Typically, the internet serves as the source for training datasets for generative AI models. As synthetic data becomes more prevalent online, self-consuming loops are likely to develop with each new model generation. To explore various outcomes from these loops, Baraniuk and his team analyzed three types of self-consuming training loops that realistically illustrate how both real and synthetic data combine in generative model training datasets:

  • Fully synthetic loop — Each generation of the generative model was trained entirely on synthetic data drawn from the outputs of previous generations.
  • Synthetic augmentation loop — Each model generation’s training set consisted of a mix of synthetic data from prior generations combined with a fixed amount of real training data.
  • Fresh data loop — Each model generation trained on a blend of synthetic data from previous generations and a new set of real training data.

Progressive iterations revealed that, over time and without adequate fresh real data, the models began to produce outputs that were increasingly distorted, often lacking quality, diversity, or both. Simply put, the more fresh data available, the healthier the AI system.

Comparing the image datasets from successive model generations showcases a concerning potential future for AI. Datasets containing human faces start to exhibit grid-like scars known as “generative artifacts,” or they begin to resemble the same individual. Meanwhile, datasets that represent numbers can devolve into unreadable scribbles.

“Our theoretical and empirical research has allowed us to speculate on the consequences as generative models become widespread, with future models trapped in self-consuming loops,” Baraniuk explained. “Some outcomes are evident: without sufficient fresh real data, forthcoming generative models are destined for MADness.”

To enhance the realism of these simulations, the researchers included a sampling bias parameter that reflects the phenomenon of “cherry picking,” where users prioritize data quality over diversity—sacrificing variety in the types of images and texts for those that look or sound appealing.

The motivation behind cherry picking is that it results in more consistent data quality over multiple model generations, yet this comes with an even sharper decline in diversity.

“A potential doomsday scenario is that if left unchecked for many generations, MAD could severely degrade the overall quality and diversity of data available on the internet,” Baraniuk warned. “Even short of this extreme, it seems likely that unforeseen consequences will emerge from AI autophagy in the near future.”

The research team included Baraniuk along with Rice Ph.D. students Sina Alemohammad, Josue Casco-Rodriguez, Ahmed Imtiaz Humayun, Hossein Babaei; Rice Ph.D. alumnus Lorenzo Luzi; Stanford postdoctoral fellow and Rice Ph.D. alumnus Daniel LeJeune; and Simons Postdoctoral Fellow Ali Siahkoohi.

This research received backing from the National Science Foundation, the Office of Naval Research, the Air Force Office of Scientific Research, and the Department of Energy.