AI’s Incest Problem

Grant Viklund
4 min readMar 5, 2024

A big issue that AI, particularly generative solutions, will need to reckon quickly is now on the horizon. While the technology’s growth is impressive, not everyone realizes the amazingly huge amounts of data needed to feed these beasts as they grow. What they consume is large pools of data across many sources, often from the internet, which can be anything from scholarly articles to social media content. However as more content on the internet is generated by AIs, a significant challenge looms: the rise of AI Incestuousness.

AI Incestuousness occurs when AI models are trained on data primarily generated by other AI systems. Experts predict that by 2025–2030, over 99% of internet content could be AI-generated (1). This poses a risk of creating an echo chamber in AI learning, limiting the diversity and richness of its knowledge base.

A fitting analogy is dog breeding. While shaping the desired bloodline certain desirable traits are amplified, and so too are the undesirable ones. In the quest to shape something that naturally evolves over generations, the breeder often introduces unintended issues. Similarly, AI models trained predominantly on AI-generated content could assume and amplify inherent biases and limitations, leading to poor quality output and overfitting on a grand scale (2).

Adding to the problem, the tightening of intellectual property (IP) laws is likely to shrink the pool of available data for AI training (3). This presents a challenge, as varied and vast data sources are crucial for robust model development. Furthermore, the issue of plagiarism and source citation in AI raises complex ethical and legal debates. So how does an AI site its sources? Does it have to? Should it? As humans can in turn see, read and learn from the materials, why can’t AI? All debates for a different article but have a direct impact on the quality of models.

Now some may argue that the use of Synthetic data (AI, code generated, etc) in training is not the same (4). That can be debated as AI researchers often frown upon the practice while others see it as a way to remove bias (5). In any case Synthetic data often serves as a stopgap in situations where real-world data is scarce.

Above all the predominance of AI-generated content raises countless existential questions. If the knowledge is almost all generated by a few AI models what is the impact? Will AI’s omnipresence drown out diverse human perspectives? Could an over-reliance on AI homogenize knowledge and stifle human creativity. These concerns underscore the potential for societal groupthink and the devaluation of diverse knowledge.

It’s imperative to prioritize human-generated content and regulate the use of AI-generated materials in AI model training. If AI comes to dominate internet content, the quality and diversity of training data could suffer (6). For users and AI based businesses the well will become poisoned. On top of that, addressing legal barriers to accessing protected IP content is crucial in maintaining data variety.

AI companies are progressing rapidly, often outpacing regulatory efforts. This is a rapidly developing global market so it’s vital to establish international “guard rails” for AI development. Different nations hold varying perspectives on regulation, making it imperative to push for global consensus swiftly.

The diversity in human thought and perspective has been a cornerstone of our survival and progress. We are genetically wired to have sometime very different ways of looking at the world around us. That variety has helped us survive near extinctions several times (7). As we integrate AI into our lives, it’s crucial to ensure that it doesn’t create echo chambers that amplify our biases or limit our perspectives. The real challenge lies in harnessing AI’s potential while safeguarding the rich tapestry of human knowledge and creativity.

Want to read more? As always check out these links for references I used in this article.

  1. “Experts Say That Soon, Almost the Entire Internet Could Be Generated by AI” — https://futurism.com/the-byte/ai-internet-generation
  2. “Overfitting“ — https://en.wikipedia.org/wiki/Overfitting
  3. “Generative AI Has an Intellectual Property Problem” — https://hbr.org/2023/04/generative-ai-has-an-intellectual-property-problem
  4. “The Pros and Cons of Using Synthetic Data for Training AI” — https://www.forbes.com/sites/forbestechcouncil/2023/11/20/the-pros-and-cons-of-using-synthetic-data-for-training-ai/?sh=71425e5a10cd
  5. “Synthetic Imagery Sets New Bar in AI Training Efficiency” — https://news.mit.edu/2023/synthetic-imagery-sets-new-bar-ai-training-efficiency-1120
  6. “Stanford Scientists Find That Yes, ChatGPT Is Getting Stupider” — https://futurism.com/the-byte/stanford-chatgpt-getting-dumber
  7. “Close Calls: Three Times When Humanity Barely Escaped Extinction” — https://gizmodo.com/close-calls-three-times-when-the-human-race-barely-esc-1730998797

--

--

Grant Viklund
Grant Viklund

Written by Grant Viklund

A creative technologist working in all things Computer Graphics, from VFX & Animation to Video Games, the Metaverse & new Interactive Experiences.

No responses yet