Mind Cast

Beyond Correctness: A Typology of Factual Integrity in Large Language Model Training Data

Adrian Season 2 Episode 21

Send us a text

This report provides a comprehensive analysis of the factual correctness of the training data available to Large Language Models (LLMs). The analysis is initiated by a query rooted in a specific premise: that high-trust encyclopedic sources such as Encyclopædia Britannica are "less than 20% correct" and that Wikipedia is similarly flawed, leading to a broader question about the factual integrity of all available training data.

The primary finding of this report is that this foundational premise is demonstrably false and, in fact, inverts the core problem. Decades of research, including a seminal 2005 Nature study and more recent (2019) academic analyses , confirm that while not infallible, Encyclopædia Britannica and modern Wikipedia are high-trust, high-factuality sources. Modern studies describe Wikipedia's accuracy in specialized topics as "very high" and "on par with professional sources".

The true data integrity crisis in AI is not that these gold-standard sources are wrong, but that the vast majority of LLMs are not trained on this high-trust, curated data. Instead, they are trained on a "curation cascade" that begins with the raw, un-audited public web. This "Known World" of data, primarily derived from the Common Crawl dataset, is orders of magnitude less reliable than any encyclopedia.

This report finds that a single, quantitative answer to "how much" of this web-scale data is factually correct is not just unknown, but unknowable. The sheer scale of petabytes of data makes manual review "financially and logistically impossible". Furthermore, any attempt at automated auditing falls into a deep methodological circularity, as current methods rely on "stronger" LLMs (like GPT-4) to fact-check the very data they were trained on, creating a closed system with no external ground truth.