Quranlm's Podcast

S1 E2: Faithful Code

quranlm Season 1 Episode 2

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 15:23

Link: https://quranlm.com/

Mission & Core Tension

Welcome back to the QuranLM podcast, where we pursue the mission of "Bridging Divine Wisdom with Modern Intelligence." Our focus in this episode shifts to the critical tension between a probabilistic machine's confident tone and its capacity for complete factual error—especially when the subject is the Holy Quran.

We explicitly intend to probe the limits of Large Language Models (LLMs) and determine what it would take to make them truly trustworthy for sacred texts and Islamic scholarship.

Methodology: Stress Testing Modern Intelligence

We explore the text not by seeking structural patterns, but by stress-testing advanced AI models against the precision and complexity required by Divine Revelation. This process aims at uncovering the fragility of linguistic models and the ethical imperative of safety.

Each episode tackles a deep problem in AI application, analyzing diverse linguistic, legal, and theological challenges. Dive into discussions on profound concepts, such as:

Part 1: The Language & Logic Breakpoint (Where AI Breaks)

  • Classical Arabic Nuance: We examine how Classical Arabic preserves morphological nuance that most models fail to capture. We detail how even small changes, like removing diacritics, fundamentally alter meaning and degrade performance.
  • Stress Testing Inheritance Law: We push frontier LLMs to their limit using complex, conditional reasoning within Islamic inheritance law. This process exposes brittle chains of logic and the danger of authoritative-sounding hallucinations that cite verses that do not exist.

Part 2: The Ethical Imperative of Safety (How to Fix It)

  • Ethical Alignment & Abstention: Using the Islam Trust benchmark, we explore the need for ethical alignment across Sunni schools of thought. We highlight abstention as a virtue—the model knowing when to say "I don't know"—and how current systems falter under ambiguity.
  • Grounded Retrieval (RAG): We demonstrate why Retrieval-Augmented Generation (RAG) is essential. By chunking at the verse level and constraining answers to verified passages, RAG “chains the model to the truth,” sharply reducing fabrication and inventing doctrine.

Core Takeaway & Foundation

The core takeaway is simple and hard: LLMs must move from coherence to faithfulness. LLMs are pattern matchers, not authorities. They need curated data, ethical benchmarks, abstention policies, and grounded retrieval to serve scholarship instead of inventing doctrine.

We situate this work in centuries of human scholarship, drawing from historical context, manuscripts, and variant readings that continue to guide responsible system design for tools like QuranLM.

⚠️ Disclaimer: This discussion is not an endorsement of specific theological interpretations. We offer a critical, data-driven analysis for spiritual empowerment, safe research, and responsible technology development.

If this conversation resonates, follow the show, share it with a friend, and leave a review with your thoughts on where AI should draw the line.

Support the show

Setting The Stakes

SPEAKER_02

Welcome to the deep dive. Our mission today is well, it's highly specialized and the stakes are incredibly high.

SPEAKER_00

They really are.

SPEAKER_02

We're looking at the intersection of cutting-edge AI, specifically large language models, and one of the world's most complex and sacred texts, the Holy Quran.

SPEAKER_00

Aaron Powell We've got a fascinating stack of material here for you. We're bridging computer science benchmarks, deep linguistic studies of classical Arabic, and some really crucial research into ethical AI alignment.

SPEAKER_02

And the core question is: can a machine that works on probability can it really handle a text where there is absolutely no tolerance for errors?

SPEAKER_00

Aaron Ross Powell Exactly, where the sources tell us even a single incorrect diacritic can entirely alter the meaning.

SPEAKER_02

That's the tension right there. So we're going to be diving into things like inheritance law, ethical guidance, and finding out what happens when these models hallucinate religious doctrine.

SPEAKER_00

There are some major aha moments in the research.

SPEAKER_02

Okay, let's unpack this. So before we even get to the AI, we have to talk about the language itself.

SPEAKER_00

Yes, you have to.

SPEAKER_02

The Quran is written in classical Arabic. Al-Arabiah al-Fu. Now, for anyone not familiar, what makes this language such a huge challenge for modern NLP models?

SPEAKER_00

Aaron Ross Powell Well, it really comes down to its nuance and its structure. Classical Arabic is um profoundly complex. It preserves features that have mostly vanished from modern speech.

SPEAKER_02

Aaron Ross Powell Like the grammatical cases.

SPEAKER_00

Exactly. It has three full grammatical cases and declension, a system known as IRAB. For centuries, human scholars have relied on, you know, deep study to understand this.

SPEAKER_02

And that's the barrier for the machine, isn't it? It can't just absorb that context.

SPEAKER_00

Aaron Ross Powell It can't. The traditional grammar resources, they're not structured for a computer. They're written in prose.

SPEAKER_01

They assume you already know the rules.

SPEAKER_00

Right. They assume this deep intuitive understanding. For a computer, that means every word is potentially ambiguous. Is it the subject? The object? It just can't tell.

SPEAKER_02

Aaron Powell, that sounds like a complete non-starter for a computer scientist. So did they have to build specific tools just to handle this?

SPEAKER_00

They absolutely did. It took a massive effort. One of the key projects was the MSAQ dataset.

SPEAKER_02

Morphological and syntactical analysis for the Quran text.

SPEAKER_00

That's the one. Think of it as a kind of rosetta stone for Arabic NLP. It has over 131,000 morphological entries, 123,000 syntactic functions.

SPEAKER_02

All designed to translate that traditional IREB into a format an AI can actually learn from.

SPEAKER_00

And it worked.

SPEAKER_02

And the results were pretty stunning, right? Once they had this structured data.

SPEAKER_00

They really were. When they tested parsing algorithms on EBASQ, one model, the random forest model, hit 99.0% accuracy in predicting grammatical roles.

SPEAKER_02

So subject, object, things like that.

SPEAKER_00

Yes. It just shows you that if you feed the machine the right kind of structured data, it can learn the structure. Trevor Burrus, Jr.

SPEAKER_02

And there's a subtle point in the research about diacritics that I think is so important.

SPEAKER_00

The tiny marks on the letters.

SPEAKER_02

Right. What happens when you take them away?

SPEAKER_00

Aaron Powell Well, the studies show that removing those diacritics consistently dropped model performance by about two to three percentage points.

SPEAKER_01

Aaron Powell Which might not sound like a lot, but in a domain like this?

SPEAKER_00

It's a huge deal. It proves they aren't optional. They are absolutely essential for the nuanced meaning of the text.

LLMs On Inheritance Law

SPEAKER_02

Okay, so we've seen that very specialized models can do well with custom data. But what about the big general purpose LLMs? How do they perform on a really high-stakes task like Islamic inheritance law?

SPEAKER_00

Aaron Powell This is where you see a huge performance gap open up.

SPEAKER_02

And these tests were zero shot, right?

SPEAKER_00

Yes, zero shot with Arabic prompts. Just to clarify for you, that means the model is just using its general knowledge. No special training for this task.

SPEAKER_02

Aaron Powell Got it. So what did the numbers look like?

SPEAKER_00

Aaron Powell Some of the advanced Western models like O3 and Gemini did okay. They were in the low 90s.

SPEAKER_02

But the Arabic focused models.

SPEAKER_00

They struggled. Models like Funar and Alum scored below 50% accuracy overall.

SPEAKER_02

Wow. And I'm guessing the difficulty of the cases made a big difference.

SPEAKER_00

Aaron Powell A massive difference. That performance drop was really clear on the advanced inheritance cases. Alum, for instance, went from 58% on beginner cases all the way down to just 27.8% on the hard ones.

SPEAKER_02

And what makes those advanced cases so hard for an AI? Is it the sheer number of variables?

SPEAKER_00

Aaron Powell It's the complex legal reasoning, the conditional logic. Inheritance law is full of if-then scenarios based on intricate family relationships, and the models just couldn't follow that multi-step logic.

SPEAKER_02

Aaron Powell Which brings us to the biggest risk of all hallucination. What kind of errors did the research find?

SPEAKER_00

We saw some truly shocking examples. It goes way beyond just getting a calculation wrong.

SPEAKER_02

How so?

SPEAKER_00

In one case, the Gemini model fabricated an entire Quranic verse. It just made one up and attributed it to Surat El Nisa, a verse that does not exist.

SPEAKER_02

Aaron Powell Wait, hold on. So it's not just an error, it's an act of authoritative deception. It creates a convincing but completely false piece of religious text.

SPEAKER_00

Aaron Powell That's the danger. These convincing but incorrect responses are a serious, serious issue when you're dealing with the sacred text.

Measuring Ethical Alignment

SPEAKER_02

Okay, so it's clearly not just about raw accuracy, it's about whether the AI can align with the ethical framework of the faith.

SPEAKER_00

Aaron Powell Exactly. The whole conversation has to shift from quantitative failure to qualitative safety.

SPEAKER_02

And that's where something like the Islam Trust benchmark comes in. Tell us about that.

SPEAKER_00

Islam Trust is a multilingual benchmark built to check if an LM's responses align with consensus-based Islamic ethical principles. You know, across the major Sunni schools of thought.

SPEAKER_01

It's asking, does the AI get the core values right?

SPEAKER_00

Yes. And the results were pretty telling. The best model only achieved about 66.5% alignment.

SPEAKER_01

So a third of the time it's misaligned. Why do the researchers think that is?

SPEAKER_00

Two main reasons. First, there's just not enough nuanced Islamic ethical discourse in the general training data. These things are trained on the broad internet, not specialized scholarship.

SPEAKER_01

And the second reason.

SPEAKER_00

When the models face an ambiguous prompt, they tend to default back to a kind of generalized non-Islamic knowledge. They fill the gap with something that sounds logical but is doctrinally wrong.

The Case For Abstention

SPEAKER_02

That's a huge problem. And another study, FIQA, looked at rulings from the four major Sunni schools.

SPEAKER_00

Right, the Maliki, Shafi'i, Hanafi, and Hanbali schools.

SPEAKER_02

Which raises a really important question. Should an AI even try to answer if it's not 100% sure?

SPEAKER_00

And that is the core of it. This is the concept of abstention.

SPEAKER_02

The idea that it should know when to say, I don't know.

SPEAKER_00

Precisely, like a human expert. And in one test, it was fascinating. GBT 4.0 had the highest raw accuracy, but other models, like Gemini and Fenar, were much better at abstaining.

SPEAKER_02

So they were better at identifying the questions they couldn't answer reliably?

Retrieval Grounding To Prevent Errors

SPEAKER_00

Exactly. And that's a critical safety feature. You want a model that knows its own limits, especially here.

SPEAKER_02

And interestingly, even though the source questions were in Arabic, all the models did worse when they had to reason in Arabic compared to English.

SPEAKER_00

Yes, that just highlights that the linguistic challenge is still there. They default to their more robust English reasoning capabilities and then get things lost in translation.

SPEAKER_02

So we have these incredibly powerful models that are also prone to spectacular failure. How do researchers lock them onto the truth? The answer seems to be retrieval-augmented generations.

SPEAKER_00

Yes.

SPEAKER_02

It forces the LLM to ground its answers in a verified knowledge base, right? To stop it from just inventing things.

SPEAKER_00

That's it exactly. And we have a really clear example of the RG pipeline they used for Quranic question answering.

SPEAKER_02

Okay, let's break that pipeline down for everyone step by step. What's the first step?

SPEAKER_00

It starts with chunking. You have to divide the text into manageable units. For the Quran, the most logical unit is the verse, the ayah.

SPEAKER_02

And they do that by splitting the text at a specific Arabic symbol, right?

SPEAKER_00

Yes. The symbol which marks the end of a verse. Chunking by verse is critical because each ayah is a meaningful, self-contained unit of revelation.

SPEAKER_02

So once you have all these individual verses, what's next?

SPEAKER_00

Next is chunk embedding. Each verse is transformed into a high-dimensional vector. Think of it as a numerical fingerprint or a coordinate in a vast mathematical space.

SPEAKER_02

A 1536-dimensional vector in some cases. And this captures the semantic meaning.

SPEAKER_00

It captures the idea, not just the keywords. So when your query comes in, it's also embedded, and the system can find the verses that are conceptually closest using semantic search.

SPEAKER_02

And then the final step is the LLM.

SPEAKER_00

Yes, the LLM performs the final refinement. Yeah. It generates the answer. But, and this is the crucial part, it is only allowed to use the verified verses that were retrieved.

SPEAKER_02

It's chained to the truth.

SPEAKER_00

It cannot go outside that context. So it's physically prevented from making up verses or fabricating doctrine.

SPEAKER_02

And we saw the same principle in the E-Men framework for other texts, like Sahih al-Bukhari. Grounding the output just dramatically reduces hallucinations.

Manuscripts, Variants, And Context

SPEAKER_00

It's the only responsible way to use these models in such a sensitive context.

SPEAKER_02

You know, it's just fascinating to see computer science grappling with a text that scholars have analyzed for over a thousand years.

SPEAKER_00

And the human-led scholarly projects are still so vital? Yeah. Like the Corpus Quranicum.

SPEAKER_02

Right, a massive digital research project. What was its main goal?

SPEAKER_00

Its goal was to document the Quran's entire history and transmission. They cataloged early manuscripts, the manuscript of Quranica, and even used carbon dating on more than 40 ancient documents.

SPEAKER_02

What other kinds of information were they collecting?

SPEAKER_00

They documented Varia Lexona's Quranicae, the variant readings it developed, because early Arabic script often have few or no diacritical marks.

SPEAKER_02

The same marks that trip up the AI.

SPEAKER_00

The very same. And crucially, the project places the Quran in its historical context. The world of the Byzantines and Persian empires, early Christianity, and Rabbinic Judaism.

SPEAKER_02

Before we wrap up, let's touch on the structure of the text itself. The data showed this really interesting diversity in chapter length.

SPEAKER_00

It did. The Quran has 114 chapters, or surahs, and they vary wildly. Surah 2, Abukhara, is the longest. It's huge, with 286 verses.

SPEAKER_02

Almost 7,000 words.

SPEAKER_00

Right. And then you have Surah 108, Al-Kathar, which is one of the shortest at only about 10 works.

SPEAKER_02

And that reflects a dual purpose, doesn't it?

SPEAKER_00

It does. It shows a dual approach. You have the long chapters for detailed narratives and legal discussions, and then you have these short, powerful chapters for concise theological reminders. And AI has to be able to handle both.

SPEAKER_02

So this deep dive has really shown us this intense, necessary struggle.

SPEAKER_00

I think so.

SPEAKER_02

It's the push to make the power of modern AI accountable to the precision and the ethics of a sacred text. The challenge isn't just generating coherent text, it's about generating faithful text.

SPEAKER_00

And I think the core lesson is that in these high-stakes domains, you have to respect the limits of your tools. LLMs are probabilistic machines, they're pattern matchers, they are not knowledge-grounded reasoners.

SPEAKER_02

So you need those R RAG frameworks, you need those ethical benchmarks.

SPEAKER_00

You need them to make sure these models stay as tools for scholarship, not become sources of synthetic doctrine.

From Coherence To Faithfulness

SPEAKER_02

It all comes back to this concept of trust and responsibility that humans carry. And that's our final thought for you to explore. If humanity was given this trust, this responsibility to be stewards, what does it mean when we delegate interpretation of that very responsibility to a machine?

SPEAKER_00

Especially a machine that has to be constantly engineered and reminded to be faithful and to know when it should just say, I don't know. Who really holds that trust? The responsibility has to fall back on the human scholar who curates the data that guides the machine. We hope this has sparked your curiosity and encouraged you to explore this intersection further.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.