🧬 AlphaGenome: Predicting Variant Effects on Gene Regulation Artwork

Heliox: Where Evidence Meets Empathy 🇨🇦‬

Disclosure: This podcast uses AI-generated synthetic voices for a material portion of the audio content, in line with Apple Podcasts guidelines.

We make rigorous science accessible, accurate, and unforgettable.

Produced by Michelle Bruecker and Scott Bleackley, it features reviews of emerging research and ideas from leading thinkers, curated under our creative direction with AI assistance for voice, imagery, and composition. Systemic voices and illustrative images of people are representative tools, not depictions of specific individuals.

We dive deep into peer-reviewed research, pre-prints, and major scientific works—then bring them to life through the stories of the researchers themselves. Complex ideas become clear. Obscure discoveries become conversation starters. And you walk away understanding not just what scientists discovered, but why it matters and how they got there.

Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter. Breathe Easy, we go deep and lightly surface the big ideas.

All Episodes

Heliox: Where Evidence Meets Empathy 🇨🇦‬

🧬 AlphaGenome: Predicting Variant Effects on Gene Regulation

July 09, 2025 • by SC Zoomers • Season 4 • Episode 70

Send a text

see the accompanying substack episode

We've been reading our genetic code wrong this whole time.

For decades, scientists focused on the 2% of our DNA that codes for proteins—the obvious stuff, the genes that make the building blocks of life. We treated the other 98% like genetic junk mail, regulatory noise that didn't really matter. We called it "junk DNA" with the casual dismissiveness of people who think they understand something because they can label it.

Turns out, that 98% is where the real conversation is happening.

Google's DeepMind just released AlphaGenome, an AI system that can read the regulatory language hidden in our non-coding DNA with unprecedented accuracy. It's not just another incremental improvement in genetic analysis—it's a fundamental shift in how we understand the blueprint of life itself.

AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model

This is Heliox: Where Evidence Meets Empathy

Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter. Breathe Easy, we go deep and lightly surface the big ideas.

Support the show

About SCZoomers:

https://www.facebook.com/groups/1632045180447285
https://x.com/SCZoomers
https://mstdn.ca/@SCZoomers
https://bsky.app/profile/safety.bsky.app

Spoken word, short and sweet, with rhythm and a catchy beat.
http://tinyurl.com/stonefolksongs

Curated, independent, moderated, timely, deep, gentle, evidenced-based, clinical & community information regarding COVID-19. Since 2017, it has focused on Covid since Feb 2020, with Multiple Stores per day, hence a large searchable base of stories to date. More than 4000 stories on COVID-19 alone. Hundreds of stories on Climate Change.

Zoomers of the Sunshine Coast is a news organization with the advantages of deeply rooted connections within our local community, combined with a provincial, national and global following and exposure. In written form, audio, and video, we provide evidence-based and referenced stories interspersed with curated commentary, satire and humour. We reference where our stories come from and who wrote, published, and even inspired them. Using a social media platform means we have a much higher degree of interaction with our readers than conventional media and provides a significant amplification effect, positively. We expect the same courtesy of other media referencing our stories.

0:25

Imagine for a moment you're trying to quickly grasp, well, really complex genetic information. Not just, you know, what a gene does, but how it's controlled by every little bit of DNA. Even the bits that don't code for proteins, the vast majority of it. It's like trying to decode this deep, hidden language in our genome. It's regulatory code. Well, today on the deep dive, we're going to get you up to speed on a really groundbreaking development that's trying to do just that. It's called Alpha Genome, a new deep learning model from Google DeepMind. Yeah. And our mission today really is to unpack how Alpha Genome tackles this huge biological puzzle. We'll look at what makes it different, its capabilities, and importantly, what it means for understanding genetic variation and disease. Okay, so to really appreciate what alpha genome does, I think we need to start with the scale of the problem. What's the core challenge in interpreting our DNA sequence today? Right, good place to start. So for a long time, we mostly thought about genes, right? The parts of DNA that code for proteins. But the thing is, research now shows that over 98% of the genetic differences we see between people, they're actually in the non-coding region. the so-called dark matter of the genome. And these non-coding variants, they're really tough to figure out, aren't they? Why is that? Well, it's because they don't directly change the protein sequence. Instead, their effects can be subtle or sometimes... quite dramatic. They can change how genes get turned on or off. The studies show they impact things like chromatin accessibility, basically how tightly packed the DNA is, or they affect epigenetic marks, those chemical tags on DNA, even the 3D folding of DNA, or how much messenger RNA gets made or how that RNA gets spliced together. It's incredibly complex. And trying to map all these effects across the whole genome without powerful computers. It's just impossible, right? Intractable. So that's where these sequence to function models come in. What are they trying to do? Exactly. These models try to predict what we call genome tracks. Think of them as detailed maps. They assign a value like an activity level or a specific feature to pretty much every single base pair in the genome based just on the DNA sequence itself. It's like predicting experimental results computationally. Okay. But the models we had before Alpha Genome, they had some major limitations, some trade-offs you had to make. That's a key point, yeah. The first big one was this tension between seeing the big picture, the long-range context, and seeing the fine details. So you had models like Informer or Borzoi. They could look at, say... 500,000 base pairs, a pretty long stretch, but they had to average things out, maybe into chunks of 32 or 128 base pairs. So you'd lose the really fine-grained stuff, like specific splice sites. Right, those tiny signals get blurred. Exactly. And then on the other hand, you had models like Splice AI. Great at base pair resolution, really sharp detail, but they could only look at maybe 10,000 base pairs. So they'd completely miss how some distant element way down the DNA strand might be influencing things. You couldn't have both. So long range view versus high res detail. What was the other trade off? That was the generalist versus specialist problem. A lot of models became really good at one specific thing. Like, Crompionet was great for chromatin accessibility. But it didn't tell you much about splicing or gene expression or anything else. Right. It missed the bigger picture. There were multimodal models trying to do more, but often they just weren't as good as the specialists on specific tasks. So again, a compromise. Okay, so faced with these challenges... Can't see far and find, can't be good at everything. How does alpha genome break through? That's the big question. Right. And alpha genome tackles these head on. First, the input size is just massive. It takes in one megabase of DNA sequence. That's a million base pairs. And that link is super important because studies show like 99% of known enhancer gene links fall within that one milibate distance. It captures those long-range connections. A million base pairs. Yeah. That's way more than before. Huge leap. And simultaneously, it predicts thousands of these genome tracks. We're talking 5,930 in humans, over 1,000 in mice. And here's the kicker. It does this all the way down to single base pair resolution across 11 different types of data gene expression, detailed splicing, chromatin state, accessibility, histone marks, even 3D contacts. Wow. So it's aiming to be both the wide angle lens and the microscope. How does it actually stack up against those specialized models then? That's what's so impressive. In the evaluations, looking at predicting the effects of genetic variance, alpha genome matched or actually beat the best specialized models in 24 out of 26 tests. So it manages to be this broad generalist and perform like a top tier specialist across many different tasks. That really does sound like having your cake and eating it too. So it's not just a claim. The data seems to back it up. Yeah, the evaluations are pretty strong. It suggests we might not have to make those tradeoffs anymore. We can get both the breadth and the depth. Incredible. And how does it technically manage all that data? A million base pairs is a lot to process. Well, the architecture is clever. It's based on something called a U-Net. It uses convolutional layers for picking up local patterns like motifs right near a gene. And then it uses transformer blocks. Yeah, similar tech to what powers large language models to capture those really long distance dependencies across the megabase sequence. Yeah. And to make it run efficiently... They use techniques like sequence parallelism on specialized hardware, basically splitting up the work smartly. Okay. The specs are impressive. But let's talk about what it can actually do. Let's dig into some specific capabilities. Right. So first off, just predicting those genome tracks. Things like RNA-sec coverage showing how much a gene is being read or ATAC-sec showing open DNA regions or different histone modifications. Alpha Genome predicts these cracks across that whole one-melibet region with really high fidelity, matching the actual experimental data remarkably well. You can sort of visually see the predicted peaks lining up with the real ones. Okay, so accurate mapping. What about splicing? You mentioned that was a challenge before. Yeah, this is where it's really groundbreaking. Alpha Genome is apparently the first system that predicts all three key levels of splicing. One, just predicting the sites where splicing happens, where introns get cut out. 2. Predicting splice site usage, because often there are multiple choices. It predicts which ones are actually used. And 3. Predicting the specific junctions formed when exons are joined together. It gives this complete picture of splicing outcomes. That sounds incredibly detailed. Do we have an example of that? Something concrete? Absolutely. There's a really striking example in the research involving a tiny deletion, just four base pairs, found in a patient sample. This variant. CHAIR 3.197081044, TACTCT, was known from lab experiments to cause exon skipping in artery tissue, basically. A whole section of the gene gets left out. Alpha Genome, just from the sequence, predicted this. It saw the usage of that exon go down, it predicted the loss of the connections, the junctions to that exon, and it predicted the appearance of a new bypass junction that skips right over it. Predicting that whole complex rearrangement is pretty amazing. There's another cool case, too, HR21.46126238.gc, where it correctly predicted a totally new splice junction and a longer exon being created. Wow. And there's this tool within alpha genome ISM and silicometogenesis. What does that let researchers do? ISM is super powerful. It's like doing millions of tiny genetic experiments inside the computer. You basically tell the model, "Okay, for this specific region of DNA, let's virtually change every single letter one by one and see what happens. See how each tiny change affects all those thousands of predicted tracks, expressions, splicing, accessibility, everything." So, for example, when they did this around the U2SURP gene, which is involved in splicing, the ISM results clearly highlighted the known sequence patterns, the motifs, that control its splicing. It helps reveal the underlying rules. Okay, so it's great for splicing. What about gene expression more broadly? How does it help understand how non-coding variants turn genes up or down? Right. Another key area is predicting the effects of EQTLs. Those are variants that affect gene expression levels. Compared to the previous best models, like Borzoi, Alpha Genome showed significantly better prediction of not just whether an EQTL affects expression, but by how much and in which direction, up or down. The numbers are pretty telling. It essentially doubled the recovery of known EQTLs from a big database called GTX. went from about 19% recovery with Borzoi up to 41% with Alpha Genome. That is a big jump. Does that improvement apply across the board, or are there certain types of EQTLs where it does particularly well? Well, the improvement seems quite general, but its strength really lies in capturing effects that depend on that longer context. So variants whose effects involve interactions with distant regulatory elements, those are the ones where having that one megabase view really pays off. Okay, and connecting this to disease research, And it's especially good for those rarer variants, the ones with low minor allele frequency. It could resolve potential mechanisms in about four times more credible sets involving those variants compared to older approaches. So it helps pinpoint potentially causal variants and suggest how they might be acting? Exactly. And it's also state-of-the-art for linking enhancers to the genes they regulate. It outperformed Borzoi zero shot, meaning without even being specifically trained for that task, especially for enhancers that are quite far away, like more than 10,000 base pairs from the gene start. That really highlights its ability to model those long-range interactions. This all comes together really nicely in the case study you mentioned, the TALE1 oncogene. That gene's linked to a type of leukemia, right? Yes, T-cell acute lymphoblastic leukemia. This example really showcases the power of Alpha Genome's unified multimodal approach. They used it to basically screen for mutations that might ramp up TAL1 activity gain-of-function mutations. And for one specific mutation known to be oncogenic, HR1.47239296.ccg, Alpha Genome predicted a whole cascade of changes that matched what scientists had said. seen in the lab. It predicted increases in activating histone marks, like H3K27, think of those as go signals. It predicted decreases in repressive marks, like H3K93, the stop signals. And it predicted an increase in a mark associated with active transcription, H3K36AME3, all leading to the final prediction, increased tau1 mRNA levels. So it's not just predicting the final outcome, it's predicting the intermediate steps, the molecular mechanism. That's powerful. And did they use ISM on that Talon variant too? They did. And the ISM analysis revealed something fascinating. It showed that the mutated sequence actually created a new binding site, a motif. for a protein called MYB, and the model predicted that this new MYB site would increase TAL1 expression and make the chromatin more accessible right there. This was actually the precise mechanism that had been figured out through painstaking lab experiments previously. Alpha Genome found it computational. That's incredible. Almost like reverse engineering the disease mechanism from sequence alone. Okay, so we see what it can do. Let's quickly touch on why it works so well. What were the key design choices? Right. The researchers did ablation studies basically testing different components. First, that base pair resolution, absolutely critical. Training the model at 1BP resolution consistently gave the best results, especially for things like splicing and pinpointing accessibility peaks. It's not optional for this level of detail. OK, resolution matters. What about the one megabase input length? Also crucial. Using the full one megabase context, both for training the model and for making predictions, gave the best overall performance. Longer context during training is definitely better. Interestingly, even if you train it on one megabate, you can still use shorter context later for faster predictions. And it still performs well, having learned those long range dependencies. Got it. And then there was this distillation technique. That sounds interesting. Yeah, this is really clever for efficiency. Instead of using a whole bunch of large models together, like an ensemble, they train those large teacher models and then distilled their combined knowledge into a single smaller student model. It's like concentrating the wisdom of experts into one highly efficient apprentice. This distilled model performed competitively with sometimes even better than the big ensembles, but was much faster and cheaper to run for predictions. Smart engineering. And the final piece, multimodal learning. Right. Training the model on all those different data types together, expression, splicing, chromatin, etc. generally worked better than training separate models for each type. It seems that learning across modalities helps the model find shared patterns, shared representations, giving it a more robust, holistic understanding of how the genome works, like learning a language through reading, listening, and speaking. Okay, makes sense. So looking ahead, what are the big implications? What does this mean for genomics and health? I mean, the potential is huge. Alpha Genome really pushes forward our ability to decode the genome's regulatory language. It allows for much deeper mechanistic interpretation of non-coding variants, which is critical for understanding the basis of many common diseases. It can power much larger scale analyses than were feasible before. And it gives us a much clearer view of things like splice disruptions. It's a major step towards understanding the consequences of genetic variation. But like any new tech, it's not perfect yet. We should probably acknowledge the current limitations, right? Where does it still need work? Absolutely. It's important to be realistic. Capturing effects from really distant elements, say beyond 100 kilobases, is still a challenge. Accurately predicting very specific patterns, like only in one specific cell type, or only under certain conditions, is also still tough. Needs more work. Right now it's only trained for human and mouse. And it hasn't really been tested for predicting outcomes based on your specific personal genome sequence. And importantly, while it's amazing at predicting molecular consequences changes in RNA, chromatin, etc., it doesn't directly predict complex traits or disease risk itself. Those involve many genes, environment development, much broader biology. So to really predict how deleterious a variant is for a complex trait, you'd likely need to combine alpha genomes predictions with other information like evolutionary conservation or knowledge about gene function. So it's a powerful piece of the puzzle, but not the whole picture for complex traits. Exactly. It predicts the molecular impact incredibly well, which is a massive step. word. And the good news for researchers is that it's being made available, right, through an API and SDK for non-commercial use. Yes, which is great. It allows the wider scientific community to start using it, testing it, building on it. That should really accelerate progress. So to wrap up, Alpha Genome feels like a landmark achievement. It's this single model providing a fundamentally new way to view the genome connecting huge stretches of DNA to tiny molecular events with unprecedented detail. Yeah, the real aha moment here is moving beyond just finding genetic variants to actually interpreting what they do at a molecular level. That interpretation is absolutely crucial if we want to get closer to personalized medicine and truly understand the blueprint of life. it definitely leaves you thinking what questions does this kind of capability raise for you listening now with the hidden language inside your own dna and how might understanding that subtle complex grammar change how we think about health and disease in the years to come