Heliox: Where Evidence Meets Empathy πŸ‡¨πŸ‡¦β€¬

⚑ How MiniMax M1 Just Rewrote the Rules of AI

β€’ by SC Zoomers β€’ Season 4 β€’ Episode 66

Send us a text

See the corresponding Substack episode

Sometimes the most profound changes happen not with fanfare, but with a whisper that echoes through eternity.

We're living through one of those whisper moments right now, and most people don't even know it happened.

While the tech world obsesses over the latest chatbot drama and which billionaire said what about AI safety, a team of researchers just quietly solved one of the most fundamental problems in artificial intelligence. They didn't announce it with a Super Bowl commercial or a flashy product launch. They simply published a research paper about something called MiniMax M1, and in doing so, they may have just changed everything.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

This is Heliox: Where Evidence Meets Empathy

Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter.  Breathe Easy, we go deep and lightly surface the big ideas.

Thanks for listening today!

Four recurring narratives underlie every episode: boundary dissolution, adaptive complexity, embodied knowledge, and quantum-like uncertainty. These aren’t just philosophical musings but frameworks for understanding our modern world. 

We hope you continue exploring our other podcasts, responding to the content, and checking out our related articles on the Heliox Podcast on Substack

Support the show

About SCZoomers:

https://www.facebook.com/groups/1632045180447285
https://x.com/SCZoomers
https://mstdn.ca/@SCZoomers
https://bsky.app/profile/safety.bsky.app


Spoken word, short and sweet, with rhythm and a catchy beat.
http://tinyurl.com/stonefolksongs

Curated, independent, moderated, timely, deep, gentle, evidenced-based, clinical & community information regarding COVID-19. Since 2017, it has focused on Covid since Feb 2020, with Multiple Stores per day, hence a large searchable base of stories to date. More than 4000 stories on COVID-19 alone. Hundreds of stories on Climate Change.

Zoomers of the Sunshine Coast is a news organization with the advantages of deeply rooted connections within our local community, combined with a provincial, national and global following and exposure. In written form, audio, and video, we provide evidence-based and referenced stories interspersed with curated commentary, satire and humour. We reference where our stories come from and who wrote, published, and even inspired them. Using a social media platform means we have a much higher degree of interaction with our readers than conventional media and provides a significant amplification effect, positively. We expect the same courtesy of other media referencing our stories.


Welcome to the Deep Dive. Today, your curiosity is leading us into the fascinating world of large language models, specifically a groundbreaking new player called Minimax M1. We've taken your sources, the core research paper, and really distilled what you need to know to be truly well-informed about its breakthrough capabilities. That's right. Our mission in this deep dive, well, it's really to explore what makes Minimax M1 unique, how it achieves these really impressive capabilities, and crucially, why its fundamental design choices are so important, you know, from its architecture right down to its training methods, why they matter for the future of AI. We'll be pulling insights directly from the source material, which is a research paper titled Minimax M1, Scaling Test Time Compute Efficiently with Lightning Attention. And the core promise here is, well, it's pretty groundbreaking. This model is designed to process incredibly long inputs and think extensively, as they put it, which leads to breakthrough performance in complex real world tasks. Okay, let's unpack this. So let's start with the big picture. These large reasoning models, LRMs, they've shown incredible success, right? We're talking about AI that can tackle increasingly complex problems. They extend their reasoning processes, often using reinforcement learning. But there's a wall they hit. Yeah, a significant scaling wall. It's about test time compute. Test time compute. So the resource is needed when the model's actually running. Exactly. And while more compute generally means better performance for these complex applications, it just becomes a huge bottleneck really fast. Why is that? What's the technical hurdle? It's fundamental. The traditional transformer architecture uses something called the soft max attention mechanism. And that has a computational complexity that scales quadratically with the input line. Quadratically. Quadratically. Meaning? Meaning as your input gets longer, the cost, the computation needed just explodes. It goes up by the square. So trying to process really long, detailed information becomes incredibly difficult and expensive. The paper does mention, you know, other ideas. People have had sparse attention, linear attention, state space model. Right. I've heard of some of those. But they haven't really been fully validated in these big competitive reasoning models yet. Most of the top LRMs, they still rely on older designs and they hit that same quadratic wall. Okay, so if those other solutions haven't quite cut it for large-scale reasoning, what's Minimax M1 doing differently? Because this is where it gets really interesting, right? You said it's the world's first open-weight, large-scale hybrid attention reasoning model. That's quite a claim. It absolutely is. And the core innovation is that unique... hybrid architecture. It combines a hybrid mixture of experts, MOE architecture. MOE. That's where it only uses parts of the model for a specific task. It makes it more efficient. Exactly. It activates only the relevant experts. And it combines that with a novel lightning attention mechanism. Mm-hmm. And this isn't just another linear attention idea. The paper stresses it's an IO-aware implementation. IO-aware, meaning input-output. Right. Optimized for efficiently handling massive amounts of data flowing in and out, which is critical for huge contexts. And their hybrid design is quite smart. They use a traditional transformer block, the one with softmax attention. But only after every seven transnormer blocks, they use this new lightning attention. Okay, kind of mixing the old and the new, what's the impact of that design? The impact is pretty staggering. M1 natively supports an astonishing context length of 1 million tokens. 1 million. Yeah, 1 million tokens. That's eight times the context size of, say, DeepSeek R1. And it's an order of magnitude greater than basically all other open weight LRMs available today. That capacity is incredible. But what about the efficiency? You mentioned the bottleneck earlier. How does M1 handle that? That's the other key part. It's remarkably efficient. Compared to DeepSeek R1, Minimax M1 consumes only 25% of the FLOPs. FLOPs being floating point operations, the measure of compute work. Exactly. Only 25% of the FLOPs when generating a response 100,000 tokens long. And it's less than 50%, even to 64,000 tokens. So much less computational cost for these very long reasoning processes. Precisely. So what does this mean for you? This efficiency makes M1 particularly suited for, as the paper says, complex tasks that require processing long inputs and thinking extensively. It can essentially do much more thinking for significantly less cost. That opens up a lot of possibilities for real-world AI. Okay, that really sets the stage. Let's dive into the blueprint then. How was Minimax M1 actually built? It started from a previous model, right? Minimax Text 01. That's right. Minimax Text 01 was the foundation. Itself, a huge model, 456 billion total parameters, though only 45.9 billion are active per token because of that MOE structure. And building M1 involved what? A couple of main stages. A rigorous two-stage preparation. First was continual pre-training. They took Minimax Text 01 and trained it on an additional 7.5 trillion tokens. 7.5 trillion. Trillion. Yeah. And the data was very carefully curated, very reasoning intensive. They put a lot of effort into data quality refining web and PDF parsing, advanced cleaning rules, extracting natural question-answer pairs, and crucially, strictly avoiding synthetic data. No fake data generated by other AIs? Exactly. They wanted real-world reasoning patterns, and they boosted the proportion of STEM, code, book, and reasoning-related data to about 70% of the corpus. Okay. Was training on such long contexts tricky? Oh, absolutely. A key challenge they mentioned is preventing gradient explosion. That's when the internal calculations can kind of spiral out of control with very long sequences. They solved it with a smoother extension of context length across four stages, starting from 32,000 tokens and incrementally going up to the full 1 million. They had to carefully manage it, recognizing that different layers in their lightning attention actually have different decay rates. It was quite intricate. Makes sense. And after pre-training... Stage two. Stage two was supervised fine tuning or SFT. The main goal here was to explicitly inject certain chain of thought patterns. Chain of thought. Teaching it to show its work, basically. Think step by step. Pretty much, yes. To lay down a really strong foundation before they moved into reinforcement learning. The SFT data was high quality examples with long chain of thought responses covering diverse areas, math, coding, STEM, writing, multi-turn chat, with math and coding making up around 60 percent of that specific data set. OK, so pre-training built the knowledge base. SFT taught it structured thinking, which brings us to the reinforcement learning, the RL scaling. This seems like where M1 really learned to reason effectively. What was so intelligent or innovative about their RL approach? This is where a lot of the really unique stuff happened. First, the algorithm itself, the CISPO algorithm. CISPO. What problem did that solve? Well, traditional RL algorithms like PPO or GRPO, they often suffer from something called token clipping. This means crucial but maybe low probability reflective behaviors. Right. Things like the model generating however, or let me recheck that, or wait, or even an aha moment. These are often really important forks in a complex reasoning path. But because they might have low initial probability, standard algorithms could clip them out, essentially ignoring them during updates. So the model wouldn't learn from those critical thinking moments. Exactly. And this was especially problematic for their hybrid architecture. CISPO, which stands for Clipped IS Weight Policy Optimization, its genius is that it avoids dropping these tokens. It does this by clipping the importance sampling weights rather than the token updates themselves. Okay, a bit technical, but the outcome is. The outcome is that all tokens contribute to the learning process. It maintains diversity, keeps the training stable, and the proof is in the pudding. In a controlled study using other models, CISPO got a 2x speedup compared to DUP, another common algorithm. It matched DEO's performance using only half the training steps. So faster and just as good. Very cool. But were there other challenges specific to using RL with this new hybrid attention? Yes. They detail a few really interesting, unique hurdles. The first was a computational precision mismatch. This was quite surprising. They found that the probabilities the model assigned to tokens during training were different from when it was just running normally in inference mode. How different? Different enough that it actually stopped the reward signal from growing. The model basically stopped learning from the RL feedback. Yeah. After digging in, they found these... unexpectedly high magnitude kind of erratic activations in the final output layer, the LM head. And the fix. Almost counterintuitively simple. They just increased the precision of that LM output head calculation to FP32, a higher precision format. And boom, the correlation between training and inference probabilities jumped from about 0.9 up to 0.99. It basically fixed the issue. Wow. A subtle numerical issue grinding the whole learning process to a halt. What else? Second was Optimizer Hyperparameter Sensitivity, the AtomW Optimizer, which adjusts the learning process. It turned out to be extremely sensitive. They saw gradients, the learning signals ranging wildly from tiny like 1E18 to 1E5. It required really careful, painstaking tuning of its internal settings, the beta and epsilon values to get stable training. So a lot of fiddly work behind the scenes. Definitely. And the third major challenge was dealing with repetition. They implemented early truncation via repetition detection. Sometimes during RL, the model could get stuck generating, you know, pathologically long and repetitive responses, just looping. Yeah, I've seen models do that. It destabilizes things. Big time. It messes up the training. So they added a clever rule. If the model generates 3,000 consecutive tokens where each one has a probability above 0.99, basically if it gets really confident and repetitive, they just cut it off early. Halt generation. It's a heuristic, but it kept things stable and maintained throughput. Smart. Okay. So beyond the algorithms and technical fixes, the data used for RL must have been critical too. You mentioned quality and variety. Hugely important. They broke the RL data down into two main types. First, tasks where they could use rule-based verification. Clear right-wrong answers. Like math problems. Exactly. They had nearly 50,000 high-quality mathematical reasoning samples. Competition-level stuff. Meticulously cleaned, filtered for difficulty, and checked to avoid any contamination from benchmark test sets. Then, logical reasoning. About 53,000 samples covering 41 different tasks, like ciphers or Sudoku. They actually synthesized these using their own framework called SynLogic, which could automatically verify the answers and adjust difficulty. Okay, math and logic. What else? Competitive programming. 30,000 data samples. For problems that didn't have good test cases, they used another LLM to generate comprehensive test suites. And then really key for practical use, software engineering, SWE. Several thousand real world GitHub issues and pull requests. These were run in a special containerized sandbox environment. Like a safe virtual computer. Exactly. Where the models generated code could actually be executed against test cases. This provided direct verifiable feedback. Did the code pass the test or fail? That's incredibly valuable for learning real-world coding. Okay, that covers things with clear rules. What about more subjective tasks? Creative rating, general instructions. That's the second category, general domain tasks with model-based feedbacks. Here, they had to rely on other AI models called generative reward models or GenRMs to provide feedback. But there's a common challenge with GenRMs. They often develop a length bias. Length bias. They just prefer longer answers. Yeah, they start rewarding length over actual quality or conciseness. longer isn't always better. M1's unique solution here was pretty neat, continuous online monitoring of length bias during the RL training itself. If they detected the main model starting to just generate longer stuff to chase reward, they'd immediately recalibrate the reward model. And they also use techniques on the RL side, like reward shaping and value clipping, to counteract it. So actively fighting that bias during training, very clever. It sounds like they use a kind of curriculum approach, mixing these different data types. They did. Started primarily with the rule-based tasks to build strong foundational skills, then gradually mixed in the general domain tasks. This was crucial to prevent catastrophic forgetting, where the model learns new stuff but forgets the specialized skills it learned earlier. It helps foster broader generalization while retaining those core competencies. And they didn't stop there. They pushed the reasoning length even further. Right. They took the model train to output up to 40,000 tokens, which they call Minimax M140K, and then did an extended RL scaling to longer thinking phase to push it to 80,000 tokens, creating Minimax M1 80K. They used a staged window expansion RL strategy, basically, carefully and incrementally increasing that output length limit from 40K to 80K based on signs that the model was ready for it. Okay, an incredible amount of work went into building and training this. So the payoff, the results, how did Minimax M1 actually perform? What stands out to you? The results are genuinely impressive. especially where that long context and deep reasoning should matter most. Let's take software engineering. On a tough benchmark called SWE Bench Verified, M1 achieved scores of 55.6% for the 40k version and 56.0% for the 80k version. that significantly beats other open-weight models. You can really see the benefit of that execution-based RL training there. Makes sense. Learning from actual code execution pays off. What about understanding long documents? That's another standout area. In long context, understanding benchmarks. The Minimax M1 models, both 40K and 80K, significantly outperform all other open-weight models. Get this, they even surpass OpenAI's O3 model and Anthropix's Cloud4 Opus. They rank second globally in the benchmarks shown, only behind Google's Gemini 2.5 Pro. That's huge validation for the 1 million token content. window. Wow, beating some top-closed models there. Any other areas where it really shone? Yes. Egentic tool use. This is about how well the model can use external tools or APIs to solve problems. On the TU benchmark, the Minimax M140K model actually surpassed all other Opelway models and eGemini 2.5 Pro. Surpassed Gemini 2.5 Pro. On that specific benchmark, yes. Which really showcases its ability to effectively plan and use tools. And importantly, across most benchmarks, the Minimax M1 80K model consistently did better than the 40K model. That unequivocally demonstrates the benefit they were aiming for. Scaling test time compute, allowing longer reasoning, leads to better performance. So the longer thinking time really made a difference. How did it stack up in more standard areas like math or general coding compared to other top models? It's very competitive there, too. In mathematical reasoning and general coding, the paper says M1 models show strong performance, generally comparable to top closed-weight models, though they do note DeepSeq R1's latest version might have a slight edge in some specific math or coding competitions. On factuality, using a benchmark called SimpleQA, M1 outperforms most other open-weight models, although it does underperform DeepSeq R1 there. So very strong overall, exceptional in long context and tool use. Did they show clear evidence that the RL scaling itself was driving improvement? Absolutely. They tracked performance during the RL training phase. They saw consistent improvements in both benchmark scores and the average length of the model's reasoning, its response length. For instance, on tough math problems, AIME, and coding challenges, LiveCodeBench, the average response length grew to exceed 20,000 tokens. 20,000 tokens just for the reasoning chain. Yeah, really long, detailed, step-by-step thinking. And performance improved alongside it. On AME 2024 problems, accuracy jumped substantially from 68% to 80% during RL scaling. There's a direct correlation shown between allowing longer reasoning and getting better results. It confirms that giving the model more room to think... pays off. Okay, let's try to wrap this up. Summarizing Minimax M1's significance, it feels like a truly groundbreaking open-weight model. It powerfully demonstrates that this lightning attention combined with the hybrid MoE architecture and all those innovative RL techniques can achieve these unprecedented context lengths, 1 million input, 80,000 generation. And with remarkable efficiency too. Remember, they created in just three weeks on 512 H800 GPUs. The paper estimates the cost at around $0.53 million, which, for developing a model this capable, is actually quite efficient. Right. So it pushes the boundaries for open-weight AI? Absolutely. Minimax M1 now clearly ranks among the world's best open-weight models. It particularly excels in those complex real-world scenarios that need deep reasoning over long contexts and effective tool use. It's really setting a new benchmark for what's possible in the open-source AI landscape. And looking ahead, the paper talks about the growing demand for models that act as agents. Agents that interact with environments, tools, computers, maybe even other agents. Exactly. And those kinds of applications require exactly what M1 is good at. They need reasoning across dozens to hundreds of terms. They need to integrate long context information from diverse sources. Mid-Max M1 looks like it's being positioned as a really strong foundation for building those kinds of future AI systems. Okay, so that brings us to the end of our deep dive. But here's a final provocative thought for you to chew on. With models like Mini Max M1 now capable of thinking for tens of thousands of tokens, understanding these massive contexts, how might this fundamentally change how we approach complex problem solving? Think about areas from, say, scientific discovery all the way to potentially automating entire company workflows. What new capabilities will truly emerge when AI agents can genuinely reason across dozens to hundreds of terms and pull together information from all these diverse, long sources? Something to ponder.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.