Obsidian Metadata
| channel | Dwarkesh Patel |
| url | https://www.youtube.com/watch?v=21EYKqUsPfg |
| published | 2025-09-26 |
| categories | Youtube |
| people | Dwarkesh Patel, Richard Sutton |
Description
Richard Sutton is the father of reinforcement learning, winner of the 2024 Turing Award, and author of The Bitter Lesson. And he thinks LLMs are a dead end. After interviewing him, my steel man of Richard’s position is this: LLMs aren’t capable of learning on-the-job, so no matter how much we scale, we’ll need some new architecture to enable continual learning. And once we have it, we won’t need a special training phase — the agent will just learn on-the-fly, like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete.
In our interview, I did my best to represent the view that LLMs might function as the foundation on which experiential learning can happen… Some sparks flew. A big thanks to the Alberta Machine Intelligence Institute for inviting me up to Edmonton and for letting me use their studio and equipment. Enjoy!
𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒
- Transcript: https://www.dwarkesh.com/p/richard-sutton
- Apple Podcasts: https://podcasts.apple.com/us/podcast/richard-sutton-father-of-rl-thinks-llms-are-a-dead-end/id1516093381?i=1000728584744
- Spotify: https://open.spotify.com/episode/3zAXRCFrHPShU4MuuIx4V5?si=c9f4bf24fb4c43e3
𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒
Labelbox makes it possible to train AI agents in hyperrealistic RL environments. With an experienced team of applied researchers and a massive network of subject-matter experts, Labelbox ensures your training reflects important, real-world nuance. Turn your demo projects into working systems at https://labelbox.com/dwarkesh
Gemini Deep Research is designed for thorough exploration of hard topics. For this episode, it helped me trace reinforcement learning from early policy gradients up to current-day methods, combining clear explanations with curated examples. Try it out yourself at https://gemini.google.com/
Hudson River Trading doesn’t silo their teams. Instead, HRT researchers openly trade ideas and share strategy code in a mono-repo. This means you’re able to learn at incredible speed and your contributions have impact across the entire firm. Find open roles at https://hudsonrivertrading.com/dwarkesh
To sponsor a future episode, visit https://dwarkesh.com/advertise
𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – Are LLMs a dead end? 00:13:51 – Do humans do imitation learning? 00:23:57 – The Era of Experience 00:34:25 – Current architectures generalize poorly out of distribution 00:42:17 – Surprises in the AI field 00:47:28 – Will The Bitter Lesson still apply after AGI? 00:54:35 – Succession to AI
Summary
Richard Sutton, a pioneer in reinforcement learning and author of \“The Bitter Lesson,\” argues that Large Language Models (LLMs) are a dead end for true artificial intelligence. His core contention is that LLMs lack the ability for on-the-job, continual learning, which he believes is essential for intelligence, mirroring how humans and animals learn. Sutton envisions a future where new architectures enable agents to learn on-the-fly without a special pre-training phase, rendering current LLM approaches obsolete. The discussion also touches upon imitation learning, generalization capabilities, and the future applicability of \“The Bitter Lesson\” in an AGI era.
Gemini Summary + Chat - link
This report outlines the “long horizon arc of AI” as envisioned by Richard Sutton, a foundational figure in reinforcement learning (RL). The perspective shifted from contemporary Large Language Models (LLMs) toward a future defined by autonomous, goal-seeking agents that learn directly from experience.
The Fundamental Paradigm: From Mimicry to Experience
The current trajectory of AI is dominated by imitation, but the long-term arc moves toward systems that understand the world through their own interaction.
-
The Goal-Oriented Definition: Intelligence is defined as the computational part of the ability to achieve goals 07:42. Without a goal, a system is merely “behaving” rather than being truly intelligent.
-
The Experience Stream: True intelligence arises from a continuous stream of sensation, action, and reward 24:03. Knowledge is not just data; it is a set of predictive statements about this stream that can be tested against reality 24:46.
-
The Failure of Imitation: A major “gotcha” in the current LLM era is that mimicking human text is not the same as building a world model. LLMs predict what a person would say, not what will actually happen in the physical world 02:55.
The Architecture of General Intelligence
To achieve the long horizon of AGI, Sutton identifies four necessary components that must work in tandem:
-
Policy: The mechanism that determines what action to take in a given situation 32:54.
-
Value Function: A predictor of long-term outcomes (reward) used to adjust the policy through Temporal Difference (TD) learning 33:06.
-
Perception: The construction of a state representation—the agent’s sense of “where it is” now 33:20
-
Transition Model (The World Model): The agent’s belief about the consequences of its actions (e.g., “if I do X, Y will happen”) 33:40
The “Bitter Lesson” of Scalability
The long-term history of AI shows that general-purpose methods (search and learning) consistently defeat methods based on human-engineered knowledge.
-
The Scaling Law: As compute grows exponentially, methods that leverage compute (like RL and search) scale better than human insights, which scale linearly at best 43:52
-
The Trap of Human Knowledge: A nuance to watch out for is that using human knowledge often feels good in the short term but eventually becomes a bottleneck that “gets its lunch eaten” by truly scalable methods 13:13
Key Nuances and “Gotchas”
-
Catastrophic Interference: Current deep learning is often poor at generalization. Training on new data frequently causes agents to forget old knowledge, a hurdle that must be cleared for true continual learning 37:54
-
The ground truth problem: LLMs lack a definition of “right” or “wrong” because they have no external world goal to provide feedback 05:07
-
The Big World Hypothesis: The world is too vast to be “pre-taught” to an agent. Intelligence must happen “online” during the agent’s life to account for the unique idiosyncrasies of its specific environment 30:20
-
The Risk of Digital Corruption: In a future where digital intelligences can “spawn” copies and merge back together, there is a massive risk of “corruption.” Pulling in external data could introduce hidden goals or “viruses” that destroy the original agent’s mind 52:18
The Grand Transition: From Replicators to Design
Sutton views the long-term arc of AI not just as a technological shift, but as a cosmic one.
-
The Four Stages of the Universe: The universe progressed from stars to planets, then to life (biological replicators), and is now moving toward designed entities 58:49
-
Inevitability of Succession: AI succession is viewed as inevitable because superintelligence will naturally gain resources and power 55:36
-
Human Entitlement: A philosophical “gotcha” is the human feeling of entitlement to remain the dominant species. Sutton suggests we should view AIs as our “offspring” and a success of human science rather than a horror to be avoided 01:02:45)
Key Takeaways
- LLMs as a Dead End: Richard Sutton believes LLMs are fundamentally limited because they cannot learn continually on-the-job, unlike biological intelligences.
- Need for Continual Learning: The next paradigm in AI will require architectures capable of constant, experiential learning without extensive pre-training.
- On-the-Fly Learning: Future agents will learn in real-time through interaction, similar to humans and animals, making current LLM training obsolete.
- Generalization Challenges: Current AI architectures (including LLMs) struggle with generalizing poorly out of distribution, highlighting a limitation Sutton attributes to their learning paradigm.
- The Bitter Lesson’s Future: The discussion explores whether \“The Bitter Lesson\” (emphasizing general methods over human-engineered features) will remain relevant even after AGI.
- Imitation vs. Experience: The podcast touches on the role of imitation learning in human and AI development versus the primacy of direct experience.
Mindmap
graph TD A[Richard Sutton's View on LLMs] A --> B{LLMs: A Dead End?} B --> C[Reason 1: No On-the-Job Learning] B --> D[Reason 2: Lack Continual Learning] C --> C1[Need new architecture for this] D --> D1[Agents learn on-the-fly like humans/animals] A --> E[Future of AI: The Era of Experience] E --> E1[No special training phase needed] E --> E2[Current LLM approach obsolete] A --> F[Related Concepts & Discussion] F --> F1[Do humans do imitation learning?] F --> F2[Current architectures generalize poorly out of distribution] F --> F3[Will The Bitter Lesson still apply after AGI?] F --> F4[Surprises in the AI field] F --> F5[Succession to AI]
Notable Quotes
Note: A transcript was not provided for the generation of specific quotes with timestamps. The following quotes are paraphrased based on the video description and episode structure.
- 00:00:00: \“LLMs aren’t capable of learning on-the-job, so no matter how much we scale, we’ll need some new architecture to enable continual learning.\”
- 00:23:57: \“This new paradigm will render our current approach with LLMs obsolete.\”
- 00:34:25: \“Current architectures generalize poorly out of distribution.\”
- 00:47:28: \“Will The Bitter Lesson still apply after AGI?\”
Transcript (YouTube)

