The essay was published on Sutton's website incompleteideas.net in 2019, and has received hundreds of formal citations according to
Google Scholar. Some of these provide alternative statements of the principle; for example, the 2022 paper "A Generalist Agent" from
Google DeepMind summarized the lesson as: Another phrasing of the principle is seen in a Google paper on switch
transformers coauthored by
Noam Shazeer: The principle is further referenced in many other works on artificial intelligence. For example,
From Deep Learning to Rational Machines draws a connection to long-standing debates in the field, such as
Moravec's paradox and the contrast between
neats and scruffies. In "Engineering a Less Artificial Intelligence", the authors concur that "flexible methods so far have always outperformed handcrafted domain knowledge in the long run" although note that "[w]ithout the right (implicit) assumptions,
generalization is impossible". More recently, "The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning" continues Sutton's argument, contending that (as of 2025) the lesson has not been fully learned in the fields of speech recognition and
brain data. Other work has looked to apply the principle and validate it in new domains. For example, the 2022 paper "Beyond the Imitation Game" applies the principle to
large language models to conclude that "it is vitally important that we understand their capabilities and limitations" to "avoid devoting research resources to problems that are likely to be solved by scale alone". In 2024, "Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings" looked at further evidence from the field of computer vision and
pattern recognition, and concludes that the previous twenty years of experience in the field shows "a strong adherence to the core principles of the 'bitter lesson'". In "Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning", the authors look at generalization of
actor-critic algorithms and find that "general methods that are motivated by stabilization of
gradient-based learning significantly outperform
RL-specific algorithmic improvements across a variety of environments" and note that this is consistent with the bitter lesson. ==References==