Weekly paper roundup: s1: Simple test-time scaling (2/3/2025)

Overview

The selected papers reflect significant advancements in AI models for a range of domains, focusing particularly on optimization of training processes and performance improvements. Common threads include the emphasis on data-centric approaches, as seen in SmolLM2, which leverages extensive datasets to train a small language model effectively, and LIMO, which challenges traditional views by achieving high reasoning performance with minimal data. Test-time scaling and reasoning accuracy are also explored, with innovative techniques proposed in “s1” and through analysis in the study of chain-of-thought reasoning in LLMs. The field of human animation and problem-solving is addressed by OmniHuman and AlphaGeometry2, respectively, showcasing improved scalability, versatility, and problem-solving capabilities leveraging multi-modal inputs and sophisticated architecture design. Together, these works underscore the importance of tailored training strategies, integrated conditions, and efficient data usage in advancing AI model effectiveness.

Spotlight :flashlight:

s1: Simple test-time scaling

Stanford University; University of Washington, Seattle; Allen Institute for AI; Contextual AI

      🤗   100

This paper introduces a novel approach to test-time scaling in language models that creatively balances computational effort and inference accuracy. By using a technique called budget forcing, the authors enhance the model’s reasoning process, which is especially beneficial for tasks requiring high accuracy, such as competitive math questions. I find it impressive that their model, s1, not only outperforms OpenAI’s o1 model in reasoning accuracy but also maintains its status as open-source, making it accessible for further development and research. The introduction of a curated dataset, s1K, also supports the model’s remarkable performance, adding substantial value to the paper’s contribution. Overall, this paper makes a compelling case for strategic computational management in enhancing language model outputs.

Raw notes: r


Other papers

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

ByteDance

      🤗   171

This paper introduces OmniHuman, a novel framework using a Diffusion Transformer to improve human animation models by integrating motion-related conditions, making them more scalable and flexible. The model effectively supports different input modes like audio and video, resulting in high-quality, realistic animations of human interactions. I found the advances in realism and versatility over existing methods to be particularly impressive, with notable demos showcased.

Raw notes: Impressive demos on X


SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model

Hugging Face

      🤗   162

This paper introduces SmolLM2, a 1.7 billion parameter language model that highlights the power of extensive data-centric training. By processing around 11 trillion tokens and integrating specialized datasets, SmolLM2 outperforms other recent small language models. I find the approach of synthesizing web text with curated data particularly impressive, thus providing a solid foundation for enhancing small model performance while also contributing valuable resources for future research.

Raw notes: r


Demystifying Long Chain-of-Thought Reasoning in LLMs

IN.AI; Tsinghua University; Carnegie Mellon University

      🤗   50

This paper delves into how large language models handle long chains-of-thought (CoT) reasoning, shedding light on crucial aspects like inference compute scaling and reinforcement learning. I found the insights into supervised fine-tuning and reward shaping particularly noteworthy, as they underline the complexity involved in optimizing these models. The research provides valuable guidance for enhancing reasoning performance in future LLMs.

Raw notes: r


LIMO: Less is More for Reasoning

SJTU; SII; GAIR

      🤗   49

This paper presents LIMO, a model that efficiently tackles complex mathematical reasoning using a remarkably small dataset. With just 817 curated examples, LIMO not only outperforms traditional models but also proves that massive datasets aren’t always necessary for superior reasoning capabilities. I’m impressed by how it supports the idea that less can actually be more when data is strategically curated and leveraged.

Raw notes: r


Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

Google DeepMind; University of Cambridge; Georgia Institute of Technology; Brown University

      🤗   39

This paper introduces AlphaGeometry2, a significant leap in automated problem-solving for Olympiad-level geometry that surpasses average gold medalists. The authors have enhanced language capabilities, optimized the search process with a new architecture, and integrated knowledge-sharing mechanisms, achieving impressive coverage and solving rates. I find it remarkable how AlphaGeometry2 not only excels in problem-solving but also moves towards automating solutions from natural language input.

Raw notes: r


Acknowledgements

Papers are retrieved from Hugging Face.