Overview
The papers collectively focus on advancements and methodologies in the training and evaluation of foundation models, particularly concerning generalization, contextual understanding, and modality integration. Central themes include the comparison of traditional supervised fine-tuning (SFT) versus novel techniques, like Critique Fine-Tuning (CFT) and reinforcement learning (RL), emphasizing improvements in model generalization and decision-making. Several papers delve into enhancing model capabilities through innovative benchmarks and datasets, such as “Humanity’s Last Exam” and WILDCHAT-50M, and optimizing performance using training frameworks including FP4 quantization. There is also a focus on addressing shortcomings in existing models through strategies like CoRAG for complex information retrieval and underthinking metrics in LLMs. Lastly, the safety and interpretability of models are other crucial aspects explored in this collection, underpinning the importance of reliable and understandable AI systems.
Spotlight 
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
HKU; UC Berkeley; Google DeepMind; NYU
This paper delves into a fascinating comparison between supervised fine-tuning (SFT) and reinforcement learning (RL) as post-training strategies for foundation models. I found it interesting that while RL improves generalization across new contexts, SFT often results in memory retention of the training data. The authors’ introduction of novel evaluation environments, GeneralPoints and V-IRL, provides fresh insights into the models’ performance. The study also highlights how RL, particularly with outcome-based rewards, enhances visual recognition and overall generalization, though SFT’s role in stabilizing model outputs shouldn’t be underestimated. Overall, it presents a nuanced perspective on the complementary roles of SFT and RL in model training.
Raw notes: RL interest renewed
Other papers
Center for AI Safety; Scale AI
This paper presents “Humanity’s Last Exam,” a challenging benchmark designed to assess large language models’ capabilities in areas where human knowledge is most advanced. By crafting 3,000 nuanced questions that resist straightforward internet searches, the authors effectively expose the gap between current model performance and expert human levels. I find it intriguing how this work not only highlights the present limitations but could also shape future developments in AI research and policy.
Raw notes: Will be interesting to see how long does it take for this benchmark to saturate
Alibaba Group
This paper presents a breakthrough with the Qwen2.5-1M models, which can handle a whopping 1 million tokens, massively upgrading their long-context abilities. I found it impressive how the open-source framework improves efficiency and performance, boasting a significant speedup in prefill tasks. The evaluation results show that these models don’t just handle longer contexts well—they outshine even top competitors in some cases.
Raw notes: r
Baichuan-Omni-1.5 Technical Report
Baichuan Inc.
This paper introduces the Baichuan-Omni-1.5 model, an omni-modal AI system designed to handle audio, text, and visual data with impressive competence. Key elements include a robust data preparation technique and a unique audio tokenizer, resulting in top-tier performance across multimodal benchmarks, especially in the medical field. The innovative training strategy fosters enhanced multimodal collaboration, setting this model apart from its predecessors.
Raw notes: r
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
University of Waterloo; Carnegie Mellon University; Vector Institute
This paper presents a novel approach called Critique Fine-Tuning (CFT), which focuses on training language models to evaluate and improve upon incorrect responses rather than just imitating correct ones. The results are impressive, showing a 4-10% improvement in performance over the traditional Supervised Fine-Tuning method across various math benchmarks. I find it particularly compelling how this method not only enhances the model’s reasoning abilities but also competes effectively with others that use much larger datasets.
Raw notes: r
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Tencent AI Lab; Soochow University; Shanghai Jiao Tong University
This paper explores the issue of “underthinking” in large language models, highlighting how these models often shift reasoning paths excessively instead of thoroughly exploring potential solutions, which leads to reduced performance on complex tasks. By introducing a new metric to measure underthinking, the authors reveal its correlation with incorrect answers and demonstrate that a strategic decoding approach that discourages frequent switching can enhance model accuracy. I find it intriguing how the authors managed to improve performance by adjusting the model’s thought process without any fine-tuning, indicating a novel direction for optimizing LLMs.
Raw notes: r
Chain-of-Retrieval Augmented Generation
Microsoft Corporation; Renmin University of China
This paper introduces CoRAG, a promising approach that combines dynamic query reformulation with stepwise information retrieval to enhance retrieval-augmented generation models. It notably boosts the performance in multi-hop question answering, significantly outperforming current models with a substantial gain in EM scores and setting a new benchmark on KILT. The paper provides valuable insights into how chain-of-retrieval methods can advance grounded foundation models for knowledge-intensive tasks.
Raw notes: r
Optimizing Large Language Model Training Using FP4 Quantization
University of Science and Technology of China; Microsoft SIGMA Team; Microsoft Research Asia
This paper introduces a novel FP4 training framework specifically designed to overcome the common hurdles in quantizing large language models, like quantization errors and restricted representational capacity. By incorporating a differentiable quantization estimator and a compensation strategy, the authors show that their approach maintains comparable accuracy with high-precision methods, even as it scales to large models and datasets. I find the results impressive, as they suggest a promising path for improving the efficiency of large language model training without sacrificing performance.
Raw notes: r
o3-mini vs DeepSeek-R1: Which One is Safer?
Mondragon University; University of Seville
This paper presents a comparative safety analysis of the DeepSeek-R1 and OpenAI’s o3-mini models, revealing that o3-mini significantly outperforms DeepSeek-R1 in terms of safety. Using an automated safety testing tool, the study found that DeepSeek-R1 had a tenfold higher rate of unsafe responses compared to o3-mini. I appreciate the clear and concise manner in which the authors convey the safety metrics, making it evident that o3-mini is the more reliable choice.
Raw notes: r
Open Problems in Mechanistic Interpretability
Apollo Research; Anthropic; Decode Research; Eleuther AI; FAR AI; Google DeepMind; Leap Laboratories; Harvard University; King’s College London; Imperial College London; MATS; MIT; METR; Northeastern University; Pr(AI)2r group; Tel Aviv University; Timaeus; University of Melbourne; Goodfire
This paper dives into the intricacies of mechanistic interpretability, underlining the significance of tackling various unresolved challenges in the field. It importantly points out the need for both conceptual and practical breakthroughs, alongside considering socio-technical aspects for effective application. I appreciate how the paper effectively frames these open issues, aiming to propel the understanding of AI system behavior and intelligence to new heights.
Raw notes: r
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
NYU
This paper presents WILDCHAT-50M, a significant contribution to the resources available for evaluating post-training performance of language models using synthetic data. I appreciated how the authors effectively demonstrated a practical application of the dataset with their RE-WILD public SFT mix, indicating its potential in optimizing model performance with fewer samples. This dataset could serve as a valuable tool for researchers looking to improve or compare language model responses across various architectures.
Raw notes: r
Acknowledgements
Papers are retrieved from Hugging Face.