Weekly paper roundup: GroUSE: A Benchmark to Evaluate Evaluators (9/9/2024)

Overview

The primary themes across these papers revolve around the advancement and evaluation of Large Language Models (LLMs) and their application in diverse domains. They collectively emphasize improving model alignment with user preferences (Paper 1), enhancing data quality and training efficiency (Papers 5 and 8), and developing novel evaluation benchmarks for assessing LLMs in real-world tasks, including data science, clinical applications, and operating system interactions (Papers 2, 4, and 7). Additionally, these works explore innovative frameworks and architectures to integrate multimodal and retrieval-augmented capabilities in LLMs (Papers 3, 6, 9, 10, 11, and 12). They also highlight the inherent challenges and future directions in achieving robust, adaptable, and efficient LLM performance across varying contexts.

Spotlight :flashlight:

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

Illuin Technology

         đź¤—   35      X0   HackerNews0   Reddit0   YouTube0   GitHub0

This paper introduces GroUSE, a benchmark aimed at evaluating the effectiveness of large language models (LLMs) as judges in grounded question answering tasks. I found it particularly insightful how the authors identified seven specific failure modes of evaluation frameworks, even when using advanced models like GPT-4. Their findings underscore the limitations of relying solely on correlations with high-performing models, emphasizing the necessity of deliberate and thoughtful evaluation processes. The study’s demonstration of fine-tuning techniques to enhance evaluation accuracy is compelling, particularly for practitioners in the field. Overall, this is a crucial read for anyone involved in the development and assessment of RAG systems.

Raw notes: This paper addresses the important problem of evaluating RAG systems. There are a lot of nuances. The arrival of GPT4-o1 could potentially cause a rethink around LLM-as-a-judge. A judge should be deliberate, use slow thinking instead of knee-jerk next token generating. Recommended read for practitioners.


Other papers

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Peking University; Alibaba Group; Shanghai Jiao Tong University; Zhongguancun Laboratory; Microsoft; University of Waterloo; University of Wisconsin-Madison; Institute of Software, Chinese Academy of Sciences

         đź¤—   67      X21   HackerNews0   Reddit0   YouTube0   GitHub0

This paper does a great job of dissecting and categorizing different strategies for aligning large language models with human preferences. By proposing a unified framework, it enhances the understanding and encourages integration of various alignment methods. I found it particularly useful for researchers focusing on alignment challenges and future directions.

Raw notes: Helpful for researchers working in alignment.+


DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

University of Texas at Dallas; Tencent AI Lab, Seattle; University of Southern California

         đź¤—   48      X132   HackerNews0   Reddit0   YouTube0   GitHub0

This paper presents DSBench as a new standard for evaluating data science agents, highlighting the large gap in performance between current models and real-world applications. With top agents only solving 34% of the tasks, it’s clear we have a long way to go in developing truly effective and autonomous data science tools. I’m intrigued to see how GPT-4 might perform on this benchmark given its design for more complex cognitive tasks.

Raw notes: Doing data science requires slow thinking that GPT o1 is designed to do. I’m curious what o1’s performance is on this benchmark.


LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China

         đź¤—   47      X1149   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces LLaMA-Omni, a cutting-edge model that enables real-time speech interaction with large language models without needing speech transcription. The integration of a speech encoder, adaptor, and streaming decoder results in rapid and stylistically superior responses. Impressively, the model was trained quickly and efficiently, showing significant potential for advancing speech-language interaction technology.

Raw notes: This work pursues goals similar to the Mini-Omni work by Tsinghua University (covered in last week’s roundup).


MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

M42 Health, Abu Dhabi, UAE

         đź¤—   47      X95   HackerNews0   Reddit0   YouTube0   GitHub0

This paper presents MEDIC, a framework to assess Large Language Models in clinical settings more comprehensively. While I appreciate the focus on multiple dimensions of clinical competence, I’m skeptical about the necessity of such expansive benchmarking when practitioners often prioritize specific, context-driven metrics. The argument for a one-size-fits-all evaluation method might overlook the real-world nuances of clinical applications.

Raw notes: I don’t fully agree with the premise of this paper. I imagine in real world applications, practitioners typically focus on a specific set of metrics based on which they make decisions on which LLMs to use and how to fine tune them to optimize for those metrics. Without application-specific requirements, it’s unclear if we need to benchmark LLMs.


MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Alibaba Group; Tongji University; Independent Researcher; The University of Sydney

         đź¤—   42      X29   HackerNews0   Reddit0   YouTube0   GitHub0

This paper presents MMEvol, a framework designed to enhance multimodal large language models by evolving instruction data iteratively, improving both quality and diversity. The results are impressive, showcasing significant accuracy gains and top-tier performance on vision-language tasks. I found it particularly insightful regarding how AI can alleviate the bottleneck of training data by reducing the need for human-provided labels.

Raw notes: A good read on the topic of dealing with training data bottleneck, using AI to reduce reliance on human labels.


Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Microsoft; Carnegie Mellon University; Columbia University

         đź¤—   35      X18   HackerNews3   Reddit145   YouTube1   GitHub111

This paper introduces the Windows Agent Arena, a robust benchmarking environment designed to evaluate the capabilities of multi-modal agents within the Windows OS across over 150 diverse tasks. The study reveals significant room for improvement in agents like Navi, which only achieve a 19.5% success rate compared to the 74.5% of unassisted humans. I found the platform’s scalability and rapid evaluation capabilities particularly promising for future advancements in OS agent development.

Raw notes: This benchmark should now include the performance of o1. Discussed on Reddit.


How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

Beijing University of Posts and Telecommunications; Meituan

         đź¤—   29      X14   HackerNews0   Reddit0   YouTube0   GitHub26

This paper dives into the critical issue of data quality in code instruction tuning for language models, revealing how data leakage skews performance metrics. By applying a pruning strategy to refine the dataset, the authors introduce XCoder, a model that excels with less training data. Their emphasis on meticulous dataset evaluation offers valuable insights for enhancing future code LLMs.

Raw notes: Data quality is important.


Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Stanford University

         đź¤—   29      X5047   HackerNews64   Reddit76   YouTube6   GitHub0

This paper investigates the potential of large language models to generate research ideas and finds that LLM-generated concepts are often seen as more novel, albeit slightly less feasible, compared to those from expert human researchers. I think it’s fascinating that, while LLMs show promise in pushing the boundaries of novelty, there’s still a need for further validation in terms of practical impact and diversity of ideas. This study serves as an interesting early step toward AI-aided innovation in research fields.

Raw notes: Baby step toward AI-generated research ideas. Widely discussed on social media.


Configurable Foundation Models: Building LLMs from a Modular Perspective

Tsinghua University; University of California San Diego; Carnegie Mellon University; ModelBest Inc.; Renmin University of China; Princeton University; National University of Singapore; Stanford University; University of California, Los Angeles

         đź¤—   26      X187   HackerNews0   Reddit0   YouTube3   GitHub0

This paper explores the innovative idea of breaking down large language models into modular units for improved efficiency and adaptability. The authors present a compelling case for using these “bricks” to dynamically assemble models tailored to specific tasks, backed by empirical evidence. I appreciate the thorough analysis but would have liked to see a discussion on how sparse autoencoders could enhance model interpretability.

Raw notes: I wish there was a discussion of the sparse autoencoders-based interpretability works.


OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

Zhejiang University; Ant Group; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph

         đź¤—   26      X19   HackerNews0   Reddit0   YouTube0   GitHub0

This paper introduces the innovative OneGen framework, which combines generation and retrieval tasks in large language models (LLMs) within a single forward pass. By integrating autoregressively generated retrieval tokens, OneGen improves retrieval-augmented tasks and is the first to achieve vector retrieval during generation. While the initial experimental results show some gains, I think more extensive testing is needed to fully validate the framework’s effectiveness.

Raw notes: Experiments show modest gains, likely cherry picked. It’d be good to run more comprehensive experiments.


MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery

Beijing Academy of Artificial Intelligence; Gaoling School of Artificial Intelligence, Renmin University of China

         đź¤—   23      X28   HackerNews0   Reddit0   YouTube0   GitHub0

This paper introduces MemoRAG, which leverages a dual-system architecture for significantly improved knowledge retrieval in large language models, especially when dealing with ambiguous queries. The novel approach, using both lightweight and more powerful LLMs for different stages of the retrieval and generation process, shows notable performance enhancements over conventional RAG systems. I found it especially relevant for those looking to build advanced RAG models, as it tackles fundamental limitations in current systems.

Raw notes: This is a different approach than o1’s to address the limitation of fast, knee-jerk thinking/next token generation. Good read for RAG builders.


Acknowledgements

Papers are retrieved from Hugging Face.

Social media metrics are from Emergent Mind.