Weekly paper roundup: LLM-as-a-judge Survey (11/25/2024)

vha14 · December 5, 2024, 10:39pm

Overview

The papers collectively highlight advancements in the fields of vision-language models, large language models (LLMs), and their applications across diverse domains, including graphical user interfaces (GUIs), medical AI, and materials science. Notably, innovations like ShowUI and GMAI-VL demonstrate the integration of multimodal capabilities to optimize interactions and diagnostics, enhancing efficiency and performance in specialized tasks. The exploration of LLMs as evaluative tools in “LLM-as-a-judge” and their role in advancing materials science underscores their transformative potential and versatility. Discussions on model efficiency are evident in works like TŪLU 3, addressing post-training refinement, and Star Attention, which focuses on memory and computational optimization. Overall, the papers reflect an overarching emphasis on openness, transparency, and the responsible development and deployment of AI technologies, with diverse contributions from both Western and Chinese research communities.

Spotlight

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Arizona State University; University of Illinois Chicago; University of Maryland, Baltimore County; Illinois Institute of Technology; University of California, Berkeley; Emory University

🤗 33

This paper dives into the intriguing concept of using Large Language Models (LLMs) as judges for AI assessments, shaking up traditional evaluation methods. It lays out a detailed taxonomy of judgment dimensions while thoughtfully examining both the opportunities and hurdles in adopting LLMs for this purpose. I appreciate how it identifies benchmarks for evaluation, which could be instrumental for future research and applications. The discussion on challenges and future directions is particularly insightful, helping to map out potential areas of exploration. Overall, this is a valuable resource for anyone interested in the growing role of LLMs in judgment and evaluation.

Raw notes: LLM-as-a-judge is fast becoming an indispensible tool for practitioners. This survey is a welcome resource and a must read.

Other papers

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

National University of Singapore; Microsoft

🤗 64

This paper introduces ShowUI, a cutting-edge model aimed at improving GUI assistance through advanced vision-language-action integration. It stands out with its innovative approaches like UI-Guided Visual Token Selection, which boosts computational efficiency, and Interleaved Vision-Language-Action Streaming that enhances task management. I find its achievement of a 75.1% accuracy in zero-shot screenshot grounding, along with optimizing training speed, particularly impressive, as it sets a high standard in the rapidly evolving field of GUI agents.

Raw notes: GUI agents are an active area of research. We have seen Claude making the first move here among frontier labs. OpenAI is probably announcing something soon.

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

Allen Institute for AI; University of Washington

🤗 55

This paper introduces TÜLU 3, a family of post-trained language models that improves on existing models, including some proprietary ones, by using advanced techniques like Reinforcement Learning with Verifiable Rewards. It offers transparency and reproducibility via openly shared resources, making it a significant contribution to the field of language model post-training. The work is positioned as a precursor to future advancements like OLMO 2 and sparks interest in open research on test-time scaling.

Raw notes: Open research contribution from Ai2, focusing on post-training. This leads up to OLMO 2. Request to Ai2: open research on test-time scaling (i.e. open version of o1).

Star Attention: Efficient LLM Inference over Long Sequences

NVIDIA

🤗 42

This paper introduces Star Attention, which cleverly optimizes inference for Transformer-based LLMs dealing with long sequences. By employing a block-sparse approximation and smart sharding across hosts, significant gains in computational efficiency and memory usage are achieved without sacrificing accuracy. I am impressed by the reported 11x reduction in memory and inference time, highlighting a promising approach for handling extensive data with large models.

Raw notes: Evolution of the attention mechanism from the efficiency standpoint.

O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

Shanghai Jiao Tong University; SII; NYU; Generative AI Research Lab (GAIR)

🤗 32

This paper explores the replication of OpenAI’s O1 model, revealing that a distillation approach can indeed surpass the performance of O1-preview on complex mathematical reasoning tasks. While the findings highlight the strengths of knowledge distillation paired with supervised fine-tuning, the authors urge the AI community to prioritize transparency and a solid grasp of foundational AI principles over shortcuts. I appreciate the paper’s critical perspective on the trade-offs involved in relying heavily on distillation techniques, as well as the call for open and accessible AI research.

Raw notes: A manifesto from Chinese researchers advocating for open research, criticizing closed US-based labs (Anthropic, OpenAI). Key problem: Can o1 be distilled? This seems like an attractive shortcut on the surface. I suspect distillation has limitations that are not fully captured by the evaluations reported here.

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Shanghai AI Laboratory; Shanghai Jiao Tong University; Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences; Nanjing University; East China Normal University; Fudan University; Xiamen University; Monash University; University of Washington; University of Cambridge; Stanford University

🤗 31

This paper presents GMAI-VL, an innovative vision-language model tailored for the medical field, alongside the GMAI-VL-5.5M dataset, containing substantial image-text data from specialized medical datasets. Through a novel three-stage training strategy, the model achieves impressive results in visual question answering and medical image diagnosis tasks, paving the way for advancements in general medical AI. I appreciate the authors’ commitment to open science, as they plan to make the code and dataset publicly accessible.

Raw notes: Chinese researchers are catching up, and medical AI domain is no exception. Data/code/model weights are open.

Large Language Model-Brained GUI Agents: A Survey

Microsoft AI, Microsoft, China

🤗 21

This paper provides a comprehensive survey of GUI agents powered by large language models, exploring how these agents use natural language commands to interact with graphical user interfaces. It effectively charts their development, vital components, and the methods used to train and test them. I find the identification of research gaps and proposed future directions particularly valuable for anyone looking to contribute to this fast-growing field.

Raw notes: As discussed earlier, GUI agents are a hot area. This massive survey is a useful resource.

Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

University of the Punjab; University of Maryland, Baltimore County; Massachusetts Institute of Technology; Friedrich-Schiller-Universität Jena; McGill University; Acceleration Consortium; University of Liverpool; University of Toronto; University of Cambridge; University of California at Berkeley; University of Illinois at Chicago; University of Houston; EPFL; iteratec GmbH; University of Chicago; Lawrence Berkeley National Laboratory; Duke University; Humboldt University of Berlin; Technology University of Darmstadt; Argonne National Laboratory; University of Southern California; Lam Research; Université catholique de Louvain; Matgenix SRL; Queen’s University; CNR Institute for Microelectronics and Microsystems; University of California at Los Angeles; Helmholtz-Zentrum Berlin für Materialien und Energie GmbH; Soley Therapeutics; Brandeis University; Kleiner Perkins; Schott; University of Utah; Tokyo Institute of Technology; Factorial Energy; Molecular Forecaster; EP Analytics, Inc.; ETH Zurich; Fordham University; Carnegie Mellon University; University of Amsterdam; IDEAS NCBR; Federal Institute of Materials Research and Testing (BAM); Università degli Studi di Milano

🤗 20

This paper offers a fascinating glimpse into the dynamic applications of large language models in the realm of materials science and chemistry, as demonstrated by the 2024 LLM Hackathon. It’s impressive to see how LLMs have evolved into versatile tools for machine learning and rapid prototyping, engaging an international array of teams. The detailed summaries and resources provided make it a valuable read for anyone interested in the integration of AI within scientific research.

Raw notes: Good snapshot peek into what’s going on in uses of AI/LLMs in material sciences and chemistry.

VisualLens: Personalization through Visual History

Meta; University of Southern California

🤗 15

This paper introduces VisualLens, a cutting-edge personalization technique that leverages users’ visual histories to better understand their interests and enhance recommendation accuracy. I found it particularly impressive how the authors addressed the challenge of managing diverse and irrelevant visual data, leading to improvements over existing recommendation systems. Though it’s mainly applicable to large tech companies, the results show notable advancements with a 5-10% increase in Hit@3 metrics and improvements beyond the capabilities of GPT-4o.

Raw notes: This is a type of research that see practical use only at big tech companies such MAANG.

Acknowledgements

Papers are retrieved from Hugging Face.

Topic	Replies	Views
Weekly paper roundup: Movie Gen (10/14/2024) General weekly-paper-roundup	84	October 27, 2024
Weekly paper roundup: SWE-Lancer Benchmark (2/17/2025) General weekly-paper-roundup	23	March 1, 2025
Weekly paper roundup: Dawn of GUI Agent (11/18/2024) General weekly-paper-roundup	36	December 6, 2024
Weekly paper roundup: OLMoE (9/2/2024) General weekly-paper-roundup	83	September 10, 2024
Weekly paper roundup: Competitive Programming with Large Reasoning Models (2/10/2025) General weekly-paper-roundup	20	March 1, 2025

Weekly paper roundup: LLM-as-a-judge Survey (11/25/2024)

Overview

Spotlight

Other papers

Acknowledgements

Related topics