Overview
The collection of papers primarily highlights advancements in multimodal models and the increasing challenge of evaluating such complex systems via sophisticated benchmarks. Papers like Baichuan-Omni and Movie Gen introduce cutting-edge models that integrate various data types, enriching interactions and media generation, while others focus on novel benchmarks like MixEval-X and MMIE for evaluating model performance across diverse tasks and modalities. A recurring theme is the difficulty of evaluation (“Eval is hard”), as evidenced by a myriad of new benchmarks tailored for tasks ranging from synthetic data detection (LOKI) to specialized tasks like Olympiad-level mathematics (Omni-MATH) and agent-based data science (DA-Code). The innovation extends to evaluation methods themselves, exemplified by Agent-as-a-Judge, illustrating novel frameworks where agents evaluate other agents, showing a trend towards self-assessment and improved evaluation processes. Collectively, these papers underscore significant strides in both model capabilities and evaluation rigor, aimed at a comprehensive understanding of multimodal AI systems.
Screenshot of the week: o1-preview and o1-mini are in a different league in JudgeBench
Spotlight
JudgeBench: A Benchmark for Evaluating LLM-based Judges
UC Berkeley; Washington University in St. Louis
This paper introduces JudgeBench, a novel benchmark aimed at evaluating language model judges with a focus on factual and logical correctness rather than aligning with human preferences. The authors critique existing benchmarks and emphasize the importance of assessing models in more challenging scenarios. I find the pipeline to generate difficult response pairs particularly innovative, offering a rigorous way to gauge the capabilities of LLM judges. It seems noteworthy that despite the challenging nature of the benchmark, OpenAI’s o1 model still outperformed others, showing its robustness. As a contribution, this paper highlights the evolving art and creativity involved in crafting effective benchmarks for AI evaluation.
Raw notes: Yet another benchmark, this time to evaluate LLM judges. The pipeline to create the eval data is interesting. Noteworthy that OpenAI’s o1 dominates in this challenging benchmark. Increasingly, there is an art of generating benchmark/eval data with a lot of creativity.
Spotlight
Movie Gen: A Cast of Media Foundation Models
Meta
This paper presents Movie Gen, an impressive set of media foundation models that can generate high-quality HD videos with synchronized audio, boasting advanced features like instruction-based editing and personalized video creation. It reveals a 30-billion-parameter transformer model that achieves state-of-the-art performance, outperforming major players like OpenAI Sora and several startups. The authors emphasize scaling in training data, computing power, and model parameters, primarily relying on a simple Transformer-based model enhanced with Flow Matching. While Meta has demonstrated its prowess and potential to outpace competitors with substantial resources, it’s worth noting that these models are not yet available to the public. This could be a game-changer in media generation, opening up avenues for further research and innovation in this exciting field.
Raw notes: Absolute monster of a paper, almost 100 pages long, from Meta AI. Casually flexed SOTA performance, beating everyone from OpenAI Sora to a host of hot startups: Runway, Luma, Pika, Eleven Labs. On the technical front: “We find that scaling the training data, compute, and model parameters of a simple Transformer-based model trained with Flow Matching yields high quality generative models for video or audio”. So attention and flow matching are all you need? OpenAI may have shown the way, but Meta has emerged as a formidable fast follower with much better resources. Caveat: the models are not available to the public yet.
Other papers
Baichuan-Omni Technical Report
Baichuan Inc.; Westlake University; Zhejiang University
This paper introduces Baichuan-Omni, touted as the first open-source 7 billion parameter model that can handle multimodal input including images, videos, audio, and text, aiming to enhance interactive experiences. Despite being a modestly sized model, it demonstrates strong performance across various benchmarks while still undergoing a security review before release. I find it exciting that a model of this nature is being developed with an open-source approach, potentially setting a new standard in multimodal large language models.
Raw notes: First true multimodal, open-weighted model, courtesy of one of China’s AI Tigers. It’s a modest 7B model and thus modest benchmark numbers. Model is still “under security review” and not released yet.
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
National University of Singapore; Nanyang Technological University; Peking University; Carnegie Mellon University; University of Waterloo; Independent Researcher
This paper introduces MixEval-X, a new benchmark framework that aims to standardize evaluations for AI models dealing with varied input and output formats. I’m intrigued by how MixEval-X addresses the inconsistencies in current evaluation methods and apparently provides meaningful improvements by comparing model rankings with crowd-sourced assessments. The framework seems particularly useful for advancing research in multi-modal AI, and it’s noteworthy that Claude Sonnet 3.5 reportedly performs the best compared to other models like GPTs.
Raw notes: Eval is hard. Eval of multimodal foundation models is the hardest. This paper claims to provide SOTA eval benchmark. Claude Sonnet 3.5 seems to perform best, slightly ahead of GPTs.
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Sun Yat-sen University; Shanghai AI Laboratory; SenseTime Research; The Chinese University of Hong Kong; The Chinese University of Hong Kong (Shenzhen)
This paper introduces LOKI, a rigorous benchmark to test the effectiveness of large multimodal models in detecting synthetic data spanning video, image, text, and audio. With 18,000 curated questions, it offers a comprehensive evaluation of how well these models can discern real from fake content and understand the reasoning behind their judgments. I appreciate the paper’s balanced view, showcasing both the strengths and weaknesses of current models in performing synthetic data detection tasks.
Raw notes: Eval is hard. Eval of synthetic detection is no exception. LOKI provides a solution.
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
UNC-Chapel Hill; Microsoft Research; University of Chicago; NUS
This paper presents MMIE, a new benchmark aiming to assess the abilities of large vision-language models in handling interleaved multimodal comprehension and generation effectively. With 20,000 thoughtfully curated queries, it seeks to address gaps in existing benchmarks and provides an automated evaluation metric to enhance robustness. Despite its innovation, results indicate that current models have substantial room for growth, emphasizing the benchmark’s potential to push research boundaries.
Raw notes: Eval is hard. How many times have I said that? Yet another benchmark, this time for interleaved multimodality.
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Tsinghua University; University of Toronto; The Chinese University of Hong Kong; Tencent Robotics X
This paper introduces VidEgoThink, a benchmark specifically developed to evaluate how well Multi-modal Large Language Models (MLLMs) understand egocentric video content in the field of Embodied AI. It outlines four tasks that aim to connect language models to basic control functions, revealing that current models struggle significantly with these tasks. I appreciate the paper’s contribution to highlighting the existing gaps in AI capabilities and look forward to seeing how this benchmark will drive future improvements.
Raw notes: Eval is hard. This is the 4th consecutive benchmark paper in the top 6 most upvoted papers.
Alibaba Group; Skywork AI; HKUST(GZ); HKUST; Zhejiang University; UC Berkeley
This paper introduces Meissonic, a cutting-edge model that revitalizes masked image modeling for more efficient and high-quality text-to-image synthesis. By leveraging innovative architecture and optimization strategies along with robust training data, it competes with, and frequently outperforms, top diffusion models in generating high-resolution imagery. I found the approach promising, as it not only enhances the efficiency of image generation but also maintains high-quality outputs.
Raw notes: In image generation, diffusion has been dominant. New alternative approaches are arriving, including flow matching and non-autoregressive masked modeling.
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Simon Fraser University; University of Waterloo
This paper introduces MEGA-Bench, a new comprehensive evaluation suite designed to test multimodal models across a broad range of real-world tasks. By evaluating models with diverse output formats and extensive metrics, it offers a more nuanced understanding of model performance. I find the approach particularly valuable for its detailed analysis capabilities, offering improved insights into model behavior compared to other benchmarks.
Raw notes: Yet another benchmark paper. GPT-4o is the leader, with Claude Sonnet 3.5 second.
Roadmap towards Superhuman Speech Understanding using Large Language Models
The Chinese University of Hong Kong, Shenzhen; Noah’s Ark Lab, Huawei
This paper proposes a comprehensive roadmap to achieve superhuman speech understanding by integrating speech and audio data with large language models. It introduces a five-level framework that progresses from basic speech recognition to more sophisticated models capable of understanding non-semantic and abstract acoustic knowledge. The work is notable for introducing the SAGI Benchmark as a tool to evaluate these advancements, though it highlights that there are still significant challenges, particularly in managing paralinguistic cues.
Raw notes: Benchmarking LLMs against 5 levels of speech understanding. Significant gaps exist.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Peking University; University of Wisconsin - Madison; Alibaba Group; Shanghai Jiao Tong University; Engineering Research Center of Information Networks; The Chinese University of Hong Kong, Shenzhen; Institute of Software, Chinese Academy of Sciences; University of Waterloo; The University of Hong Kong; Zhongguancun Laboratory
This paper introduces Omni-MATH, a benchmark specifically crafted to assess large language models’ capabilities in solving Olympiad-level mathematics problems. Despite current models’ proficiency, they only achieve just over 60% accuracy on these complex tasks, reflecting the challenges of high-level mathematical reasoning. I appreciate how the paper highlights the limitations of existing benchmarks and underscores the steep climb ahead for LLMs in mastering advanced mathematics.
Raw notes: For the n-th time this week, it’s … benchmark time! Math is the focus. OpenAI’s o1 absolutely destroys the field, yet only perform at 60% accuracy.
LiveXiv – A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Tel-Aviv University; IBM Research; University of Michigan, USA; JKU Linz, Austria; TU Graz, Austria; MIT-IBM
This paper presents LiveXiv, an innovative live benchmark that efficiently evaluates multi-modal models using automatically generated content from ArXiv papers. I find it impressive that the approach enables the creation of visual question-answer pairs without human oversight, ensuring scalability and up-to-date testing. The paper successfully illustrates the benchmark’s challenges and contributions by showcasing evaluations of various models, emphasizing its impact on the field.
Raw notes: For the n+1st time this week, it’s benchmark time! Automatically created from arXiv papers. Claude Sonnet’s dominance is particularly noteworthy.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Tsinghua University; ModelBest Inc.; Rice University; Northeastern University
This paper presents VisRAG, a groundbreaking system that integrates visual information into the retrieval-augmented generation process for documents with multiple modalities. By using a vision-language model to process documents as images, VisRAG retains crucial layout and image data that traditional methods often miss. While the experimental results are promising, showing a notable performance boost over standard RAG systems, its practical effectiveness outside the lab is still to be fully validated.
Raw notes: The concept makes sense, but how well it works in practice remains to be seen.
A Comparative Study on Reasoning Patterns of OpenAI’s o1 Model
M-A-P; University of Manchester; OpenO1 Team; 077AI; Abaka AI; Zhejiang University; University of Chinese Academy of Sciences
This paper delves into how OpenAI’s o1 model stacks up against existing Test-time Compute methods, particularly highlighting its superior performance in areas like math, coding, and commonsense reasoning. It doesn’t shy away from discussing the model’s limitations, particularly in reward models and search methods. Furthermore, it provides an insightful categorization and analysis of six distinct reasoning patterns the o1 model exhibits, offering a nuanced view of its capabilities.
Raw notes: Interesting read analyzing the performance of OpenAI’s o1 versus other test time compute methods.
Agent-as-a-Judge: Evaluate Agents with Agents
Meta AI; KAUST
This paper presents an innovative framework, Agent-as-a-Judge, designed to improve the evaluation of agentic systems by having agents assess each other, which addresses the inefficiencies and oversights of traditional methods. By applying this approach to the DevAI benchmark for code generation tasks, the study demonstrates that the framework surpasses existing evaluation techniques in performance and reliability, providing agentic systems with valuable feedback mechanisms for self-improvement. I find it particularly noteworthy that the research involves collaboration between Meta AI and KAUST, with Jurgen Schmidhuber contributing as the last author, hinting at a significant advancement in AI evaluation methodologies.
Raw notes: Interesting concept: agent-as-a-judge. Also interesting collaboration between Meta AI and KAUST. Really interesting that Jurgen Schmidhuber is the last author.
MMLab, The Chinese University of Hong Kong; National Key Laboratory for Novel Software Technology, Nanjing University; Beijing Institute of Technology
This paper presents a novel framework called Retrieval Augmented Personalization (RAP) that effectively personalizes multimodal large language models by integrating user-specific data and generating customized responses. I find it compelling how RAP enables the real-time adjustment of concepts and enhances response quality without a need for further fine-tuning. The approach shows promising potential for applications in tasks like image captioning and question answering, paving the way for more adaptable and personalized AI assistants.
Raw notes: Very early research into personalization of AI.
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; University of California, Davis; Microsoft Research Asia; Shanghai Artificial Intelligence Laboratory
This paper introduces the DA-Code benchmark designed to assess the capability of large language models in executing intricate agent-based data science tasks. What stands out is that even the top-performing models in this evaluation only managed to achieve an accuracy of 30.5%, showing significant room for improvement. The benchmark is made accessible online, signaling an invitation for further research to bridge these performance gaps.
Raw notes: For the umpteenth time this week, another benchmark paper. Notably missing o1 models in the eval.
Acknowledgements
Papers are retrieved from Hugging Face.