Overview
The papers collectively highlight advancements in the fields of vision-language models, large language models (LLMs), and their applications across diverse domains, including graphical user interfaces (GUIs), medical AI, and materials science. Notably, innovations like ShowUI and GMAI-VL demonstrate the integration of multimodal capabilities to optimize interactions and diagnostics, enhancing efficiency and performance in specialized tasks. The exploration of LLMs as evaluative tools in “LLM-as-a-judge” and their role in advancing materials science underscores their transformative potential and versatility. Discussions on model efficiency are evident in works like TŪLU 3, addressing post-training refinement, and Star Attention, which focuses on memory and computational optimization. Overall, the papers reflect an overarching emphasis on openness, transparency, and the responsible development and deployment of AI technologies, with diverse contributions from both Western and Chinese research communities.
Spotlight
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Arizona State University; University of Illinois Chicago; University of Maryland, Baltimore County; Illinois Institute of Technology; University of California, Berkeley; Emory University
This paper dives into the intriguing concept of using Large Language Models (LLMs) as judges for AI assessments, shaking up traditional evaluation methods. It lays out a detailed taxonomy of judgment dimensions while thoughtfully examining both the opportunities and hurdles in adopting LLMs for this purpose. I appreciate how it identifies benchmarks for evaluation, which could be instrumental for future research and applications. The discussion on challenges and future directions is particularly insightful, helping to map out potential areas of exploration. Overall, this is a valuable resource for anyone interested in the growing role of LLMs in judgment and evaluation.
Raw notes: LLM-as-a-judge is fast becoming an indispensible tool for practitioners. This survey is a welcome resource and a must read.
Other papers
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
National University of Singapore; Microsoft
This paper introduces ShowUI, a cutting-edge model aimed at improving GUI assistance through advanced vision-language-action integration. It stands out with its innovative approaches like UI-Guided Visual Token Selection, which boosts computational efficiency, and Interleaved Vision-Language-Action Streaming that enhances task management. I find its achievement of a 75.1% accuracy in zero-shot screenshot grounding, along with optimizing training speed, particularly impressive, as it sets a high standard in the rapidly evolving field of GUI agents.
Raw notes: GUI agents are an active area of research. We have seen Claude making the first move here among frontier labs. OpenAI is probably announcing something soon.
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
Allen Institute for AI; University of Washington
This paper introduces TÜLU 3, a family of post-trained language models that improves on existing models, including some proprietary ones, by using advanced techniques like Reinforcement Learning with Verifiable Rewards. It offers transparency and reproducibility via openly shared resources, making it a significant contribution to the field of language model post-training. The work is positioned as a precursor to future advancements like OLMO 2 and sparks interest in open research on test-time scaling.
Raw notes: Open research contribution from Ai2, focusing on post-training. This leads up to OLMO 2. Request to Ai2: open research on test-time scaling (i.e. open version of o1).
Star Attention: Efficient LLM Inference over Long Sequences
NVIDIA
This paper introduces Star Attention, which cleverly optimizes inference for Transformer-based LLMs dealing with long sequences. By employing a block-sparse approximation and smart sharding across hosts, significant gains in computational efficiency and memory usage are achieved without sacrificing accuracy. I am impressed by the reported 11x reduction in memory and inference time, highlighting a promising approach for handling extensive data with large models.
Raw notes: Evolution of the attention mechanism from the efficiency standpoint.
Shanghai Jiao Tong University; SII; NYU; Generative AI Research Lab (GAIR)
This paper explores the replication of OpenAI’s O1 model, revealing that a distillation approach can indeed surpass the performance of O1-preview on complex mathematical reasoning tasks. While the findings highlight the strengths of knowledge distillation paired with supervised fine-tuning, the authors urge the AI community to prioritize transparency and a solid grasp of foundational AI principles over shortcuts. I appreciate the paper’s critical perspective on the trade-offs involved in relying heavily on distillation techniques, as well as the call for open and accessible AI research.
Raw notes: A manifesto from Chinese researchers advocating for open research, criticizing closed US-based labs (Anthropic, OpenAI). Key problem: Can o1 be distilled? This seems like an attractive shortcut on the surface. I suspect distillation has limitations that are not fully captured by the evaluations reported here.
Shanghai AI Laboratory; Shanghai Jiao Tong University; Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences; Nanjing University; East China Normal University; Fudan University; Xiamen University; Monash University; University of Washington; University of Cambridge; Stanford University
This paper presents GMAI-VL, an innovative vision-language model tailored for the medical field, alongside the GMAI-VL-5.5M dataset, containing substantial image-text data from specialized medical datasets. Through a novel three-stage training strategy, the model achieves impressive results in visual question answering and medical image diagnosis tasks, paving the way for advancements in general medical AI. I appreciate the authors’ commitment to open science, as they plan to make the code and dataset publicly accessible.
Raw notes: Chinese researchers are catching up, and medical AI domain is no exception. Data/code/model weights are open.
Large Language Model-Brained GUI Agents: A Survey
Microsoft AI, Microsoft, China
This paper provides a comprehensive survey of GUI agents powered by large language models, exploring how these agents use natural language commands to interact with graphical user interfaces. It effectively charts their development, vital components, and the methods used to train and test them. I find the identification of research gaps and proposed future directions particularly valuable for anyone looking to contribute to this fast-growing field.
Raw notes: As discussed earlier, GUI agents are a hot area. This massive survey is a useful resource.
University of the Punjab; University of Maryland, Baltimore County; Massachusetts Institute of Technology; Friedrich-Schiller-Universität Jena; McGill University; Acceleration Consortium; University of Liverpool; University of Toronto; University of Cambridge; University of California at Berkeley; University of Illinois at Chicago; University of Houston; EPFL; iteratec GmbH; University of Chicago; Lawrence Berkeley National Laboratory; Duke University; Humboldt University of Berlin; Technology University of Darmstadt; Argonne National Laboratory; University of Southern California; Lam Research; Université catholique de Louvain; Matgenix SRL; Queen’s University; CNR Institute for Microelectronics and Microsystems; University of California at Los Angeles; Helmholtz-Zentrum Berlin für Materialien und Energie GmbH; Soley Therapeutics; Brandeis University; Kleiner Perkins; Schott; University of Utah; Tokyo Institute of Technology; Factorial Energy; Molecular Forecaster; EP Analytics, Inc.; ETH Zurich; Fordham University; Carnegie Mellon University; University of Amsterdam; IDEAS NCBR; Federal Institute of Materials Research and Testing (BAM); Università degli Studi di Milano
This paper offers a fascinating glimpse into the dynamic applications of large language models in the realm of materials science and chemistry, as demonstrated by the 2024 LLM Hackathon. It’s impressive to see how LLMs have evolved into versatile tools for machine learning and rapid prototyping, engaging an international array of teams. The detailed summaries and resources provided make it a valuable read for anyone interested in the integration of AI within scientific research.
Raw notes: Good snapshot peek into what’s going on in uses of AI/LLMs in material sciences and chemistry.
VisualLens: Personalization through Visual History
Meta; University of Southern California
This paper introduces VisualLens, a cutting-edge personalization technique that leverages users’ visual histories to better understand their interests and enhance recommendation accuracy. I found it particularly impressive how the authors addressed the challenge of managing diverse and irrelevant visual data, leading to improvements over existing recommendation systems. Though it’s mainly applicable to large tech companies, the results show notable advancements with a 5-10% increase in Hit@3 metrics and improvements beyond the capabilities of GPT-4o.
Raw notes: This is a type of research that see practical use only at big tech companies such MAANG.
Acknowledgements
Papers are retrieved from Hugging Face.