Weekly paper roundup: Survey of User Interface Design for GenAI (11/4/2024)

Overview

The reviewed papers collectively explore advancements in large language models (LLMs) and their applications across diverse domains, with an emphasis on open-source development and multi-modal integration. Several papers, such as OpenCoder, highlight the importance of open-access models that promote transparency and reproducibility, specifically in code generation and structured reasoning. Others, like OS-ATLAS and AndroidLab, focus on enhancing autonomous GUI agents and benchmarking frameworks across different platforms. Innovations in model architecture, like Mixture-of-Transformers and Hunyuan-Large, underline the need for efficiency and scalability, while personalization and interpretability are examined in both survey and application-based studies. Complementarily, papers such as Multi-expert Prompting and DynaMath address the potential to improve language model reliability, safety, and robustness in reasoning, particularly through specialized prompting techniques and benchmarks.

Spotlight :flashlight:

Survey of User Interface Design and Interaction Techniques in Generative AI Applications

University of California – San Diego; Adobe Research; University of Waterloo; University of Maryland, College Park

      🤗   11

This paper offers a well-rounded survey of user interface designs and interaction techniques specifically tailored for generative AI applications, with an emphasis on user-guided interactions. By presenting a comprehensive taxonomy of interaction patterns, it aims to address a significant gap in existing literature and serves as a resource for designers and developers interested in the field. I find it intriguing how the paper highlights the shift within Human-Computer Interaction towards Human-AI Interaction. It does an excellent job at lowering the barriers to designing effective generative AI interfaces, making it a valuable contribution. Overall, it stands out as a strong survey coming from Adobe, offering fresh insights into the evolving dynamics of interaction design in AI applications.

Raw notes: Really good survey from Adobe. Interesting quote: “the field of Human-Computer Interaction (HCI) has shifted much of its focus to study a sub-field of HCI called Human-AI Interaction”


Other papers

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

INF; M-A-P

      🤗   92

This paper presents OpenCoder, an open-access code large language model that rivals proprietary models and is designed to boost scientific research through transparency and reproducibility. The project’s commitment to releasing model weights, inference code, and exhaustive training data marks a significant step forward for AI in code generation, emphasizing collaboration and openness within the AI community. It’s exciting to see contributions from Chinese AI groups like INF, highlighting a trend towards open development and setting a benchmark in the open-source space.

Raw notes: This follows the footsteps of Ai2’s (M)Olmo: fully open. Contributors come from INF (a rather obscure Chinese company) and Multimodal Art Projection (aka M-A-P). Everything should be reproducible. Benchmark numbers look good. Looking forward to vibe checks in real world use cases. Huge win for the open-source community. It’s interesting that Chinese AI groups are pushing open development of AI. This week’s most upvoted paper.


ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Google; National University of Singapore

      🤗   62

This paper introduces ReCapture, a compelling method that transforms user-provided videos by generating new camera angles and cinematic motions, thanks to advances in multiview diffusion models and masked video fine-tuning. The potential here is immense, as it allows users to engage creatively with video editing, offering a professional touch to amateur footage. However, it’s a bit disappointing that the authors did not share the code or data, limiting immediate hands-on exploration.

Raw notes: Very cool research from Google: upload a video, then use AI to generate the content of that video from different camera angles, zoom/tilt/pan/orbit. Anyone could be a cinematographer one day. Bummer that code/data are not shared.


Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Huawei Noah’s Ark; UCL; Technical University of Darmstadt

      🤗   47

This paper introduces Agent K v1.0, a data science agent that excels in automating tasks using a structured reasoning framework, achieving an impressive success rate in Kaggle competitions. I find the approach of leveraging structured reasoning both innovative and promising, especially as it bypasses the traditional need for model fine-tuning across different domains. The absence of shared code is a missed opportunity for wider community engagement, but the results indicate intriguing possibilities in the rapidly advancing field of AI-driven data science.

Raw notes: Noteworthy attempt to create a data science agent. The structured reasoning approach makes sense. While the workflow of a data scientist is currently too complex and challenging for AI to replicate, I expect that this will change rapidly. Watch this space. Bummer that code is not shared. In real world, this part of data science is often easier than determining what problem to solve, what metrics to measure, etc.


AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Tsinghua University; Peking University

      🤗   45

This paper presents AndroidLab, a framework designed to enhance the training and benchmarking of Android autonomous agents, which fills a gap in systematic evaluation within this field. Impressively, it reports substantial improvements in agent performance by employing an Android Instruction dataset. While the advances are noteworthy, the practical implications and broader applicability of such autonomous agents, especially compared to desktop/laptop environments, remain uncertain.

Raw notes: It’s unclear to me if this line of research will have practical uses. There’s a lot more potential for agents operating on a desktop/laptop computers.


OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Shanghai AI Laboratory; Shanghai Jiaotong University; The University of Hong Kong; MIT

      🤗   44

This paper introduces OS-ATLAS, a powerful open-source model that takes the capabilities of generalist GUI agents to a new level by improving their performance in GUI grounding and handling Out-Of-Distribution tasks. With a toolkit for synthesizing data across different platforms and a massive dataset of over 13 million GUI elements, the paper sets a strong foundation for advancing the field. The evaluation shows impressive performance gains over existing models and provides valuable insights for future development in vision-language models.

Raw notes: GUI agents are getting a lot of love, with the recent release from Anthropic under the tittle of “computer use”. This work provides an open-source platform (action model, UI grounding, datasets) to create GUI agents. And yes, it’s from Chinese institutions, again.


“Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization

Neural Magic; Institute of Science and Technology Austria

      🤗   44

This paper sheds light on the balance between accuracy and performance when using different quantization formats (FP8, INT8, INT4) for large language models, particularly the Llama-3.1 family. It effectively demonstrates how FP8 can retain accuracy across the board, while INT8 and INT4 can offer competitive alternatives with specific tuning for efficiency. I find it particularly useful that the authors provide practical guidelines to choose the best quantization strategy based on specific deployment environments and model size needs, making it a valuable read for those working with large-scale language models and aiming for performance optimization.

Raw notes: Good read on tradeoffs with quantization.


Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

FAIR at Meta; Stanford University, Department of Computer Science

      🤗   40

This paper presents a new architecture called Mixture-of-Transformers (MoT), which smartly reduces computational demands while maintaining strong performance across various modalities like text, image, and speech. I find the approach impressive because it skillfully decouples non-embedding parameters by modality, achieving or surpassing the quality of traditional dense models with far less computational effort. The paper is a solid contribution to the ongoing discussion on optimizing transformer efficiency, especially for large-scale multi-modal tasks.

Raw notes: Training efficiency improvement for multi-modal models.


WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Tsinghua University; Zhipu AI

      🤗   33

This paper introduces WebRL, a novel framework that leverages self-evolving online curriculum reinforcement learning to train web agents utilizing open large language models. By addressing challenges such as limited training tasks and sparse feedback, the framework significantly boosts performance, allowing open models like Llama-3.1 and GLM-4 to outperform proprietary models such as GPT-4-Turbo. I find it particularly intriguing that it sets a compelling precedent for enhancing the capability of autonomous web interaction systems in an open-source paradigm.

Raw notes: Another paper on GUI agents, focusing on the Web. Authored by the company that built GLM series models. Should be read in conjunction with OS-Atlas and AndroidLab papers.


Personalization of Large Language Models: A Survey

Dartmouth College; Adobe Research; Stanford University; University of Massachusetts Amherst; Pattern Data; Vanderbilt University; Dolby Research; University of California San Diego; Cisco Research; University of Oregon

      🤗   30

This paper provides a thorough overview of the personalization of large language models, highlighting the current gap between personalized text generation and their application in recommendation systems. It thoughtfully categorizes the field through a proposed taxonomy, addressing usage, techniques, datasets, and evaluation methods. I appreciate how it also outlines the challenges and open problems, offering an insightful framework for future research and development in this rapidly evolving area.

Raw notes: Lay of the land of personalization.


How Far is Video Generation from World Model: A Physical Law Perspective

Bytedance Research; Tsinghua University; Technion

      🤗   29

This paper critically examines video generation models, highlighting their limitations in learning and predicting physical laws accurately across diverse scenarios. By focusing on diffusion-based models, it demonstrates that while these models can generalize within known contexts, they struggle with out-of-distribution cases, relying heavily on mimicking similar past examples rather than abstracting broader physical principles. This insight resonates with the broader understanding of large language models as primarily case-based reasoners, suggesting that scaling up model size alone isn’t sufficient for deeper comprehension of underlying structures.

Raw notes: This is similar to the perspective that LLMs are approximate retrievers/case-based reasoners that don’t learn/generalize intricate patterns that human can. The first author shared a great video summary here: x.com


M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

UNC Chapel Hill; Bloomberg

      🤗   23

This paper introduces M3DocRAG, a novel framework designed to enhance the capabilities of document visual question answering by effectively managing both multi-page and multi-document scenarios using a multi-modal approach. The authors successfully demonstrate that their method outperforms existing techniques, particularly with complex queries, though it’s worth noting that the authors have not shared their implementation code. Overall, I find the integration of multi-modal retrieval with a language model to be a promising step forward in addressing the challenges of long and varied document types in open-domain tasks.

Raw notes: The premise makes sense, seems almost obvious. Bummer that no code is shared.


Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Tencent

      🤗   22

This paper introduces Hunyuan-Large, an open-source Mixture of Experts (MoE) model with a staggering 389 billion parameters, designed by Tencent to push the boundaries in language processing tasks. It employs a unique mixed expert routing strategy, leverages large-scale synthetic data, and utilizes expert-specific learning rates to deliver impressive performance compared to other models like LLama3.1-70B. Although primarily focused on text, I think it leaves room for future expansion into vision and speech modalities, which could further solidify its place in the competitive landscape of large language models.

Raw notes: Tencent is not to fall behind in the LLM race. This model is text-only. I expect they would at least discuss plans to add vision/speech modalities.


LLaMo: Large Language Model-based Molecular Graph Assistant

Korea University

      🤗   20

This paper introduces LLaMo, a novel tool that bridges large language models with the molecular graph domain to enhance molecular understanding. By integrating a multi-level graph projector and leveraging machine-generated instructions during training, LLaMo excels in tasks such as molecular description generation and property prediction. Impressively, it outperforms existing approaches, highlighting its potential in advancing both molecular and language understanding.

Raw notes: Molecular graph is a new modality for LLMs!


TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Yale University; Allen Institute for AI

      🤗   19

This paper introduces TOMATO, a new benchmark designed to evaluate the visual temporal reasoning skills of Multimodal Foundation Models in video understanding. The authors uncover a substantial performance gap between humans and these models, pointing out fundamental weaknesses in how current models interpret video sequences. By providing a comprehensive set of human-annotated tasks, this work aims to push the research community to develop improved models that can better grasp video dynamics.

Raw notes: A recurring research theme: a) identify a potential weakness of current AI, b) create a benchmark to highlight this weakness, aiming to maximize the gap between human and AI, c) invite the research community to work on closing the gap, d) profit (in number of citations). This work is a good example of this theme.


Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

University of Cambridge; The University of Hong Kong

      🤗   19

This paper delves into how effectively large language models can retrieve information from extensive datasets, revealing that while they can skillfully handle complex threads, their true context limits are often shorter than advertised, impacting accuracy negatively as the context increases. It highlights the importance of comparing tokenizer outputs carefully due to their potential to represent different character counts and notes that closed frontier models generally perform significantly better than open models, with Reka models particularly trailing in performance. As someone interested in the intricacies of LLM performance, I find this exploration of context utilization both insightful and critical for future developments in the field.

Raw notes: Closed frontier models outperform open ones by a lot. Reka models in particular lag far behind.


Analyzing The Language of Visual Tokens

University of California, Berkeley; The University of Tokyo, Tokyo

      🤗   17

This paper delves into the intricacies of visual tokens within transformer-based vision and language models, drawing comparisons with natural languages. I found it intriguing how visual tokens follow Zipfian distributions like natural languages but lack the same grammatical cohesion, making them less organized. These insights could significantly impact the future development of more sophisticated computer vision models.

Raw notes: Important work towards better understanding visual tokens.


DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

University of Illinois at Urbana-Champaign; University of California, Berkeley

      🤗   15

This paper introduces DynaMath, an innovative benchmark designed to test how well Vision-Language Models (VLMs) can handle variations in mathematical problems. I find it fascinating how the study reveals that even advanced models like GPT-4o stumble when faced with slightly altered tasks, highlighting a stark contrast with human reasoning capabilities. Overall, the research underscores the pressing need to enhance the robustness of VLMs in mathematical reasoning, positioning DynaMath as a critical tool for future model enhancements.

Raw notes: Math is a popular topic :slight_smile:


M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Yale University; Allen Institute for AI

      🤗   13

This paper presents M3SciQA, a novel benchmark aimed at pushing the boundaries of how foundation models handle complex, multimodal, and multi-document scientific question answering. It highlights the current challenges models face in integrating and reasoning with diverse information sources, showing significant shortcomings when compared to human performance. The insights from this research underscore the need for further advancements to achieve the ambitious goals of creating truly autonomous AI systems capable of scientific reasoning.

Raw notes: A virtual AI scientist (as proposed by Sakana recently) is an ambitious goal that will take a while to realize.


From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Microsoft; OpenAI

      🤗   9

This paper explores how OpenAI’s o1-preview model competes against the Medprompt approach in tackling medical challenge problems. The findings demonstrate that o1-preview consistently outperforms GPT-4 with Medprompt, except when few-shot prompting is involved. The insights provided here suggest that new advancements from major AI players like OpenAI could potentially render existing methods outdated, emphasizing the need for adaptability and innovation in benchmark development.

Raw notes: OpenAI’s o1 threw a curve ball at previous prompt techniques like Medprompt (in a good way). Takeaway: be judicious not to invest too much in areas where frontier work from OpenAI and the likes could make them obsolete.


LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

FPT Software AI Center, Viet Nam; VNUHCM - University of Science

      🤗   8

This paper presents LibMoE, a modular framework aimed at simplifying research and training on Mixture of Experts algorithms within large language models. By providing standardized processes and comprehensive benchmarks, it makes these advanced techniques more accessible to researchers. The study finds that different MoE algorithms perform comparably across various tasks, which highlights the need for further innovation in the field.

Raw notes: Infra enhancement for trainking MoEs.


Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

Carnegie Mellon University

      🤗   6

This paper introduces Specialized Sparse Autoencoders (SSAEs) as an innovative technique to improve the interpretability of foundation models by targeting rare but important concepts. By leveraging dense retrieval for data selection and using Tilted Empirical Risk Minimization, SSAEs demonstrate superior capability over traditional Sparse Autoencoders in identifying hard-to-capture features. The effectiveness of this method is highlighted with notable improvements in classification accuracy within the Bias in Bios dataset, particularly with respect to challenging gender bias issues.

Raw notes: Keywords: interpretability, rare concepts, sparse autoencoders, SAE.


Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

National University of Singapore; Institute for Infocomm Research (I2R), A*STAR; Nanyang Technological University

      🤗   5

This paper presents an innovative approach called Multi-expert Prompting, which involves simulating input from multiple expert models to refine and enhance the output of large language models (LLMs). The method improves the overall quality and safety of LLM outputs by focusing on truthfulness, informativeness, and reducing harmful responses. I’ve found that this multi-expert collaboration outperforms previous techniques and demonstrates impressive adaptability across diverse applications.

Raw notes: Is it ethical to eat meat? Apparently asking multiple LLM experts (ethicist, nutritionist, environmentalist) and aggregating into a single answer is a good idea.


Acknowledgements

Papers are retrieved from Hugging Face.