Weekly paper roundup: Inference Scaling for Long-Context RAG (10/7/2024)

Overview

Several common themes emerge across the reviewed papers. Innovations in Transformer models feature prominently, with multiple approaches focusing on enhancing attention mechanisms to optimize performance while reducing computational demands (Papers 1, 11). Multimodal and vision-language integration are pivotal in advancing real-world understanding and practical applications (Papers 3, 4, 10, 14). The focus on optimizing computational efficiency and energy consumption in AI systems is highlighted, presenting novel methods such as the L-Mul algorithm and the Mamba architecture (Papers 2, 8). The importance of benchmarks and evaluations for improving AI capabilities and understanding model limitations is evident, with new tools introduced for both software-related and reasoning tasks (Papers 13, 15, 17, 19). Lastly, AI’s potential for enhancing applications such as education and agent systems is explored, indicating a growing trend towards practical implementations (Papers 7, 5, 18).

Spotlight :flashlight:

Inference Scaling for Long-Context Retrieval Augmented Generation

University of Illinois Urbana-Champaign; Google DeepMind; University of Massachusetts Amherst

      🤗   8

This paper delves into the advantages of optimizing inference computation in retrieval augmented generation models, particularly for long-context language models. It identifies that, rather than simply adding more knowledge, effectively scaling test-time computation through in-context learning and iterative prompting can substantially boost performance. The authors present a model to predict the best parameters for various constraints, showing remarkable performance improvements up to 58.9% on benchmark tests. I find this particularly insightful for anyone developing RAG applications, especially those interested in fine-tuning inference computation strategies. A must-read if you’re looking into practical applications of scaling in language model inference.

Raw notes: Required reading if you are building RAG applications and interested in inference computation scaling (o1 and such).


Other papers

Differential Transformer

Microsoft Research; Tsinghua University

      🤗   135

This paper introduces the Diff Transformer, which enhances traditional Transformer models by employing a differential attention mechanism to improve focus on relevant information while minimizing distractions. The results are quite impressive in controlled experimental setups, showcasing improvements in language modeling and context handling. However, I am curious to see how Microsoft AI might apply this architecture in real-world scenarios, as the paper doesn’t disclose Microsoft’s actual implementation strategies or potential shortcomings.

Raw notes: This weeks’ most upvoted paper is from Microsoft Research Asia (headed by Furu Wei), the group that brought us BitNet. Experimental results are impressive, but understandably in a highly constrained setup. I’d be more interested in what Microsoft AI is actually doing with this idea for their frontier models efforts; they are almost certainly not sharing. As usual in research, the paper did not discuss what did not work, so the insight is skewed.


Addition is All You Need for Energy-efficient Language Models

BitEnergy AI, Inc., Cambridge, MA 02142, USA

      🤗   129

This paper introduces the L-Mul algorithm that offers a novel approach by using integer addition to approximate floating point multiplication, promising substantial reductions in energy consumption and computational resources. Impressively, the method claims to cut energy costs by up to 95% in certain operations, while maintaining precision comparable to existing methods. Given its potential transformative impact on energy efficiency in NLP tasks, I’ll be keeping an eye on this development despite the flashy claims from a relatively unknown company.

Raw notes: Jaw dropping claim from an obscure company called BitEnergy with a non-descript website, suggesting a two MIT-alum founding team. Sounds almost too good to be true. The paper seems to have enough details for reproducibility. Looking forward to tracking this progress.


Aria: An Open Multimodal Native Mixture-of-Experts Model

Rhymes AI

      🤗   94

This paper introduces Aria, an open-source multimodal native mixture-of-experts model that boasts impressive benchmark results in integrating visual and textual data, positioning itself as a powerful alternative to proprietary models. The authors highlight a comprehensive pre-training approach that enhances Aria’s capabilities, despite the absence of an audio modality. I find the model’s open-weight system compelling, as it allows for adaptable use across various tasks, making it a promising tool for real-world applications.

Raw notes: New open-weight (all else is closed) multimodal vision language model trained from scratch with impressive benchmark results. It does not hear or speak though (no audio modality). Supposedly by a Tokyo-based company (Rhymes AI). Its chief multimodal scientist is based in Singapore and an ex Salesforce researcher.


Pixtral 12B

Mistral

      🤗   53

This paper presents Pixtral-12B, a 12-billion-parameter multimodal language model capable of understanding both images and text while surpassing even larger models in several benchmarks. Notably, it introduces a novel vision encoder and an impressive context window, enhancing its ability to process inputs flexibly. The authors also contribute an open-source benchmarking tool, MM-MT-Bench, to aid in the standardized evaluation of vision-language models.

Raw notes: Mistral’s multimodal team strikes back, showing superior benchmark numbers compared to other frontier models, including AI2’s Molmo. I’m actually rather lost in all these comparisons. There’s a clear need for better eval. For example, Rhymes AI’s Aria is claimed to beat Pixtral 12B.


WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Australian AI Institute, Faculty of Engineering and IT, University of Technology Sydney; Department of Computer Science, University of Maryland, College Park; Tencent

      🤗   44

This paper introduces “WALL-E,” an LLM agent that enhances performance in complex virtual environments by incorporating a neurosymbolic approach to align with its surroundings through rule learning. By utilizing the pre-existing knowledge of LLMs and learning a few additional rules, WALL-E demonstrates improved exploration capabilities and efficiency in environments like Minecraft, achieving better success rates with reduced replanning costs compared to current LLM agents. Although the initial results are promising, further development and exploration are needed to fully gauge the potential of this approach.

Raw notes: Worth a quick skim for agentic system builders. The premise of the paper makes sense. Still super early though.


LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Technion; Google Research; Apple

      🤗   43

This paper delves into the fascinating realm of how large language models handle factual errors, uncovering that they may internally possess more accurate information than their outputs suggest. It identifies specific tokens that encode the truthfulness, which could enhance error detection capabilities. However, the work importantly points out the inconsistency of this encoding across datasets, indicating that our approach to mitigating such errors needs to be more refined and strategic.

Raw notes: Early research into understanding how LLMs make mistakes and whether they can be prevented.


Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

Stanford University

      🤗   24

This paper introduces Tutor CoPilot, an innovative Human-AI system aimed at enhancing tutoring effectiveness, especially in underserved educational environments. Through a rigorous randomized controlled trial, it demonstrates meaningful improvements in student outcomes, with notable gains among underperforming tutors. Though the positive results are promising for democratizing access to quality education, some limitations in the system’s guidance are acknowledged.

Raw notes: Tutor CoPilot is a promising concept. This should not be surprising. Still, it’s good to see a randomized controlled trial being carried out to validate.


Falcon Mamba: The First Competitive Attention-free 7B Language Model

Technology Innovation Institute, Abu Dhabi, United Arab Emirates

      🤗   24

This paper introduces Falcon Mamba, a 7-billion-parameter language model that leverages the innovative Mamba architecture, showcasing remarkable performance and efficiency compared to traditional Transformer-based models. I am impressed by how it challenges the current dominance of hybrid models in the field, demonstrating that attention-free architectures can be viable competitors. With its model weights publicly available, this research invites further exploration into the potential of purely Mamba-based designs.

Raw notes: Mamba research continues, despite its limited use in frontier models. Diversity is good.


FAN: Fourier Analysis Networks

School of Computer Science, Peking University; ByteDance

      🤗   23

This paper introduces Fourier Analysis Networks (FAN), which effectively utilize Fourier Series to enhance the modeling of periodic phenomena, improving upon traditional neural networks. While the approach showcases notable performance improvements in areas like time series forecasting, the experimental setup is somewhat limited, relying on small networks. Additionally, the paper could benefit from a discussion on potential limitations and areas for further research.

Raw notes: Neat paper addressing the need to model periodicity. However the experiments are very preliminary with small networks. The paper lacks self-criticism: a section discussing limitations of the findings.


NL-Eye: Abductive NLI for Images

Department of Data Science, Technion - IIT; Google Research

      🤗   22

This paper presents NL-Eye, a new benchmark aimed at testing the visual abductive reasoning of Visual Language Models. By using 350 carefully selected triplet examples, NL-Eye exposes how these models struggle with tasks that seem straightforward to humans, highlighting a significant gap in current AI capabilities. It underscores the necessity for improving VLMs to handle real-world scenarios that demand complex multimodal reasoning.

Raw notes: AI has a very long way to go to approach human intelligence in this ability.


Selective Attention Improves Transformer

Google Research

      🤗   22

This paper introduces Selective Attention, an innovative, parameter-free tweak to the standard attention mechanism that heightens the efficiency of language models by curbing attention to less relevant input elements. Impressively, this approach allows transformers to match the performance of models with many more attention heads and parameters, while dramatically cutting memory and computational needs. I find it fascinating how this method demonstrates memory reductions by factors as high as 47X without sacrificing validation perplexity, highlighting its potential for more resource-efficient model deployment.

Raw notes: Related to the Differential Transformer paper from Microsoft Research (most upvoted paper this week). Attention may be all you need, what attention really means is a rabbit hole in itself.


F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Shanghai Jiao Tong University; University of Cambridge; Geely Automobile Research Institute (Ningbo) Company Ltd.

      🤗   22

This paper introduces F5-TTS, a non-autoregressive text-to-speech system that leverages flow matching and Diffusion Transformer technology to generate natural-sounding speech efficiently. I found it impressive how the system uses ConvNeXt for text representation and a Sway Sampling strategy, resulting in faster training and a smoother inference process with a real-time factor of 0.15. Notably, the system exhibits strong capabilities in zero-shot learning and expressive code-switching, promising a significant impact on multilingual speech applications.

Raw notes: Flow matching is emerging as a powerful alternative to diffusion. I checked out the samples. They sound good.


Benchmarking Agentic Workflow Generation

Zhejiang University; Alibaba Group

      🤗   20

This paper introduces WorFBench, a comprehensive benchmarking tool designed to evaluate the capabilities of Large Language Models (LLMs) in generating workflows through both sequence and graph planning tasks. It also develops WorFEval to provide a robust evaluation protocol using advanced matching algorithms, uncovering significant performance gaps among LLMs. I find it particularly enlightening as it highlights the challenges and potential improvements needed in creating more effective agentic systems.

Raw notes: Developers are buidling agentic systems everywhere. Yet we have so much gap in understanding how to build effective agents. This paper is a reminder of that.


Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

The Ohio State University; Orby AI

      🤗   16

This paper introduces UGround, a novel model for visual grounding tailored for GUI agents, allowing for enhanced interaction and interpretation purely through visual cues. Impressively, the model has demonstrated up to 20% improvement over current methods, showcasing its potential in revolutionizing GUI interactions. I’m intrigued by the real-world applications and the significant performance gains, and I’ll definitely be watching Orby AI as they push boundaries in the RPA market.

Raw notes: Research paper co-authored by Orby AI, a series A funded ($30M, June 2024) startup that guns for the RPA market with cutting edge multimodal AI. Watch this space.


GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Apple

      🤗   14

This paper explores the mathematical reasoning skills of large language models using a new benchmark called GSM-Symbolic and reveals that these models struggle significantly with variability and complexity in mathematical problems. The results underscore that LLMs might depend more on their training data than on real logical processing, exposing notable weaknesses in their mathematical reasoning. These findings might not be surprising, but they certainly highlight the ongoing challenges and potential areas for improvement in developing more robust models.

Raw notes: Widely discussed on social media, and immediately picked up by neural network skeptics Gary Marcus and Pedro Domingos. The findings are not surprising. We have work to do (as always). I’m still bullish.


RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

City University of Hong Kong; Huawei Noah’s Ark Lab; The Hong Kong University of Science and Technology (Guangzhou); McGill University; MILA

      🤗   10

This paper showcases RevisEval, an innovative evaluation framework for text generation that leverages response-adapted references to enhance how large language models evaluate content. By tailoring the evaluation references to the specific response, it achieves superior results in traditional metrics and LLM-based evaluations, showing effectiveness across different language tasks. Impressively, this approach manages to cut down on bias while staying cost-effective, offering valuable insights and tools to practitioners in natural language generation.

Raw notes: A bit difficult to grasp. Will come back to it later. May be relevant to practitioners.


CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

FPT Software AI Center, Viet Nam; Hanoi University of Science and Technology; VNU-HCM - University of Science

      🤗   9

This paper introduces CodeMMLU, a novel benchmark that specifically tests the code understanding capabilities of Code Large Language Models across a broad range of challenging questions. It effectively highlights the existing gap in LLMs’ proficiency, demonstrating that while these models are adept at generating code, their understanding of intricate software concepts needs significant improvement. I appreciate the focus on the importance of code reasoning, as it points towards the potential development of more advanced AI tools for coding assistance.

Raw notes: Good paper from FPT AI group in Vietnam.


MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI

      🤗   6

This paper introduces MLE-bench, a benchmark designed to evaluate AI agents on their machine learning engineering skills using 75 curated Kaggle competitions. Impressively, the study shows that the best configurations can reach a bronze medal level in about 17% of these competitions, providing a solid baseline for future research in enhancing AI agents’ ML capabilities. Open-sourcing the benchmark is a smart move that could significantly boost collaborative advancements in this field.

Raw notes: Fun paper from OpenAI. The best scaffolder is AIDE from Weco AI. o1 makes a big difference.


Acknowledgements

Papers are retrieved from Hugging Face.