Weekly paper roundup: Document Parsing Survey (10/28/2024)

Overview

The recent papers collectively focus on advances in artificial intelligence, highlighting themes such as multimodal learning, the development and evaluation of AI models, and improvements in model interpretability and efficiency. Clear connections emerge in innovations aimed at enhancing privacy, improving data labeling, and exploring the reasoning and thinking processes of AI systems, as seen with CLEAR’s unlearning benchmarks and the analysis of cognitive effects on LLM performance. Several papers, such as those introducing AutoKaggle and NeuZip, emphasize optimization and efficiency, underlying an increased focus on resource management. Additionally, the integration of AI into real-world applications and its potential societal impact is explored through works like the development of ROCKET-1 for decision-making in open-world environments and CARE’s personalized exploratory tasks interface. Collectively, these papers push forward the understanding of AI’s capabilities and limitations across diverse settings, from healthcare to programming and data science.

Spotlight :flashlight:

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Shanghai Artificial Intelligence Laboratory; Peking University; Tsinghua University

      🤗   26

This paper offers a comprehensive review of current document parsing techniques, indicating their pivotal role in transforming unstructured content into machine-readable formats. I appreciate the focus on both the methodologies and the inherent challenges, such as dealing with complex layouts and the integration of diverse data types. The authors effectively highlight the importance of better datasets in progressing these technologies, which seems crucial for future advancements. It’s a solid resource for anyone interested in the evolution of document parsing, especially in light of advancements made possible by large language models. The emphasis on a practical, experimental approach for assessing tools and techniques adds valuable perspective for practitioners and researchers alike.

Raw notes: What is the best document parsing tool? Should I use OCR? My answer has always been: set up your experimentation workbench and answer this question empirically. The survey contains a bewildering array of techniques and tools.


Other papers

CLEAR: Character Unlearning in Textual and Visual Modalities

AIRI; MIPT; Skoltech; Sber; University of Sharjah; HSE University

      🤗   193

This paper introduces a unique benchmark called CLEAR to address the relatively unexplored area of multimodal unlearning in deep learning, aiming to enhance privacy. It provides a dataset of 200 fictitious individuals with associated images and question-answer pairs to evaluate different unlearning methods, particularly focusing on the challenges inherent in this domain. I found it intriguing that the study highlights ell_1 regularization as a promising approach to mitigate catastrophic forgetting while preserving performance on the data that should be retained, marking a significant step forward in machine unlearning research.

Raw notes: This week’s most upvoted paper is from a group of Russian researchers. Related to model editing, machine unlearning is an interesting research topic.


GPT-4o System Card

OpenAI

      🤗   72

This paper introduces GPT-4o, a cutting-edge multimodal AI model known for its proficiency in handling diverse inputs like text, audio, images, and video, along with an enhanced performance in non-English languages. The improvements in response time and accuracy are commendable, and the dedication to safe and responsible AI through robust third-party assessments stands out. I find the insights into red teaming efforts particularly fascinating, revealing a strong focus on safety and evaluation processes.

Raw notes: The discussion on red teaming efforts is interesting.


Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

EPFL

      🤗   70

This paper delves into the mechanistic interpretability of the SDXL Turbo text-to-image diffusion model using Sparse Autoencoders. By uncovering the distinct roles of various transformer blocks in the image generation process, the study not only advances our understanding of generative models but also showcases the utility of SAEs for interpreting complex visual domains. I found this approach intriguing as it offers a promising direction for demystifying the inner workings of advanced AI models.

Raw notes: Mechanistic interpretability for diffusion models.


What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

University of Maryland; University of Chicago

      🤗   53

This paper delves into how training large language models with fast versus slow thinking influences layer-wise gradients, finding that fast thinking results in larger and more variable gradients. Interestingly, it highlights that pre-trained models show more resilience against these fast thinking instabilities than instruction-tuned models. The exploration provides meaningful insights into LLM training dynamics, offering pathways for creating more stable and efficient models.

Raw notes: We know so little about LLMs regarding how they “think”, fast or slow. This paper tries to make baby steps here.


CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Academy of Artificial Intelligence; Huawei Poisson Lab; Waseda University, Tokyo, Japan

      🤗   51

This paper introduces CORAL, a benchmark for evaluating Retrieval-Augmented Generation (RAG) systems in multi-turn conversational contexts. By addressing the limitations of previous research that predominantly focused on single-turn interactions, the study provides a standardized framework for assessing RAG methods using diverse, information-seeking conversations based on Wikipedia data. I find this work valuable because it reflects the complexities of real-world conversations and highlights opportunities for enhancing conversational RAG systems through comprehensive evaluation.

Raw notes: Multi-turn conversations are more common in real world, so this is a welcome addition.


ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

PKU; UCLA; BIGAI; Team CraftJarvis

      🤗   48

This paper introduces ROCKET-1, a method that leverages visual-temporal context prompting to significantly improve vision-language models (VLMs) for decision-making in open-world scenarios. By focusing on object segmentation from past and current observations, this approach enhances the interaction between VLMs and policy models, which is crucial for complex task handling. The experiments, conducted in a Minecraft environment, highlight the power and potential of integrating both temporal and visual contexts for embodied AI agents, achieving objectives previously out of reach.

Raw notes: Progress in embodied AI agents, based on Minecraft environment.


A Survey of Small Language Models

University of Oregon; Northeastern University; Carnegie Mellon University; University of California, San Diego; University of Maryland, College Park; State University of New York at Buffalo; Arizona State University; Adobe Research; University of Massachusetts Amherst; Intel AI Research; Meta AI; Dartmouth College; University of Arizona

      🤗   35

This paper delivers a thorough survey of Small Language Models, highlighting their effectiveness in limited-resource settings. I found the novel taxonomy for optimization methods, such as model compression and pruning, particularly insightful. Additionally, the discussion of key challenges makes this a useful resource for anyone involved in SLM development or deployment.

Raw notes: SLMs survey.


AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

ByteDance Inc.; University of Melbourne; Interdisciplinary Centre for Security, Reliability and Trust (SnT), Université du Luxembourg

      🤗   31

This paper introduces a collaborative framework called AutoKaggle, designed to streamline the workflow for data scientists working with tabular data through an automated, iterative process. By focusing on code correctness and logic consistency, the framework enhances productivity and performs well in Kaggle competitions. I find the exploration of automating repetitive and manual tasks particularly insightful, and it raises intriguing questions about the potential for AI to fully automate data science workflows in the future.

Raw notes: This demonstrates how much of a data scientist’s workflow is manual, repetitive, non-differentiated, and good target for AI. Question: will it be 100% in the future?


Teach Multimodal LLMs to Comprehend Electrocardiographic Images

The Ohio State University; Carnegie Mellon University

      🤗   22

This paper presents ECGInstruct, a pioneering dataset crafted to enhance ECG image interpretation by multimodal large language models (MLLMs). By introducing PULSE, an MLLM optimized specifically for ECG images, the study shows substantial improvements over general models in ECG tasks. Moreover, the development of ECGBench offers a valuable benchmark for evaluating these advancements, highlighting the potential of integrating medical data modalities in AI for improved clinical outcomes.

Raw notes: ECG as a new modality for AI. I’d love to see more work on incorporating medical data modalities into AI.


Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Technion - Institute of Technology; Google Research

      🤗   15

This paper delves into the impact of label errors on the performance evaluation of large language models (LLMs), positing that these errors might skew our perception of the models’ capabilities. By employing an ensemble of LLMs to detect and correct these errors, the study reveals that model performance improves significantly, suggesting that supposed faults often arise from mislabeled data rather than limitations of the LLMs themselves. The work is a compelling reminder of the importance of accurate data labeling in assessing artificial intelligence capabilities.

Raw notes: The findings are really interesting.


OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

Zhejiang University; Tencent AI Lab (Seattle); Westlake University

      🤗   14

This paper presents OpenWebVoyager, an intriguing open-source framework designed for creating multimodal web agents that learn from real-world interactions. I found the iterative approach, which starts with imitation learning and moves into a cycle of exploration and policy optimization, to be a particularly strong feature that allows for significant self-improvement of the agents. The experimental results are compelling and suggest this framework could have meaningful applications in real-world scenarios.

Raw notes: Nice contribution to the open research in a topic with great real world use cases.


On Memorization of Large Language Models in Logical Reasoning

Google; University of Illinois Urbana-Champaign; Princeton University; Allen Institute for AI

      🤗   13

This paper delves into the memorization habits of large language models when tackling logical reasoning tasks, highlighting their reliance on memorized knowledge when tested with familiar puzzles. The research underscores that LLMs falter when faced with even slight changes to these puzzles, suggesting a heavy dependence on rote learning over genuine reasoning. Interestingly, the study also uncovers that while fine-tuning tends to boost generalization, it doesn’t entirely eradicate the tendency to memorize, indicating a nuanced interplay between memorization and reasoning.

Raw notes: It’s a fine line between rote memorization and deep understanding when it comes to LLMs. We still don’t know much about how they work.


Can Language Models Replace Programmers? REPOCOD Says ‘Not Yet’

Purdue University

      🤗   11

This paper evaluates the effectiveness of large language models in code generation using the REPOCOD benchmark, derived from real-world projects, and reveals that none of the tested models surpass a 30% success rate. While traditionally Claude Sonnet 3.5 has been perceived as the leader in code generation, the paper finds that GPT-4o performs better in this context. However, it’s evident that while LLMs have potential, they currently fall short of replacing human programmers for complex, real-world development tasks.

Raw notes: Code LLMs love (new) benchmarks too. General perception is Claude Sonnet 3.5 is best in code generation. This paper’s findings differ: GPT-4o is the top dog. AI’s perf is still quite low (less than 30%).


NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Dept. Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta; Borealis AI

      🤗   11

This paper introduces NeuZip, an innovative weight compression technique that effectively halves the memory footprint of neural networks without sacrificing performance quality. I’m impressed by how this approach smartly leverages floating-point entropy to achieve significant memory savings, notably reducing the training memory of large models like Llama-3 8B from 31GB to under 16GB. It challenges the notion that further efficiency gains in neural network architectures are infeasible, suggesting a promising direction for future exploration.

Raw notes: Tim Dettmers recently claimed that “From my own experience (a lot of failed research), you cannot cheat efficiency. If quantization fails, then also sparsification fails, and other efficiency mechanisms too. If this is true, we are close to optimal now.” Interesting thread to follow.


AAAR-1.0: Assessing AI’s Potential to Assist Research

Pennsylvania State University; Netflix; University of California, Davis; University of Illinois Chicago; Fudan University; Zhejiang University; University of Alabama at Birmingham; Ohio State University; Salesforce Research

      🤗   10

This paper presents AAAR-1.0, a novel benchmark dataset designed to evaluate how effectively large language models (LLMs) can assist in complex research tasks. I appreciate the focus on the specific needs of researchers, requiring a level of domain expertise that pushes current AI capabilities. The paper does a great job highlighting both the promise and current limitations of using AI as a supportive tool in advanced scientific domains.

Raw notes: Benchmark to measure progress toward AI Scientist. The road ahead is long.


Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

Princeton University; New York University

      🤗   9

This paper delves into how chain-of-thought (CoT) prompting can actually degrade performance in large language models for tasks that are negatively impacted by explicit reasoning, similar to certain cognitive tasks in humans. Through empirical studies, it finds that CoT can significantly reduce accuracy, highlighting the importance of prompt selection based on the task’s nature. I found it particularly intriguing how understanding human cognitive challenges could enhance AI performance by informing more effective prompting strategies.

Raw notes: “Don’t over think, just do it” can be the optimal approach in certain scenarios.


Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

MMLab, CUHK; Shanghai AI Laboratory; Tencent

      🤗   8

This paper presents an innovative framework called Vision Search Assistant that enhances vision-language models by integrating them with web agents for real-time information retrieval. I find it notable that the proposed method significantly boosts performance in handling both familiar and unfamiliar visual content, according to extensive experiments. However, I’m curious about the real-world applicability of the question types tested in the study, as it could impact the practical utility of the approach.

Raw notes: Impressive performance gain. However, I wonder if the types of questions under study are common in real world settings.


AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Academy of Artificial Intelligence

      🤗   7

This paper introduces AutoMIR and its unique approach, Self-Learning Hypothetical Document Embeddings (SL-HyDE), which boosts zero-shot medical information retrieval without needing relevance-labeled data. By generating hypothetical documents with contextually relevant information, SL-HyDE enhances the identification process of pertinent documents. Although the performance gains may appear small, the technique’s potential applicability beyond the medical domain adds a valuable dimension to its utility.

Raw notes: Could be applicable/handy beyond the medical domain. The perf gain seems small.


Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

Southeast University, China; Microsoft, China; State Key Laboratory for Novel Software Technology, Nanjing University, China; Microsoft, USA

      🤗   7

This paper introduces CARE, a chat-based system designed to personalize and improve user experience during exploratory tasks using a multi-agent large language model framework. The structured interface supports iterative query refinement and offers tailored solutions, distinguishing it from conventional LLM chatbots. Results from a user study indicate that CARE reduces cognitive load and enhances creativity, making it a valuable tool in personalized problem-solving.

Raw notes: Worth a skim for anyone interested in Human-AI Interface.


Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Snowflake AI Research; Polish Academy of Sciences

      🤗   4

This paper presents an innovative benchmark to evaluate large language models (LLMs) in their role as data scientists, specifically in the task of feature engineering. By assessing the performance of an XGBoost model trained on datasets modified by LLM-generated feature engineering code, the research highlights the potential of AI models to enhance data science tasks. I found the results particularly intriguing, as they not only demonstrate the effectiveness of the evaluation method but also detail how certain models significantly outperform others, suggesting a promising future for AI in data exploration and feature engineering.

Raw notes: Can AI do feature engineering? There’s increasing interest in building AI/data scientists. Interesting excerpt: “Globally, the leaderboard is dominated by the O1-PREVIEW model, achieving an impressive performance of 11%+. The second tier consists of GEMINI models, the latest GPT-4O, and O1-MINI, followed by the best open source models (DEEPSEEK, MISTRAL LARGE), and CLAUDE SONNET. It takes at least a MIXTRAL 8X7B class model to offer a noticeable improvement over abstaining from processing the DataFrame.”


Acknowledgements

Papers are retrieved from Hugging Face.