Overview
The collection of papers broadly explores advancements in multimodal intelligence, LLM training and capabilities, reasoning, and model efficiency. Papers like Emu3 and MIO focus on enhancing multimodal models by integrating different data types for improved understanding and generation, with MIO uniquely including audio. Research on LLMs, as seen in Law of the Weakest Link and RATIONALYST, delves into cross-capabilities and reasoning enhancement, addressing areas such as honesty and reasoning gaps in LLMs evidenced by their weakest links. Additionally, papers like VPTQ and Hyper-Connections contribute novel techniques for optimizing model performance, suggesting significant implications for model efficiency, specifically in quantization and connection design. These studies collectively push the boundaries of AI capabilities across different contexts and tasks, reflecting significant interdisciplinary advancements in AI research.
Spotlight
Law of the Weakest Link: Cross Capabilities of Large Language Models
Llama Team, AI @ Meta; University of Illinois Urbana-Champaign
This paper delves into the concept of “cross capabilities” in Large Language Models, which is crucial for tackling real-world tasks that require multiple abilities. The authors introduce CrossEval, a benchmark focused on evaluating how well LLMs handle this intersection of skills. Interestingly, they found that LLMs tend to struggle with tasks demanding cross capabilities, hampered by their weakest skills. The paper underscores the importance of enhancing specific areas to improve overall performance. As a bonus, the annotation section is insightful and the discussion on LLMs as judges offers a thoughtful perspective on their applications.
Raw notes: Very high quality research paper from the Llama Team and a great read. The section on annotation is particularly worth studying. Also a good discussion on the use of LLMs as judges.
Other papers
Emu3: Next-Token Prediction is All You Need
BAAI
This paper introduces Emu3, a groundbreaking multimodal model that leverages next-token prediction across images, text, and videos to outperform some existing models like SDXL and LLaVA-1.6. By using a unified transformer architecture, Emu3 claims a simpler and more effective approach, possibly charting a course toward general multimodal intelligence. However, the assertion that next-token prediction is all you need might be a bit premature, especially given the omission of audio data in the evaluation.
Raw notes: This week’s most upvoted paper is from Beijing Academy of AI (BAAI), a non-profit AI lab responsible for the largest LLM to date (WuDao 2.0 with 1.75 trillion parameters). It’s noteworthy that they use GPT-4V extensively to generate training data for Emu. The paper’s claim (NTP is all you need) is suspect given that audio is not covered at all.
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Apple AI/ML
This paper explores the impact of using both synthetic captions and original AltTexts in training multimodal foundation models, highlighting that a hybrid approach can improve performance. It also discusses a scalable captioning pipeline that can produce diverse formats to suit different models, which is an intriguing idea. I’d be interested to see how these results compare to the voice-based annotation technique from the Molmo team at AI2, as this could add another dimension to the understanding of captioning’s role in model training.
Raw notes: I want to see a discussion of the voice-based annotation technique used by the Molmo team at AI2.
MIO: A Foundation Model on Multimodal Tokens
Beihang University; AIWaves; The Hong Kong Polytechnic University; University of Alberta; University of Waterloo; University of Manchester; Institute of Automation, Chinese Academy of Sciences; Peking University; The Hong Kong University of Science and Technology; 201.AI; M-A-P
This paper presents MIO, a foundation model that adeptly handles multimodal tasks involving speech, text, images, and videos through an innovative training approach. By addressing the limitations in existing models with processes like alignment and speech-enhanced pre-training, MIO exhibits advanced capabilities such as interleaved video-text generation. It’s fascinating to see the progress coming from this collaborative effort, underscoring the swift advancements in AI coming from China.
Raw notes: The frightening pace of AI research from China continues. MIO is a collaboration of a dozen or so groups, most of which are Chinese. This paper should be read in conjunction with the Emu 3 paper from BAAI. Unlike Emu 3, MIO’s modalities include speech/audio.
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Apple
This paper presents MM1.5, a set of multimodal large language models that leverage diverse data sources to enhance text-rich image understanding and multi-image reasoning. I appreciate the focus on employing high-quality OCR data and synthetic captions which results in robust performance across different scales. The exploration of specialized model variants for niche applications like video and mobile UI understanding provides a solid foundation for future multimodal AI development.
Raw notes: Second paper from Apple in this week’s top 5 most upvoted papers. Kudos to Apple.
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Johns Hopkins University; University of Notre Dame
This paper presents RATIONALYST, a model that improves reasoning in large language models by incorporating rationale annotations during pre-training. By leveraging a dataset of 79,000 rationales, the model demonstrates a notable average accuracy improvement on several reasoning benchmarks, even outperforming larger models like GPT-4. It’s impressive how academia continues to push the boundaries of AI innovation despite resource constraints, as highlighted in the paper’s limitations section.
Raw notes: This is what academia’s response to OpenAI’s strawberry looks like. Academic research is now so limited by computational and engineering resources. The limitation section says as much.
Loong: Generating Minute-level Long Videos with Autoregressive Language Models
University of Hong Kong; ByteDance
This paper introduces Loong, an autoregressive language model that pushes the boundaries of video generation to create minute-long videos. I am impressed by how it addresses common challenges like maintaining video quality and mitigating error accumulation through a unified approach to model text and video tokens. The results illustrate its effectiveness and potential in advancing video generation technology.
Raw notes: One of the four papers from ByteDance this week, and one of the three in this week’s top 3!
Video Instruction Tuning With Synthetic Data
ByteDance; NTU; BUPT
This paper presents the creation of a synthetic dataset, LLaVA-Video-178K, to improve the training process of video large multimodal models for instruction-following tasks. By addressing the challenge of acquiring high-quality raw video data, the authors demonstrate significant performance improvements across various video benchmarks. I appreciate that they also plan to provide public access to the dataset, generation pipeline, and model checkpoints, which will likely facilitate further advancements in this field.
Raw notes: Second paper from ByteDance on video multimodal models, this time focusing on understanding intead of generation.
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Shanghai Jiao Tong University; UC Davis; East China Normal University
This paper presents the Multi-Granularity Debugger (MGDebugger) that enhances the debugging process of code generated by large language models by tackling bugs hierarchically. By using a bottom-up approach and a simulated Python executor for error tracing, it significantly improves the accuracy and success rates in repairing various types of bugs. I find the concept promising, especially with its potential applications in real-world AI4code systems.
Raw notes: The idea makes a lot of sense. Lots of opportunities to explore when building real-world AI4code systems.
A Survey on the Honesty of Large Language Models
The Chinese University of Hong Kong; The University of Hong Kong; Tsinghua University; University of Illinois at Urbana-Champaign; University of Virginia; Peking University; WeChat AI
This paper tackles the thorny issue of honesty in large language models (LLMs) by highlighting how these models often offer confident yet incorrect responses and struggle to admit their knowledge gaps. It thoughtfully proposes a framework aimed at understanding and enhancing these systems’ honesty, presenting an important contribution to aligning AI behavior with human values. I find the exploration of honesty in LLMs to be a fascinating intersection of technology and philosophy, suggesting ample areas for further investigation.
Raw notes: I’m not sure what to make of the concept of honesty for AI. Perhaps of philosophical interest.
Not All LLM Reasoners Are Created Equal
Mila; Google DeepMind; Microsoft Research
This paper delves into the reasoning abilities of large language models when tackling interdependent math problems, highlighting how models falter in paired problem-solving compared to individual tasks. The authors discover that smaller, specialized models exhibit significant performance disparities, attributed to contextual errors and insufficient reasoning in sequential steps rather than test-set leakage. Intriguingly, tuning and finetuning practices play critical roles, underscoring substantial differences in models’ reasoning capabilities.
Raw notes: The reasoning gap for GPT-4o-mini is particularly striking. Size does matter!
MinerU: An Open-Source Solution for Precise Document Content Extraction
Shanghai Artificial Intelligence Laboratory
This paper introduces MinerU, an open-source tool aimed at enhancing the precision of document content extraction through sophisticated models and processing techniques. Impressively, it surpasses other open-source solutions in accuracy, which is supported by experimental results. I found its comparison to Surya particularly intriguing, highlighting MinerU’s superior reliability and quality in content extraction tasks.
Raw notes: Another noteworthy open source tool for document context extraction. The comparison against Surya is interesting.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Microsoft; University of Science and Technology of China
This paper presents VPTQ, a method that shines in reducing the bit requirement for Large Language Models while maintaining performance, thanks to clever use of vector quantization and optimization techniques. The approach claims to not only slash memory and bandwidth needs but also to enhance model efficiency during inference. Impressively, the results suggest this low-bit quantization method surpasses existing approaches in both perplexity and accuracy.
Raw notes: Nice progress on post training quantization.
Distilling an End-to-End Voice Assistant Without Instruction Training Data
Georgia Institute of Technology; Stanford University; National University of Singapore; Northeastern University
This paper introduces the Distilled Voice Assistant (DiVA), a novel method leveraging self-supervision from text-only Large Language Models to enhance end-to-end voice assistant training without the need for instruction data. I find it impressive that DiVA can outperform state-of-the-art models in both performance and efficiency. However, it’s unclear whether the authors have shared any artifacts like data and code, which could influence the replicability of their results.
Raw notes: Really interesting progress on speech LLMs. It’s unclear if artifacts such as data and code are shared.
ByteDance
This paper presents hyper-connections as a novel approach to improve upon traditional residual connections in neural networks. By allowing for dynamic adjustment of connection strengths between features at different depths, it tackles common issues like gradient vanishing and representation collapse. The reported performance improvements across various AI tasks demonstrate the potential for broad applicability, making it a promising area for further research and experimentation.
Raw notes: Sounds like a significant discovery given the widespread use of residual connections. I look forward to more studies and experiments of this idea.
Quantifying Generalization Complexity for Large Language Models
Harvard University; Massachusetts Institute of Technology; University of Illinois Urbana-Champaign; Meta; University of Chicago
This paper introduces Scylla, a framework designed to assess how well large language models (LLMs) generalize across different complexities of tasks. It uncovers a critical complexity threshold where models shift from utilizing understanding to relying on memorization, highlighting the impact of model size on this ability. Notably, GPT models, especially GPT-4o-mini, show impressive robustness in handling complex reasoning tasks.
Raw notes: GPTs seem most robust, with GPT-4o-mini particularly impressive.
Acknowledgements
Papers are retrieved from Hugging Face.