Overview
The collection of papers predominantly explores advancements in multimodal models, large language models (LLMs), and foundational models for specific domains. Several works, such as xGen-MM (BLIP-3), Show-o, Transfusion, and Open-FinLLMs, focus on improving multimodal understanding and generation, leveraging state-of-the-art techniques like autoregressive and diffusion modeling. Papers like Sapiens and TWLV-I highlight domain-specific foundation models for human-centric vision tasks and video comprehension, respectively, showcasing significant performance improvements. There is a notable emphasis on model efficiency and scalability, exemplified by LongVILA and Jamba-1.5, which introduce methods for handling longer contexts and optimizing resource usage. Additionally, the importance of pre-training datasets and model compression is underscored in papers such as “To Code, or Not To Code?” and “LLM Pruning and Distillation in Practice,” indicating ongoing efforts to enhance model robustness and applicability.
Spotlight
Automated Design of Agentic Systems
University of British Columbia; Vector Institute; Canada CIFAR AI Chair
This paper delves into the burgeoning field of Automated Design of Agentic Systems (ADAS) and introduces Meta Agent Search, an algorithm that can autonomously design and optimize agentic systems. I found the concept extremely forward-thinking, especially as it leverages AI to surpass traditional hand-designed solutions. However, while the paper shows promising results, it’s clear that this is in its early stages and would require much more development before it makes a real-world impact. The ambition and potential societal benefits are enormous, but caution and rigorous testing are essential to advance this research safely. It’s no wonder it’s sparking extensive discussion on social media. Raw notes
Other papers
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Salesforce AI Research; University of Washington
This paper introduces xGen-MM (BLIP-3), an innovative framework designed for creating large multimodal models with impressive in-context learning abilities and competitive performance against other open-source models. I found it particularly noteworthy that the authors didn’t just focus on performance but also on safety-tuning, minimizing harmful outputs. The open-source nature of the project, supported by Salesforce and the University of Washington, makes it an excellent resource for further research and development in the field. Raw notes
Sapiens: Foundation for Human Vision Models
Meta
This paper presents Sapiens, a family of vision models excelling in tasks like pose estimation and depth prediction by leveraging fine-tuning on a large dataset of human images. I found it impressive how Sapiens consistently outperforms previous state-of-the-art methods, especially in scenarios with limited labeled data. The work also reaffirms the importance of combining architectural simplicity, model scale, and data quality in training successful foundational models, making it highly relevant for practitioners in the field. Raw notes
Controllable Text Generation for Large Language Models: A Survey
Renmin University of China; Institute for Advanced Algorithms Research; China Telecom Research Institute
This paper offers a thorough survey of Controllable Text Generation (CTG) techniques for Large Language Models (LLMs). I appreciate its clear categorization of CTG tasks and insightful discussion on methods like model retraining and prompt engineering. The paper’s evaluation of these methods’ strengths and weaknesses and its suggestions for future research directions make it incredibly valuable for both researchers and practitioners in the field. Raw notes
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
Twelve Labs
This paper offers a comprehensive evaluation framework for video foundation models and introduces TWLV-I, a model that significantly improves upon current benchmarks in video understanding. The authors’ approach addresses critical gaps in action recognition tasks and provides accessible tools for further exploration in the field. I found the combination of practical application and theoretical advancement particularly noteworthy and relevant for industries like sports and media. Raw notes
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
NVIDIA; MIT; UC Berkeley; UT Austin
This paper showcases LongVILA, a novel approach to scaling vision-language models for long videos, featuring the Multi-Modal Sequence Parallelism (MM-SP) system that drastically improves training and inference efficiency. With context lengths scaling up to 2 million tokens, this model significantly enhances video captioning and processing capabilities. It’s a notable contribution from a collaborative effort involving NVIDIA, MIT, Berkeley, and UT Austin, emphasizing both algorithmic and system advancements. Raw notes
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Beihang University; University of Waterloo; Fudan University; Beijing Information Science and Technology University
This paper introduces TableBench, a benchmark aimed at testing the TableQA capabilities of various large language models (LLMs) and highlighting the gap between academic benchmarks and real-world challenges. Despite the advancements showcased in models like GPT-4, it demonstrates that these models still struggle significantly with complex tabular data tasks in practical applications. I found it intriguing, although I wondered why Anthropic models were not included in the evaluation. Raw notes
LLM Pruning and Distillation in Practice: The Minitron Approach
NVIDIA
This paper presents a practical approach to compressing language models like Llama 3.1 and Mistral NeMo via pruning and distillation, retaining performance in smaller variants. The comparative analysis of different pruning techniques and the emphasis on fine-tuning teacher models are particularly insightful. I find the availability of the base model weights to be a valuable resource for further research and application. Raw notes
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show Lab, National University of Singapore; ByteDance
This paper presents Show-o, a novel transformer model that skillfully integrates autoregressive and discrete diffusion modeling to tackle both multimodal understanding and generation tasks. The results are promising, as the model outperforms specialized benchmarks in tasks like visual question-answering and text-to-image generation, though the experimental data appear somewhat preliminary. The collaboration between Show Lab at NUS and ByteDance suggests a strong research foundation and potential for future developments in multimodal applications. Raw notes
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Meta; Waymo; University of Southern California
This paper introduces Transfusion, a groundbreaking multi-modal model combining next token prediction and diffusion techniques to handle text and image data within a transformer framework. Leveraging 7 billion parameters with specific encoding and decoding layers, Transfusion excels beyond traditional methods, demonstrating significant efficacy in both text and image generation. I find it particularly intriguing that it joins the burgeoning research trend of integrating autoregressive and diffusion models, showing promise in advancing multimodal processing. Raw notes
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
University of Washington; FAIR at Meta
This paper introduces JPEG-LM, an innovative approach where large language models (LLMs) generate images and videos by directly modeling compressed file formats such as JPEG and AVC, rather than traditional pixel values or vector quantization. I found it particularly impressive that this method achieves a significant 31% reduction in Fréchet Inception Distance (FID), indicating more efficient and effective image generation. In addition, it clearly demonstrates the potential for simpler integration and advancements in multi-modal language and visual generation tasks. Raw notes
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Cohere For AI; Cohere
This paper delves into the impact of including code in the pre-training data of large language models and demonstrates that it significantly boosts performance in various non-code tasks. The authors convincingly show that higher-quality code in the pre-training phase leads to better generalization and enhanced natural language reasoning. I found the results both expected and essential for refining future pre-training approaches. Raw notes
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
_AI21 _
This paper introduces Jamba-1.5, which uses a hybrid Transformer-Mamba architecture to achieve efficient performance and low memory usage while maintaining high-quality output. It stands out with its distinctive new quantization technique, ExpertsInt8, and offers large context capabilities and public accessibility of model weights. I find the innovative approach intriguing, especially given the contrasting strategies to other major players in the field. Raw notes
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
The Fin AI; Wuhan University; Columbia University; The Chinese University of Hong Kong, Shenzhen; Nanjing University; Rensselaer Polytechnic Institute; The University of Manchester; Stevens Institute of Technology; Sichuan University; University of Florida; University of Montreal; Yale University; New York University; Stony Brook University; NVIDIA; Artificial Intelligence Research Centre; Archimedes/Athena Research Centre
This paper introduces Open-FinLLMs, a suite of large language models fine-tuned for financial applications and capable of processing multi-modal data. The models, including FinLLaMA and FinLLaVA, show remarkable performance improvements over existing financial language models in various tasks and trading simulations. I find the collaborative effort across numerous institutions particularly impressive, underscoring the significant strides in financial technology these models represent. Raw notes
Acknowledgements
Papers are retrieved from Hugging Face.
Social media metrics are from Emergent Mind.