Weekly paper roundup: Automated Design of Agentic Systems (8/19/2024)

Overview

The collection of papers predominantly explores advancements in multimodal models, large language models (LLMs), and foundational models for specific domains. Several works, such as xGen-MM (BLIP-3), Show-o, Transfusion, and Open-FinLLMs, focus on improving multimodal understanding and generation, leveraging state-of-the-art techniques like autoregressive and diffusion modeling. Papers like Sapiens and TWLV-I highlight domain-specific foundation models for human-centric vision tasks and video comprehension, respectively, showcasing significant performance improvements. There is a notable emphasis on model efficiency and scalability, exemplified by LongVILA and Jamba-1.5, which introduce methods for handling longer contexts and optimizing resource usage. Additionally, the importance of pre-training datasets and model compression is underscored in papers such as “To Code, or Not To Code?” and “LLM Pruning and Distillation in Practice,” indicating ongoing efforts to enhance model robustness and applicability.

Spotlight :flashlight:

Automated Design of Agentic Systems

University of British Columbia; Vector Institute; Canada CIFAR AI Chair

      🤗         X2053   HackerNews4   Reddit13   YouTube5   GitHub0

This paper delves into the burgeoning field of Automated Design of Agentic Systems (ADAS) and introduces Meta Agent Search, an algorithm that can autonomously design and optimize agentic systems. I found the concept extremely forward-thinking, especially as it leverages AI to surpass traditional hand-designed solutions. However, while the paper shows promising results, it’s clear that this is in its early stages and would require much more development before it makes a real-world impact. The ambition and potential societal benefits are enormous, but caution and rigorous testing are essential to advance this research safely. It’s no wonder it’s sparking extensive discussion on social media. Raw notes


Other papers

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Salesforce AI Research; University of Washington

      🤗         X745   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces xGen-MM (BLIP-3), an innovative framework designed for creating large multimodal models with impressive in-context learning abilities and competitive performance against other open-source models. I found it particularly noteworthy that the authors didn’t just focus on performance but also on safety-tuning, minimizing harmful outputs. The open-source nature of the project, supported by Salesforce and the University of Washington, makes it an excellent resource for further research and development in the field. Raw notes


Sapiens: Foundation for Human Vision Models

Meta

      🤗         X99   HackerNews0   Reddit0   YouTube3   GitHub0

This paper presents Sapiens, a family of vision models excelling in tasks like pose estimation and depth prediction by leveraging fine-tuning on a large dataset of human images. I found it impressive how Sapiens consistently outperforms previous state-of-the-art methods, especially in scenarios with limited labeled data. The work also reaffirms the importance of combining architectural simplicity, model scale, and data quality in training successful foundational models, making it highly relevant for practitioners in the field. Raw notes


Controllable Text Generation for Large Language Models: A Survey

Renmin University of China; Institute for Advanced Algorithms Research; China Telecom Research Institute

      🤗         X224   HackerNews0   Reddit0   YouTube1   GitHub40

This paper offers a thorough survey of Controllable Text Generation (CTG) techniques for Large Language Models (LLMs). I appreciate its clear categorization of CTG tasks and insightful discussion on methods like model retraining and prompt engineering. The paper’s evaluation of these methods’ strengths and weaknesses and its suggestions for future research directions make it incredibly valuable for both researchers and practitioners in the field. Raw notes


TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Twelve Labs

      🤗         X78   HackerNews0   Reddit0   YouTube0   GitHub13

This paper offers a comprehensive evaluation framework for video foundation models and introduces TWLV-I, a model that significantly improves upon current benchmarks in video understanding. The authors’ approach addresses critical gaps in action recognition tasks and provides accessible tools for further exploration in the field. I found the combination of practical application and theoretical advancement particularly noteworthy and relevant for industries like sports and media. Raw notes


LongVILA: Scaling Long-Context Visual Language Models for Long Videos

NVIDIA; MIT; UC Berkeley; UT Austin

      🤗         X368   HackerNews0   Reddit0   YouTube0   GitHub0

This paper showcases LongVILA, a novel approach to scaling vision-language models for long videos, featuring the Multi-Modal Sequence Parallelism (MM-SP) system that drastically improves training and inference efficiency. With context lengths scaling up to 2 million tokens, this model significantly enhances video captioning and processing capabilities. It’s a notable contribution from a collaborative effort involving NVIDIA, MIT, Berkeley, and UT Austin, emphasizing both algorithmic and system advancements. Raw notes


TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Beihang University; University of Waterloo; Fudan University; Beijing Information Science and Technology University

      🤗         X1   HackerNews2   Reddit0   YouTube0   GitHub0

This paper introduces TableBench, a benchmark aimed at testing the TableQA capabilities of various large language models (LLMs) and highlighting the gap between academic benchmarks and real-world challenges. Despite the advancements showcased in models like GPT-4, it demonstrates that these models still struggle significantly with complex tabular data tasks in practical applications. I found it intriguing, although I wondered why Anthropic models were not included in the evaluation. Raw notes


LLM Pruning and Distillation in Practice: The Minitron Approach

NVIDIA

      🤗         X365   HackerNews3   Reddit0   YouTube6   GitHub0

This paper presents a practical approach to compressing language models like Llama 3.1 and Mistral NeMo via pruning and distillation, retaining performance in smaller variants. The comparative analysis of different pruning techniques and the emphasis on fine-tuning teacher models are particularly insightful. I find the availability of the base model weights to be a valuable resource for further research and application. Raw notes


Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Show Lab, National University of Singapore; ByteDance

      🤗         X465   HackerNews0   Reddit0   YouTube0   GitHub337

This paper presents Show-o, a novel transformer model that skillfully integrates autoregressive and discrete diffusion modeling to tackle both multimodal understanding and generation tasks. The results are promising, as the model outperforms specialized benchmarks in tasks like visual question-answering and text-to-image generation, though the experimental data appear somewhat preliminary. The collaboration between Show Lab at NUS and ByteDance suggests a strong research foundation and potential for future developments in multimodal applications. Raw notes


Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Meta; Waymo; University of Southern California

      🤗         X1677   HackerNews1   Reddit84   YouTube7   GitHub0

This paper introduces Transfusion, a groundbreaking multi-modal model combining next token prediction and diffusion techniques to handle text and image data within a transformer framework. Leveraging 7 billion parameters with specific encoding and decoding layers, Transfusion excels beyond traditional methods, demonstrating significant efficacy in both text and image generation. I find it particularly intriguing that it joins the burgeoning research trend of integrating autoregressive and diffusion models, showing promise in advancing multimodal processing. Raw notes


JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

University of Washington; FAIR at Meta

      🤗         X634   HackerNews6   Reddit108   YouTube1   GitHub0

This paper introduces JPEG-LM, an innovative approach where large language models (LLMs) generate images and videos by directly modeling compressed file formats such as JPEG and AVC, rather than traditional pixel values or vector quantization. I found it particularly impressive that this method achieves a significant 31% reduction in Fréchet Inception Distance (FID), indicating more efficient and effective image generation. In addition, it clearly demonstrates the potential for simpler integration and advancements in multi-modal language and visual generation tasks. Raw notes


To Code, or Not To Code? Exploring Impact of Code in Pre-training

Cohere For AI; Cohere

      🤗         X899   HackerNews7   Reddit18   YouTube2   GitHub0

This paper delves into the impact of including code in the pre-training data of large language models and demonstrates that it significantly boosts performance in various non-code tasks. The authors convincingly show that higher-quality code in the pre-training phase leads to better generalization and enhanced natural language reasoning. I found the results both expected and essential for refining future pre-training approaches. Raw notes


Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

_AI21 _

      🤗         X281   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces Jamba-1.5, which uses a hybrid Transformer-Mamba architecture to achieve efficient performance and low memory usage while maintaining high-quality output. It stands out with its distinctive new quantization technique, ExpertsInt8, and offers large context capabilities and public accessibility of model weights. I find the innovative approach intriguing, especially given the contrasting strategies to other major players in the field. Raw notes


Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

The Fin AI; Wuhan University; Columbia University; The Chinese University of Hong Kong, Shenzhen; Nanjing University; Rensselaer Polytechnic Institute; The University of Manchester; Stevens Institute of Technology; Sichuan University; University of Florida; University of Montreal; Yale University; New York University; Stony Brook University; NVIDIA; Artificial Intelligence Research Centre; Archimedes/Athena Research Centre

      🤗         X1   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces Open-FinLLMs, a suite of large language models fine-tuned for financial applications and capable of processing multi-modal data. The models, including FinLLaMA and FinLLaVA, show remarkable performance improvements over existing financial language models in various tasks and trading simulations. I find the collaborative effort across numerous institutions particularly impressive, underscoring the significant strides in financial technology these models represent. Raw notes


Acknowledgements

Papers are retrieved from Hugging Face.

Social media metrics are from Emergent Mind.