Weekly paper roundup: Moshi (9/16/2024)

Overview

The provided papers indicate a broad convergence in advancing large language models (LLMs), multimodal models, and novel architectures for diverse applications in AI, such as code generation, self-correction, image and music generation, mathematical reasoning, search engines, and personalization. Various innovative techniques such as reinforcement learning for self-correction, dynamic resolution processing in vision-language models, and hybrid architectures combining different model types reflect a trend towards enhancing model performance and efficiency. The emphasis on open-source contributions and the release of high-quality datasets for further research underscores a collaborative approach to accelerating AI developments. Moreover, methodologies like retrieval-based attention and quantization for large LLMs highlight ongoing efforts to optimize computational efficiency and resource management in deploying these sophisticated models.

Spotlight :flashlight:

Moshi: a speech-text foundation model for real-time dialogue

Kyutai

This paper was not uploaded to arXiv and thus not covered by Hugging Face Papers. However, it is a significant contribution to the field of speech-text foundation multimodal models and a huge win for open science. Moshi can be thought of as an open-source answer to OpenAI’s GPT-4o’s conversational abilities (remember the Her-inspired demo?). The technical report is massive and worth a read for those interested in speech-text multimodality. Moshi can be experienced on the Kyutai website.

Spotlight :flashlight:

Qwen2.5-Coder Technical Report

Alibaba Group

         🤗   105      X683   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces the Qwen2.5-Coder series, showcasing two models, Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B, which excel in code generation tasks. The 7B model is particularly impressive, outperforming other open code LLMs under 40B parameters and matching OpenAI GPT-4 0613 on multiple benchmarks. What stands out to me is the permissive Apache 2.0 licensing and the potential real-world applications. It’s exciting to see a 32B model on the horizon and to consider the implications if Alibaba can navigate the challenges that have plagued US tech companies. This work is a significant leap forward for code intelligence research.

Raw notes: Impressive benchmark results. 7B Outperforms other open Code LLMs < 40B, including Mistral Codestral, or Deepseek. 7B matches OpenAI GPT-4 0613 on various benchmarks
Released under Apache 2.0 and available on huggingface. 32B model coming. If Alibaba can avoid the problems of US big tech companies and execute well, watch out.


Other papers

Training Language Models to Self-Correct via Reinforcement Learning

Google DeepMind

         🤗   98      X2803   HackerNews228   Reddit441   YouTube5   GitHub0

This paper introduces SCoRe, a novel reinforcement learning approach designed to enhance the self-correction abilities of large language models. I found it compelling because it highlights the limitations of supervised fine-tuning and demonstrates significant improvements in model performance on challenging benchmarks. It’s a timely and relevant contribution, especially given the ongoing exploration of self-improvement capabilities within AI research.

Raw notes: Widely discussed on social media. Connection to OpenAI’s o1 release. Some folks noted that Google does research, whereas OpenAI ships. It’s potentially a good example of cases where supervised learning (fine tuning in this case) is not as effective as RL. Time will tell.


OmniGen: Unified Image Generation

Beijing Academy of Artificial Intelligence

         🤗   72      X455   HackerNews0   Reddit25   YouTube5   GitHub489

This paper introduces OmniGen, a unified diffusion model for image generation that merges various control conditions into a single framework, streamlining tasks like text-to-image generation and image editing. I appreciate how it emphasizes ease of use and the transfer of knowledge across different tasks, making it a versatile tool in the image generation field. The emphasis on unification, simplicity, and cross-task applicability really sets this approach apart.

Raw notes: The three points around unification, simplicity, and knowledge transfer are noteworthy. Nice paper. Research papers have been coming out from Chinese institutions at a frightening pace.


Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

Alibaba Group

         🤗   61      X201   HackerNews0   Reddit0   YouTube3   GitHub2145

This paper introduces the Qwen2-VL Series with innovative mechanisms like Naive Dynamic Resolution and Multimodal Rotary Position Embedding to significantly enhance vision-language models’ ability to process and integrate varied data types at different resolutions. I am impressed by how it competes with leading models such as GPT-4o and Claude3.5-Sonnet, demonstrating superior performance across multiple benchmarks. Moreover, the availability of model weights is a fantastic bonus for the research community.

Raw notes: In addition to the code model, Alibaba’s Qwen team also announced a VL model that matches or surpasses GPT-4o and Claude3.5-Sonnet on many MM benchmarks. Model weights are available. Amazing!


NVLM: Open Frontier-Class Multimodal LLMs

NVIDIA

         🤗   54      X673   HackerNews0   Reddit0   YouTube0   GitHub1

This paper introduces the NVLM 1.0 models, which excel in vision-language tasks through an innovative architecture and unique 1-D tile-tagging design, outperforming existing models. I found the emphasis on high-quality, diverse datasets over sheer volume during pretraining particularly insightful. The planned release of model weights and code is a commendable step towards fostering further research in the field.

Raw notes: Strong MMLMs from NVIDIA. It’d be interesting to compare to Qwen2-VL (see previous paper).


Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

ByteDance

         🤗   43      X6   HackerNews0   Reddit0   YouTube0   GitHub0

This paper presents an innovative framework that enhances music generation by combining auto-regressive and diffusion techniques. I was particularly impressed by its interactive capabilities for editing lyrics and melodies, which could significantly streamline the music production process. The contribution from ByteDance promises to set a new standard for both quality and control in automated music creation.

Raw notes: Interesting work from ByteDance.


InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

ByteDance, Inc; Chinese Academy of Sciences

         🤗   43      X94   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces InfiMM-WebMath-40B, a robust open-source dataset aimed at enhancing mathematical reasoning in Multimodal Large Language Models. I found it impressive that the dataset, with its extensive collection of 24 million web pages, significantly boosts performance across multi-modal math benchmarks, achieving state-of-the-art results. The authors provide a thorough overview of their data collection process, making this a noteworthy contribution to the field.

Raw notes: Welcomed dataset contribution from ByteDance. RL is still on the roadmap only. It’d be interesting to run o1-preview and o1-mini on this.


MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

CUHK MMLab; ByteDance; CUHK MiuLar Lab; Shanghai AI Laboratory; Peking University; Stanford University; Sensetime Research

         🤗   33      X1   HackerNews0   Reddit0   YouTube0   GitHub57

This paper presents MMSearch, a benchmarking framework designed to evaluate how well Large Multimodal Models (LMMs) can function as search engines handling multimodal queries. I found the experiments insightful, especially noting how the GPT-4o model outperformed existing commercial solutions, albeit without including Gemini. The work sheds light on the potential and challenges of developing multimodal AI search engines, offering valuable data and direction for future advancements.

Raw notes: It’s unclear how common is MM search. Gemini is not included in the experiments. Interesting to note the poor performance of Perplexity.


Kolmogorov-Arnold Transformer

National University of Singapore

         🤗   31      X1205   HackerNews2   Reddit0   YouTube2   GitHub0

This paper introduces the Kolmogorov-Arnold Transformer (KAT), which replaces traditional MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance performance and expressiveness. It addresses significant challenges in optimizing KANs for practical deployment, providing effective solutions such as rational functions and variance-preserving initialization. The results indicate that KAT scales well and outperforms conventional transformers, making it an intriguing advancement especially for vision models.

Raw notes: Early work building on KAN. Vision models only.


RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Microsoft Research; Shanghai Jiao Tong University; Fudan University

         🤗   28      X183   HackerNews0   Reddit0   YouTube1   GitHub0

This paper presents RetrievalAttention, a method that accelerates the inference of long-context large language models using approximate nearest neighbor search to efficiently retrieve key-value vectors. The approach significantly cuts down inference latency and GPU memory usage while maintaining accuracy, allowing the model to handle 128K tokens with just 16GB of GPU memory. However, I noticed that no code is provided, which might be a bit of a letdown for those wanting to implement this immediately.

Raw notes: Good read for those interested in long context and LLM inference efficiency. No code is shared though.


LLMs + Persona-Plug = Personalized LLMs

Gaoling School of Artificial Intelligence, Renmin University of China; Baidu Inc.

         🤗   28      X164   HackerNews0   Reddit0   YouTube1   GitHub0

This paper presents a fresh approach to personalizing large language models (LLMs) using a lightweight plug-in user embedder to generate user-specific embeddings efficiently. I appreciate how the authors manage to improve personalization without the high costs of fine-tuning for each user, and the experimental results demonstrating superior performance are convincing. This innovation could be a game-changer for consumer products requiring personalized user experiences.

Raw notes: Personalization is a timeless topic. All consumer products need it.


To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

The University of Texas at Austin; Johns Hopkins University; Princeton University

         🤗   27      X1041   HackerNews1   Reddit0   YouTube5   GitHub0

This paper does a fantastic job of showing how chain-of-thought (CoT) prompting significantly boosts performance in math and symbolic reasoning tasks for large language models. Though, it points out that the benefits don’t extend much to other areas. I do wish it had addressed the nuances of test-time scaling, as an update in that aspect seems crucial.

Raw notes: Good analysis of CoT. This paper needs an urgent update that includes a discussion of test-time scaling (o1).


Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Johns Hopkins University; Samaya AI

         🤗   19      X261   HackerNews0   Reddit0   YouTube0   GitHub0

This paper introduces Promptriever, a retrieval model capable of interacting through prompts, akin to instruction-tuned language models. With training on a large dataset, Promptriever achieves state-of-the-art results and demonstrates flexibility in handling diverse queries and relevance instructions. I find this work particularly exciting for its potential to revolutionize how information retrieval systems are integrated with language models.

Raw notes: LLMs continues encroachment of IR.


jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Jina AI GmbH

         🤗   18      X565   HackerNews0   Reddit0   YouTube1   GitHub0

This paper introduces jina-embeddings-v3, a powerful multilingual text embedding model that leverages Low-Rank Adaptation (LoRA) adapters and Matryoshka Representation Learning for enhanced performance and flexible embedding dimensions. It outperforms leading models from OpenAI and Cohere in English and multilingual tasks, demonstrating significant advancements in multilingual data and long-context retrieval tasks. I appreciate its versatility and state-of-the-art results in handling up to 8192 tokens effectively.

Raw notes: Advances in text embedding, focusing on multilingual scenarios


A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

ETRI; KETI; Neubla

         🤗   15      X148   HackerNews0   Reddit1   YouTube0   GitHub0

This paper dives deep into how different quantization methods impact the performance of instruction-tuned large language models, specifically those up to 405 billion parameters. I was impressed by the nuanced findings that larger models tend to retain more performance post-quantization compared to their smaller FP16 counterparts. However, the revelation that the MT-Bench evaluation method struggles to distinguish among high-performing models adds an interesting layer to the discussion.

Raw notes: Good reading wrt quantization/performance tradeoffs


Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Apple

         🤗   15      X0   HackerNews0   Reddit0   YouTube0   GitHub0

This paper introduces HyperCloning, an innovative method leveraging smaller pre-trained models to initialize larger language models, significantly streamlining the pre-training process and slashing GPU hours required. While the technique shows promise for improving efficiency and accuracy, its practical impact might be limited due to the small number of entities capable of conducting pre-training. I find the approach clever, but its broader applicability could be a concern.

Raw notes: Very few entities can do pre-training, so the impact here is small.


Acknowledgements

Papers are retrieved from Hugging Face.

Social media metrics are from Emergent Mind.