Weekly paper roundup: Nemotron-4-340B (6/10/2024)

Overview

Our pick for this week’s spotlight is NVIDIA’s two papers on Nemotron-340B models and the HelpSteer2 dataset. It’s striking to note that Nemotron-340B-Instruct was fine tuned on data that is 98% synthetic. AI-powered synthetic data generation is an emerging trend that we predict will accelerate for a long time to come. This week’s other hot topics are evaluation and state-space models (SSMs), with papers from Google, Meta, Microsoft, AI2, Together AI, and top universities.

Spotlight

Authors: NVIDIA

Practical score: :star::star::star::star:

Barely a week goes by without announcements of some LLMs. This past week was NVIDIA’s turn with the release of three 340B-parameter models in their Nemotron family: Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. The Base and Instruct models are SOTA or competitive with open-weight models such as Llama3, Mistrals/Mixtrals, and Qwen-2, whereas the Reward model ranks as number one among all models in AI2’s RewardBench.

The most interesting thing about this release is NVIDIA’s permissive license that allows distribution, modification, and use of the models and their outputs. This is a departure from other key players such as OpenAI and Meta whose licenses prohibit the use of their LLMs to produce training data, in particular for the purpose of training LLMs (OpenAI has the most restrictive license). NVIDIA wants to sell as many GPUs as possible and thus unsurprisingly wants to support an open AI ecosystem. Nevertheless, I wholeheartedly applaud this move from them, hoping that this permissive licensing approach will be adopted by the rest of the industry. What’s the point of having a restrictive license when other SOTA models are available without your restrictions?

This move from NVIDIA comes at a time where there’s increasingly more reported use of LLMs to generate synthetic training or eval data to develop AI systems (see previous Harmonious’ spotlight papers such as Google’s Gecko embeddings and factuality eval respectively). Nemotron-4-Instruct itself was fine-tuned on data that is 98% synthetic; the synthetic data was generated with the help of the Nemotron Reward model. NVIDIA made it possible for anyone to use both Nemotron Reward model and the synthetic data generation pipeline behind all of this (ideally on NVIDIA’s stack, of course). In addition, NVIDIA published the HelpSteer2 dataset that they used to train the Nemotron Reward model. Amazing! Is it that surprising the NVIDIA became the most valuable company in the world?

For practitioners, ask yourself the question: are you using (LLM-generated) synthetic data? If not, you may be falling behind. For this reason, I picked the Nemotron-4 tech report and the HelpSteer2 paper as this week’s spotlight.

Why 340B parameters? NVIDIA said that 340B param models can be deployed to a single DGX H100 with 8 GPUs at FP8. Reading between the lines, I suspect that a 70B-parameter Nemotron, if existed, would perform worse compared to, say Llama3-70B. Still, this is impressive work given that it uses only 10K human labeled fine tuning datapoints, which is three orders of magnitude less than Llama3’s 10M datapoints.

Noteworthy papers

LLM applications

Proofread: Fixes All Errors with One Tap

  • TLDR: How to build a real-world LLM-powered application (Gboard’s Proofread), from (synthetic) data generation to metrics to inference optimization.
  • The good: Interesting case study that covers key aspects of building real-world LLM applications.
  • The bad: Quality of writing is subpar. Not a true research paper in that things are reported as is, without analyses such as ablation.
  • Practical score: :star::star:

LLM agents

Mixture-of-Agents Enhances Large Language Model Capabilities. Authors: Together AI.

  • TLDR: Proposes a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response.
  • The good: Achieves 65.1% on AlpacaEval 2.0, beating GPT-4o’s 57.5%. Identifies collaborativeness of LLMs.
  • The bad: It’s not clear for practitioners building today’s AI systems how to interpret the paper’s findings. Why chose those specific benchmarks? How relevant is AlpacaEval’s LC win rate in the context of building a RAG application?
  • Practical score: :star::star:

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning. Authors: UW, Meta, AI2.

  • TLDR: See the title.
  • The good: 7B model-based agents can beat vanilla frontier models in reasoning. Also releases HuskyQA, a new eval set that tests agents for mixed-tool reasoning, with a focus on retrieving missing knowledge and performing numerical reasoning.
  • The bad: No discussion on limitations and future directions.
  • Practical score: :star::star::star:

LLM prompting

The Prompt Report: A Systematic Survey of Prompting Techniques. Authors: various.

  • TLDR: See the title.
  • The good: A good snapshot of the prompt landscape.
  • The bad: This should live in an online place that allows regular updates (e.g. wiki-style).
  • Practical score: :star::star::star:

Multimodal LLMs

Tx-LLM: A Large Language Model for Therapeutics. Authors: Google.

  • TLDR: See the title. Fine tuned from PaLM-2. Progress on LLM encoding of biochemical knowledge.
  • The good: Competitive with SOTA on 43 out of 66 tasks and exceeding SOTA on 22. Potential use in e2e drug discovery development. Datasets are available.
  • The bad: Model not available.
  • Practical score: :star::star::star:

Synthetic data

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Authors: U Washington and AI2.

  • TLDR: Novel approach to generate alignment data for LLM fine tuning. This paper should be read in conjunction with the Nemotron-4 and HelpSteer2 papers.
  • The good: Magpie-finetuned 8B model achieves comparable performance to Llama3-8B-Instruct.
  • The bad: None.
  • Practical score: :star::star::star:

What If We Recaption Billions of Web Images with LLaMA-3? Authors: UC Santa Cruz, University of Edinburgh, JHU, Adobe, UT Austin

  • TLDR: Fine tune a Llama3-8B-powered LLaVA-1.5 vision-language model to caption 1.3 billion images from the DataComp-1B dataset. Another example of AI-powered synthetic data generation.
  • The good: Improvement in zero-shot discriminative CLIP and text-to-image diffusion transformers. Everything is shared.
  • The bad: None.
  • Practical score: :star::star::star:

Benchmarks and evaluations

A number of papers were published on benchmarks and evaluations, a sign that this continues to be an active area of research.

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning. Authors: Google Deepmind.

  • TLDR: Benchmarks in Trip Planning, Meeting Planning, and Calendar Scheduling. It’s a challenging benchmark in that frontier models get only low 30%s.
  • The good: Extensive ablation studies showing limitations of self-correction, few-shot generalization, and in-context planning.
  • The bad: None.
  • Practical score: :star::star:

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild. Authors: AI2 and UW.

  • TLDR: Automated benchmark for LLMs, using frontier models such as GPT-4-Turbo.
  • The good: Strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks.
  • The bad: None.
  • Practical score: :star::star:

CRAG – Comprehensive RAG Benchmark. Authors: Meta and HKUST.

  • TLDR: RAG benchmark on 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search.
  • The good: Frontier LLMs get <34%, but improve to 44% with straightforward RAG. SOTA RAG systems get 63%. Will take a while for this benchmark to saturate.
  • The bad: None
  • Practical score: :star::star::star:

Are We Done with MMLU? Authors: University of Edinburgh + others.

  • TLDR: MMLU has many errors, hence the need to fix them and create MMLU-Redux.
  • The good: It’s good to know that 57% of the analysed questions in the Virology subset contain errors. Yikes.
  • The bad: None.
  • Practical score: :star::star::star:

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning. Authors: Google.

  • TLDR: See the title. Temporal reasoning is about reasoning between events in time.
  • The good: Extensive comparison between Claude 3 Sonnet, GPT-4, and Gemini 1.5 Pro.
  • The bad: None.
  • Practical score: :star::star:

LLM efficiency

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. Authors: Shanghai Jiaotong, Tsinghua U

PowerInfer-2: Fast Large Language Model Inference on a Smartphone. Authors: Shanghai Jiaotong.

  • TLDR: A pair of papers on PowerInfer and PowerInfer-2, a framework to train and deploy LLMs that optimize inference performance (e.g. tokens/sec, memory footprint, etc.), particularly in the context of mobile phone LLMs.
  • The good: 11 tokens/sec for TurboSparse-Mixtral-47B.
  • The bad: While the models are shared on Hugging Face, the code is not.
  • Practical score: :star::star::star:

LLM frontier: State-Space Models

An Empirical Study of Mamba-based Language Models. Authors: NVIDIA, University of Wisconsin-Madison, Princeton University, Together AI, Carnegie Mellon University, Cartesia AI.

  • TLDR: Direct comparison between 8B-parameter Mamba, Mamba-2, Mamba/MLP hybrids, and Transformer models trained on the same datasets of up to 3.5T tokens.
  • The good: Hybrids look promising: compared to Transformers 8B Mamba-2-Hybrid achieves better benchmark performance (+2.65 points on average) while 8X faster at token generation. Code and checkpoints are released.
  • The bad: None.
  • Practical score: :star::star::star:

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling. Authors: Microsoft and UIUC.

  • TLDR: A different kind of Mamba hybrid, using sliding window attention. SAMBA selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention
    mechanism.
  • The good: Promising results on long context benchmarks, and 3.6X speed up on context generation compared to Transformers. Code is released.
  • The bad: None.
  • Practical score: :star::star::star:

1 Like