Spotlight
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Authors: Google Research and Deepmind
Venue: ICLR 2024.
Summary
In real world uses of LLMs, fine tuning can be an effective technique to improve their performance to a level required by users. Fine tuning research is however still a nascent area. This paper is a useful read for practitioners who are interested in the scary frontier that is LLM fine tuning.
Details
Given the fact that little is know about best practices for fine tuning LLMs, we at the AI2 Incubator advise startups to focus on prompt optimization and delay fine tuning as long as possible. For teams that have done all they could with prompt optimization and still have not reached the required performance, they may look into fine tuning. This paper presents a few findings about LLM fine tuning, covering the impact of model size, pre-training data size, fine tuning parameter size, and fine-tuning data size on three fine-tuning techniques: full-model tuning (FMT), prompt tuning, and low-rank adaptation (LoRA):
- Fine tuning benefits from larger models than larger pre-training data set.
- For both prompt tuning and LoRA, increasing the number of parameters does not work.
- FMT is data hungry, requiring million-scale fine tuning data to be effective.
- When only few thousands of fine tuning examples are available, use prompt tuning or LoRA. With sightly larger datasets, LoRA is preferred due to its stability and slightly better fine tuning data scalability.
- Scaling law for fine tuning is multiplicative: performance improves multiplicatively with respect to increases in various factors.
- These are just rules of thumb. The specific task matters a great deal.
We expect to see lots of research activities around LLM fine tuning that would shed more light into this importance area.
Noteworthy papers
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web. Authors: CMU and Writer.com
- Summary: If there is any lingering doubt about the limitations of the current generation LLMs, this paper would do a fine job dispensing it. The task of building AI agents that can interact with a computer on the user’s behalf proves to be extremely challenging. This paper introduces a benchmark for this task that is so difficult that the strongest performer, unsurprisingly based on GPT-4, only reaches 15% of the human level. Projects such as Adept.ai and Open Interpreter would need to significantly simplify the tasks they aim to automate.
Evaluating Very Long-Term Conversational Memory of LLM Agents. Authors: UNC, USC, and Snap Inc.
- Summary: This paper introduces a benchmark called LoCoMo to measure long-term memory of LLM chatbots. Experiments show that LLM chatbots suffer greatly from long-term memory loss. With long-context and retrieval augmented generation, they show strong improvement but still lag far behind human level. A RAG-based solution works best when dialogs are recorded into a database of assertions.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. Microsoft Research and University of Chinese Academy of Sciences.
- Summary: What if we replace floating point numbers in LLMs with just -1, 0, and 1? We no longer need to do floating point multiplication and instead only need to do integer addition. This paper shows early promising results that such LLMs can be trained much faster and with less power while remain competitive with the regular floating point versions. It’s early since they have only trained LLMs up to 3 billion parameters. They are currently training larger models. If the results hold for larger models, then the statement about a new era will be justified. This paper already has 500+ upvotes on its Hugging Face page where it generates lots of discussion.