This is the 10th edition of Harmonious’ weekly paper roundup series. This past week I did not find any paper that merits the spotlight designation. Thus I am experimenting with giving a brief overview of papers that are organized around the following topics.
- LLM applications: agents, chatbots, RAG, document understanding, coding, and others.
- LLM prompting: techniques such as CoT to help us get the most out of LLMs.
- Multimodal LLMs. This is an area where many folks expect a lot of advances in the next wave.
- Synthetic data and other novel ways to generate training data. Despite the rise of LLMs and zero/few-shot learning, the data bottleneck is still present. It’s helpful to find creative ways to get data, not only for fine tuning but also for prior generation ML approaches.
- Benchmarks and evaluations: we need to understand the strengths and limitations of LLMs.
- LLM fine tuning/many shot learning. This is an important option wherever prompting is not sufficient.
- Context: topics such as context length and limits, effective use of context, context compression, etc.
- LLM efficiency, primarily for fine tuning and inference. This is obviously important for real world deployment.
- LLM internals: how they work.
- LLM frontier: what’s the next big leap beyond transformers? State-space models? Self-evolution?
- LLM announcements: e.g. Llama, Phi, etc.
I created this taxonomy based on reading a few hundred papers for the first 9 editions of the weekly paper roundup series. I also roughly order the topics in the order of relevance to practitioners (obvious caveat: this is highly subjective). I may adjust this taxonomy if necessary. Not every topic will have papers for a given week.
Let’s look at the papers for the week of April 22, 2024.
LLM use cases
LLM-code
How Far Can We Go with Practical Function-Level Program Repair? Authors: Southern U of Science and Technology, Shenzhen and Kwai Inc.
- TLDR: Study of LLM-based function-level automatic program repair (APR), focusing on few-shot learning and the auxiliary repair-relevant information. Proposes an LLM-based function-level APR technique which adopts a dual-LLM framework to leverage the power of the auxiliary repair-relevant information for advancing the repair performance.
- Assessment: The good: interesting insight into using LLMs to fix bugs. The bad: unclear why GPT-4 is not included in the study, given its superior code capabilities compared to GPT-3.5.
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. Authors: Nanyang Technological U, Singapore U of Technology, Alibaba.
- TLDR: Use LLMs for database query rewrite for efficiency. Contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM.
- Assessment: The good: paper seems thorough and data/code is made available. The bad: unclear how this work can have impact outside of the niche area of database query optimization.
LLM-agents
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation. Authors: various institutions and companies based in China
- TLDR: Uses LLM agents to generate code for extracting information from webpages.
- Assessment: The good: this is a useful task for LLMs to solve. The bad: this is not about crawlers; it’s about webpage parsers. Also, the findings seem inconclusive.
A Multimodal Automated Interpretability Agent. Authors: MIT.
- TLDR: LLM agents as researchers doing interpretability analysis on machine learning models.
- Assessment: The good: bold exploration pushing the boundaries of LLM agents; no job is safe from LLM agents’ encroachment. The bad: modest success; a fair amount of real researchers’ babysitting is still needed.
FlowMind: Automatic Workflow Generation with LLMs. Authors: JP Morgan.
- TLDR: LLM agents to write one-off API-driven python scripts for non-technical finance folks.
- Assessment: The good: the so-called “lecture” prompting technique is useful to know. The bad: the benchmark is too easy; it’s almost already saturated.
Prompting
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners. Authors: Wuhan U, U of Sydney, Nanyang Tech.
- TLDR: A prompting technique to implore LLMs to think deeply about reasoning problems (e.g. math).
- Assessment: The good: incremental gains over baseline. The bad: code is not shared.
Multimodal
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites. Authors: various China-based institutions and companies.
- TLDR: InternVL 1.5 multimodal LLM is competitive with GPT-4V and others MLLMs.
- Assessment: The good: probably SOTA MLLM for Chinese language. Model/code is shared. The bad: no discussion about limitations or future work. WYSIWYG.
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension. Authors: Tencent, the Chinese U of HongKong.
- TLDR: New benchmark for text-rich document understanding and another win for GPT-4V over Gemini Pro and Claude 3 Opus.
- Assessment: The good: welcome benchmark addition for an important, highly practical task. The bad: none.
Long context
LongEmbed: Extending Embedding Models for Long Context Retrieval. Authors: Peking U and Microsoft.
- TLDR: Extend context length of pre-trained short-context embedders instead of training long-context ones from scratch + a new benchmark for long-context tasks.
- Assessment: The good: looks like it can be done effectively, and the benchmark seems to be well designed. The bad: analysis only on training-free techniques, and there’s no baseline for training-from-scratch long context embedders.
SnapKV: LLM Knows What You are Looking for Before Generation. Authors: UIUC, Cohere, Princeton
- TLDR: Clever idea to compress KV cache to speed up long-context processing.
- Assessment: The good: 3.6X faster generation and 8.2X smaller memory footprint. The bad: none
LLM efficiency
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study. Authors: various China-based institutions.
- TLDR: Significant degradation observed, especially with ultra-low bit-width.
- Assessment: The good: useful study + code is shared. The bad: none.
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference. Authors: ServiceNow, MILA
- TLDR: Highly technical idea to significantly reduce the memory foot print of the KV cache.
- Assessment: The good: improvement seems impressive with real world implications. The bad: eval is done on QA benchmarks only.
LLM analysis
Retrieval Head Mechanistically Explains Long-Context Factuality. Authors: Peking U, U of Washington, MIT, UIUC, U of Edinburgh.
- TLDR: For retrieval from long-context, there is a set of attention heads that do this job.
- Assessment: The good: fascinating insight with potential impact across many high-level LLM tasks. The bad: none.
LLM frontier
A Survey on Self-Evolution of Large Language Models. Authors: various China-based institutions.
- TLDR: Survey of how LLMs may create a self-training loop to continuously improve themselves.
- Assessment: The good: a peek into potentially the next big breakthrough. The bad: none.
LLM releases
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework. Authors: Apple.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Authors: Microsoft.