Paper spotlight
Large Language Models as Zero-shot Dialogue State Tracker through Function Calling
Authors: UC Santa Barbara, CMU, and Meta AI.
Summary
This paper introduces the idea of using LLM function calling to track key information and maintain memory in task-oriented chatbots. While demonstrating impressive new SOTA performance on a few benchmarks, the paper also suggests that LLMs will likely complement but not completely replace non-LLM techniques in the near term.
Details
Open-ended chatbots such as ChatGPT can converse with users over a wide range of topics. However, in the vast majority of real world use cases, chatbots are deployed to serve specific goals, helping user accomplish specific tasks such as customer service and support, internal business operations, etc. Such chatbots are called task-oriented and have traditionally been developed with significant complexity and cost around intent understanding, dialog management, training data curation, and (non-LLM) models. It is interesting to ask if LLMs can help speed things up here, especially given the fact that task-oriented chatbots have traditionally been considered an easier problem compared to open-ended ones.
The answer suggested by this paper seems to be a No. A key challenge for task-oriented chatbots is keeping an accurate memory of information collected from the user over a multi-turn dialog. This is called the dialog state tracking (DST) problem. Forgetting or mixing up user inputs is typically unacceptable in practice; imagine a chatbot that books a flight to a wrong destination. The paper’s main idea is to use LLM function calling for single-turn DST, which is often referred to as slot filling in dialog research literature. Below is an example of function calling as slot filling (screenshot taken from the paper):
Experiments show that this approach achieves new SOTA performance on a number of DST benchmarks such as Attraction, Hotel, Restaurant, Taxi, and Train. But as pointed by the authors in the section on limitations, the new performance is still pretty far from the threshold required by real world uses. The main reported metric is Joint Goal Accuracy (JGA), and the new SOTA, achieved by GPT-4, is only 62.6%. The authors also note that the delexicalization is used in this study, adding a further caveat to practical considerations.
The authors promised to release code and models behind this work.