In the dynamic landscape of customer service and support, the integration of Large Language Models (LLMs) has opened up new avenues for efficiency and scale. Apart from simple question and answer tasks, LLMs can also support interactive chatbots that engage in conversations, understand context, and keep the dialogue going for several exchanges. This capability can improve how customers interact, particularly in sectors such as retail, telecommunications, and hospitality, where engaging with consumers is crucial. Studies confirm that nearly half (49%) of US adults have turned to an AI chatbot for customer service in the past 12 months (eMarketer)
PwC’s most recent Global CEO Survey found that the risks as well as opportunities around AI are a key focus for top executives. Eighty-five percent of CEOs agree that AI will significantly change the way they do business in the next five years. However, with great power comes great cost, and the utilization of LLM tokens can pose significant challenges, particularly in sustaining operations over time.
In this article, we explore how the principles of Pareto optimization and caching can offer transformative solutions to mitigate these challenges, ensuring sustainable and effective customer service applications.
The Cost Challenge
Every LLM has a price per token, or you can say per word. These tokens serve as the currency for interactions with language models, representing computational resources and, consequently, real-world expenses. That is why monitoring and managing how much data you send in and how much you're expecting out is very important to the cost of running your application. LLM usage pricing can be tricky to understand and simulate. For a simple chatbot, you need to figure out how many words a question will be and how long your answers will be.
If you use GPT-3.5-turbo with 4k tokens for both input and output, it’ll be $0.000002 / prediction or $2 / 1M predictions (as of May 2024). The DoorDash ML model made 10 billion predictions per day so if each cost of prediction is $0.000002, that’d be $2000 a day.
As the popularity of applications powered by LLMs continues to rise, finding ways to cut costs will become increasingly important for ensuring a good return on investment (ROI). With the use of tokens increasing and more advanced, but also more expensive, models becoming available, businesses will need to find ways to keep their expenses in check. The market understands the importance of making the most out of LLMs while also being mindful of costs. By adopting methods to reduce expenses, companies can fully utilize the potential of generative AI while keeping their budgets under control. Companies that prioritize optimization will position themselves ahead of the competition in a business landscape increasingly driven by AI technology.
The Pareto Principle
Enter the Pareto principle, a concept widely applicable across various domains, including customer service.
The principle suggests that roughly 80% of effects come from 20% of causes.
In the context of customer queries, this translates to the observation that a significant majority of inquiries stem from a relatively small subset of content. Following this intuition, we can mitigate potential issues related to financial costs as well as memory intensiveness involved in deploying LLMs for customer service use cases.
Firstly, consider the optimization of LLMs. Traditionally, the success of LLMs was attributed to their scale, with larger models boasting impressive performance. However, recent research suggests that the quality of the dataset used for training plays a crucial role in achieving high-performance models. This shift in paradigm aligns with the Pareto principle, as it emphasizes the importance of focusing on the critical factors (such as dataset curation) that contribute the most to model performance, rather than solely relying on scale.
Similarly, when deploying Small Language Models (SLMs) for resource-constrained users, the Pareto principle comes into play by guiding the allocation of limited computational resources. Given the constraints on memory and processing power, it becomes essential to prioritize the optimization of SLMs for the most impactful tasks or queries. By identifying the subset of queries that contribute the most to user satisfaction or business objectives, organizations can allocate resources effectively, ensuring that the most critical tasks are handled efficiently.
Let's delve into a real-life scenario in the context of a customer service application for an e-commerce platform. Suppose the platform receives a significant volume of inquiries related to order tracking, product returns, and account management. These queries represent the most common and high-impact interactions that customers have with the platform's customer service team.
Now, let's consider the traditional approach where every customer query is processed using a full-scale Large Language Model (LLM). For simplicity, let's assume the cost of processing each query with the LLM is $0.10 per token, and the average token count per query is 100 tokens. With this information, we can calculate the total cost of processing all customer queries over a given period, say one month.
Assuming the platform receives 100,000 customer queries per month, the total cost of processing these queries using the LLM can be calculated as follows:
Total cost = Cost per token * Average token count per query * Number of queries
= $0.10 * 100 * 100,000
= $1,000,000 (per month)
Now, let's explore the scenario where Pareto optimization and caching are implemented, and Small Language Models (SLMs) are deployed to handle the most common queries. Suppose through Pareto analysis, it is determined that 80% of customer queries fall into the category of order tracking, product returns, and account management, which can be efficiently handled by SLMs. These SLMs have a reduced token cost of $0.05 per token.
With this optimization strategy, only 20% of customer queries are processed using the full-scale LLM, while the remaining 80% are handled by the SLMs. Let's recalculate the total cost of processing customer queries using this optimized approach:
Total cost = (Cost per token * Average token count per query * Number of queries for LLM) + (Cost per token * Average token count per query * Number of queries for SLM)
= ($0.10 * 100 * 20,000) + ($0.05 * 100 * 80,000)
= $200,000 + $400,000
= $600,000 (per month)
By implementing Pareto optimization and caching with SLMs, the total cost of processing customer queries is reduced from $1,000,000 to $600,000 per month, resulting in significant cost savings of $400,000.
This calculation demonstrates how the Pareto principle, coupled with the use of SLMs, can lead to substantial cost reductions in real-life scenarios. By focusing resources on optimizing the handling of the most common and high-impact queries, organizations can achieve significant efficiency gains and cost savings in their customer service operations.
Caching as a Solution
Frequently asked questions, greetings, and feedback can burden LLMs unnecessarily. Here comes caching into the picture, a technique long championed in the realm of computer science for its ability to store frequently accessed data and retrieve it swiftly when needed. This is useful for two reasons:
It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times. It can speed up your application by reducing the number of API calls you make to the LLM provider.
KV caching
One of the proposed solutions to accelerate LLM inference is through Key-Value (KV) caching for the attention layers in Transformers. This approach substitutes quadratic-complexity computation with linear-complexity memory accesses, effectively improving inference speed. However, as the demand grows for processing longer sequences, the memory overhead associated with KV caching becomes a bottleneck, leading to reduced throughput and potential out-of-memory errors, especially on resource-constrained systems like single commodity GPUs.
To address these challenges, a novel algorithm-system co-design solution, ALISA operates at both the algorithm and system levels to optimize LLM inference performance while mitigating memory constraints:
- Algorithm Level: ALISA introduces Sparse Window Attention (SWA), which prioritizes tokens that are most important in generating a new token. SWA introduces high sparsity in attention layers, reducing the memory footprint of KV caching with minimal impact on accuracy.
- System Level: ALISA employs three-phase token-level dynamical scheduling to optimize the trade-off between caching and recomputation. This optimization strategy maximizes overall performance in resource-constrained systems by dynamically allocating resources based on workload characteristics.
In the world of customer service, response time is crucial for providing timely assistance to customers. By implementing caching mechanisms like GPTCache, which can store and retrieve commonly used responses, saves LLM calls and improves response time. KV caching with algorithms such as ALISA, organizations can significantly reduce the time required for processing customer queries.
Moreover, by optimizing memory usage and improving throughput, caching mechanisms like ALISA can also help reduce the cost associated with deploying LLMs in customer service applications, this means that responses to frequently asked queries can be precomputed and stored in cache memory. As a result, fewer tokens are required for both input and output during inference, saving computational resources and reducing costs associated with token consumption.
Prompt Caching
Prompt caching is a technique employed to optimize the performance and reduce the computational cost associated with running Large Language Models (LLMs). It revolves around the idea of storing the responses generated by LLMs for specific prompts or inputs, allowing for quicker retrieval of responses when encountering similar prompts in the future. This not only conserves computational resources but also dramatically improves response times, making it particularly beneficial for real-time applications like customer service chatbots.
Here's how prompt caching works:
- Prompt Identification: When a new prompt or input is received by the system, it first undergoes identification and hashing. This process converts the prompt into a unique hash value, which serves as a reference for future retrieval.
- Cache Lookup: The system then checks if the hash of the current prompt exists in the cache. The cache stores previously encountered prompts along with their corresponding responses.
- Response Retrieval: If a match is found in the cache, indicating that the current prompt has been encountered before, the cached response associated with that prompt is retrieved and returned to the user. This bypasses the need for the LLM to process the prompt again from scratch.
- Processing and Caching: In cases where no match is found in the cache, implying that the prompt is new, the LLM proceeds to process the prompt and generate a response. Subsequently, both the prompt's hash and the generated response are stored in the cache for future use.
This process requires a well-designed caching strategy to balance between cache size, retrieval speed, and the freshness of the cached responses.
Langchain integrates lots of caching tools. You can read more about it here: LLM Caching integrations. You can also use the functools library in Python to implement caching. This code snippet ensures that previously computed results are retrieved instead of recomputing.
from functools
import lru_cache@lru_cache(maxsize=None)
def cached_function(query):
# Your function logic here
Customer Service Use cases
Let’s look at customer service use cases where we can use pareto and caching to improve customer experience metrics.
Technical Support
In a technical support scenario, customers may encounter issues with products or services and seek assistance from a customer service LLM. These inquiries can range from troubleshooting steps to account-related issues or software bugs.
By caching solutions to common technical issues along with their respective prompts, the LLM can efficiently handle technical support inquiries without the need for repetitive computation. This streamlines the resolution process, enabling faster problem-solving and reducing the workload on human support agents. Additionally, by storing successful resolution paths, the LLM can learn from past interactions and continuously improve its troubleshooting capabilities over time.
Alltius conversational AI platform simplifies technical support by caching solutions for common issues and automating responses. This reduces human workload and provides faster resolutions. By storing successful resolution paths, Alltius enables its conversational AI agents to continuously learn and improve, delivering efficient support at scale.
Case Study: Assurance IQ leveraged Alltius' AI agents for sales teams, boosting their call-to-sale conversion by 300%, showcasing how fast responses enhance agent performance.
Personalized Responses
In another scenario, consider a customer service LLM that provides personalized responses based on the user's previous interactions or purchase history. For instance, a customer may inquire about the status of their recent order or seek recommendations for products based on their browsing history.
By caching personalized responses tailored to individual users or specific contexts, the LLM can deliver more relevant and timely assistance. This enhances the overall customer experience by providing personalized support and recommendations, leading to increased customer loyalty and engagement.
Alltius empowers customer service with personalized interactions by tailoring responses based on previous interactions or purchase history, ensuring timely, relevant support that enhances customer loyalty. Alltius' AI ensures personalized recommendations, leading to higher engagement and customer retention. See how it works.
Frequently Asked Questions (FAQs)
Imagine a customer service chatbot deployed on an insurance platform. This chatbot handles a wide range of customer inquiries, including questions about insurance matching, updating policy information or adding new users.Many of these queries are repetitive and fall under the category of frequently asked questions (FAQs).
By caching responses to frequently asked questions along with their corresponding prompts, the chatbot can quickly retrieve pre-computed responses from the cache when similar queries are submitted. This significantly reduces the time and computational resources required to process these common inquiries. As a result, customers receive instant responses to their queries, leading to improved satisfaction and efficiency in customer service operations.
Alltius addresses FAQs efficiently by caching answers to repetitive queries, reducing processing times. When customers ask common questions, the chatbot retrieves cached responses instantly, improving both response time and customer satisfaction. This streamlines operations, allowing human agents to focus on more complex queries.
Case Study: Using Alltius' technology, Matchbook AI enhanced efficiency in handling frequent queries, enabling the team to handle 95% queries without human intervention.
The reduction in token consumption resulting from prompt caching translates directly into cost savings for organizations deploying customer service LLMs. With fewer tokens expended on redundant processing of similar queries, organizations can achieve more cost-effective operation of their LLM-powered customer service systems. This cost reduction can be particularly significant in scenarios with high query volumes, where token costs can quickly accumulate over time.
Conclusion
Lets recap what we saw, Pareto optimization and caching offer numerous benefits for customer service operations, ranging from cost savings to enhanced customer experience.
Cost Savings - Think of Pareto optimization and caching like finding the sweet spot in your budget. By focusing on the 20% of queries that make up 80% of your customer interactions, you're not wasting resources on the less common stuff. This targeted approach means you're not burning through your token budget on questions that hardly come up, saving you some serious cash in the long run.
Improved Response Times - Ever waited ages for a response from customer service? Yeah, it's not fun. With caching, it's like having answers on speed dial. When a common question pops up, the system doesn't need to go back to square one to find the answer—it's already cached and ready to go. That means lightning-fast responses for your customers, keeping them happy and engaged.
Enhanced Customer Experience - With optimization and caching techniques, you're not just saving time and money; you're also delivering top-notch service. Customers feel valued when they get the help they need right when they need it, making them more likely to stick around and come back for more.
Enhanced Scalability - Pareto optimization and caching strategies improve the scalability of customer service operations by enabling organizations to handle a larger volume of inquiries with existing resources. By focusing on optimizing the handling of common queries, organizations can efficiently scale their customer service systems to accommodate growing demand without proportional increases in costs or response times.
In a nutshell, Pareto optimization and caching are like your secret weapons for running a smooth and efficient customer service operation. They save you money, make your responses lightning-fast, and leave your customers feeling like they're getting VIP treatment. What's not to love?