Published: Feb 04, 2025

How CAG Saves Teams $100,000s in Runway

LLMs have forever changed how the world works and plays in the regular day to day. From being a novelty to now an every day tool it’s no secret they are incredibly powerful – however, they are expensive to operate. Startups and teams building AI products should keep this in mind when thinking about runway and the longevity of the company as inference costs will be one of the biggest financial and technical constraints.

The good news is there’s an effective strategy to improve costs (technical and financial) when building: cache-augmented generation. The idea is to reuse previously computed activations to then speed up inference. For example, say you were building a LEGO building - traditionally you’d dig through a giant box of LEGOs to look for a specific piece - that’s tedious and inefficient. However, say you had another box of just the commonly used pieces you already found, it would bring your building to life much quicker. CAG is the equivalent of that second box.

Likewise, there is a similar method in KV caching, which is also well-known amongst the community – however hardware also plays a big role here and developers must understand how might the hardware (AMD vs NVIDIA) work with the caching strategy – especially for teams who are focused on optimizing their costs, speed, and overall performance.

This post explores cache-augmented generation, its impact on inference performance, and how a team might think about this in relation to different GPUs available.

Why is Inference So Expensive?!

Imagine every time you have to do a task, you have to revisit all the lessons in your life which helped you acquire the skills needed to perform the task. Simply doing your laundry would take ages! Revisiting how to walk, how to pick up the basket, load the machine, etc. Currently that is how transformer-based LLMs work – producing text only one token at a time. At every step, they recompute attention across all previous tokens x1,x2,..., xT.

This means if you’re generating a 1,000-token response, the model is recalculating earlier values 1,000 times.

Teams beware – this will directly impact inference latency and GPU memory bandwidth causing spikes in software costs and creating hardware issues which could lead to further unnecessary spending.

Example Considerations

Hypothetically, if you were to build an AI-powered CS Agent startup in the world e-commerce you’d want it to digest customer inquiries, retrieve relevant product information, and generate custom responses in real-time. Reminder that:

  • Latency matters: Users expect an immediate response, ideally within 1 second of asking a question.
  • Costs can soar: The user base will increase resulting in more interactions, ultimately resulting in more LLM inference requests. Here, the smallest inefficiency could result in thousands of dollars worth of costs the company incurs.
  • Hardware costs dominate: Running models on cloud compute resources scales very quickly also resulting in potentially avoidable costs.

Scaling with proper hardware

As previously mentioned, it’s more important now than ever to consider the long-term goals and needs of the business when building and/or working with AI. Development teams don’t have an endless budget or free reign to spend company resources at will. With hardware being a huge component here, keep in mind AMD offers great solutions:

Costs - Lower upfront investment compared to competitors in the industry.

Availability - Readily available for use giving builders a competitive advantage with speed.

Performance - Competitive workload handling for LLMs.

Putting it all together

CAG requires you and your development team to begin by extending traditional KV caching - reducing redundant computation and optimizing inference. We’ll put it all together below and the following is based on using an AMD GPU.

1. Activation Caching Strategy

Your KV Caching here stays the same as you normally would. The keys (K) and values (V) from the self-attention layer are simply cached to prevent recomputing – aka digging through the unorganized LEGO box. This can be done by storing K/V pairs in a global cache accessible across inference steps.

Now, where we get into something a bit new is a piece of the strategy where you store activations from deeper layers like FFN and residual connections. These are usually recalculated for each token, but with CAG, we store these intermediate activations.

2. Leveraging AMD GPUs for Activation Caching

Managing memory capacity is something we also have to consider and AMD GPUs, especially those with HBM, are capable of handling large activation matrices pretty efficiently. To get activation caching going on AMD GPUs, you'll need to leverage their AMD ROCm as well to get the most bang for your buck.

With this keep in mind you’ll have to optimize memory access and will need to store activations and KV pairs in HBM to ensure fast access during inference. You can do this using ROCm's memory allocation methods. The data can be stored in FP16 or INT8 precision to minimize memory usage, leveraging AMD GPUs' ability to speed up processing.

We are successfully building our second, organized, LEGO bucket here.

3. Efficient Activation Retrieval

Time to generate the next token where we’ll reuse the cached activations from the previous layers. Only the new tokens need to be processed for the remaining layers. Two things to think about:

Parallelism: AMD’s multi-core architecture and SIMD capabilities can actually efficiently retrieve activations for multiple tokens in parallel. Doing so means you don’t have to wait for processing in a one by one, sequential and slow, order.

Activation Reuse: When it’s time for the model to process the next token, it uses cached activations – goes to our organized LEGO bucket, instead of going back through the old one. This is especially important for products and services operating in and/or close to real time where you need as little complexity as possible from tokens.

4. Handling Dynamic Contexts

Sometimes a random situation may require dynamic caching in your build.This could be something as ‘simple’ as a conversation changing topics. If so, cached activations must be updated or evicted accordingly. When “evicting” implement those policies based on token relevance or token window size.

5. Inference Loop Optimization

Lastly, once the activation cache is in place, the inference loop will change significantly. The organized LEGO box has been successfully created and our building is coming to life much faster. Instead of recalculating activations and KV pairs for every token, we’re only computing new values for the first few tokens, then going back to our cached activations for subsequent tokens.

The TLDR:

Teams have to be mindful of technical and financial constraints when building. Money doesn’t grow on trees and there is an important overall business context to keep in mind (if you like your job 🙂).

Running LLMs is like digging through a messy LEGO bin for every piece—slow and expensive. CAG fixes this by storing and reusing activations, ultimately saving the company time and money.

  • Start reducing redundancy with CAG
  • Leverage AMG GPUs and ROCm for efficient memory use.
  • Fetch cached activations faster using AMD’s SIMD architecture.
  • Only compute new activations, reusing the rest for faster responses.

CAG on AMD GPUs slashes costs, improves speed, and extends runway—a must for AI startups.

Your team’s accountant will love you for using CAG on AMD GPUs - cost going down and runway going up.