Unlocking real-time chat with 1M context Llama3-70B model on AMD's MI300X

Jul 05, 2024

A killer application of large language models (LLMs) is being able to intelligently interact with va...

A killer application of large language models (LLMs) is being able to intelligently interact with vast amounts of text. Imagine conversing with a model that has comprehensive knowledge of entire codebases, literary works, or legal documents for unlocking valuable insights.

Recently, long context windows have emerged as a promising approach, with Google’s Gemini 1.5 Pro and Flash models supporting million-token context and outperforming popular techniques such as RAG.

While Google’s long context models are undeniably impressive, they are unfortunately closed behind an api and come with significant limitations:

  1. Lack of Customization: Limited fine-tuning or modification capabilities
  2. Scalability Constraints: API rate limits hinder large-scale deployments
  3. Cost Inefficiency: Prohibitive expenses for long context token utilization
  4. Data Security Risks: Reliance on third-party API for sensitive data processing
  5. Feature Gaps: Absence of real-time chat with cached context

Addressing these challenges, we’ve developed an alternative approach that gives users the ability to run  long context models fully under their control.

At a recent event hosted by TensorWave in San Francisco for the developer community, we showed an open-source 1M context Llama3-70B model running on AMD MI300X hardware.

This project was a collaboration between MK1, who developed the inference software stack, Gradient, who provided the long context model, and TensorWave, who built and hosted the MI300X cloud.

Demonstrating Two Real-World Applications

Apollo 11 Transcript: Shows the model’s ability to maintain context over the entire transcript of the historic moon landing.

Three.js code examples: Shows the model’s proficiency handling queries for complex coding environments.

A significant breakthrough is our novel feature never seen before: persistent context caching, enabling real-time multi-user interaction with the same document. This feature greatly enhances development workflows by allowing direct loading of large document sets with no additional processing time for subsequent interactions.

For example, in our Three.js demo we were able to load the cache into memory in 15 seconds, versus 8 minutes for the original pre-fill stage, which is over 20 times faster! Furthermore, there was no additional overhead for subsequent prompts!

Leveraging AMD MI300X Capabilities

AMD’s MI300X accelerators, with their massive 192GB of memory per card, are crucial for running long context models efficiently. Specifically:

  • Model parameters and large in-memory context caches can be stored on fewer accelerators.
  • Multiple caches can be stored in memory, enabling rapid context switching.

Our work is a  significant step towards more accessible, customizable, and efficient deployment of long context models. For the first time ever, users can run an open-source 1M context model on infrastructure they control, tailoring it to their specific needs.

For businesses requiring long context windows, this opens up new possibilities for:

  • Enhanced data privacy and security
  • Customized model fine-tuning for domain-specific applications
  • Scalable deployments without API limitations
  • Cost-effective solutions for processing extensive documents or conversations

This is ready for evaluation and can be adapted for various production environments.

We invite you to explore the capabilities of this system and consider how it could be integrated into your workflows to unlock new levels of language model performance and utility for your use case on the TensorWave Cloud.

Reach out for more information.

About TensorWave and MK1

TensorWave - The premier cloud service provider for AMD MI300X accelerators. Try it today at TensorWave.com

MK1 - Engines for the AI Economy. Visit us online at mk1.ai