Simplifying AI Infrastructure: dstack’s Open Source Alternative to Kubernetes

Apr 22, 2025

At the Beyond CUDA Summit 2025, Andrey Cheptsov, CEO and Founder of dstack, unveiled a bold vision: ...

At the Beyond CUDA Summit 2025, Andrey Cheptsov, CEO and Founder of dstack, unveiled a bold vision:
Simplify container orchestration for AI teams — without the complexity of Kubernetes or Slurm.

Here’s everything you need to know about this open-source innovation designed to accelerate AI development and deployment.👇

🔧 The Problem: Why Kubernetes and Slurm Fall Short for AI

While Kubernetes and Slurm are widely used to orchestrate workloads, they weren’t built with AI in mind:

  • Kubernetes ➔ Great for DevOps, but too low-level and manual for AI engineers
  • Slurm ➔ Built for HPC, not modern cloud-native AI workflows

Result?
AI teams waste valuable time building internal platforms instead of focusing on models, training, and data.

🛠️ The Solution: dstack — AI-Native Container Orchestration

dstack offers a simple, cloud-agnostic container orchestrator built specifically for AI.

Key features:

  • Works with any accelerator: NVIDIA, AMD, Google TPUs, Intel Gaudi
  • Supports any cloud: Hyperscalers, private clouds, and even on-prem clusters
  • Vendor agnostic: Total freedom over frameworks, data, and models
  • Integrated with TensorWave for high-performance AMD MI300X and MI325X cloud deployments

dstack abstracts away infrastructure complexity — letting AI teams focus only on building and shipping models.

Unified Interfaces for the Entire AI Workflow

dstack provides five simple interfaces to cover all AI team needs:

  • Dev Environments ➔ Spin up remote workspaces instantly from your desktop IDE
  • Tasks ➔ Launch training, fine-tuning, and batch jobs across clouds or on-prem
  • Services ➔ Deploy scalable inference endpoints (e.g., using VLLM, SGLang)
  • Fleets ➔ Manage distributed GPU clusters
  • Volumes ➔ Use persistent storage across runs for checkpoints, caching, and datasets

All controlled by a few YAML specs and a simple CLI:
dstack apply ➔ Done. ✅

🧠 Real-World Examples: Development to Large-Scale Training

🔹 Dev Environments:
Spin up a remote GPU-powered coding environment from your laptop in minutes.

🔹 Training with Tasks:
Define distributed jobs using any framework (Megatron, DeepSpeed, HuggingFace Accelerate) and let dstack handle cluster provisioning.

🔹 Inference with Services:
Auto-scale your LLM inference endpoints based on demand — without worrying about infrastructure plumbing.

🔹 Persistent Storage:
Cache models, save training checkpoints, and manage data across sessions — cloud and on-prem supported.

Built for Flexibility: Cloud, On-Prem, and Hybrid

Whether you run on TensorWave's AMD AI Cloud, AWS, GCP, Azure, or your own GPU servers:

  • Cloud-native ➔ Native integrations with all major providers
  • On-prem friendly ➔ Just register your GPU hosts via SSH
  • Hybrid-ready ➔ Combine cloud and on-prem seamlessly

You get full control — no lock-in, no compromises.

💬 Final Takeaway: Open Source Simplicity for AI Builders

Andrey closed the session by inviting everyone to try out dstack:

  • 100% Open Source
  • Fast-moving development
  • Designed to make AI infrastructure effortless

👉 Explore the dstack GitHub repo and start building smarter, not harder.

The future of AI infrastructure is open, simple, and accelerator-agnostic — and dstack is leading the way. 🚀

📺 Watch the Full Talk 👉 Simplifying Container Orchestration for AI | Beyond CUDA Summit 2025

🚀 Deploy AI Workloads on AMD MI300X and MI325X Cloud 👉 Explore TensorWave’s AI Cloud Solutions for training, inference, and scaling LLMs at cost-effective speeds.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.