Simplifying AI Infrastructure: dstack’s Open Source Alternative to Kubernetes
Apr 22, 2025
At the Beyond CUDA Summit 2025, Andrey Cheptsov, CEO and Founder of dstack, unveiled a bold vision: ...

At the Beyond CUDA Summit 2025, Andrey Cheptsov, CEO and Founder of dstack, unveiled a bold vision:
Simplify container orchestration for AI teams — without the complexity of Kubernetes or Slurm.
Here’s everything you need to know about this open-source innovation designed to accelerate AI development and deployment.👇
🔧 The Problem: Why Kubernetes and Slurm Fall Short for AI
While Kubernetes and Slurm are widely used to orchestrate workloads, they weren’t built with AI in mind:
- Kubernetes ➔ Great for DevOps, but too low-level and manual for AI engineers
- Slurm ➔ Built for HPC, not modern cloud-native AI workflows
Result?
AI teams waste valuable time building internal platforms instead of focusing on models, training, and data.
🛠️ The Solution: dstack — AI-Native Container Orchestration
dstack offers a simple, cloud-agnostic container orchestrator built specifically for AI.
Key features:
- Works with any accelerator: NVIDIA, AMD, Google TPUs, Intel Gaudi
- Supports any cloud: Hyperscalers, private clouds, and even on-prem clusters
- Vendor agnostic: Total freedom over frameworks, data, and models
- Integrated with TensorWave for high-performance AMD MI300X and MI325X cloud deployments
dstack abstracts away infrastructure complexity — letting AI teams focus only on building and shipping models.
Unified Interfaces for the Entire AI Workflow
dstack provides five simple interfaces to cover all AI team needs:
- Dev Environments ➔ Spin up remote workspaces instantly from your desktop IDE
- Tasks ➔ Launch training, fine-tuning, and batch jobs across clouds or on-prem
- Services ➔ Deploy scalable inference endpoints (e.g., using VLLM, SGLang)
- Fleets ➔ Manage distributed GPU clusters
- Volumes ➔ Use persistent storage across runs for checkpoints, caching, and datasets
All controlled by a few YAML specs and a simple CLI:dstack apply
➔ Done. ✅
🧠 Real-World Examples: Development to Large-Scale Training
🔹 Dev Environments:
Spin up a remote GPU-powered coding environment from your laptop in minutes.
🔹 Training with Tasks:
Define distributed jobs using any framework (Megatron, DeepSpeed, HuggingFace Accelerate) and let dstack handle cluster provisioning.
🔹 Inference with Services:
Auto-scale your LLM inference endpoints based on demand — without worrying about infrastructure plumbing.
🔹 Persistent Storage:
Cache models, save training checkpoints, and manage data across sessions — cloud and on-prem supported.
Built for Flexibility: Cloud, On-Prem, and Hybrid
Whether you run on TensorWave's AMD AI Cloud, AWS, GCP, Azure, or your own GPU servers:
- Cloud-native ➔ Native integrations with all major providers
- On-prem friendly ➔ Just register your GPU hosts via SSH
- Hybrid-ready ➔ Combine cloud and on-prem seamlessly
You get full control — no lock-in, no compromises.
💬 Final Takeaway: Open Source Simplicity for AI Builders
Andrey closed the session by inviting everyone to try out dstack:
- 100% Open Source
- Fast-moving development
- Designed to make AI infrastructure effortless
👉 Explore the dstack GitHub repo and start building smarter, not harder.
The future of AI infrastructure is open, simple, and accelerator-agnostic — and dstack is leading the way. 🚀
📺 Watch the Full Talk 👉 Simplifying Container Orchestration for AI | Beyond CUDA Summit 2025
🚀 Deploy AI Workloads on AMD MI300X and MI325X Cloud 👉 Explore TensorWave’s AI Cloud Solutions for training, inference, and scaling LLMs at cost-effective speeds.
About TensorWave
TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.
Ready to get started? Connect with a Sales Engineer.