Databricks Adds Serverless NVIDIA GPUs

Databricks launches AI Runtime, offering serverless NVIDIA GPUs for simplified AI model training and fine-tuning directly within the Lakehouse.

3 min read
Databricks logo with NVIDIA GPU imagery
Image credit: StartupHub.ai

Databricks is bringing scalable, serverless NVIDIA GPUs directly to its Lakehouse platform with the introduction of its new AI Runtime. This move aims to eliminate the infrastructure headaches typically associated with training and fine-tuning complex AI models, particularly large language models (LLMs).

The company announced the public preview of the AI Runtime (AIR), which provides on-demand access to NVIDIA A10 and H100 GPUs. Users can now configure these GPUs within their Databricks notebooks in just a few clicks, avoiding the need to provision and manage their own clusters. This aligns with Databricks' broader push towards simplifying data operations, as seen in their prior Databricks Serverless Simplifies Data Ops initiatives.

On-Demand Power for AI Model Training

Traditionally, deep learning researchers and engineers spend significant time wrestling with GPU procurement, environment configuration, and data loading bottlenecks. The AI Runtime is designed to abstract away these complexities, allowing teams to focus on model development rather than infrastructure troubleshooting. This is a critical step for organizations looking to leverage advanced AI model training platforms.

AIR comes pre-loaded with essential deep learning frameworks like PyTorch and CUDA, along with optimized support for distributed training libraries such as Ray and Hugging Face Transformers. This "batteries included" approach means users can start training immediately, whether they are working on computer vision models, LLMs, or recommendation systems.

For production-ready workloads, the AI Runtime integrates with Databricks' Lakeflow orchestration tools and supports Declarative Automation Bundles (DABs) for CI/CD pipelines. This ensures that model training and fine-tuning can be tightly synchronized with existing data pipelines and production systems.

Integrated Governance and Observability

A key advantage highlighted by Databricks is the native integration of AI Runtime with the Lakehouse. This means GPU workloads run directly where the data resides, simplifying governance and observability. Unity Catalog provides centralized access controls and lineage tracking, while MLflow offers built-in experiment management and automatic tracking of GPU utilization.

This unified approach ensures that AI workloads remain within the enterprise data perimeter, offering robust security and compliance without sacrificing flexibility. Databricks' ability to handle massive datasets is further underscored by features like those enabling Databricks AI Runtime.

Partnership with NVIDIA

The collaboration with NVIDIA is central to this offering. By integrating the latest NVIDIA hardware, including H100 GPUs, Databricks aims to provide customers with cutting-edge performance for their most demanding AI tasks. NVIDIA sees this as a crucial step in enabling broader AI adoption across industries.

Databricks is also looking ahead, mentioning continued partnership to bring future NVIDIA technologies, such as the RTX PRO 4500 Blackwell Server Edition, to their platform.

The public preview of AI Runtime is now available, with template notebooks and starter guides provided to help users get started quickly.