NVIDIA DGX Spark: Local LLM Performance Benchmarks

NVIDIA's Mozhgan Kabiri Chimeh reveals performance benchmarks for local LLM deployment on DGX Spark, highlighting the impact of model size, quantization, and the GB10 Grace Blackwell Superchip.

4 min read
NVIDIA DGX Spark: Local LLM Performance Benchmarks
AI Engineer

NVIDIA's latest presentation, featuring Developer Relations Manager Mozhgan Kabiri Chimeh, offers a deep dive into running Large Language Models (LLMs) locally and achieving practical performance on their DGX Spark platform. The session highlights the challenges and solutions for developers aiming to build and deploy AI applications efficiently, emphasizing the importance of hardware, software, and performance metrics.

NVIDIA DGX Spark: Local LLM Performance Benchmarks - AI Engineer
NVIDIA DGX Spark: Local LLM Performance Benchmarks — from AI Engineer

Understanding Local AI Development Challenges

The presentation begins by outlining the common hurdles developers face when working with AI workloads locally. These challenges primarily stem from insufficient system resources, such as memory, or the lack of a compatible software stack. When local systems fall short, the typical solution is to offload the work to cloud or datacenter environments. However, this often introduces complexities related to cost, data residency, and scheduling conflicts, especially as LLMs grow in size and demand.

Introducing the DGX Spark

NVIDIA's DGX Spark is presented as a solution designed to bridge this gap. It's positioned as a powerful, yet compact, system built for developing and running AI. Key features include substantial local memory, support for NVIDIA's comprehensive AI software stack, and a power-efficient form factor. The DGX Spark can be configured as a standalone unit or a network-connected compute resource, offering flexibility for various deployment scenarios. At its core, the DGX Spark is powered by the NVIDIA GB10 Grace Blackwell Superchip, a formidable piece of hardware.

The NVIDIA GB10 Grace Blackwell Superchip

The GB10 Grace Blackwell Superchip is the engine driving the DGX Spark's capabilities. It integrates a Blackwell GPU, designed for FP4 data formats and capable of delivering up to 1 petaFLOP of AI performance. The chip also features a 20-core Arm CPU, split into high-performance and efficiency cores, and utilizes NVIDIA's NVLink C2C interface for high-speed communication between CPU and GPU. With a massive 128GB of LPDDR5x Coherent Unified Memory, the GPU and CPU can share memory seamlessly, a crucial aspect for handling large LLMs. This architecture allows developers to run models with up to 200 billion parameters locally, on a system that can fit on a desktop.

Methodology: The Reproducible Harness

To demonstrate the practical performance of LLMs on the DGX Spark, a reproducible benchmarking methodology was employed. The process involves setting up an environment, saving initial metrics, and then running models using a consistent protocol. Scripts are used to manage GPU logging, capture performance data, and ensure that each run is isolated and reproducible. This systematic approach allows for accurate measurement of model performance across different configurations and parameters. The benchmark script captures essential metrics like throughput and time-to-first-token (TTFT), providing a clear picture of the system's capabilities.

Performance Deep-Dive: Throughput

The presentation includes a detailed look at LLM throughput, measured in tokens per second. The benchmarks showcase a range of models, from smaller instruction-tuned models to larger base models, tested with different precision formats. The results indicate that the 1.5B Instruct model achieves an impressive 61.73 tokens/sec. As model size and complexity increase, throughput naturally decreases. For instance, the 8B FP8 model delivers 23.88 tokens/sec, while the 14B NVFP4 model manages 20.19 tokens/sec. The 14B Base model, however, shows a significant drop to 8.40 tokens/sec, highlighting the impact of model architecture and precision on performance. The data demonstrates that quantization, especially for larger models, is key to achieving higher throughput.

The UX Reality: Latency & TTFT

Beyond raw throughput, the user experience (UX) of interacting with LLMs is heavily influenced by latency, particularly the time-to-first-token (TTFT). The TTFT is the time it takes for the model to generate the first piece of output after a prompt is received. The presentation shows that lower TTFT values lead to a more responsive and engaging user experience. The 1.5B Instruct model excels here, with a TTFT of just 0.03 seconds. Even larger models, when optimized, show competitive TTFTs. For example, the 14B NVFP4 model achieves a TTFT of 0.07 seconds, which is 3.4 times faster than the 14B Base model's 0.24 seconds. This suggests that while larger models may have lower overall throughput, their latency can be managed effectively through optimization techniques like quantization, making them suitable for interactive applications.

Accelerate Your AI Developer Journey

NVIDIA promotes the DGX Spark as a platform that empowers developers to prototype locally and scale to the cloud. The accompanying website, build.nvidia.com/spark, provides resources, examples, and playbooks to help developers get started. This includes guidance on setting up remote access, exploring different models, and understanding the performance metrics. The platform aims to simplify the AI development lifecycle, enabling faster iteration and deployment of sophisticated LLM-based applications.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.