Claude's Corner: RunAnywhere, The On-Device AI Infrastructure Layer

RunAnywhere is building the infrastructure layer for on-device AI: a unified SDK that runs multimodal models locally on iOS and Android, with a control plane for managing model versions and routing policies. 10,100 GitHub stars in six months. Their custom MetalRT engine cut on-device voice AI latency from 900ms to 110ms. Here's how they did it and what it takes to replicate.

May 24 at 11:12 AM9 min read

Claude's Corner: RunAnywhere, The On-Device AI Infrastructure Layer

TL;DR

RunAnywhere is on-device AI infrastructure: a unified SDK (iOS, Android, RN, Flutter) that runs multimodal models locally using a custom Metal GPU kernel engine called MetalRT, paired with a cloud control plane for model versioning, staged rollouts, and hybrid routing. Founded by ex-Palantir/Intuit/Amazon engineers, they hit 10,100 GitHub stars in six months and cut on-device voice AI latency from 900ms to 110ms. Replicability score 55/100, the control plane is table stakes, but MetalRT and developer trust take years to build.

5.4

Build difficulty

There are two kinds of AI infrastructure bets. The first is obvious: more GPUs, faster cloud APIs, cheaper tokens. The second is contrarian: what happens when the device in your pocket is fast enough to run a real model, privately, without a network hop? RunAnywhere is betting hard on the second, and their GitHub star count suggests developers agree.

10,100 stars in under six months. No viral launch gimmick. Just engineers discovering that running a multimodal model on-device in 110ms, down from 900ms, is genuinely useful, and that someone finally shipped the infrastructure layer to make it production-grade.

This is that company.

What They Build

RunAnywhere is an on-device AI platform: a unified SDK that lets you run multimodal models locally on iOS and Android, paired with a cloud control plane for managing model versions, routing policies, and fleet rollouts across your user base.

The pitch is disarmingly simple. You're building a voice assistant, an offline agent, a healthcare app that can't afford to send patient audio to the cloud. You want AI that works in airplane mode, in a hospital basement, in rural India with 2G connectivity. Right now you're stitching together llama.cpp, CoreML exports, a custom download manager, and praying your model fits in RAM. RunAnywhere replaces all of that with one SDK and a dashboard.

Target customers: mobile teams shipping AI features who are either burned by cloud inference costs, blocked by privacy requirements, or allergic to the 800ms round-trip latency that makes cloud-based voice AI feel sluggish. That's a large and growing population.

Business model: open-source SDK (distribution) + paid control plane (monetization). Classic developer infrastructure playbook. The SDK gets you in the door; the fleet management layer is where the contract lives.

How It Works

The technical core of RunAnywhere is MetalRT, a proprietary multimodal inference engine built specifically for Apple Silicon, using custom Metal GPU kernels. Shubham Malhotra, the CTO, wrote it. It's the thing that cut voice AI latency from 900ms to 110ms on-device.

That number deserves unpacking. 900ms to 110ms isn't a parameter tweak. It's a rearchitecture. Stock llama.cpp on Apple Silicon does fine for text, but voice AI requires a speech encoder (Whisper-class model), a language model pass, and often a vocoder, three sequential inference passes. The naive approach chains them. MetalRT fuses them where possible and saturates the Neural Engine and GPU in parallel, exploiting Apple Silicon's unified memory architecture in ways that a general-purpose inference framework can't without hardware-specific kernel authorship.

The Android path is less flashy but equally necessary. Android's hardware landscape is a nightmare: Qualcomm Hexagon DSPs, MediaTek APUs, Mali GPUs, and a long tail of chipsets that all behave slightly differently. RunAnywhere abstracts this via their inference engine abstraction layer, letting you target a capability profile ("run this 7B model if the device has an NPU, else this 1B model") rather than a specific chip.

The control plane is where the product becomes a platform rather than a library. Key capabilities:

Model delivery, resumable chunked downloads with background fetch, so a 2GB model lands on the device without your user noticing, spread across wifi sessions over two days if necessary.
Versioning and staged rollouts, ship a new quantized model to 5% of devices, watch crash rates and latency, promote or rollback. Same CI/CD intuition you have for code, applied to model weights.
Hybrid routing policies, define rules: "run locally by default, fall back to cloud when the device is below 20% battery or the model hasn't downloaded yet." The SDK enforces these at runtime without you writing the logic.
Telemetry, inference latency, memory usage, thermal throttling events, model download completion rates, cloud fallback frequency. The data you need to decide whether to quantize harder or just buy more cloud credits.

The SDK surface supports iOS (Swift), Android (Kotlin), React Native, and Flutter. Four cross-platform targets from one core runtime, which tells you something about where their abstraction layer lives.

The Competitive Landscape

The honest comparison set:

llama.cpp, the open-source workhorse. Excellent quantization support, runs everywhere, zero control plane. You're on your own for model delivery, versioning, and hybrid routing. RunAnywhere is what you build on top of llama.cpp if you do this enough times to get tired of the boilerplate.

MLC-LLM, TVM-based, academically rigorous, genuinely fast. No fleet management, no hybrid routing, harder to integrate into a production mobile app. Great research tool, awkward production dependency.

Apple's CoreML / Create ML, deeply integrated with the Apple stack, genuinely fast on Neural Engine. iOS-only. No Android. No control plane. No hybrid routing. Fine if you ship only to iOS and never need to manage model versions across a fleet.

Qualcomm AI Hub / MediaTek NeuroPilot, chip vendor solutions that optimize for their own silicon. Great performance numbers, chip-locked, require separate integrations per vendor.

RunAnywhere's bet is that none of these are the right answer for a production mobile team shipping AI features in 2026. They're probably right.

Difficulty Score

Dimension	Score	Why
ML / AI	8/10	Custom Metal GPU kernels, cross-platform inference engine, quantization-aware model packaging. This is hard ML systems work, not fine-tuning.
Data	4/10	Telemetry pipeline and model registry are standard infrastructure. Nothing exotic.
Backend	6/10	Control plane, model CDN, routing policy engine, staged rollout logic. Solid engineering, nothing algorithmically novel.
Frontend	3/10	SDK APIs, dashboard UI. Standard.
DevOps	6/10	Cross-platform build pipelines (iOS + Android + RN + Flutter), model packaging infrastructure, CDN for large binary delivery. Operationally annoying.

Overall: 7/10. The hard part isn't the product concept, it's MetalRT. Writing correct, fast custom Metal kernels for a novel inference workload is a skill that takes years to develop. Everything else is table stakes cloud infrastructure.

The Moat

RunAnywhere has three layers of defensibility, and they're not equally strong.

Technical moat (strong, decaying): MetalRT is a genuine lead. Custom Metal GPU kernels for on-device inference is a non-trivial thing to build and requires hardware expertise that's scarce. But Apple is shipping better Neural Engine tooling every year, and eventually the gap between "use Apple's CoreML" and "use RunAnywhere's MetalRT" will narrow. The 110ms vs 900ms delta is real today; it may be 200ms vs 150ms in three years.

Distribution moat (growing fast): 10,100 GitHub stars in six months is remarkable for infrastructure tooling. Developers who build a voice AI feature on RunAnywhere don't replace it when the contract comes up, they're integrated into your build system, your model delivery pipeline, your CI/CD. The open-source flywheel is working: developers discover it, ship with it, evangelize it to their team, sign the enterprise contract.

Data moat (speculative): If RunAnywhere accumulates telemetry across thousands of production deployments, which models quantize well on which chipsets, which routing thresholds minimize cloud costs without hurting latency, that data becomes a training signal for better automatic optimization recommendations. Not there yet, but the architecture supports it.

What's easy to replicate: the control plane dashboard, the model delivery CDN, the routing policy DSL. These are standard SaaS infrastructure problems.

What's genuinely hard: MetalRT, the cross-platform inference engine abstraction, and the 10K GitHub stars (earned, not manufactured).

Replicability Score: 55 / 100

You could ship a credible competitor in nine to twelve months with two strong ML systems engineers and a mobile infrastructure generalist. The control plane is a standard SaaS backend. The SDK wrappers are tedious but tractable. A well-funded team could match the feature set.

What you can't replicate: the MetalRT kernel work without someone who has done this before (rare), the GitHub star count and developer trust, and the integration surface area that accumulates with each production deployment. RunAnywhere will be entrenched in CI pipelines and model registries before a competitor ships. The open-source distribution moat is the most durable part of the business.

Score breakdown: hardware inference optimization is genuinely hard (pushes toward 70), but the control plane is standard SaaS and the open-source core is forkable (pulls toward 40). The 10K star distribution moat and early enterprise integrations land this at 55.

The Build-or-Buy Question

If you're a mobile team shipping AI features, RunAnywhere is almost certainly the right call. The alternative is months of inference engine work, custom download managers, and model versioning logic that isn't your core product. The open-source tier is free. The control plane costs money but less than the engineering time to build it.

If you're a chip vendor (Qualcomm, MediaTek, Google with TPU), RunAnywhere is an interesting acquisition target. They've done the cross-platform integration work that chip vendors struggle with politically (each wants their chipset to win). A neutral infrastructure layer with 10K stars and production deployments is valuable to someone who needs developer reach.

If you're a cloud provider, RunAnywhere is a threat and an opportunity. Every inference request that stays on-device is one less API call. But every developer using RunAnywhere's hybrid routing might be sending you the overflow. They could be a distribution partner before they're a competitor.

Bottom Line

On-device AI is not a niche use case. It's the default architecture for anything that needs to be fast, private, or functional offline, which describes a surprisingly large fraction of real-world AI applications. RunAnywhere is building the infrastructure layer for that world, they have the GitHub traction to prove developer appetite, and Shubham's MetalRT work is a genuine technical differentiator that took real expertise to build.

The risk: Apple and Google ship dramatically better on-device tooling as a platform feature, reducing the need for a third-party inference layer. The opportunity: the cross-platform control plane problem doesn't go away even if per-device inference gets easier. Managing model versions across 500K iOS and Android devices is an enterprise problem that platform vendors won't solve for you.

10,100 GitHub stars in six months, and the hard part is already built. That's a strong start.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a RunAnywhere Clone: On-Device AI Inference Platform

A step-by-step guide to building an on-device AI deployment platform with a managed control plane. Stack: Rust (inference core), Swift/Kotlin (mobile SDKs), React (dashboard), PostgreSQL + S3 (backend).

---

## Step 1: Core Inference Engine (C/Rust)

Build a cross-platform inference runtime that abstracts over multiple backends.

```
inference-core/
  src/
    engine.rs          # Inference engine trait + dispatcher
    backends/
      metal.rs         # Apple Metal GPU kernels (iOS/macOS)
      nnapi.rs         # Android NNAPI backend
      cpu.rs           # llama.cpp fallback
    quantization.rs    # GGUF/GGML model loading, q4_k_m, q8_0
    memory.rs          # KV-cache management, context windows
```

**Key algorithm, Metal kernel dispatch:**
```swift
// Metal compute shader for attention (simplified)
kernel void attention_forward(
    device const float* Q [[buffer(0)]],
    device const float* K [[buffer(1)]],
    device const float* V [[buffer(2)]],
    device float* output  [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]]
) {
    // fused QK^T * softmax * V in one kernel pass
    // avoids round-tripping through system RAM
}
```

For Android, delegate to NNAPI for NPU-capable devices; fall back to CPU via llama.cpp's ggml backend. Detect capabilities at runtime via `android.os.Build` and `NeuralNetworks_getDeviceCount()`.

---

## Step 2: Model Registry & Packaging

Design a model artifact format that supports versioning, chunked delivery, and capability tagging.

**DB schema (PostgreSQL):**
```sql
CREATE TABLE models (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name        TEXT NOT NULL,
  version     TEXT NOT NULL,        -- semver: "1.2.0"
  format      TEXT NOT NULL,        -- "gguf", "coreml", "onnx"
  quantization TEXT,                -- "q4_k_m", "q8_0", "f16"
  size_bytes  BIGINT NOT NULL,
  sha256      TEXT NOT NULL,
  s3_key      TEXT NOT NULL,
  capabilities JSONB,               -- {"min_ram_mb": 2048, "requires_npu": false}
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE model_chunks (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  model_id    UUID REFERENCES models(id),
  chunk_index INT NOT NULL,
  byte_offset BIGINT NOT NULL,
  size_bytes  INT NOT NULL,
  sha256      TEXT NOT NULL,
  s3_key      TEXT NOT NULL
);
```

Store model weights in S3 with 50MB chunks. Sign URLs with 1-hour TTL; clients resume failed downloads by chunk index.

---

## Step 3: Routing Policy Engine

The hybrid routing DSL determines when to run locally vs. fall back to cloud.

```sql
CREATE TABLE routing_policies (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  app_id      UUID NOT NULL,
  name        TEXT NOT NULL,
  rules       JSONB NOT NULL,  -- evaluated top-to-bottom
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
-- Example rules JSONB:
-- [
--   {"if": {"battery_pct": {"lt": 15}}, "route": "cloud"},
--   {"if": {"model_downloaded": false},  "route": "cloud"},
--   {"if": {"device_thermal": "critical"}, "route": "cloud"},
--   {"else": "local"}
-- ]
```

Evaluate on the SDK side (no network round-trip). Sync policy JSON at app start + on policy update webhook.

---

## Step 4: Mobile SDKs (iOS + Android)

**iOS Swift SDK structure:**
```
RunAnywhereiOS/
  Sources/RunAnywhere/
    RunAnywhere.swift          # Public API: init, infer, download
    ModelManager.swift         # Download, resume, verify sha256
    PolicyEvaluator.swift      # Evaluates routing policy JSON
    InferenceEngine.swift      # Bridges to Rust via C FFI
    TelemetryReporter.swift    # Batched events -> control plane
```

**Public API (iOS):**
```swift
let ra = RunAnywhere(apiKey: "...", appId: "...")

// Pulls model if not cached; respects routing policy
ra.loadModel("llama-3.2-1b-q4", version: "1.0.0") { result in
    switch result {
    case .success(let model):
        model.generate(prompt: "Hello", maxTokens: 200) { token in
            print(token) // streaming
        }
    case .failure(let err): print(err)
    }
}
```

**Android Kotlin SDK:**
Same structure, NNAPI bridge instead of Metal. Use JNA or JNI to call into the shared Rust inference core compiled for `arm64-v8a` and `x86_64`.

---

## Step 5: Control Plane API

REST API backing the dashboard and SDKs. Node.js + Fastify or Go + chi.

```
GET  /v1/apps/:appId/models/active     → model manifest for this app version
POST /v1/apps/:appId/rollouts          → create staged rollout (% of fleet)
GET  /v1/apps/:appId/telemetry/summary → latency p50/p95, cloud fallback rate
POST /v1/telemetry/batch               → ingest SDK telemetry events
GET  /v1/models/:id/download-urls      → signed S3 chunk URLs
POST /v1/models                        → upload new model version
```

**Staged rollout logic:**
```sql
CREATE TABLE rollouts (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  app_id       UUID NOT NULL,
  model_id     UUID REFERENCES models(id),
  pct_of_fleet INT NOT NULL CHECK (pct_of_fleet BETWEEN 0 AND 100),
  status       TEXT DEFAULT 'active',  -- active, paused, completed, rolled_back
  created_at   TIMESTAMPTZ DEFAULT NOW()
);
```

Hash the device_id modulo 100 to determine rollout cohort membership. Stable per-device, no coordination required.

---

## Step 6: Telemetry Pipeline

Collect inference events from SDK, aggregate into metrics.

**Event schema:**
```typescript
interface TelemetryEvent {
  app_id: string;
  device_id: string;       // hashed, no PII
  model_id: string;
  route_taken: "local" | "cloud" | "cloud_fallback";
  latency_ms: number;
  prompt_tokens: number;
  completion_tokens: number;
  thermal_state: "nominal" | "fair" | "serious" | "critical";
  battery_pct: number;
  error?: string;
  ts: number;
}
```

SDK buffers events in SQLite, flushes batch of 50 every 30s. Control plane ingests into a time-series table, materializes hourly aggregates for the dashboard.

```sql
CREATE TABLE telemetry_events (
  id           BIGSERIAL PRIMARY KEY,
  app_id       UUID NOT NULL,
  device_id    TEXT NOT NULL,
  model_id     UUID NOT NULL,
  route_taken  TEXT NOT NULL,
  latency_ms   INT,
  ts           TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON telemetry_events (app_id, ts DESC);
```

---

## Step 7: Dashboard + Deployment

React dashboard (Vite + shadcn/ui), five screens: Apps, Models, Rollouts, Telemetry, Policies.

**Key dashboard metrics to surface:**
- Cloud fallback rate (%), primary health metric
- P95 local inference latency by device tier
- Model download completion rate over 72h cohort
- Thermal throttle events per 1K inferences

**Deployment:**
- Control plane: Railway or Fly.io (start simple, containerized Go/Node service)
- Model storage: S3 + CloudFront (models are large, CDN matters)
- DB: Supabase PostgreSQL (pgvector ready if you add embedding search later)
- SDK distribution: CocoaPods + Swift Package Manager (iOS), Maven Central (Android), npm (RN/Flutter)
- CI: GitHub Actions, build Rust inference core for each target triple, run integration tests against a real device farm (Bitrise or AWS Device Farm)

**The non-obvious hard part:** The Rust inference core must be compiled for `aarch64-apple-ios`, `aarch64-apple-ios-sim`, `aarch64-linux-android`, and `x86_64-linux-android`. Set up cross-compilation in CI early. The Metal shader compilation step (`.metal` → `.metallib`) must happen at app build time, not runtime, for AppStore compliance.

---

## Estimated Build Time

| Phase | Time (2-person team) |
|-------|---------------------|
| CPU inference baseline (llama.cpp integration) | 2 weeks |
| iOS SDK + Metal backend | 4 weeks |
| Android SDK + NNAPI backend | 3 weeks |
| Control plane API | 2 weeks |
| Model registry + S3 delivery | 1 week |
| Routing policy engine | 1 week |
| Telemetry pipeline | 1 week |
| Dashboard | 2 weeks |
| **Total** | **~16 weeks** |

The Metal kernel work (Step 1) is the long pole. If you're not bringing that expertise in-house, start with llama.cpp on CPU and CoreML for Neural Engine acceleration, you'll get 80% of the result in 20% of the time. Ship that, get customers, then invest in the custom kernel work once you have revenue to justify it.

Install for:

claude-code-skills.md

#on-device AI #mobile AI #edge inference #YC W2026 #developer infrastructure #iOS #Android #Metal GPU #llama.cpp #AI infrastructure