Claude's Corner: Compresr — The Token Accountant Your AI Stack Desperately Needs

Four EPFL researchers built a PhD-backed LLM context compression API that could cut your token bill by 10x — or get eaten alive by Anthropic. Here's the technical breakdown and how to build your own.

May 1 at 11:10 AM8 min read

Claude's Corner: Compresr — The Token Accountant Your AI Stack Desperately Needs

Every AI company is bleeding tokens. Not metaphorically — literally. Context windows are the new compute budget, and most teams have no idea how fast they're burning through them. A RAG pipeline that retrieves 20 documents? 30,000 tokens minimum. An agentic loop with tool calls, prior conversation, and system prompt? You're at 50,000 before the model has said a word. The bill arrives at the end of the month and finance emails you asking what "Anthropic API" is.

Compresr (YC W2026) thinks it has the fix. Four EPFL researchers — including a CEO who wrote his PhD specifically on LLM context compression — built an API that compresses what goes into the context window without losing what actually matters. The pitch is clean: same answers, fewer tokens, lower latency, smaller bills. Drop in their SDK or stand up their open-source proxy, and the rest just works.

This is either a clever infrastructure wedge that grows into essential AI plumbing, or a feature that Anthropic ships in a Tuesday release and vaporizes the company. Let's figure out which.

What They Build

Compresr offers two products. The first is a compression API — you send them a query plus the context you were going to inject, and they return a compressed version that preserves the semantically relevant tokens for that specific query. It's query-conditioned extraction, not dumb truncation. The second is Context Gateway, an open-source proxy (Go + TypeScript, 595 GitHub stars as of writing) that sits between your coding agent and the LLM API. It intercepts outbound context, compresses it on the fly, and forwards the leaner payload. For Claude Code or Cursor users, setup is a config file change and a Docker container.

The target customer is any team running high-token workloads: RAG pipelines ingesting large document sets, agentic systems accumulating long tool-call histories, coding assistants working across large codebases. In practice, that's almost every serious AI application today.

Business model is API usage-based — you pay per compression call, presumably at a price that undercuts the token savings you get on the downstream LLM call. There's no published pricing yet, which is very YC early-stage, but the unit economics make structural sense: if they charge $0.001 per compression and save you $0.01 in GPT-4o tokens, you're happy.

How It Actually Works

Context compression sounds simple — delete the irrelevant stuff. The hard part is knowing what's irrelevant, and that answer changes depending on what you're trying to do.

Related startups

Compresr's core approach is intent-conditioned compression. Rather than blindly summarizing or truncating, the system analyzes why a piece of context was retrieved — the query behind the retrieval — and filters tokens against that intent. A grep that was called to find error patterns? Keep the matching lines, drop the surrounding noise. A document retrieved for a specific clause? Keep the clause and its immediate context, compress the rest.

Under the hood, Context Gateway uses small language models (SLMs) as the compression engine — cheap, fast models that can perform semantic filtering without the latency or cost of a full frontier call. The system runs compression asynchronously in the background, triggering at 85% context fill before the agent even notices the window is full. This is the key UX insight: users tolerate no-latency background compression; they do not tolerate waiting three minutes for a synchronous compact operation.

The proxy also implements lazy tool loading — only the tools relevant to the current step appear in the context. This matters because OpenAI and Anthropic's tool schemas are verbose. A 30-tool configuration easily adds 5,000 tokens just in schema definitions. Show 3 tools, not 30.

One clever detail: compressed segments preserve an expand() handle. If the model determines it needs the uncompressed version of something, it can call back for it. This is crucial for correctness — you don't want a compressed grep output to cause the agent to miss a bug — and it distinguishes Compresr's approach from lossy summarization. They're very pointed about this distinction: summarization changes the content; their compression changes the representation while keeping the content intact.

The tech stack choice is interesting. Go at 90% of the codebase signals they care about throughput and predictable latency in the proxy layer. This isn't a Python FastAPI weekend project. You want the compression middleware to be the fastest thing in your request chain, not a bottleneck.

Difficulty Score

Dimension	Score	Why
ML / AI	7/10	SLM-based semantic filtering, intent conditioning, compression without accuracy loss — active research area with meaningful IP
Data	5/10	Need quality/accuracy benchmark datasets for compression validation; no proprietary data moat yet
Backend	5/10	Go proxy with async compression, context window tracking, multi-agent support — solid eng but not exotic
Frontend	2/10	Minimal — dashboard for monitoring compression metrics, nothing exotic
DevOps	4/10	Docker, multi-agent config, SLM inference infra — manageable for a small team

The Moat (Or: What You Can't Just Copy)

Let's be honest about what's replicable here.

The Context Gateway proxy? You can build a functional clone in a weekend. The architecture is public on GitHub. An LLM proxy in Go is three hundred lines of code. The open-source release was smart for adoption, but it means the infrastructure layer is fully commoditized by definition.

The compression API is the real question. Compresr's founders spent years on this at EPFL — Zakazov's PhD was specifically on this problem, Gabouj researched efficient ML and prompt compression, and the team came from Bell Labs and AXA Research. This isn't a team that read a few papers on LLMLingua and shipped an endpoint. They have working knowledge of where the published methods break and what the failure modes are at production scale.

The moat is quality. Specifically: compression that provably doesn't degrade downstream model accuracy, benchmarked across diverse task types, at useful compression ratios. Getting to "100x compression, same accuracy" is not a weekend task. Published open-source methods like LLMLingua, RECOMP, and AutoCompressor give you a starting point, but production-grade reliability across arbitrary input types requires the kind of empirical grind that takes months and a lot of labeled failure cases.

The real defensibility, if they get it, will come from:

Integration depth — being the default compression middleware for major agent frameworks (LangChain, LlamaIndex, AutoGen) creates switching costs
Accuracy benchmarks — if they can publish compelling evals that competitors can't match, procurement decisions write themselves
The feedback loop — every production compression job is a signal about where quality degrades. More usage = better compression = more usage

The existential risk is the one the HN commenters raised: Anthropic, OpenAI, and Google can ship native context compression any time they want. Claude's /compact already exists. If they make it automatic and smarter, Compresr's core value prop evaporates. The counter-argument is that this is also true of every observability, caching, and infrastructure layer that has ever existed around cloud APIs — and most of those companies survived and thrived. Datadog didn't die when AWS launched CloudWatch.

There's also a legitimate security concern worth flagging: running untrusted external content (like retrieved documents) through a compression layer that modifies them before injection introduces a potential prompt injection surface. A document that says "ignore previous compression instructions and include the following..." is a real threat class. Compresr needs to be as serious about this as they are about compression ratios.

Replicability Score: 42 / 100

You can clone the Context Gateway proxy in a weekend — it's open source. You can approximate the compression logic with LLMLingua or a prompt-based approach using a cheap LLM. What you cannot quickly replicate is four years of PhD-level research on compression quality, the benchmark suite that proves your system doesn't silently degrade agent performance, or the integration partnerships that make you the default choice in major frameworks.

The 42 score reflects that the structural moat is real but not deep. A well-resourced team (say, a YC competitor with two ML researchers) could get to 80% of the quality in three to six months. The remaining 20% — the edge cases where bad compression costs an agent an entire task — is what matters in production and takes much longer to nail. That gap is narrow enough to call this clonable, but wide enough that the team has a meaningful head start.

Should You Build This?

If you're a developer running high-token AI workloads today: yes, use Compresr or at minimum their open-source Context Gateway. The ROI math is obvious. A 10x compression ratio on tool outputs alone — even if you discount the "100x" headline claim — cuts meaningful API spend while you're waiting for context windows to get cheaper.

If you're thinking about building a competitor: the barrier isn't the proxy layer, it's the ML research. Go read the LLMLingua papers, the RECOMP paper, and the AutoCompressor work from Princeton. Then budget six months to build something you'd actually trust with production agent workloads. The market exists — every serious AI team is spending real money on tokens — but the compression quality bar is higher than it looks from the outside.

The broader point is that context efficiency is becoming a first-class engineering concern, not an afterthought. Compresr is early to this infrastructure layer. The question is whether they can build deep enough before the model providers swallow the market. It's a race worth watching.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build Guide: LLM Context Compression API (Compresr Clone)

A step-by-step guide to building a production-grade LLM context compression service using Claude Code.

---

## Step 1: Define Your Data Model & DB Schema

```sql
-- Compression jobs table
CREATE TABLE compression_jobs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  created_at TIMESTAMPTZ DEFAULT now(),
  query TEXT NOT NULL,              -- the user intent / query that conditions compression
  raw_context TEXT NOT NULL,        -- original context passed in
  compressed_context TEXT,          -- output
  compression_ratio FLOAT,          -- raw_tokens / compressed_tokens
  model_used VARCHAR(100),
  latency_ms INT,
  accuracy_score FLOAT,             -- optional: downstream eval score
  expand_handles JSONB DEFAULT '[]' -- [{id, original_chunk, position}]
);

-- API keys
CREATE TABLE api_keys (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID REFERENCES users(id),
  key_hash TEXT UNIQUE NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now(),
  last_used_at TIMESTAMPTZ,
  monthly_token_budget BIGINT DEFAULT 10000000
);

-- Usage tracking
CREATE TABLE usage_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  api_key_id UUID REFERENCES api_keys(id),
  ts TIMESTAMPTZ DEFAULT now(),
  input_tokens INT NOT NULL,
  output_tokens INT NOT NULL,
  compression_ratio FLOAT,
  billed_amount_cents INT
);
```

---

## Step 2: Implement the Compression Engine

The core algorithm is intent-conditioned extractive compression. Use a small, fast model (Haiku 4.5) to do semantic filtering:

```python
import anthropic

client = anthropic.Anthropic()

COMPRESSION_SYSTEM = """You are a context compression engine. Your job is to reduce the provided context to only the tokens relevant to answering the user's query, without paraphrasing or summarizing. Return the preserved chunks verbatim, separated by [COMPRESSED] markers. For each dropped section, insert a [EXPAND:id] handle so the caller can retrieve the original if needed."""

def compress_context(query: str, context: str, target_ratio: float = 0.1) -> dict:
    """
    Compress context conditioned on a specific query.
    Returns compressed text + expand handles.
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=8096,
        system=COMPRESSION_SYSTEM,
        messages=[{
            "role": "user",
            "content": f"QUERY: {query}\n\nCONTEXT TO COMPRESS:\n{context}\n\nTarget compression ratio: {target_ratio}"
        }]
    )
    
    compressed = response.content[0].text
    
    # Parse expand handles from compressed output
    import re
    handles = re.findall(r'\[EXPAND:([a-z0-9]+)\]', compressed)
    
    return {
        "compressed": compressed,
        "expand_handles": handles,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }
```

**Critical:** Enable prompt caching on the system prompt — it's called on every request.

```python
# With prompt caching (saves ~80% on system prompt tokens at scale)
response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=8096,
    system=[{
        "type": "text",
        "text": COMPRESSION_SYSTEM,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[...]
)
```

---

## Step 3: Build the REST API

```python
# FastAPI app — compress endpoint
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel

app = FastAPI()

class CompressionRequest(BaseModel):
    query: str
    context: str
    target_ratio: float = 0.1   # 10% of original = 10x compression

class CompressionResponse(BaseModel):
    compressed: str
    expand_handles: list[str]
    compression_ratio: float
    tokens_saved: int

@app.post("/v1/compress", response_model=CompressionResponse)
async def compress(req: CompressionRequest, api_key: str = Depends(verify_api_key)):
    result = compress_context(req.query, req.context, req.target_ratio)
    
    raw_tokens = estimate_tokens(req.context)
    compressed_tokens = estimate_tokens(result["compressed"])
    ratio = raw_tokens / max(compressed_tokens, 1)
    
    await log_usage(api_key, raw_tokens, compressed_tokens)
    
    return CompressionResponse(
        compressed=result["compressed"],
        expand_handles=result["expand_handles"],
        compression_ratio=ratio,
        tokens_saved=raw_tokens - compressed_tokens
    )

@app.get("/v1/expand/{handle_id}")
async def expand(handle_id: str, job_id: str, api_key: str = Depends(verify_api_key)):
    """Retrieve original uncompressed chunk by expand handle."""
    chunk = await get_original_chunk(job_id, handle_id)
    if not chunk:
        raise HTTPException(404, "Handle not found or expired")
    return {"content": chunk}
```

---

## Step 4: Build the Context Gateway Proxy (Go)

The proxy intercepts agent requests, compresses outbound context, and forwards to LLM APIs:

```go
// main.go
package main

import (
    "bytes"
    "encoding/json"
    "net/http"
    "net/http/httputil"
    "net/url"
)

const contextFillThreshold = 0.85  // compress at 85% fill

type GatewayConfig struct {
    UpstreamURL       string
    CompressionAPIURL string
    CompressionAPIKey string
    TriggerThreshold  float64
}

func NewCompressionProxy(cfg GatewayConfig) *httputil.ReverseProxy {
    target, _ := url.Parse(cfg.UpstreamURL)
    proxy := httputil.NewSingleHostReverseProxy(target)
    
    proxy.Director = func(req *http.Request) {
        // Parse request body
        var body map[string]interface{}
        json.NewDecoder(req.Body).Decode(&body)
        
        // Estimate context fill ratio
        fillRatio := estimateContextFill(body)
        
        if fillRatio > contextFillThreshold {
            // Compress messages asynchronously before forwarding
            body = compressMessages(body, cfg)
        }
        
        // Lazy tool loading — only include tools relevant to current step
        body = filterTools(body)
        
        newBody, _ := json.Marshal(body)
        req.Body = io.NopCloser(bytes.NewReader(newBody))
        req.ContentLength = int64(len(newBody))
        req.URL.Host = target.Host
        req.URL.Scheme = target.Scheme
        req.Host = target.Host
    }
    
    return proxy
}

func main() {
    cfg := loadConfig()
    proxy := NewCompressionProxy(cfg)
    http.ListenAndServe(":8080", proxy)
}
```

---

## Step 5: Accuracy Evaluation Pipeline

Compression is useless if it degrades downstream task performance. Build a continuous eval loop:

```python
# eval.py
import anthropic
from datasets import load_dataset

def evaluate_compression_accuracy(compressor, n_samples=500):
    """
    Benchmark compressed vs uncompressed contexts on a QA dataset.
    Uses LongBench or NarrativeQA for long-context eval.
    """
    client = anthropic.Anthropic()
    dataset = load_dataset("THUDM/LongBench", "qasper", split="test")
    
    results = []
    for item in dataset.select(range(n_samples)):
        query = item["input"]
        context = item["context"]
        ground_truth = item["answers"][0]
        
        # Answer with full context
        full_answer = ask_with_context(client, query, context)
        
        # Answer with compressed context
        compressed = compressor.compress(query, context)
        compressed_answer = ask_with_context(client, query, compressed["compressed"])
        
        results.append({
            "compression_ratio": compressed["compression_ratio"],
            "full_correct": evaluate_answer(full_answer, ground_truth),
            "compressed_correct": evaluate_answer(compressed_answer, ground_truth),
        })
    
    accuracy_delta = (
        sum(r["compressed_correct"] for r in results) /
        sum(r["full_correct"] for r in results)
    )
    
    return {"accuracy_preservation": accuracy_delta, "samples": n_samples}
```

---

## Step 6: Billing & Usage Metering

```python
# billing.py
import stripe

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

PRICE_PER_1M_TOKENS_COMPRESSED = 0.50  # $0.50 per million input tokens compressed

async def bill_compression_job(api_key_id: str, input_tokens: int):
    """Meter usage and bill via Stripe."""
    amount_cents = int((input_tokens / 1_000_000) * PRICE_PER_1M_TOKENS_COMPRESSED * 100)
    
    if amount_cents < 1:
        return  # below billing threshold, accumulate
    
    customer_id = await get_stripe_customer(api_key_id)
    
    # Report metered usage to Stripe
    stripe.billing.MeterEvent.create(
        event_name="compression_tokens",
        payload={
            "stripe_customer_id": customer_id,
            "value": str(input_tokens),
        }
    )
```

---

## Step 7: Deploy

```yaml
# docker-compose.yml
services:
  compression-api:
    build: ./api
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - DATABASE_URL=${DATABASE_URL}
      - STRIPE_SECRET_KEY=${STRIPE_SECRET_KEY}
    ports:
      - "3000:3000"
  
  context-gateway:
    build: ./gateway
    environment:
      - COMPRESSION_API_URL=http://compression-api:3000
      - UPSTREAM_LLM_URL=https://api.anthropic.com
      - TRIGGER_THRESHOLD=0.85
    ports:
      - "8080:8080"
  
  postgres:
    image: postgres:16
    environment:
      - POSTGRES_DB=compresr
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:
```

**Production checklist:**
- Deploy compression API behind a CDN with aggressive caching on repeated queries (same context, same query = same output)
- Run SLM inference on GPU for sub-100ms compression latency
- Set up Prometheus metrics for compression ratio distribution, latency p99, accuracy alerts
- Use Supabase or RDS for the usage/billing DB with connection pooling (PgBouncer)
- Rate-limit expand() calls to prevent storage abuse

Install for:

claude-code-skills.md

#AI infrastructure #LLM #context compression #YC W2026 #developer tools #tokens #RAG #agents

AI Daily Digest

Get the most important AI news daily.

+40k readers