Claude's Corner: CellType — Teaching LLMs to Speak Biology

CellType is the two-person YC W2026 company building an agentic drug discovery platform on top of a 27B biological foundation model. Their Cell2Sentence technique translates single-cell gene expression into sequences LLMs can learn from — and they've already validated a cancer immunotherapy prediction in living cells. Here's how they built it, why it's hard to replicate, and a step-by-step guide to building a clone.

7 min read
Claude's Corner: CellType — Teaching LLMs to Speak Biology

TL;DR

CellType is the agentic drug company using a 27B biological foundation model to simulate human cellular responses to drugs — replacing mice with math. Their Cell2Sentence technique converts gene expression profiles into text that LLMs can learn from, and they've already validated a cancer immunotherapy prediction in living cells.

6.8
C

Build difficulty

Drug discovery is one of the most expensive failures in modern science. A drug that works in mice fails in humans 90% of the time. The industry spends a decade and a billion dollars finding out. And yet, for fifty years, we kept doing the same thing — because there was no better model of "human" to test against.

CellType thinks that era is over.

The two-person YC W2026 company from New Haven and New York has built what they call "the agentic drug company" — a platform where AI agents run the full drug discovery pipeline on top of biological foundation models that simulate human biology at the cellular level. Instead of mice, they simulate patients. Instead of wet-lab trial-and-error, they run computational screens across thousands of drug candidates in days.

The kicker: they've already validated it. A model prediction about cancer immunotherapy — identifying a compound that could make "cold tumors" visible to the immune system — was subsequently confirmed in living cells, increasing antigen presentation by roughly 50%. That's not a demo. That's a result.

What They Do

CellType's product is a B2B platform for pharmaceutical companies. Pharma brings their drug candidates, disease areas, and biological questions. CellType runs AI-driven discovery workflows to prioritize which molecules are worth taking into expensive wet-lab experiments or clinical trials.

The target customer is a Top 10 pharma company that spends north of $1B per approved drug and has a preclinical attrition rate that keeps their CFO awake. According to the founders, all current pharma deals came inbound — a strong signal they've hit something real.

Revenue is almost certainly service-based for now: pharma companies pay for discovery runs, hypothesis validation, and platform access. The long-term play is more interesting — if you are the virtual human that all pharma companies query before they run a single animal study, you're upstream of every drug on the planet.

They've signed a strategic MOU with Senhwa Biosciences (March 2026) to integrate their AI platform into the clinical development of CX-4945 (Silmitasertib), a lead cancer compound. That's not a tech partnership — that's a drug company betting their lead asset on CellType's predictions.

Related startups

How It Works

The core insight is elegant and surprisingly simple in its framing: cells are already speaking a language. Every cell in your body expresses genes at different levels, and the pattern of which genes are active — and by how much — encodes everything about what that cell is doing, what disease state it's in, and how it will respond to a drug.

The problem is that pattern lives in a high-dimensional numerical space that neural networks struggle with and biologists can barely interpret. CellType's founders, David van Dijk and Ivan Vrkic, asked a different question: what if you just translated it into English?

Cell2Sentence

Cell2Sentence (published at ICML 2024) is that translation layer. The technique is almost brutally simple: take a single-cell RNA sequencing (scRNA-seq) profile — a vector of tens of thousands of gene expression measurements for one cell — and rank the genes by expression level in descending order. The resulting sequence of gene names, separated by spaces, is a "cell sentence."

GENE_A GENE_B GENE_C GENE_D ... GENE_Z

That's it. A cell becomes a sentence. Thousands of cells become a corpus. And now you can fine-tune any large language model on it using standard next-token prediction.

The model learns that certain gene co-expression patterns always appear together in cancer cells. It learns that certain drug perturbations shift a cell's gene sentence in predictable ways. It learns the "grammar" of cellular biology from data, not from hand-crafted biological rules.

The Foundation Model

CellType's production model is a 27-billion parameter model built on Google's Gemma architecture, developed in collaboration with Google DeepMind. At 27B parameters, this isn't a fine-tuned toy — it's a serious foundation model trained on billions of cell sentences across diverse cell types, tissues, disease states, and perturbation conditions.

The scale matters. Smaller models can do cell type annotation. At 27B, trained on the right data, you can start asking questions like "how will this specific cell type in a tumor microenvironment respond to this specific drug at this dose?" — and get a biologically grounded answer.

Google CEO Sundar Pichai highlighted the work, which reached 7 million views. The Yale/Google blog post about C2S-Scale (the 27B model) describes it generating a novel hypothesis about cancer immunotherapy, which was then experimentally confirmed. In living cells.

The Agentic Layer

On top of the foundation model sits an agentic orchestration layer. Drug discovery isn't one query — it's a pipeline. You start with a disease hypothesis, screen thousands of compounds, prioritize lead candidates, model mechanism of action, predict off-target effects, simulate patient-relevant biology, and flag what's worth taking into expensive in vitro validation.

CellType's agents chain these steps together, using the foundation model as the reasoning core at each stage. A human scientist sets the goal; the agents run the pipeline. This is not "AI-assisted drug discovery" where a researcher still does 95% of the work. It's closer to an autonomous research associate that runs weeks of computational experiments overnight.

The Team

David van Dijk is a Yale professor with 11,000+ citations and publications in Cell, Nature, NeurIPS, and ICML. He turned down Google to start CellType. Ivan Vrkic co-developed Cell2Sentence at Yale, previously led foundation model training at a biotech, and — in what is possibly the most unusual line on any founder's CV — wrote software that helped control CERN's Large Hadron Collider. Two people. This is the whole company.

Difficulty Score

DimensionScoreWhy
ML / AI9/1027B biological foundation model, novel tokenization of scRNA-seq data, validated cancer prediction
Data9/10Single-cell RNA-seq datasets are expensive and proprietary; curating a training corpus at scale requires wet-lab partnerships
Backend6/10Agentic pipeline orchestration, inference serving, experiment tracking — hard but tractable
Frontend3/10A pharma researcher dashboard; not the hard part
DevOps7/10Serving 27B inference at reasonable latency, GPU cluster management, handling long-running agentic jobs

The Moat

What's hard to replicate: The Cell2Sentence methodology is published. The idea is in the open. What you cannot replicate easily is the trained 27B model — which required enormous compute and carefully curated training data. You also cannot replicate the validated experimental result. That cancer finding is CellType's best sales asset, and getting a comparable validation in living cells requires a wet-lab, biological expertise, and time.

The pharma relationships are equally hard to fake. When a Top 10 pharma company integrates your platform into their discovery pipeline, they're betting their internal drug programs on your model's quality. That trust was earned through results. No clone starts with results.

David van Dijk spent years building Cell2Sentence at Yale. The research credibility (11k citations, ICML paper, Google collaboration, published in Cell and Nature) is the trust signal that gets CellType into rooms where two-person startups normally don't go. A competitor without that track record would spend years proving what CellType has already proven.

What's easy to replicate: The agentic orchestration layer. The frontend. The concept of "use an LLM on biological data." Several well-funded startups (Recursion, Genentech, Insitro) are doing adjacent things with larger teams and more capital. But none of them trained a 27B model on Cell2Sentence representations, and none of them have that specific validated cancer result yet.

The real question is whether CellType can raise enough money, fast enough, to stay ahead of better-capitalized competitors who will read the ICML paper and try to reproduce it. Two people cannot fight that battle forever.

Replicability Score: 72 / 100

The Cell2Sentence paper is on arXiv. The GitHub repo is public. A competent ML team with deep pockets could, in theory, reproduce the approach. The methodology is not secret.

What scores this a 72 rather than a 40: the training compute cost for a 27B biological model is in the millions of dollars. The proprietary scRNA-seq training data took years to curate. The experimental validation in living cells requires a wet-lab setup most ML startups don't have. The pharma relationships took David van Dijk's entire academic career to build.

You could clone the idea. You cannot clone the head start. Not without serious capital, deep biological domain expertise, a wet-lab partner, and years of time. In a slow-moving industry like pharma, that head start compounds.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build Guide: Cloning CellType with Claude Code

**Goal:** Build a minimal agentic drug discovery platform that uses LLMs to predict drug responses from single-cell gene expression data.

**Estimated time:** 4-6 weeks solo with Claude Code assistance
**Cost to get running:** ~$2,000-$10,000 (GPU compute for training, data access)

---

## Step 1: Data Pipeline — Acquire and Process scRNA-seq Data

**What you need:** Single-cell RNA sequencing datasets with drug perturbation labels.

- Download the **LINCS L1000** dataset (gene expression after drug perturbation across 978 landmark genes) from lincsproject.org — free, ~100GB
- Optionally supplement with **CELLxGENE** (cellxgene.cziscience.com) for diverse cell types
- Write a Cell2Sentence tokenizer:

```python
def cell_to_sentence(gene_expression_vector: dict[str, float], top_n: int = 200) -> str:
    """Convert a gene expression profile to a ranked gene sentence."""
    ranked = sorted(gene_expression_vector.items(), key=lambda x: x[1], reverse=True)
    return " ".join(gene for gene, _ in ranked[:top_n])
```

- Output format: `cell_id | cell_type | drug | dose | cell_sentence`
- Target corpus: 10M+ cell sentences minimum for meaningful training signal

**DB schema (PostgreSQL):**
```sql
CREATE TABLE cells (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  cell_type TEXT NOT NULL,
  tissue TEXT,
  disease_state TEXT,
  cell_sentence TEXT NOT NULL,  -- ranked gene names
  metadata JSONB
);

CREATE TABLE drug_perturbations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  cell_id UUID REFERENCES cells(id),
  drug_name TEXT NOT NULL,
  smiles TEXT,                    -- molecular structure
  dose_nm FLOAT,
  perturbed_sentence TEXT NOT NULL,  -- cell sentence post-treatment
  delta_genes JSONB               -- genes that shifted significantly
);
```

---

## Step 2: Fine-Tune an LLM on Cell Sentences

**Start small:** Use Gemma 7B (not 27B) to validate the approach before spending on compute.

- Use HuggingFace `transformers` + `trl` library for SFT (Supervised Fine-Tuning)
- Training objective: next-token prediction on cell sentences (standard causal LM)
- Key training tasks to include:
  1. **Cell type prediction**: Given cell sentence → predict cell type label
  2. **Drug response prediction**: Given cell sentence + drug → predict perturbed sentence
  3. **Q&A pairs**: "Is this cell cancerous?" / "What drug would activate antigen presentation?"

```python
# Fine-tuning config (QLoRA for memory efficiency)
from trl import SFTTrainer
from peft import LoraConfig

lora_config = LoraConfig(r=64, lora_alpha=128, target_modules=["q_proj", "v_proj"])
trainer = SFTTrainer(
    model="google/gemma-7b",
    train_dataset=cell_sentence_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=512,
)
trainer.train()
```

- **Compute budget:** 7B model × 10M samples ≈ 4× A100 × ~48h ≈ ~$800 on Lambda Labs
- Checkpoint frequently; evaluate on held-out drug perturbations

---

## Step 3: Build the Drug Response Prediction API

Wrap the fine-tuned model in a FastAPI service that pharma researchers can query.

```python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class PredictionRequest(BaseModel):
    cell_sentence: str      # pre-treatment cell state
    drug_smiles: str        # molecular structure
    question: str           # "Will this enhance antigen presentation?"

@app.post("/predict")
async def predict_drug_response(req: PredictionRequest):
    prompt = f"""Cell state: {req.cell_sentence}
Drug: {req.drug_smiles}
Question: {req.question}
Answer:"""
    response = model.generate(prompt, max_new_tokens=200)
    return {"prediction": response, "confidence": extract_confidence(response)}
```

- Add a `/screen` endpoint that accepts a list of drug SMILES and a target cell state, returns ranked candidates
- Cache common cell state + drug combinations in Redis
- Rate-limit by API key; each pharma customer gets a key

---

## Step 4: Build the Agentic Discovery Pipeline

Chain multiple prediction steps into an autonomous workflow using LangGraph.

```python
from langgraph.graph import StateGraph

def build_discovery_pipeline():
    graph = StateGraph(DiscoveryState)
    
    # Node 1: Load disease cell state from DB
    graph.add_node("load_disease_state", load_patient_cell_profile)
    
    # Node 2: Screen compound library (top 1000 candidates by embedding similarity)
    graph.add_node("screen_compounds", screen_compound_library)
    
    # Node 3: Predict top-20 responses in detail
    graph.add_node("predict_responses", predict_drug_responses)
    
    # Node 4: Filter by toxicity / off-target predictions
    graph.add_node("filter_toxicity", run_toxicity_screen)
    
    # Node 5: Generate hypothesis report
    graph.add_node("generate_report", generate_scientific_report)
    
    graph.set_entry_point("load_disease_state")
    graph.add_edge("load_disease_state", "screen_compounds")
    graph.add_edge("screen_compounds", "predict_responses")
    graph.add_edge("predict_responses", "filter_toxicity")
    graph.add_edge("filter_toxicity", "generate_report")
    
    return graph.compile()
```

- Run pipelines as background jobs (Celery + Redis); pharma teams submit jobs and get email/webhook when done
- Store all pipeline runs, inputs, and outputs for reproducibility

---

## Step 5: Build the Pharma Researcher Dashboard (Next.js)

Simple but useful — researchers need to submit jobs, browse results, and export reports.

**Key pages:**
1. **Job submission** — upload a target cell profile (CSV scRNA-seq), select a compound library, set the question
2. **Results viewer** — ranked compound list, per-compound predicted response, confidence scores, downloadable report
3. **Compound explorer** — search your compound library, view molecular structure (using RDKit), see historical predictions

```typescript
// Example: Job submission form
interface DiscoveryJob {
  diseaseState: File;      // scRNA-seq CSV
  compoundLibrary: string; // "FDA-approved" | "custom" | user upload
  targetQuestion: string;  // free text
  notifyEmail: string;
}
```

- Auth: NextAuth with Google OAuth (pharma IT will insist on SSO — add SAML next)
- Deploy on Vercel; keep the UI fast, pharma researchers are not patient

---

## Step 6: Database Schema and Experiment Tracking

Track every prediction for reproducibility and model improvement.

```sql
-- Core tables
CREATE TABLE experiments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id UUID NOT NULL,
  disease_state JSONB NOT NULL,
  compound_library TEXT NOT NULL,
  target_question TEXT NOT NULL,
  status TEXT DEFAULT 'queued',  -- queued | running | complete | failed
  submitted_at TIMESTAMPTZ DEFAULT now(),
  completed_at TIMESTAMPTZ,
  result_summary JSONB
);

CREATE TABLE compound_predictions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  experiment_id UUID REFERENCES experiments(id),
  compound_smiles TEXT NOT NULL,
  compound_name TEXT,
  predicted_response TEXT NOT NULL,
  confidence_score FLOAT,
  rank_in_experiment INT,
  model_version TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- For learning: track which predictions got wet-lab validation
CREATE TABLE wet_lab_validations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  compound_prediction_id UUID REFERENCES compound_predictions(id),
  validated_by TEXT,
  outcome TEXT,  -- 'confirmed' | 'refuted' | 'partial'
  effect_size FLOAT,
  notes TEXT,
  validated_at TIMESTAMPTZ
);
```

- Feed confirmed validations back into the training pipeline to continuously improve the model
- This is the flywheel: more predictions → more validations → better model → more pharma customers

---

## Step 7: Deployment — GPU Infrastructure and Scale

Serving a 7B+ model in production requires attention to cost and latency.

**Inference serving:**
- Use **Modal** (modal.com) for on-demand GPU inference — you pay per second, cold starts in ~30s
- For high-traffic: deploy on A10G instances behind a load balancer
- Model quantization: INT8 via bitsandbytes cuts GPU memory in half with <2% accuracy loss

```python
import modal

app = modal.App("celltype-inference")

@app.function(gpu="A10G", memory=40960)
def predict(cell_sentence: str, drug: str, question: str) -> str:
    model = load_model_from_volume()  # cached on Modal Volume
    return run_inference(model, cell_sentence, drug, question)
```

**Monitoring:**
- Log every inference call with model version, latency, and output length
- Track prediction confidence distributions over time — drift means your biology data has shifted
- Set up alerts if pipeline jobs fail > 5% of the time

**Cost at scale:**
- 1,000 pharma predictions/day on A10G ≈ ~$50/day inference cost
- Justifiable at $500-$5,000/experiment pharma pricing

**The wet-lab validation loop:**
- Partner with a CRO (contract research organization) or academic lab early
- Even one validated prediction per quarter is worth more than 1,000 unvalidated ones
- Document the validation pipeline rigorously — pharma due diligence will ask for it
claude-code-skills.md