Claude's Corner: EigenPal — The Eval-First Document AI That's Actually Getting Into Banks

EigenPal is the YC W2026 bet that enterprise document AI fails not on extraction accuracy but on trust — so they built the eval framework first. Here's the architecture, the moat, and how you'd clone it.

8 min read
Claude's Corner: EigenPal — The Eval-First Document AI That's Actually Getting Into Banks

TL;DR

EigenPal is an eval-first enterprise document automation platform that gets compliance teams comfortable by proving workflows work on historical data before going live. Already deployed inside two European banks, the platform combines configurable OCR/VLM pipelines with built-in audit trails and on-prem deployment — targeting the trust gap that has stalled document AI adoption since 2017.

6.8
C

Build difficulty

Document processing is one of those corners of enterprise software where "we'll use AI" has been the answer since 2017, and yet somehow most banks still have an intern manually keying data from PDFs. The gap isn't enthusiasm — it's trust. Nobody in compliance signs off on a workflow they can't audit, test, and explain to a regulator.

EigenPal is building the trust layer first. The Warsaw-and-London team out of YC W2026 isn't pitching another OCR wrapper. They're pitching an eval-first document automation platform — one where you prove the workflow works on your historical data before it touches a single live transaction. That's a subtly different product philosophy, and it's exactly why they're already inside two large European banks.

What They Build

EigenPal automates document-heavy workflows for enterprises: KYC packages, loan applications, insurance claims, shipping manifests, invoices, contracts with amendments. The documents that make enterprise ops teams cry — handwritten, scanned at 72dpi sideways, water-damaged, third-party formats that change quarterly.

The target customer is any operation that has humans doing repetitive document review at scale. Right now that's financial services (their beachhead), but the platform handles healthcare (HIPAA-compliant), manufacturing, and insurance equally well. The unit of sale is a workflow: a configurable pipeline that takes documents in and produces structured data, validation decisions, or template-based output documents out.

They're not trying to be a foundation model. They're the layer above — opinionated tooling that lets a non-ML enterprise team build, test, and deploy a document AI workflow in weeks instead of quarters.

How It Works

The architecture is genuinely interesting because it's designed around configurability and observability rather than magic black boxes.

Pipeline composition. Every workflow is a configurable sequence of stages. You pick your OCR/VLM component (the vision layer that reads the document), your LLM (for reasoning, extraction, and validation), and define the output schema. Critically, these aren't locked to EigenPal's models — enterprises can plug in their preferred providers or internal models. This is a smart enterprise play: it sidesteps the "but we've standardized on Azure OpenAI" objection entirely.

Example-based workflow generation. You upload 3–5 sample documents and the system infers the workflow structure — field mappings, validation rules, exception handling. The AI copilot then lets you refine it in natural language. "Flag any mortgage application where the declared income doesn't match the bank statement by more than 15%" is a config update, not a code change. EigenPal claims this collapses a 2–4 week spec-and-implementation cycle down to about five minutes. That's the kind of claim that makes enterprise buyers lean forward.

Related startups

Eval-first deployment. Before any workflow goes live, you run it against your historical document corpus. You get accuracy metrics, confidence distributions, and failure case examples. Only when you're satisfied with the eval does the workflow get promoted to production. This is the product's core insight: the eval framework isn't a feature bolted on after launch — it's the gate between "built" and "deployed." Every enterprise team building on raw LLM APIs learns this lesson the hard way. EigenPal sells it solved.

Smart human-in-the-loop. When the model's confidence drops below a threshold, the document gets routed for human review instead of auto-processed. This isn't a failure mode — it's designed in. The 89% automation rate they cite for a US retail bank mortgage workflow means 11% still goes to a human, and that's intentional. The system is calibrated to be right when it's confident, not to maximize straight-through processing at the expense of accuracy.

Observability. Built-in tracing for every document through every workflow stage, with OpenTelemetry export for teams that want to pipe events into their existing monitoring stack. Cost tracking, compliance logs, and failure dashboards are first-class features, not afterthoughts. This matters enormously in regulated industries: the audit trail isn't optional.

Deployment model. Cloud (AWS, Azure, GCP) or fully on-premises. The on-prem option is table stakes for European banks post-GDPR, and it's why EigenPal already has two of them as customers while competitors are still fighting over US fintechs. SOC II Type 2, GDPR, CCPA, and HIPAA compliance are certified, not aspirational.

The Team

Matej Novak (CEO) has an MIT and Imperial College London CS background and is a third-time B2B AI founder. Three previous companies means he knows what enterprise sales actually looks like — long cycles, security reviews, legal red tape — and presumably has the scar tissue to navigate it without burning cash on the wrong things.

Jedrzej Blaszyk (CTO) studied Computing with AI at Imperial and worked on the core engineering team for the Agent Builder at Elastic before EigenPal. That's a relevant pedigree: Elastic is a company that lives in the observability and data pipeline space, and building agents there means thinking hard about reliability, tracing, and enterprise deployment patterns.

Two Imperial alumni, one MIT, one ex-Elastic — this is a technical founding team, not a sales-led one. The product architecture reflects that.

Business Model

Enterprise SaaS, pricing not publicly disclosed (classic). The company is at $500K raised (YC convertible note) with two paying bank customers already in production. The unit economics of document automation are attractive: once a workflow is deployed, it runs at near-zero marginal cost while replacing headcount that costs real money. The ROI conversation with a bank operations team is straightforward math.

Expansion within accounts should be strong — every bank has dozens of document-heavy processes. Land with loan applications, expand into KYC, SWIFT confirmations, regulatory filings. The platform model supports this; the workflow builder means new use cases don't require engineering cycles from EigenPal.

Difficulty Score

DimensionScoreWhy
ML / AI8 / 10Custom OCR/VLM pipelines, eval frameworks, confidence calibration, degraded document handling
Data7 / 10Curating eval corpora for diverse document types; privacy constraints on enterprise data
Backend7 / 10Workflow engine, pipeline orchestration, multi-tenant on-prem deployment, OpenTelemetry
Frontend5 / 10Workflow builder UI, monitoring dashboards — understood patterns but needs real enterprise polish
DevOps7 / 10Multi-cloud + on-prem deployment, SOC II certification, GDPR/HIPAA compliance infrastructure

Overall: 7 / 10. The individual ML components exist as commodities, but stitching them into a reliable, auditable, enterprise-deployable product is non-trivial. The hardest part isn't the OCR — it's building the eval infrastructure and deployment model that makes compliance officers comfortable enough to sign off.

The Moat

EigenPal's moat isn't a novel model. It's trust infrastructure.

Getting into a bank's data environment requires security reviews, procurement cycles, and compliance sign-offs that take months. Once you're in and the workflow is in production, the switching cost is enormous — not just technically, but politically. The risk of replacing a working system is career-limiting for the person who approved it. That's stickiness no open-source alternative can compete with.

The eval-first philosophy compounds over time. Every workflow deployed generates labeled production data — documents the system processed, confidence scores, human review outcomes. That feedback loop trains better models and calibrates thresholds more accurately for each customer's specific document mix. The platform gets more accurate the longer a customer uses it. That's a real data moat, and it's customer-specific, which means a competitor can't buy their way out of it.

The on-prem capability is also a genuine differentiator for European enterprise. Most US-based document AI vendors deprioritize this — it's expensive to maintain and cuts into margins. EigenPal is betting that European banks are a high-value, underserved market precisely because the incumbents don't want the operational complexity of on-prem.

Easy to replicate: the basic extraction pipeline (VLM + LLM + output schema), the natural language workflow editor, the monitoring dashboard. These are table stakes and can be built by any competent team in a few months.

Hard to replicate: the eval infrastructure (subtle to build correctly), enterprise trust from paying bank customers, SOC II Type 2 certification (takes 6–12 months), and the on-prem deployment capability at enterprise scale.

Replicability Score: 62 / 100

The core tech stack — OCR, VLMs, LLMs, workflow orchestration — is all commodity. You could assemble a functional clone in 3–6 months with a strong team. But "functional" and "enterprise-deployable" are different products. The compliance certifications, the on-prem deployment model, the eval framework that compliance teams actually trust, and the customer relationships with European banks that took months of security reviews — that's the 62 points you can't shortcut.

It's also a market where incumbents (UiPath, Hyperscience, ABBYY) are fat and slow. New entrants with modern AI stacks can punch well above their weight. The window is open, but it won't stay open forever as the big players patch their AI capabilities.

If you're building a competitor: pick one vertical (insurance claims is wide open), get SOC II early, make on-prem a feature not a roadmap item, and build the eval framework before you build the extraction pipeline. Most teams do it backwards and wonder why enterprise procurement stalls.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build EigenPal: AI Document Automation Platform

## Overview
An EigenPal clone is an enterprise document processing platform combining OCR/VLM extraction, configurable workflow orchestration, eval-first deployment, and enterprise observability.

## Step 1: Document Ingestion & OCR Pipeline

### DB Schema
```sql
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  workflow_id UUID REFERENCES workflows(id),
  filename TEXT NOT NULL,
  file_path TEXT NOT NULL,
  file_hash TEXT NOT NULL,
  mime_type TEXT NOT NULL,
  page_count INTEGER,
  status TEXT DEFAULT 'pending',
  confidence_score FLOAT,
  extracted_data JSONB,
  processing_log JSONB DEFAULT '[]',
  created_at TIMESTAMPTZ DEFAULT now(),
  updated_at TIMESTAMPTZ DEFAULT now()
);
```

### Key Implementation
- Use `pypdfium2` for PDF rendering at 300dpi
- Route to extractor based on quality score: high quality → VLM (GPT-4o or Claude 3.5 Sonnet); handwritten → Tesseract + VLM fusion; structured forms → LayoutLM/Donut
- Store raw OCR output + VLM extraction separately for eval comparison

## Step 2: Workflow Definition Engine

### Schema
```sql
CREATE TABLE workflows (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  name TEXT NOT NULL,
  version INTEGER DEFAULT 1,
  status TEXT DEFAULT 'draft',
  pipeline_config JSONB NOT NULL,
  output_schema JSONB NOT NULL,
  confidence_threshold FLOAT DEFAULT 0.85,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE workflow_stages (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  workflow_id UUID REFERENCES workflows(id),
  stage_order INTEGER NOT NULL,
  stage_type TEXT NOT NULL,
  config JSONB NOT NULL
);
```

### Pipeline Config
```json
{
  "ocr_provider": "tesseract|azure_di|aws_textract",
  "vlm_provider": "gpt4o|claude-3-5-sonnet",
  "llm_provider": "gpt4o|claude-3-5-sonnet|azure_openai",
  "stages": [
    {"type": "extract", "fields": ["invoice_number", "total", "vendor"]},
    {"type": "validate", "rules": [{"field": "total", "op": "gt", "value": 0}]},
    {"type": "route", "condition": "confidence < 0.85", "target": "human_review"}
  ]
}
```

## Step 3: Eval Framework (Core Moat)

### Schema
```sql
CREATE TABLE eval_datasets (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  workflow_id UUID REFERENCES workflows(id),
  name TEXT NOT NULL,
  document_count INTEGER DEFAULT 0,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE eval_samples (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  dataset_id UUID REFERENCES eval_datasets(id),
  document_id UUID REFERENCES documents(id),
  ground_truth JSONB NOT NULL,
  model_output JSONB,
  field_scores JSONB,
  overall_score FLOAT,
  evaluated_at TIMESTAMPTZ
);

CREATE TABLE eval_runs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  workflow_id UUID REFERENCES workflows(id),
  dataset_id UUID REFERENCES eval_datasets(id),
  workflow_version INTEGER,
  results JSONB,
  passed BOOLEAN,
  run_at TIMESTAMPTZ DEFAULT now()
);
```

### Eval Logic
```python
def run_eval(workflow_id, dataset_id):
    samples = get_eval_samples(dataset_id)
    results = [compare_fields(s.ground_truth, run_workflow(workflow_id, s.document_id)) for s in samples]
    return EvalResult(
        field_accuracy={f: mean(r[f] for r in results) for f in all_fields},
        overall_accuracy=mean(r["overall"] for r in results),
        automation_rate=sum(1 for r in results if r["confidence"] >= threshold) / len(results),
        failure_cases=[r for r in results if r["overall"] < 0.9]
    )
```

## Step 4: Natural Language Workflow Builder

```python
SYSTEM = """You are a workflow config assistant. Given current config and a user instruction, output a JSON patch. Only output valid JSON."""

def update_workflow_from_nl(workflow, instruction):
    response = anthropic.messages.create(
        model="claude-sonnet-4-6",
        system=SYSTEM,
        messages=[{"role": "user", "content": f"Config: {json.dumps(workflow.pipeline_config)}\n\nInstruction: {instruction}"}]
    )
    return apply_json_patch(workflow.pipeline_config, json.loads(response.content[0].text))
```

For example-based learning: extract fields from 3-5 samples via VLM, find intersection (present in 3+ = required), infer types from value distributions, generate starter config.

## Step 5: Observability & Audit Trail

```python
from opentelemetry import trace
tracer = trace.get_tracer("eigenpal.workflow")

def process_document(doc_id, workflow_id):
    with tracer.start_as_current_span("document.process") as span:
        span.set_attribute("document.id", doc_id)
        with tracer.start_as_current_span("ocr.extract"):
            ocr_result = run_ocr(doc_id)
        with tracer.start_as_current_span("llm.extract"):
            return run_llm_extraction(ocr_result, workflow_id)
```

Immutable audit log:
```sql
CREATE TABLE audit_log (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  document_id UUID,
  event_type TEXT NOT NULL,
  actor TEXT,
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT now()
);
REVOKE UPDATE, DELETE ON audit_log FROM app_role;
```

## Step 6: Human-in-the-Loop Review Queue

```sql
CREATE TABLE review_queue (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID REFERENCES documents(id),
  workflow_id UUID REFERENCES workflows(id),
  reason TEXT NOT NULL,
  assigned_to UUID,
  status TEXT DEFAULT 'pending',
  reviewer_output JSONB,
  created_at TIMESTAMPTZ DEFAULT now(),
  resolved_at TIMESTAMPTZ
);
```

Review UI: side-by-side document image + fields, per-field confidence color-coding (green >0.95, yellow >0.80, red <0.80), one-click approve/reject. Corrections auto-feed back into eval dataset.

## Step 7: Multi-Tenant Cloud + On-Prem Deployment

Split into Control Plane (cloud: tenant mgmt, billing, workflow marketplace) and Data Plane (per-tenant: document storage, workflow engine, audit log).

On-prem Helm values:
```yaml
eigenpal:
  mode: on-prem
  llm:
    provider: azure_openai  # or vllm for air-gapped
    endpoint: ${LLM_ENDPOINT}
  storage:
    provider: s3-compatible  # MinIO for on-prem
  telemetry:
    otel_endpoint: ${OTEL_COLLECTOR}
```

Key constraint: zero document content ever leaves the customer network boundary. Control plane receives only metadata (workflow configs, usage counts).
claude-code-skills.md