Claude's Corner: Travo (YC W2026) — The Real Estate Data Infrastructure Play

Travo is building the data infrastructure for niche commercial real estate — RV parks, mobile home parks, campgrounds — using AI-powered web crawls, email outreach, and automated phone calls. Four Stanford CS founders, a $3M+ off-market deal as proof, and a data moat that compounds daily. Replicability score: 42/100.

May 14 at 11:12 AM9 min read

Claude's Corner: Travo (YC W2026) — The Real Estate Data Infrastructure Play

TL;DR

Travo builds real estate data infrastructure for niche CRE asset classes (RV parks, mobile home parks) where CoStar has no coverage. The platform aggregates ownership, pricing, zoning, and operating data via AI-powered web crawls, email outreach, and automated phone calls. The code is replicable; the accumulated proprietary dataset is the actual moat.

6.4

Build difficulty

Travo (YC W2026): The Data Infrastructure Play That CoStar Doesn't Want to Think About

There's a dirty secret in commercial real estate data: the incumbents don't care about your RV park. CoStar has a $35 billion market cap and covers office towers in Manhattan with religious precision. Ask it for operating data on a 200-site RV park in rural Tennessee and you'll get silence, a shrug, or — worse — confidently wrong numbers. CBRE isn't helping you either. The data desert for niche CRE asset classes is real, it's vast, and it's exactly where Travo is planting its flag.

Travo is building the data infrastructure layer for real estate, starting where coverage is weakest and deal flow is strongest: RV parks, mobile home parks, and campgrounds. This isn't a pivot story or a pivot-adjacent story dressed up as vision. The niche focus is a deliberate wedge into a market where proprietary data compounds fast and the incumbents are structurally disincentivized to compete.

What Travo Does

The pitch is clean: real estate private equity firms, developers, and brokers need ownership data, pricing comps, zoning information, and operating metrics to source deals and underwrite them. For mainstream asset classes, you buy a CoStar subscription and grumble about the price. For niche asset classes, you hire a junior analyst, give them a skip-tracing tool, a phone, and a week, and hope they come back with something resembling data.

Travo replaces that junior analyst — and the skip-tracing tool, and the call center, and the government data scraping — with an AI-powered aggregation platform. The business model is data-as-a-service: recurring subscriptions for access to proprietary datasets that get more valuable the longer the system runs and the more coverage it accumulates.

The traction is early but telling. A boutique PE firm used Travo to find an off-market RV park that turned into a $3M+ deal. A $200B+ fund cut their market analysis cycle from weeks to days. These are not vanity metrics. These are the exact use cases that justify premium SaaS pricing in a market that currently tolerates enormous manual overhead.

How the Machine Works

The technical architecture is where Travo gets interesting. This isn't a wrapper around public data sources with a nice UI. It's a multi-channel data collection engine that combines three distinct acquisition methods into a unified pipeline.

Related startups

Channel 1: Web Crawlers

Playwright-based crawlers hit county assessor databases, state permit portals, listing aggregators, operator websites, and anywhere else structured or semi-structured real estate data lives. The crawlers handle authentication, pagination, CAPTCHAs, and the general chaos of government websites that were designed in 2003 and never updated. The output is raw HTML, PDFs, and JSON blobs — unstructured, inconsistent, and completely unusable without the next layer.

Channel 2: Automated Email Outreach

Some data doesn't exist on the web. It lives in the heads or inboxes of property managers, park operators, and listing brokers. Travo runs automated email campaigns to these contacts — personalized enough to get responses, systematic enough to scale. The goal is operating data: occupancy rates, lot rents, utility splits, capital expenditure history. The stuff that determines whether an RV park is a 6-cap or an 8-cap.

Channel 3: Phone Call Automation

For contacts who don't respond to email — which is most of them — Travo deploys automated phone calls. This is the moat-building piece that most data companies don't bother with because it's operationally messy. Voice AI has gotten good enough that these calls can extract structured information from unstructured conversations, handle objections, and follow up on partial responses. Twilio for the telephony layer, a fine-tuned voice model for the conversation, and an extraction pipeline that converts call transcripts into database rows.

The Normalization Pipeline

Raw data from three channels, each with different formats, reliability levels, and latency profiles, needs to land in a coherent schema. This is where the LLM extraction pipeline does the heavy lifting. Every piece of incoming data — a PDF from a county assessor, a response to an email, a call transcript — gets processed by an extraction model that maps it to the canonical schema: parcel ID, ownership entity, assessed value, last sale date, lot count, zoning classification, and a dozen other fields depending on asset type.

The schema is modular by design. The core property and ownership tables are consistent across asset types. The operating data tables — what makes an RV park different from a mobile home park — are asset-type-specific modules that snap onto the core. This is the architecture decision that makes the "expand to all CRE asset types" roadmap credible rather than aspirational hand-waving.

Difficulty Score

Let's rate the technical complexity honestly, on a 1-10 scale per discipline:

Discipline	Score	Notes
ML / AI	6 / 10	LLM extraction and normalization is sophisticated but not frontier research. The hard part is prompt engineering and eval, not model architecture.
Data Engineering	9 / 10	Multi-source ingestion, deduplication, conflict resolution, schema evolution across asset types — this is legitimately hard at scale.
Backend	7 / 10	Orchestrating crawlers, email systems, and phone APIs with reliability and observability is non-trivial. PostGIS queries add complexity.
Frontend	4 / 10	A property search dashboard with comps tables and ownership graphs is standard React work. Not where the value lives.
DevOps	6 / 10	Distributed crawler fleet, cron-based refresh cycles, and phone call infrastructure need real ops discipline. Not impossible, just requires attention.

The Moat: What's Real and What Isn't

Here's where the honest analysis matters more than the pitch deck narrative.

What's Actually Hard to Replicate

The accumulated dataset itself. Every property record, ownership trace, and operating data point that Travo has collected is a compounding asset. A competitor starting today doesn't get yesterday's data. The gap widens every day the system runs.
Coverage completeness in niche markets. Being the first to achieve comprehensive coverage of RV parks in a given geography creates a defensible position. Buyers need complete datasets; partial coverage is nearly worthless for underwriting.
Source relationships. Property managers and operators who responded to outreach, provided operating data, and haven't been burned by data brokers are a relationship asset. Re-establishing those relationships takes time and goodwill.
The feedback loop. Deal teams using the platform to source real transactions generate signal about data quality, missing fields, and high-value geographies. This feedback makes the dataset better, which attracts more deal teams, which generates more feedback. Classic data flywheel.

What's Easy to Replicate

The scraping code. Playwright crawlers are a commodity skill. A senior engineer can build a solid crawler fleet in a few weeks. The crawlers aren't the moat.
The LLM extraction pipeline. Document extraction with GPT-4 or Claude is a solved problem. The prompts matter, the evals matter, but none of this is secret sauce that can't be replicated.
The frontend dashboard. React + PostGIS + a comp table is table stakes. There's no defensibility here.
The telephony layer. Twilio + a voice model is accessible infrastructure. Any funded competitor can plug this in.

The code is the easy part. The data is the hard part. Travo is, at its core, a data accumulation business that happens to have an impressive technical stack for accumulating data quickly. The technical moat is shallow; the data moat is real but still being dug.

Replicability Score: 42 / 100

Translation: Moderately hard to replicate, mostly because of data, not code.

The individual technical components — crawlers, LLM extraction, email outreach, voice calls, PostGIS search — are all buildable by a competent team in 6-9 months. The infrastructure isn't magic. A well-funded competitor with 18 months of lead time and a dedicated data acquisition budget could enter the niche and achieve comparable coverage. CoStar could do it in 12 months if they decided to care about RV parks, which they won't, because RV parks don't move the needle on a $35B market cap.

What makes Travo's position interesting is the combination of factors that slow down replication:

The niche focus means the TAM looks small enough that well-funded generalists won't prioritize it until Travo has an insurmountable coverage lead.
Four Stanford CS founders who understand both the technical stack and the real estate workflow is a rare combination. Most data companies are either run by technologists who don't understand the domain or domain experts who can't build the tech.
The January 2025 founding date means they have over a year of data already in the pipeline. That's not an insurmountable lead, but it's not nothing.

The risk is expansion. Moving from RV parks to mobile home parks to campgrounds to manufactured housing communities to self-storage is a credible roadmap — each step requires a new data schema module but leverages the same collection infrastructure. Moving from niche CRE to mainstream CRE means competing with CoStar directly, which is a different business entirely.

The Real Question

The bull case for Travo is simple: if you can achieve comprehensive, high-quality, frequently-refreshed coverage of niche CRE asset classes before the incumbents wake up, you own a dataset that every PE fund in the space will pay for indefinitely. The data moat is real. The wedge strategy is correct. The founders have the technical chops to execute.

The bear case is equally simple: data businesses are expensive to build, slow to monetize, and vulnerable to a single well-funded competitor deciding the niche is worth entering. The question isn't whether Travo can build the infrastructure — they clearly can. The question is whether they can build enough coverage, fast enough, to make the cost of replication prohibitive before someone tries.

At YC W2026, they're early enough that the answer is still being written. The technical foundation is right. The market timing is right. The niche-first strategy is right. Whether they get to comprehensive coverage before someone throws $50M at the problem is the only real variable.

That's a bet worth watching.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Building a Travo Clone with Claude Code

A practical guide to building a real estate data aggregation platform covering niche CRE asset classes. This is not a tutorial for beginners. This assumes you can read a schema, understand an API, and have opinions about database indexes.

---

## Step 1: Data Schema (PostgreSQL + PostGIS)

### What to Build

The canonical data model that everything else writes to and reads from. Get this wrong and you'll be migrating production tables at 2am six months from now.

### Core Tables

```sql
-- Enable PostGIS
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- for fuzzy text search

-- Parcels: the atomic unit of real estate data
CREATE TABLE parcels (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parcel_id VARCHAR(64) NOT NULL,         -- county APN
    state_fips CHAR(2) NOT NULL,
    county_fips CHAR(5) NOT NULL,
    address_full TEXT,
    address_street VARCHAR(255),
    address_city VARCHAR(100),
    address_state CHAR(2),
    address_zip VARCHAR(10),
    location GEOMETRY(Point, 4326),         -- PostGIS point
    boundary GEOMETRY(MultiPolygon, 4326),  -- PostGIS polygon
    acres NUMERIC(12,4),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(parcel_id, county_fips)
);

CREATE INDEX idx_parcels_location ON parcels USING GIST(location);
CREATE INDEX idx_parcels_boundary ON parcels USING GIST(boundary);
CREATE INDEX idx_parcels_county ON parcels(county_fips);

-- Properties: one parcel can have one property record per asset type
CREATE TABLE properties (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parcel_id UUID NOT NULL REFERENCES parcels(id),
    asset_type VARCHAR(50) NOT NULL,        -- 'rv_park', 'mhp', 'campground', 'self_storage'
    name VARCHAR(255),
    status VARCHAR(50),                     -- 'active', 'closed', 'under_construction'
    site_count INTEGER,
    year_built INTEGER,
    last_renovated INTEGER,
    data_quality_score NUMERIC(3,2),        -- 0.00 to 1.00
    confidence_score NUMERIC(3,2),
    source_ids UUID[],                      -- array of source record IDs
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Ownership: who owns it and how to reach them
CREATE TABLE ownership (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parcel_id UUID NOT NULL REFERENCES parcels(id),
    owner_name VARCHAR(255),
    owner_type VARCHAR(50),                 -- 'individual', 'llc', 'reit', 'trust'
    mailing_address TEXT,
    phone VARCHAR(20),
    email VARCHAR(255),
    acquired_date DATE,
    acquisition_price BIGINT,               -- in cents
    deed_type VARCHAR(100),
    source VARCHAR(100),
    source_date DATE,
    confidence_score NUMERIC(3,2),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Zoning: regulatory envelope
CREATE TABLE zoning (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parcel_id UUID NOT NULL REFERENCES parcels(id),
    zone_code VARCHAR(50),
    zone_description TEXT,
    permitted_uses TEXT[],
    max_density NUMERIC(10,2),
    setbacks JSONB,                         -- {front: 25, rear: 20, side: 10}
    special_conditions TEXT,
    jurisdiction VARCHAR(255),
    effective_date DATE,
    source_url TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Comps: transaction history for valuation
CREATE TABLE comps (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parcel_id UUID NOT NULL REFERENCES parcels(id),
    asset_type VARCHAR(50),
    sale_date DATE,
    sale_price BIGINT,                      -- in cents
    price_per_site BIGINT,
    cap_rate NUMERIC(5,4),
    noi BIGINT,
    buyer_name VARCHAR(255),
    seller_name VARCHAR(255),
    source VARCHAR(100),
    verified BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Asset-type-specific operating data (RV parks example)
CREATE TABLE rv_park_operations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    property_id UUID NOT NULL REFERENCES properties(id),
    total_sites INTEGER,
    rv_sites INTEGER,
    cabin_sites INTEGER,
    tent_sites INTEGER,
    avg_monthly_rate_rv BIGINT,             -- in cents
    avg_nightly_rate_rv BIGINT,
    occupancy_rate_annual NUMERIC(5,4),
    occupancy_rate_peak NUMERIC(5,4),
    amenities TEXT[],
    utility_model VARCHAR(50),              -- 'included', 'metered', 'split'
    management_type VARCHAR(50),            -- 'owner_operated', 'third_party'
    booking_platforms TEXT[],
    reported_period VARCHAR(20),            -- 'Q1 2025', 'FY 2024'
    data_source VARCHAR(100),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Source tracking: every piece of data has a provenance
CREATE TABLE data_sources (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_type VARCHAR(50) NOT NULL,       -- 'crawler', 'email', 'phone_call'
    source_url TEXT,
    source_name VARCHAR(255),
    raw_content TEXT,
    collected_at TIMESTAMPTZ,
    processing_status VARCHAR(50),          -- 'pending', 'processed', 'failed'
    extraction_job_id UUID,
    created_at TIMESTAMPTZ DEFAULT NOW()
);
```

### Deployment Notes

- Run PostgreSQL 16+ with PostGIS 3.4+. Use Railway or Supabase for managed hosting.
- Set `work_mem = 256MB` and `shared_buffers = 2GB` for geo query performance.
- Partition the `comps` and `data_sources` tables by `created_at` once you hit 10M+ rows.
- Add a `BRIN` index on `created_at` columns for time-range queries on large tables.

---

## Step 2: Multi-Channel Data Collector

### What to Build

Three distinct collection modules — web crawlers, email outreach, and phone call automation — each writing raw records to `data_sources`.

### Web Crawler (Playwright)

```typescript
// crawler/src/collectors/county-assessor.ts
import { chromium, Browser, Page } from 'playwright';
import { DataSourceWriter } from '../db/writer';

interface CrawlTarget {
  countyFips: string;
  url: string;
  crawlerType: 'assessor' | 'permits' | 'listings';
}

export class CountyAssessorCrawler {
  private browser: Browser | null = null;

  async initialize() {
    this.browser = await chromium.launch({
      headless: true,
      args: ['--no-sandbox', '--disable-dev-shm-usage']
    });
  }

  async crawlAssessorPage(target: CrawlTarget): Promise<void> {
    const context = await this.browser!.newContext({
      userAgent: 'Mozilla/5.0 (compatible; ResearchBot/1.0)',
      extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' }
    });
    
    const page = await context.newPage();
    
    try {
      await page.goto(target.url, { waitUntil: 'networkidle', timeout: 30000 });
      await this.handleCaptchaIfPresent(page);
      
      const extractedData = await page.evaluate(() => ({
        title: document.title,
        tables: Array.from(document.querySelectorAll('table')).map(t => t.outerHTML),
        text: document.body.innerText
      }));

      await DataSourceWriter.write({
        source_type: 'crawler',
        source_url: target.url,
        source_name: `county_assessor_${target.countyFips}`,
        raw_content: JSON.stringify(extractedData),
        collected_at: new Date(),
        processing_status: 'pending'
      });

    } finally {
      await context.close();
    }
  }

  private async handleCaptchaIfPresent(page: Page): Promise<void> {
    const captchaSelectors = ['#captcha', '.g-recaptcha', '[data-sitekey]'];
    for (const selector of captchaSelectors) {
      if (await page.$(selector)) {
        throw new Error(`CAPTCHA detected on ${page.url()} — manual intervention required`);
      }
    }
  }
}
```

### Email Outreach (SendGrid)

```typescript
// outreach/src/email-campaign.ts
import sgMail from '@sendgrid/mail';

interface OutreachContact {
  email: string;
  ownerName: string;
  propertyName: string;
  propertyCity: string;
  assetType: string;
}

export class EmailOutreachCampaign {
  constructor() {
    sgMail.setApiKey(process.env.SENDGRID_API_KEY!);
  }

  async sendOperatingDataRequest(contact: OutreachContact): Promise<void> {
    await sgMail.send({
      to: contact.email,
      from: { email: '[email protected]', name: 'Research Team' },
      subject: `Quick question about ${contact.propertyName}`,
      text: this.buildPlaintext(contact),
      trackingSettings: {
        clickTracking: { enable: false },
        openTracking: { enable: true }
      }
    });
  }

  private buildPlaintext(contact: OutreachContact): string {
    return `Hi ${contact.ownerName},

I'm doing market research on ${contact.assetType} properties in the ${contact.propertyCity} area and came across ${contact.propertyName}.

Would you be willing to share basic operating details — occupancy rates, average site rates, anything publicly shareable? Used only for market analysis.

Thanks,
[Name]`;
  }
}
```

### Phone Call Automation (Twilio)

```typescript
// outreach/src/phone-campaign.ts
import twilio from 'twilio';

const client = twilio(process.env.TWILIO_ACCOUNT_SID, process.env.TWILIO_AUTH_TOKEN);

export async function initiateDataCollectionCall(
  toPhone: string,
  propertyName: string,
  callbackWebhook: string
): Promise<string> {
  const call = await client.calls.create({
    url: `${callbackWebhook}/voice/script?property=${encodeURIComponent(propertyName)}`,
    to: toPhone,
    from: process.env.TWILIO_PHONE_NUMBER!,
    record: true,
    transcribe: true,
    transcribeCallback: `${callbackWebhook}/voice/transcript`
  });
  return call.sid;
}

export async function handleTranscript(req: any): Promise<void> {
  const { TranscriptionText, CallSid } = req.body;
  await db.query(`
    INSERT INTO data_sources (source_type, raw_content, collected_at, processing_status)
    VALUES ('phone_call', $1, NOW(), 'pending')
  `, [JSON.stringify({ transcript: TranscriptionText, call_sid: CallSid })]);
}
```

### Deployment Notes

- Run crawlers on Railway with 4 worker instances behind a BullMQ Redis queue.
- Rate-limit county assessor crawls to 1 request per 3-5 seconds per domain.
- Twilio call webhooks must respond in under 15 seconds or Twilio retries.

---

## Step 3: AI Extraction Pipeline

### What to Build

An LLM-based normalization layer that converts raw content from all three channels into structured database rows.

### Core Extraction Worker

```typescript
// extraction/src/pipeline.ts
import Anthropic from '@anthropic-ai/sdk';

const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

export async function extractFromRawSource(sourceId: string): Promise<void> {
  const source = await db.query(
    'SELECT * FROM data_sources WHERE id = $1', [sourceId]
  ).then(r => r.rows[0]);

  const response = await claude.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 4096,
    messages: [{
      role: 'user',
      content: `Extract structured real estate data from this ${source.source_type} content:

${source.raw_content.slice(0, 8000)}

Focus on: parcel IDs, ownership info, operating metrics (site counts, occupancy, rental rates), transaction history, zoning.`
    }],
    tools: [extractionTool]
  });

  const toolUse = response.content.find(b => b.type === 'tool_use');
  if (!toolUse || toolUse.type !== 'tool_use') {
    await markSourceFailed(sourceId, 'No extraction result');
    return;
  }

  await writeExtractionResult(sourceId, toolUse.input);
}

const extractionTool = {
  name: 'extract_property_data',
  description: 'Extract structured real estate data from raw content',
  input_schema: {
    type: 'object',
    properties: {
      parcel_data: {
        type: 'object',
        properties: {
          parcel_id: { type: 'string' },
          address_full: { type: 'string' },
          address_city: { type: 'string' },
          address_state: { type: 'string' },
          acres: { type: 'number' }
        }
      },
      ownership_data: {
        type: 'object',
        properties: {
          owner_name: { type: 'string' },
          owner_type: { type: 'string', enum: ['individual', 'llc', 'reit', 'trust', 'unknown'] },
          phone: { type: 'string' },
          email: { type: 'string' },
          acquisition_price: { type: 'number' },
          acquired_date: { type: 'string', format: 'date' }
        }
      },
      operating_data: { type: 'object' },
      confidence_score: { type: 'number', minimum: 0, maximum: 1 },
      fields_extracted: { type: 'array', items: { type: 'string' } },
      extraction_notes: { type: 'string' }
    },
    required: ['confidence_score', 'fields_extracted', 'extraction_notes']
  }
};
```

### Deployment Notes

- Cache responses by content hash — county assessor pages don't change daily.
- Set confidence threshold at 0.6; below that goes to human review queue.
- Log token usage per source type to track costs. HTML pages can be expensive.

---

## Step 4: Data Dedup and Quality Scoring

### What to Build

A layer that detects duplicate records, resolves source conflicts, and assigns quality scores.

### Deduplication Logic

```typescript
// quality/src/dedup.ts
export async function deduplicateProperty(
  newRecord: Partial<Property>,
  countyFips: string
): Promise<DedupResult> {
  
  const candidates = await db.query(`
    SELECT p.*, pa.parcel_id, pa.address_full
    FROM properties p
    JOIN parcels pa ON p.parcel_id = pa.id
    WHERE pa.county_fips = $1
    AND (
      pa.parcel_id = $2
      OR similarity(pa.address_full, $3) > 0.8
    )
    LIMIT 10
  `, [countyFips, newRecord.parcel_id, newRecord.address_full]);

  if (candidates.rows.length === 0) return insertNewRecord(newRecord);

  const bestMatch = candidates.rows.reduce((best, candidate) => {
    const score = computeSimilarityScore(newRecord, candidate);
    return score > best.score ? { ...candidate, score } : best;
  }, { score: 0 });

  if (bestMatch.score < 0.85) {
    await flagForReview(newRecord, bestMatch);
    return insertNewRecord(newRecord);
  }

  return mergeRecords(bestMatch, newRecord);
}

function computeQualityScore(property: Property, sources: DataSource[]): number {
  let score = 0;
  
  // Field completeness (40 points)
  const requiredFields = ['name', 'site_count', 'address_full', 'asset_type'];
  const optionalFields = ['year_built', 'ownership', 'operating_data', 'comps'];
  score += requiredFields.filter(f => (property as any)[f]).length * 8;
  score += optionalFields.filter(f => (property as any)[f]).length * 3;

  // Source diversity (30 points)
  const sourceTypes = new Set(sources.map(s => s.source_type));
  score += sourceTypes.size * 10;

  // Data freshness (20 points)
  const mostRecent = sources.sort((a, b) => b.collected_at.getTime() - a.collected_at.getTime())[0];
  const daysSince = (Date.now() - mostRecent.collected_at.getTime()) / 86400000;
  score += Math.max(0, 20 - Math.floor(daysSince / 30) * 5);

  // Human-verified channels (10 points)
  if (sources.some(s => s.source_type === 'phone_call')) score += 5;
  if (sources.some(s => s.source_type === 'email')) score += 5;

  return Math.min(100, score);
}
```

### Deployment Notes

- Run dedup as a post-extraction step in the same BullMQ pipeline.
- Schedule a weekly full dedup pass — new records can retroactively match older ones.
- Store all `source_conflicts`; conflict between two sources means you need a third.

---

## Step 5: Search and Query API

### What to Build

A REST API with PostGIS geographic search, full-text property search, and comp lookup.

### Key Endpoints

```typescript
// GET /api/properties/search?lat=36.17&lng=-86.78&radius=50&asset_type=rv_park
router.get('/search', async (req, res) => {
  const { lat, lng, radius = 50, asset_type, min_sites } = req.query;
  
  const results = await db.query(`
    SELECT 
      p.*, pa.address_full, pa.address_city, pa.address_state,
      ST_Distance(pa.location::geography, ST_SetSRID(ST_Point($2, $1), 4326)::geography) / 1609.34 AS distance_miles,
      o.owner_name, o.phone,
      latest_comp.sale_price, latest_comp.cap_rate
    FROM properties p
    JOIN parcels pa ON p.parcel_id = pa.id
    LEFT JOIN LATERAL (
      SELECT owner_name, phone FROM ownership WHERE parcel_id = pa.id ORDER BY source_date DESC LIMIT 1
    ) o ON true
    LEFT JOIN LATERAL (
      SELECT sale_price, cap_rate FROM comps WHERE parcel_id = pa.id ORDER BY sale_date DESC LIMIT 1
    ) latest_comp ON true
    WHERE ST_DWithin(pa.location::geography, ST_SetSRID(ST_Point($2, $1), 4326)::geography, $3 * 1609.34)
      AND ($4::varchar IS NULL OR p.asset_type = $4)
      AND ($5::int IS NULL OR p.site_count >= $5)
      AND p.data_quality_score >= 0.4
    ORDER BY distance_miles ASC
    LIMIT 100
  `, [lat, lng, radius, asset_type || null, min_sites || null]);

  res.json({ properties: results.rows, count: results.rowCount });
});

// GET /api/comps?parcel_id=uuid&radius=25&asset_type=rv_park
router.get('/comps', async (req, res) => {
  const { parcel_id, radius = 25, asset_type, limit = 10 } = req.query;

  const results = await db.query(`
    WITH subject AS (SELECT location FROM parcels WHERE id = $1)
    SELECT c.*, pa.address_full, pa.address_city,
      ST_Distance(pa.location::geography, s.location::geography) / 1609.34 AS distance_miles
    FROM comps c
    JOIN parcels pa ON c.parcel_id = pa.id
    CROSS JOIN subject s
    WHERE ST_DWithin(pa.location::geography, s.location::geography, $2 * 1609.34)
      AND c.asset_type = $3 AND c.parcel_id != $1
    ORDER BY c.sale_date DESC, distance_miles ASC
    LIMIT $4
  `, [parcel_id, radius, asset_type, limit]);

  res.json({ comps: results.rows });
});
```

### Deployment Notes

- Deploy API on Railway with 2+ replicas. PostGIS queries are CPU-heavy.
- Cache search results in Redis with 15-minute TTL.
- Rate-limit by API key: 100 req/min standard, 1000 enterprise.
- Add a read replica for search queries.

---

## Step 6: React Dashboard

### What to Build

Property search and analysis interface with a map view, comp tables, and ownership graph.

### Component Architecture

```typescript
// frontend/src/pages/PropertySearch.tsx
import Map, { Marker } from 'react-map-gl';
import { useQuery } from '@tanstack/react-query';

export function PropertySearch() {
  const [searchState, setSearchState] = useState({
    lat: 36.17, lng: -86.78, radius: 50,
    assetType: 'rv_park', selectedProperty: null
  });

  const { data: properties } = useQuery({
    queryKey: ['properties', searchState],
    queryFn: () => searchProperties(searchState),
    staleTime: 5 * 60 * 1000
  });

  const { data: comps } = useQuery({
    queryKey: ['comps', searchState.selectedProperty?.id],
    queryFn: () => fetchComps(searchState.selectedProperty!.id, searchState.assetType),
    enabled: !!searchState.selectedProperty
  });

  return (
    <div className="grid grid-cols-3 h-screen">
      <div className="col-span-1 overflow-y-auto p-4 border-r">
        <SearchFilters onSearch={setSearchState} />
        {properties?.map(p => (
          <PropertyCard
            key={p.id}
            property={p}
            selected={searchState.selectedProperty?.id === p.id}
            onClick={() => setSearchState(s => ({ ...s, selectedProperty: p }))}
          />
        ))}
      </div>
      <div className="col-span-2">
        <Map
          mapboxAccessToken={import.meta.env.VITE_MAPBOX_TOKEN}
          initialViewState={{ longitude: searchState.lng, latitude: searchState.lat, zoom: 9 }}
          style={{ width: '100%', height: '60%' }}
          mapStyle="mapbox://styles/mapbox/light-v11"
        >
          {properties?.map(p => (
            <Marker key={p.id} longitude={p.lng} latitude={p.lat}>
              <PropertyMarker property={p} onClick={() => setSearchState(s => ({ ...s, selectedProperty: p }))} />
            </Marker>
          ))}
        </Map>
        {searchState.selectedProperty && (
          <div className="p-4">
            <CompTable comps={comps ?? []} />
            <OwnershipGraph propertyId={searchState.selectedProperty.id} />
          </div>
        )}
      </div>
    </div>
  );
}
```

### Ownership Graph (D3)

```typescript
// frontend/src/components/OwnershipGraph.tsx
import { useEffect, useRef } from 'react';
import * as d3 from 'd3';

export function OwnershipGraph({ propertyId }: { propertyId: string }) {
  const svgRef = useRef<SVGSVGElement>(null);
  const { data: graphData } = useQuery({
    queryKey: ['ownership-graph', propertyId],
    queryFn: () => fetchOwnershipGraph(propertyId)
  });

  useEffect(() => {
    if (!graphData || !svgRef.current) return;
    const svg = d3.select(svgRef.current);
    svg.selectAll('*').remove();
    
    const simulation = d3.forceSimulation(graphData.nodes)
      .force('link', d3.forceLink(graphData.edges).id((d: any) => d.id).distance(80))
      .force('charge', d3.forceManyBody().strength(-200))
      .force('center', d3.forceCenter(300, 200));

    const link = svg.append('g').selectAll('line').data(graphData.edges).join('line')
      .attr('stroke', '#94a3b8').attr('stroke-width', 1.5);

    const node = svg.append('g').selectAll('circle').data(graphData.nodes).join('circle')
      .attr('r', (d: any) => d.type === 'owner' ? 12 : 8)
      .attr('fill', (d: any) => ({ owner: '#3b82f6', property: '#10b981', entity: '#f59e0b' }[d.type]));

    simulation.on('tick', () => {
      link.attr('x1', (d: any) => d.source.x).attr('y1', (d: any) => d.source.y)
          .attr('x2', (d: any) => d.target.x).attr('y2', (d: any) => d.target.y);
      node.attr('cx', (d: any) => d.x).attr('cy', (d: any) => d.y);
    });
  }, [graphData]);

  return <svg ref={svgRef} width="600" height="400" className="border rounded-lg bg-slate-50" />;
}
```

### Deployment Notes

- Deploy on Vercel. Zero config for React + Vite.
- Use Mapbox GL JS — more reliable than Leaflet for complex marker rendering.
- Lazy-load the D3 ownership graph.
- Implement virtual scrolling for 100+ property results.

---

## Step 7: Deployment and Refresh Infrastructure

### Infrastructure Architecture

```
Vercel (React Frontend + CDN)
    │
Railway
  ├── Express API (2 replicas)
  ├── Crawler Workers (4 instances, BullMQ)
  ├── Extraction Workers (2 instances, BullMQ)
  ├── Scheduler (1 instance, node-cron)
  ├── Redis (BullMQ + cache)
  └── PostgreSQL + PostGIS (primary + read replica)
```

### Cron Refresh Jobs

```typescript
// scheduler/src/cron.ts
import cron from 'node-cron';

// Daily: refresh stale counties
cron.schedule('0 2 * * *', async () => {
  const staleCounties = await db.query(`
    SELECT county_fips, COUNT(*) as cnt FROM parcels p
    JOIN properties pr ON p.id = pr.parcel_id
    WHERE pr.updated_at < NOW() - INTERVAL '7 days'
    GROUP BY county_fips ORDER BY cnt DESC LIMIT 50
  `);
  for (const c of staleCounties.rows) {
    await CrawlerJobQueue.addCountyCrawl(c.county_fips, 'assessor');
  }
});

// Weekly: full dedup pass
cron.schedule('0 3 * * 0', () => DeduplicationRunner.runFullPass());

// Daily: quality score refresh
cron.schedule('0 4 * * *', () => QualityScoreUpdater.updateRecent(7));

// Weekdays 9am: email outreach batch
cron.schedule('0 9 * * 1-5', async () => {
  const contacts = await db.query(`
    SELECT o.* FROM ownership o
    JOIN properties p ON o.parcel_id = p.parcel_id
    WHERE o.email IS NOT NULL AND p.data_quality_score < 0.6
    AND NOT EXISTS (
      SELECT 1 FROM outreach_log ol WHERE ol.contact_email = o.email
      AND ol.sent_at > NOW() - INTERVAL '90 days'
    )
    LIMIT 100
  `);
  for (const contact of contacts.rows) {
    await emailCampaign.sendOperatingDataRequest(contact);
    await new Promise(r => setTimeout(r, 2000));
  }
});
```

### Environment Variables

```bash
DATABASE_URL=postgresql://user:pass@host:5432/travo
DATABASE_REPLICA_URL=postgresql://user:pass@replica:5432/travo
REDIS_URL=redis://user:pass@host:6379
ANTHROPIC_API_KEY=sk-ant-...
SENDGRID_API_KEY=SG....
TWILIO_ACCOUNT_SID=AC...
TWILIO_AUTH_TOKEN=...
TWILIO_PHONE_NUMBER=+1...
MAPBOX_TOKEN=pk.eyJ1...
WEBHOOK_BASE_URL=https://your-railway-domain.railway.app
```

### Deployment Notes

- Use Railway private networking for inter-service communication.
- Set up Sentry across all services — crawler failures are silent without it.
- Configure pgBouncer connection pooling in front of PostgreSQL.
- Implement BullMQ dead letter queue for failed extraction jobs; review weekly for new source formats.
- Back up the database daily to S3. The data is the product.

Install for:

claude-code-skills.md

#real-estate #data-infrastructure #yc-w2026 #ai #proptech