Plan · v1 · 2026-05-11

Practice Data Enrichment

One-time scrape of the 1,245 Elite-flagged practice websites to find gaps in our CRM — missing prescribers and missing secondary office locations. A Python validator splits rows into auto-import vs human-review so most rows skip per-row review.

Draft — pending Sarah review ~2 days build ~$50 cost Validator-as-reviewer

1,245

Practice websites in scope

From the Elite-rollup output

~30%

Expected genuine gap-fill rate

~2,250 of ~7,500 extracted provider rows

~10%

Rows needing human review

~6 hours of reviewer time, not 16

§1What this fills in — gap-fill, not roster-grab

We already have NPs, MDs, PAs in CRM today — including many at Elite-flagged practices. This plan is not "find all the providers." It's find what's missing.

Missing prescribers. Some prescribers — usually NPs and PAs added more recently, sometimes longstanding gaps in our coverage — work at Elite practices but aren't tied to that Account in CRM. Sales has no way to discover them short of manually browsing practice websites.
Missing secondary office locations. 492 Elite Prescribers cover multiple practices per the rollup audit. Primary-AccountId rollup misses them. NPI Registry is too stale to trust here; practice websites are the most current source because every practice keeps "Our Providers" current for new-patient acquisition. Fax is load-bearing for a compounding pharmacy (Rx routing) — high priority for the locations capture.

§2Pipeline — five stages, validator is the review

The Python validator at Stage 4 is what answers "we don't need a human reviewing every row." Most rows clear the validator and route to auto_import.csv for a no-thinking Data Loader upload. Only ambiguous rows hit human_review.csv.

Stage 1

URL discovery

Read Account.Website; Google fallback for blanks; regex provider + location slugs.

~2,500 URLs

Stage 2

Headless fetch

Playwright Chromium, readability cleanup. Rate-limit 1 req / 3s / domain. Honest UA. Skip Akamai 403s.

~2,400 HTML

Stage 3

Batch LLM extract

Anthropic Batch API, Haiku 4.5, JSON schema, anti-hallucination prompt. ~24 hr turnaround.

1,245 JSON docs

Stage 4

Python validator

The review layer. Luhn-check NPIs, fuzzy-match against existing Contacts, credential allowlist, confidence threshold.

Per-row verdict

Stage 5

Route to outputs

Three CSVs split by verdict. Auto-import is Data Loader–ready. Human-review only for the ambiguous tail.

3 CSVs

§3Scope — one tier, no Tier 2

Earlier drafts of this plan included EHR system detection, accreditations, languages spoken, sub-specialty taxonomies, office-manager extraction. Cut. None of that is what Tyler asked for. Scope is two things: missing prescribers + secondary locations. Stop.

provider record

One per provider found on the practice website. Required fields in blue, validator-populated in amber.

practice_account_id provider_name credentials specialty npi photo_url bio_url confidence evidence_span match_to_existing validator_verdict

location record

One per office location per practice. Fills the secondary-location gap.

practice_account_id location_label street_address city state zip phone fax is_primary confidence evidence_span match_to_existing validator_verdict

§4How the validator routes rows

For each extracted row, the Python validator applies confidence thresholds, NPI Luhn checks, fuzzy matching against existing Salesforce Contacts under the same AccountId, and a credential allowlist. Expected split across ~7,500 extracted provider rows:

~60% · auto_skip.csv

Already in CRM

NPI exact match OR high-confidence name fuzzy match (Jaro-Winkler ≥ 0.92). We already have this person. No action.

~4,500 rows
Audit-only — reviewer skims for false positives

~30% · auto_import.csv

Genuine gap fill

High-confidence row, no existing match. Credential in allowlist. The actual prescribers we want to add.

~2,250 rows
Data Loader–ready — admin uploads, no per-row review

~10% · human_review.csv

Ambiguous

Low confidence, ambiguous fuzzy match (0.75–0.92), invalid Luhn NPI, weird credential, or missing required fields.

~750 rows
Reviewer walks at ~30 sec each = ~6 hours

§5Three prototype targets — validate before the full run

If the pipeline handles these three cleanly, it handles the remaining 1,242. Selected from the deep research to cover each difficulty tier.

Easy

Pinnacle ENT Associates

pentadocs.com

WordPress archetype. 9 locations, ~30–50 providers. Slug-based URLs, likely open /wp-json/wp/v2/ REST API. Server-rendered HTML.

Medium

South Florida ENT Associates

sfenta.org

Modern SPA archetype. DatoCMS + Next.js/React. Tab-controlled providers index (?tab=providers). 50+ providers, 30+ locations.

Hard

Cleveland Clinic Head & Neck Institute

my.clevelandclinic.org/...

Hospital-network archetype. Sitecore + React, Akamai Bot Manager. Tests graceful degradation — correct behavior is log + skip, not anti-bot evasion.

§6Cost

All-in under $50 for the one-time pass.

Line	One-time v1
Claude Haiku 4.5 Batch API (~37M tokens)	~$24
Azure Container Instance compute (Playwright + validator)	~$15
Azure Blob Storage (~1 GB)	<$5
Salesforce Bulk API 2.0 query for cross-check	$0
Total	~$44

Inside the $1,000 budget Kyle authorized for research. ~95% of the budget remains for unforeseen costs or for CarePrecise Advanced ($879 one-time) as a secondary hallucination-check signal layer if v1 surfaces invented NPIs.

§7Legal posture — proceed with normal precautions

Settled in favor of scrapers as of 2026. Logged-off scraping of public provider rosters for B2B sales targeting has no realistic CFAA, HIPAA, state-privacy, or contract exposure.

hiQ Labs v. LinkedIn (9th Cir. 2022) — public-data scraping does not violate CFAA
Meta v. Bright Data (N.D. Cal. Jan 2024) — logged-off scrapers not bound by ToS
X Corp. v. Bright Data (N.D. Cal. May 2024) — state breach-of-contract preempted by federal copyright law
HIPAA does not apply — provider names + credentials + offices are not PHI

Required precautions, encoded in the pipeline: Honest User-Agent, rate-limit 1 req/3s/domain, honor robots.txt disallows on provider pages (rare in practice), do not bypass CAPTCHAs/Akamai blocks, do not republish bios verbatim or photos. No outside counsel review warranted for this scope.

§8What's deferred to v2

Documented to keep them out of v1, not as build commitments. Source design for the recurring infrastructure: research/quarterly-reconciliation-architecture.pdf.

Quarterly reconciliation engine — Logic App + Container Apps Jobs + Splink fuzzy matching + Salesforce custom review objects + Composite-API write-back with idempotency. Annual infrastructure $240–$1,005. Build only if v1 proves recurring value.
Autonomous Salesforce write-back via API. v1 outputs Data Loader–ready CSV; admin runs Data Loader manually. Direct API integration doesn't materially change v1's reviewer burden — defer until quarterly recurrence makes the engineering worth it.
Tier 2 enrichment — EHR system detection, sub-specialty taxonomy, languages spoken, office-manager extraction, accreditations. Not requested. Don't build.
AccountContactRelation backfill for multi-practice prescribers. The locations CSV surfaces the gaps; admin decides which to add manually.
NPPES cross-validation of extracted NPIs (as a hallucination check, not source). Build only if NPI hallucinations show up in v1.

§9Build sequence — what unblocks what

Elite rollup ships to production (defines the 1,245-practice list).
Sarah confirms scraping from Azure infra is OK (UA, rate limits).
Phase A — Prototype against 3 sites (1 day).
Phase B — Full 1,245-site batch run (1 day + ~24 hr Batch API).
Phase C — Run the validator (no human time). Splits into three CSVs.
Phase D — Admin imports auto_import.csv via Data Loader (~1 hour).
Phase E — Reviewer walks human_review.csv (~6 hours).
Phase F — Retrospective ~2 weeks post-import. Did sales convert any gap-filled prescribers? Decide v2 scope.