Plan · v1 · 2026-05-11

Practice Data Enrichment

One-time scrape of the 1,245 Elite-flagged practice websites to find gaps in our CRM — missing prescribers and missing secondary office locations. A Python validator splits rows into auto-import vs human-review so most rows skip per-row review.

Draft — pending Sarah review ~2 days build ~$50 cost Validator-as-reviewer
1,245
Practice websites in scope
From the Elite-rollup output
~30%
Expected genuine gap-fill rate
~2,250 of ~7,500 extracted provider rows
~10%
Rows needing human review
~6 hours of reviewer time, not 16

§1What this fills in — gap-fill, not roster-grab

We already have NPs, MDs, PAs in CRM today — including many at Elite-flagged practices. This plan is not "find all the providers." It's find what's missing.

§2Pipeline — five stages, validator is the review

The Python validator at Stage 4 is what answers "we don't need a human reviewing every row." Most rows clear the validator and route to auto_import.csv for a no-thinking Data Loader upload. Only ambiguous rows hit human_review.csv.

Stage 1
URL discovery
Read Account.Website; Google fallback for blanks; regex provider + location slugs.
~2,500 URLs
Stage 2
Headless fetch
Playwright Chromium, readability cleanup. Rate-limit 1 req / 3s / domain. Honest UA. Skip Akamai 403s.
~2,400 HTML
Stage 3
Batch LLM extract
Anthropic Batch API, Haiku 4.5, JSON schema, anti-hallucination prompt. ~24 hr turnaround.
1,245 JSON docs
Stage 4
Python validator
The review layer. Luhn-check NPIs, fuzzy-match against existing Contacts, credential allowlist, confidence threshold.
Per-row verdict
Stage 5
Route to outputs
Three CSVs split by verdict. Auto-import is Data Loader–ready. Human-review only for the ambiguous tail.
3 CSVs

§3Scope — one tier, no Tier 2

Earlier drafts of this plan included EHR system detection, accreditations, languages spoken, sub-specialty taxonomies, office-manager extraction. Cut. None of that is what Tyler asked for. Scope is two things: missing prescribers + secondary locations. Stop.

provider record
One per provider found on the practice website. Required fields in blue, validator-populated in amber.
practice_account_id provider_name credentials specialty npi photo_url bio_url confidence evidence_span match_to_existing validator_verdict
location record
One per office location per practice. Fills the secondary-location gap.
practice_account_id location_label street_address city state zip phone fax is_primary confidence evidence_span match_to_existing validator_verdict

§4How the validator routes rows

For each extracted row, the Python validator applies confidence thresholds, NPI Luhn checks, fuzzy matching against existing Salesforce Contacts under the same AccountId, and a credential allowlist. Expected split across ~7,500 extracted provider rows:

~60% · auto_skip.csv
Already in CRM
NPI exact match OR high-confidence name fuzzy match (Jaro-Winkler ≥ 0.92). We already have this person. No action.
  • ~4,500 rows
  • Audit-only — reviewer skims for false positives
~30% · auto_import.csv
Genuine gap fill
High-confidence row, no existing match. Credential in allowlist. The actual prescribers we want to add.
  • ~2,250 rows
  • Data Loader–ready — admin uploads, no per-row review
~10% · human_review.csv
Ambiguous
Low confidence, ambiguous fuzzy match (0.75–0.92), invalid Luhn NPI, weird credential, or missing required fields.
  • ~750 rows
  • Reviewer walks at ~30 sec each = ~6 hours

§5Three prototype targets — validate before the full run

If the pipeline handles these three cleanly, it handles the remaining 1,242. Selected from the deep research to cover each difficulty tier.

Easy
Pinnacle ENT Associates
pentadocs.com
WordPress archetype. 9 locations, ~30–50 providers. Slug-based URLs, likely open /wp-json/wp/v2/ REST API. Server-rendered HTML.
Medium
South Florida ENT Associates
sfenta.org
Modern SPA archetype. DatoCMS + Next.js/React. Tab-controlled providers index (?tab=providers). 50+ providers, 30+ locations.
Hard
Cleveland Clinic Head & Neck Institute
my.clevelandclinic.org/...
Hospital-network archetype. Sitecore + React, Akamai Bot Manager. Tests graceful degradation — correct behavior is log + skip, not anti-bot evasion.

§6Cost

All-in under $50 for the one-time pass.

Line One-time v1
Claude Haiku 4.5 Batch API (~37M tokens) ~$24
Azure Container Instance compute (Playwright + validator) ~$15
Azure Blob Storage (~1 GB) <$5
Salesforce Bulk API 2.0 query for cross-check $0
Total ~$44

Inside the $1,000 budget Kyle authorized for research. ~95% of the budget remains for unforeseen costs or for CarePrecise Advanced ($879 one-time) as a secondary hallucination-check signal layer if v1 surfaces invented NPIs.

§7Legal posture — proceed with normal precautions

Settled in favor of scrapers as of 2026. Logged-off scraping of public provider rosters for B2B sales targeting has no realistic CFAA, HIPAA, state-privacy, or contract exposure.
  • hiQ Labs v. LinkedIn (9th Cir. 2022) — public-data scraping does not violate CFAA
  • Meta v. Bright Data (N.D. Cal. Jan 2024) — logged-off scrapers not bound by ToS
  • X Corp. v. Bright Data (N.D. Cal. May 2024) — state breach-of-contract preempted by federal copyright law
  • HIPAA does not apply — provider names + credentials + offices are not PHI
Required precautions, encoded in the pipeline: Honest User-Agent, rate-limit 1 req/3s/domain, honor robots.txt disallows on provider pages (rare in practice), do not bypass CAPTCHAs/Akamai blocks, do not republish bios verbatim or photos. No outside counsel review warranted for this scope.

§8What's deferred to v2

Documented to keep them out of v1, not as build commitments. Source design for the recurring infrastructure: research/quarterly-reconciliation-architecture.pdf.

§9Build sequence — what unblocks what

  1. Elite rollup ships to production (defines the 1,245-practice list).
  2. Sarah confirms scraping from Azure infra is OK (UA, rate limits).
  3. Phase A — Prototype against 3 sites (1 day).
  4. Phase B — Full 1,245-site batch run (1 day + ~24 hr Batch API).
  5. Phase C — Run the validator (no human time). Splits into three CSVs.
  6. Phase D — Admin imports auto_import.csv via Data Loader (~1 hour).
  7. Phase E — Reviewer walks human_review.csv (~6 hours).
  8. Phase F — Retrospective ~2 weeks post-import. Did sales convert any gap-filled prescribers? Decide v2 scope.