One-time scrape of the 1,245 Elite-flagged practice websites to find
gaps in our CRM — missing prescribers and missing secondary office
locations. A Python validator splits rows into auto-import vs
human-review so most rows skip per-row review.
Draft — pending Sarah review~2 days build~$50 costValidator-as-reviewer
1,245
Practice websites in scope
From the Elite-rollup output
~30%
Expected genuine gap-fill rate
~2,250 of ~7,500 extracted provider rows
~10%
Rows needing human review
~6 hours of reviewer time, not 16
§1What this fills in — gap-fill, not
roster-grab
We already have NPs, MDs, PAs in CRM today —
including many at Elite-flagged practices. This plan is
not "find all the providers." It's
find what's missing.
Missing prescribers.
Some prescribers — usually NPs and PAs added more recently,
sometimes longstanding gaps in our coverage — work at Elite
practices but aren't tied to that Account in CRM. Sales has no way
to discover them short of manually browsing practice websites.
Missing secondary office locations.
492 Elite Prescribers cover multiple practices per the rollup audit.
Primary-AccountId rollup misses them. NPI Registry is too stale to
trust here; practice websites are the most current source because
every practice keeps "Our Providers" current for new-patient
acquisition.
Fax is load-bearing for a compounding pharmacy
(Rx routing) — high priority for the locations capture.
§2Pipeline — five stages, validator is the
review
The Python validator at Stage 4 is what answers "we don't need a human
reviewing every row." Most rows clear the validator and route to
auto_import.csv for a no-thinking Data Loader upload.
Only ambiguous rows hit human_review.csv.
Stage 1
URL discovery
Read Account.Website; Google fallback for blanks;
regex provider + location slugs.
The review layer. Luhn-check NPIs, fuzzy-match
against existing Contacts, credential allowlist, confidence
threshold.
Per-row verdict
Stage 5
Route to outputs
Three CSVs split by verdict. Auto-import is Data Loader–ready.
Human-review only for the ambiguous tail.
3 CSVs
§3Scope — one tier, no Tier 2
Earlier drafts of this plan included EHR system detection,
accreditations, languages spoken, sub-specialty taxonomies,
office-manager extraction. Cut. None of that is what
Tyler asked for. Scope is two things: missing prescribers + secondary
locations. Stop.
provider record
One per provider found on the practice website. Required fields in
blue, validator-populated in amber.
For each extracted row, the Python validator applies confidence
thresholds, NPI Luhn checks, fuzzy matching against existing
Salesforce Contacts under the same AccountId, and a credential
allowlist. Expected split across ~7,500 extracted provider rows:
~60% · auto_skip.csv
Already in CRM
NPI exact match OR high-confidence name fuzzy match (Jaro-Winkler
≥ 0.92). We already have this person. No action.
~4,500 rows
Audit-only — reviewer skims for false positives
~30% · auto_import.csv
Genuine gap fill
High-confidence row, no existing match. Credential in allowlist.
The actual prescribers we want to add.
~2,250 rows
Data Loader–ready — admin uploads, no per-row review
~10% · human_review.csv
Ambiguous
Low confidence, ambiguous fuzzy match (0.75–0.92), invalid Luhn
NPI, weird credential, or missing required fields.
~750 rows
Reviewer walks at ~30 sec each = ~6 hours
§5Three prototype targets — validate before
the full run
If the pipeline handles these three cleanly, it handles the remaining
1,242. Selected from the deep research to cover each difficulty tier.
Inside the $1,000 budget Kyle authorized for research. ~95% of the
budget remains for unforeseen costs or for CarePrecise Advanced ($879
one-time) as a secondary hallucination-check signal layer if v1
surfaces invented NPIs.
§7Legal posture — proceed with normal
precautions
Settled in favor of scrapers as of 2026. Logged-off
scraping of public provider rosters for B2B sales targeting has no
realistic CFAA, HIPAA, state-privacy, or contract exposure.
hiQ Labs v. LinkedIn (9th Cir. 2022) — public-data
scraping does not violate CFAA
Meta v. Bright Data (N.D. Cal. Jan 2024) — logged-off
scrapers not bound by ToS
X Corp. v. Bright Data (N.D. Cal. May 2024) — state
breach-of-contract preempted by federal copyright law
HIPAA does not apply — provider names + credentials + offices are
not PHI
Required precautions, encoded in the pipeline:
Honest User-Agent, rate-limit 1 req/3s/domain, honor robots.txt
disallows on provider pages (rare in practice), do not bypass
CAPTCHAs/Akamai blocks, do not republish bios verbatim or photos.
No outside counsel review warranted for this scope.
Quarterly reconciliation engine
— Logic App + Container Apps Jobs + Splink fuzzy matching +
Salesforce custom review objects + Composite-API write-back with
idempotency. Annual infrastructure $240–$1,005. Build only if v1
proves recurring value.
Autonomous Salesforce write-back via API.
v1 outputs Data Loader–ready CSV; admin runs Data Loader manually.
Direct API integration doesn't materially change v1's reviewer
burden — defer until quarterly recurrence makes the engineering
worth it.
Tier 2 enrichment — EHR
system detection, sub-specialty taxonomy, languages spoken,
office-manager extraction, accreditations. Not requested. Don't
build.
AccountContactRelation backfill
for multi-practice prescribers. The locations CSV surfaces the gaps;
admin decides which to add manually.
NPPES cross-validation of
extracted NPIs (as a hallucination check, not source). Build only if
NPI hallucinations show up in v1.
§9Build sequence — what unblocks what
Elite rollup ships to production
(defines the 1,245-practice list).
Sarah confirms scraping from Azure infra is OK
(UA, rate limits).
Phase A — Prototype
against 3 sites (1 day).
Phase B — Full 1,245-site batch run
(1 day + ~24 hr Batch API).
Phase C — Run the validator
(no human time). Splits into three CSVs.
Phase D — Admin imports auto_import.csv
via Data Loader (~1 hour).
Phase E — Reviewer walks human_review.csv
(~6 hours).
Phase F — Retrospective
~2 weeks post-import. Did sales convert any gap-filled prescribers?
Decide v2 scope.