---
title: Practice data enrichment via website scrape
last_verified: 2026-05-11
verification_type: eyeball
verification_ttl_days: 30
status: draft
---

# Practice Data Enrichment — v1

> **Parent program:** [`docs/plans/account-enhancement/`](../plan.md).
> **Status:** draft — pending Sarah review and ship of [`elite-prescriber-rollup/`](../elite-prescriber-rollup/) (which produces the practice list this plan operates on).
> **Owner:** Kyle. **Build venue:** existing AdvancedRx Azure platform.
> **Build effort:** ~2 focused days of engineering. ~$50 in API/compute for the one-time pass.

## 1. Goal

For each of the ~1,245 Practice Accounts flagged `Has_ENT_Elite_Prescriber__c = TRUE`, visit the practice's website once and find **the gaps in what we already have** — prescribers working at the practice who aren't in our CRM yet, plus secondary office locations we don't have on file. A Python validator scores each row against existing Salesforce data; clean rows go straight to a Data Loader–ready CSV, uncertain rows go to a smaller human-review CSV.

## 2. What we're actually filling in

We **already have** NPs / MDs / PAs in our CRM today — including many at Elite-flagged practices. This plan is **not** "find all the providers." It's **gap-fill**: surface what's on each practice's website that doesn't already match a Contact under that Account.

Two real gaps practice websites can close, both surfaced during the rollup audit:

- **Missing prescribers.** Some prescribers — usually NPs and PAs added more recently, sometimes longstanding gaps in our coverage — work at Elite practices but aren't tied to that Account in CRM. Sales has no way to discover these short of manually browsing the practice's website. Practice websites are the most current source because every practice keeps "Our Providers" current for new-patient acquisition.
- **Missing secondary office locations.** The rollup audit found 492 Elite Prescribers with non-direct `AccountContactRelation` rows — these are prescribers who cover multiple practices. The primary-`AccountId` rollup misses them. NPI Registry is too stale to trust here (Kyle, 2026-05-09). Practice websites list every location the practice operates at.

For each gap row found, the Python validator (§4.3) determines whether it's clean enough to auto-import or needs a human eye.

## 3. Scope — what we extract

**One tier. No Tier 2.** Earlier drafts of this plan included EHR system detection, accreditations, languages spoken, sub-specialty taxonomies, office-manager extraction. **Cut.** None of that is what Tyler asked for. Scope = missing prescribers + secondary locations. Stop.

Two output object shapes from the extraction (not CSVs — the Python validator decides where each row routes after extraction).

### 3.1 Provider records

| Field                 | Type          | Notes                                                                                                                                                       |
| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `practice_account_id` | Salesforce Id | From the seed CSV                                                                                                                                           |
| `practice_name`       | string        | From the seed CSV                                                                                                                                           |
| `practice_website`    | URL           | The URL scraped                                                                                                                                             |
| `provider_name`       | string        | Full name as it appeared on the page                                                                                                                        |
| `credentials`         | string        | MD / DO / NP / PA-C / AuD / DPT / etc.                                                                                                                      |
| `specialty`           | string        | "Otolaryngology" / "Allergy & Immunology" / etc. — verbatim from the page                                                                                   |
| `npi`                 | string(10)    | **Will be null for most rows.** Strict anti-hallucination prompt prevents inventing 10-digit numbers. Luhn-validated post-extraction; invalid drops to null |
| `photo_url`           | URL           | Preserved from `<img src>`                                                                                                                                  |
| `bio_url`             | URL           | If linked from the provider listing                                                                                                                         |
| `confidence`          | number(0–1)   | LLM-reported per-row confidence                                                                                                                             |
| `evidence_span`       | string(≤200)  | Verbatim source text supporting this row                                                                                                                    |
| `match_to_existing`   | enum          | `npi_exact` / `name_fuzzy_high` / `name_fuzzy_low` / `none` — populated by the validator                                                                    |
| `validator_verdict`   | enum          | `auto_import` / `auto_skip` / `human_review` — populated by the validator                                                                                   |

### 3.2 Location records

| Field                 | Type            | Notes                                                                |
| --------------------- | --------------- | -------------------------------------------------------------------- |
| `practice_account_id` | Salesforce Id   |                                                                      |
| `practice_name`       | string          |                                                                      |
| `location_label`      | string          | "Main Office" / "North Office" — as listed                           |
| `street_address`      | string          |                                                                      |
| `city`                | string          |                                                                      |
| `state`               | string(2)       |                                                                      |
| `zip`                 | string(5 or 10) | Preserve ZIP+4                                                       |
| `phone`               | string          | E.164 if extractable                                                 |
| `fax`                 | string          | E.164 if extractable. **High priority — Rx routing depends on this** |
| `is_primary`          | bool            |                                                                      |
| `confidence`          | number(0–1)     |                                                                      |
| `evidence_span`       | string(≤200)    |                                                                      |
| `match_to_existing`   | enum            | `exact` / `fuzzy` / `none`                                           |
| `validator_verdict`   | enum            | `auto_import` / `auto_skip` / `human_review`                         |

That's it. No other fields. No `practice_summary.csv`. No EHR system. No languages. No accreditations.

## 4. Pipeline architecture

Source-of-truth design lives in [`research/build-vs-buy-analysis.pdf`](research/build-vs-buy-analysis.pdf). This section is the buildable summary, updated to include the validator stage Kyle called for.

### 4.1 Five-stage pipeline (each stage idempotent and resumable)

```
Stage 1: URL discovery       1,245 Practices → ~2,500 candidate URLs
        │   Read Account.Website + heuristic page-type discovery
        ▼
Stage 2: Headless fetch       Playwright + readability-style cleanup
        │   Rate-limit, honest UA, log 403s/skip Akamai
        ▼   Output: ~2,400 cleaned HTML blobs in Blob Storage
Stage 3: Batch LLM extract    Anthropic Batch API, all 1,245 sites, one job
        │   Haiku 4.5 + JSON-schema-constrained, anti-hallucination prompt
        ▼   Output: 1,245 JSON documents
Stage 4: Python validator     Rules + Salesforce cross-check (THIS IS THE REVIEW)
        │   Luhn-check NPIs, fuzzy-match against existing Contacts,
        │   credential allowlist, confidence threshold
        ▼   Output: per-row verdict (auto_import / auto_skip / human_review)
Stage 5: Route to outputs     Three CSVs:
        │     • auto_import.csv  — Data Loader–ready, no human review
        │     • human_review.csv — smaller; only ambiguous rows
        │     • auto_skip.csv    — audit-only; rows already in CRM
        ▼   Drop all three to SharePoint
```

The Python validator at Stage 4 is what answers the "we don't need a human reviewing every row" question. Most rows clear the validator and go straight to `auto_import.csv` for a no-thinking Data Loader upload. Only ambiguous rows hit `human_review.csv`.

### 4.2 Stages 1–3 — unchanged from prior research

URL discovery, Playwright fetch with rate-limiting and Akamai-skip handling, and the Haiku 4.5 batch extraction work exactly as `research/build-vs-buy-analysis.pdf` §6 (Approach 1) recommends. The LLM prompt and JSON schema are in §4.4 below.

### 4.3 Stage 4 — Python validator (the review layer)

A single Python script — call it `validate.py` — reads the extraction JSON, queries Salesforce once for all existing Contacts and addresses under the 1,245 practice Account.Ids (one Bulk API 2.0 query, ~10 seconds), runs the rules below, and writes the three output CSVs. **The script IS the review.**

For each extracted **provider** record:

1. **Confidence floor.** If `confidence < 0.75` → `human_review`. Stop.
2. **NPI sanity.** If `npi` is non-null, validate Luhn check digit (per `research/quarterly-reconciliation-architecture.pdf` Section 3, the NPI format spec). Invalid Luhn → drop NPI to null, set `validator_verdict = human_review`.
3. **Match to existing Contact under this AccountId.**
   - If `npi` matches an existing Contact's NPI → `match_to_existing = npi_exact`, `validator_verdict = auto_skip` (we already have this person).
   - Else if name (normalized) fuzzy-matches an existing Contact at ≥ 0.92 Jaro-Winkler → `name_fuzzy_high`, `auto_skip`.
   - Else if name fuzzy-matches at 0.75–0.92 → `name_fuzzy_low`, `human_review` (ambiguous — same person with credentials misread? or different person?).
   - Else → `none`, **`auto_import`** (the gap-fill we wanted).
4. **Credential sanity.** If `credentials` is in the allowlist `{MD, DO, NP, NP-C, PA, PA-C, AuD, DPT, RN, LCSW, PhD}`, allow auto-import. If something weird (e.g., "Office Manager", "Marketing Director"), force `auto_skip` with a note — the LLM probably misclassified office staff.
5. **Provider-vs-staff signal.** If the section header that surrounded the `evidence_span` contains obvious staff signals ("Office Staff", "Management Team", "Our Team"), force `auto_skip` regardless of credentials.

For each extracted **location** record:

1. **Confidence floor.** Same `< 0.75` → `human_review` rule.
2. **Required-field check.** If `street_address` or `zip` is null → `human_review`.
3. **Match against existing addresses under this AccountId** (Account.BillingStreet, Account.ShippingStreet, ContactPointAddress rows): exact ZIP+4 match with street similarity ≥ 0.85 → `auto_skip`. Less than 0.85 with same ZIP5 → `human_review`. Different ZIP5 → `auto_import`.

### 4.4 LLM prompt (Stage 3)

Verbatim:

```
You are an information extraction system for ENT medical practice websites.

You will be given the cleaned text content of one or two pages from a single
medical practice's website (a "providers" page listing physicians and staff,
and/or a "locations" page listing office addresses), plus the practice's
known name and Salesforce Account.Id for reference.

Extract a structured JSON object matching the provided schema.

CRITICAL extraction rules:

1. If a field is not present on the page, return null. Do NOT guess, infer,
   or complete plausible-looking values. This is especially important for
   NPI (10-digit numbers) — return null unless explicitly stated on the page.

2. Only extract people who are clearly clinical providers (physicians, NPs,
   PAs, audiologists, speech pathologists). Do NOT extract office staff,
   receptionists, billing personnel, or marketing staff as providers.

3. For each extracted field where applicable, populate `confidence` (0.0–1.0)
   and `evidence_span` (the verbatim text from the source page that supports
   the extraction, max 200 characters).

4. If the page appears to be about a different medical practice or specialty
   (e.g., the page name doesn't match the provided practice name, or the
   specialty is something other than ENT/allergy/audiology), return all
   fields as null and set `extraction_status = "wrong_page_likely"`.

5. If the providers page lists more than 50 providers, it is likely a
   directory page, not a single practice's roster — return null providers
   and set `notes_for_review = "possible directory page — >50 providers"`.

6. Phone numbers: prefer E.164 format (+1XXXXXXXXXX). If you cannot parse
   confidently, return verbatim from the page.

Return ONLY the JSON object matching the provided schema. No preamble,
no explanation, no markdown fences.
```

Token budget per site: ~30K input + ~1.5K output. Total: ~37M input + ~1.9M output across 1,245 sites. Cost at Haiku 4.5 batch rates ≈ **$24** for the run.

### 4.5 Stage 5 — three output CSVs

- `auto_import.csv` — providers + locations with `validator_verdict = auto_import`. Data Loader–ready. Admin uploads via Data Loader; one round trip, no per-row review.
- `human_review.csv` — providers + locations with `validator_verdict = human_review`. Sorted by confidence ascending (lowest first). Reviewer eyeballs ambiguous rows.
- `auto_skip.csv` — for audit. Rows we already have. Reviewer skims to catch any false-positive matches.

**Expected split** based on the rollup audit numbers (1,245 practices × ~6 providers/practice = ~7,500 provider rows extracted):

- **~60% already in CRM** (NPI or high-confidence name match) → `auto_skip` (~4,500 rows). Pure audit volume; the reviewer skims.
- **~30% genuine gap fills** with high confidence → `auto_import` (~2,250 rows). No per-row review.
- **~10% ambiguous** → `human_review` (~750 rows). Reviewer walks at ~30 sec each = **~6 hours** of human review.

That's **~60% less reviewer time** than a flat "every row goes to human review" design, while keeping a human on the calls that actually need judgment.

## 5. Cost and budget

| Line                                                      | One-time v1 |
| --------------------------------------------------------- | ----------: |
| Claude Haiku 4.5 Batch API (~37M tokens)                  |        ~$24 |
| Azure Container Instance compute (Playwright + validator) |        ~$15 |
| Azure Blob Storage (cleaned HTML + JSON)                  |         <$5 |
| Salesforce Bulk API 2.0 query for cross-check             |          $0 |
| **Total**                                                 |    **~$44** |

## 6. Three prototype targets — validate before the full run

From [`research/build-vs-buy-analysis.pdf`](research/build-vs-buy-analysis.pdf) §10. Covers the three difficulty tiers.

1. **Pinnacle ENT Associates** (pentadocs.com) — WordPress archetype, easy.
2. **South Florida ENT Associates** (sfenta.org) — modern SPA, medium.
3. **Cleveland Clinic Head & Neck Institute** — Akamai-blocked hospital network, hard. Tests graceful degradation.

If the pipeline handles all three cleanly — including the validator splitting their output sensibly into auto_import vs human_review — it'll handle the remaining 1,242.

## 7. Legal posture — proceed with normal precautions

Source: [`research/build-vs-buy-analysis.pdf`](research/build-vs-buy-analysis.pdf) §4.

- _hiQ v. LinkedIn_, _Meta v. Bright Data_, _X v. Bright Data_ all settle public-data scraping in scrapers' favor as of 2026.
- HIPAA doesn't apply — provider names + credentials + office addresses are not PHI.
- Honest User-Agent (`AdvancedRxBot/1.0; +https://advancedrx.com/bot; contact@advancedrx.co`), rate-limit ≥ 1 req per domain every 3 seconds, honor robots.txt disallows, no CAPTCHA bypass.
- No outside counsel review warranted for this scope.

## 8. Success criteria

A successful v1 run satisfies all of:

1. **≥ 90% of practices yield ≥ 1 extracted provider row.** Allowing ~10% Akamai blocks + unusual layouts.
2. **Validator splits the output sensibly.** Expected ~60% auto_skip, ~30% auto_import, ~10% human_review. Significant deviation (e.g., 50%+ human_review) means the validator rules are wrong and should be tuned before treating the run as authoritative.
3. **Zero invalid-Luhn NPIs in auto_import.csv.**
4. **Zero rows in auto_import.csv where the credential is not in the allowlist.**
5. **Three prototype sites (§6) pass spot-check** against manually-extracted ground truth before the full 1,245-site batch runs.
6. **Human-review CSV walks in ≤ 6 hours.** If it blows past 10 hours, validator rules are too strict.

## 9. Build sequence

1. [`elite-prescriber-rollup/`](../elite-prescriber-rollup/) ships to production (defines the 1,245-practice list).
2. Sarah confirms scraping from AdvancedRx Azure infrastructure is OK.
3. **Phase A — Prototype** (1 day): build pipeline against the 3 sites in §6. Validate JSON schema, validator rules, error handling.
4. **Phase B — Full run** (1 day): execute against all 1,245 practices. Batch API runs overnight (~24 hr).
5. **Phase C — Validator review** (no human time): run `validate.py` to split into the three output CSVs.
6. **Phase D — Admin imports** `auto_import.csv` via Data Loader (~1 hour, mostly Data Loader's batch time).
7. **Phase E — Reviewer walks** `human_review.csv` (~6 hours).
8. **Phase F — Retrospective** ~2 weeks post-import. Did sales convert any of the gap-filled prescribers? Decide whether to scope a quarterly recurrence.

## 10. Deferred to v2 / out of scope

- **Quarterly reconciliation engine** — comprehensive design in [`research/quarterly-reconciliation-architecture.pdf`](research/quarterly-reconciliation-architecture.pdf). Defer until v1 proves recurring value.
- **Autonomous Salesforce write-back via Composite/Bulk API.** v1 outputs Data Loader–ready CSV; admin runs Data Loader. Direct API integration is straightforward but doesn't materially change v1's reviewer burden — defer until quarterly recurrence makes the engineering worth it.
- **Tier 2 enrichment** — EHR system detection, sub-specialty taxonomy, languages spoken, office manager extraction, accreditations. None of these were requested. Don't build.
- **AccountContactRelation backfill** for multi-practice prescribers. The locations CSV surfaces the gaps; admin decides which to add manually.

## 11. Cross-references

- **Parent program:** [`../plan.md`](../plan.md).
- **Sibling plan:** [`../elite-prescriber-rollup/`](../elite-prescriber-rollup/) — produces the practice list this plan operates on.
- **Source research artifacts:** [`research/`](research/)
  - [`research/build-vs-buy-analysis.pdf`](research/build-vs-buy-analysis.pdf) — v1 architecture, tool survey, cost math, legal analysis.
  - [`research/quarterly-reconciliation-architecture.pdf`](research/quarterly-reconciliation-architecture.pdf) — v2+ infrastructure (deferred).
- **Top-level inventory:** [`docs/plans/README.md`](../../README.md).
- **Tyler email thread:** 2026-05-08 (original ask) → 2026-05-11 (confirmed Option A on rollup scope, raised secondary locations as future endeavor).