Scraping Local Business Data for SEO Audits: A Practical Cookbook
A practical cookbook for collecting, normalizing, and analyzing local business listings across maps, social, and directories to find SEO gaps.
Hook: Stop chasing inconsistent citations — automate a repeatable local listings audit
If you manage multi-location SEO, nothing is more frustrating than inconsistent business listings across maps, directories, and social platforms. In 2026 local visibility now depends on an ecosystem of maps, TikTok, Instagram, Reddit, AI-powered answers. This cookbook shows how to collect, normalize, and analyze business listings at scale using a practical scraping stack (Scrapy, Playwright/Puppeteer, Selenium, HTTP clients) and modern normalization / deduplication techniques so you can surface citation inconsistencies and local SEO gaps reliably.
Why this matters in 2026
Search behavior has continued to fragment: consumers discover brands via maps, TikTok, Instagram, Reddit, AI assistants, and traditional search. As Search Engine Land noted in January 2026, discoverability is now cross-platform and authority is a combined signal across social and search. For local businesses, inconsistent NAP (Name, Address, Phone), missing categories, or stale hours create ranking leakage and confuse AI summarizers and map ranking models.
“Discoverability is no longer about ranking first on a single platform. It’s about showing up consistently across the touchpoints that make up your audience’s search universe.” — Search Engine Land, Jan 2026
What this cookbook delivers
- Concrete scraping patterns for maps, directories, and social platforms
- Anti-blocking and proxy strategies for 2026 (residential mobile, browser contexts, fingerprint rotation)
- Normalization and entity resolution pipelines with code examples
- Metrics and queries to surface local SEO gaps and citation inconsistencies
- Compliance checklist and alternatives (APIs, provider partnerships)
Overview: data model you should collect
Define a canonical schema before scraping. Collecting extra fields upfront saves time during normalization.
- core: source, source_id/place_id, scraped_at
- identity: business_name, alternate_names
- contact: phone, website, email (if public)
- address: street_address, city, region, postal_code, country, lat, lon
- attributes: categories (primary + secondary), hours, price_range, services
- engagement: review_count, rating, recent_reviews (text + date)
- meta: screenshot_url, raw_html, crawl_headers, cookies
Step 1 — Choose the right acquisition method
Maps and major platforms (Google Maps, Apple Maps, Yelp, Facebook/Meta, Bing, TripAdvisor, Mapbox-powered directories) often block raw HTTP scraping and use dynamic JS and bot detections. Options:
- Official APIs (preferred) — Google Places API, Yelp Fusion, Facebook Graph. Use where coverage and quota fit your needs. Faster, lawful, and stable.
- Headless browser scraping — Playwright or Puppeteer for JS-heavy pages and when you need front-end-only fields (hours UI, structured JSON-LD rendered after scripts).
- HTTP scraping (Scrapy / requests) — Fast for directories with server-side rendered HTML (YellowPages, local chambers)
- Hybrid — Use HTTP for directories; fallback to Playwright when you detect dynamic content.
Practical rule
Prefer APIs → HTTP scrapers → headless browsers. But real audits often need the extra fields only present in map UIs, so include Playwright in your stack.
Step 2 — Example spiders and scripts
Scrapy example: scrape a server-side directory
# scrapy spider: local_directory_spider.py
import scrapy
class DirectorySpider(scrapy.Spider):
name = 'directory'
start_urls = ['https://example-directory.com/search?city=seattle']
def parse(self, response):
for card in response.css('.listing-card'):
yield {
'source': 'example-directory',
'source_id': card.attrib.get('data-id'),
'business_name': card.css('.title::text').get(),
'street_address': card.css('.addr::text').get(),
'phone': card.css('.phone::text').get(),
'website': card.css('a.website::attr(href)').get(),
}
next_page = response.css('.pagination .next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Playwright (Python) example: scrape Google Maps results page
Note: Google Maps is heavily defended. Use official Places API where possible; use Playwright for UI-only fields or visual verification. Keep sessions short, randomize contexts, use residential proxies.
from playwright.sync_api import sync_playwright
import time
def scrape_maps(query):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...')
page = context.new_page()
page.goto(f'https://www.google.com/maps/search/{query}', timeout=60000)
time.sleep(3) # let UI render
cards = page.locator('div[aria-label][role=listitem]')
results = []
for i in range(min(cards.count(), 10)):
card = cards.nth(i)
name = card.locator('h3').inner_text()
addr = card.locator('.section-result-location').inner_text()
results.append({'name': name, 'address': addr})
browser.close()
return results
Puppeteer + stealth example (Node.js)
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
;(async () => {
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto('https://www.yelp.com/search?find_desc=coffee&find_loc=Seattle')
const listings = await page.$$eval('.container__09f24', els => els.slice(0,10).map(e => ({
name: e.querySelector('a.link')?.innerText,
url: e.querySelector('a.link')?.href,
})))
console.log(listings)
await browser.close()
})()
Step 3 — Anti-blocking & scaling strategies (2026)
Anti-bot tech has matured: fingerprinting, behavioral analysis, and ML-based detection. Your tooling must be up to date.
- Residential & mobile proxies — Mobile proxies mimic carrier IP ranges and reduce detection when scraping mobile-first map UIs.
- Browser contexts & session rotation — Use one browser per proxy + fresh context + cleanable storage. Playwright's browser.new_context is essential.
- Fingerprint diversity — Rotate user-agents, timezone, viewport, fonts and accept-language. Use stealth plugins but combine with proxy rotation.
- Rate-limiting & randomized timing — Implement per-target politeness (ex: 1–5s jitter for map UIs) and exponential backoff on 429/503.
- Headless detection avoidance — Keep headless false where supported; disable webdriver flags and vendor sniffing.
- Use managed extraction services — For high-risk targets, consider provider partnerships (scraping-as-a-service) or paid datasets to avoid interruptions.
Step 4 — Normalization & entity resolution
Raw scraped data is noisy. Normalization standardizes formats; entity resolution groups records for the same physical business.
Normalization checklist
- Normalize phone numbers to E.164 using libphonenumber.
- Canonicalize addresses via geocoding (Google / OpenCage / Nominatim) and store lat/lon.
- Lowercase and strip punctuation from names for comparison; keep original for display.
- Standardize categories to a controlled taxonomy (Google/Bing categories) and map synonyms.
- Parse and canonicalize operating hours into ISO intervals.
Fast Python pipeline: normalization + fuzzy dedupe
import pandas as pd
from phonenumbers import parse, format_number, PhoneNumberFormat
from rapidfuzz import fuzz, process
# example row: name, street_address, city, phone, lat, lon, source
def normalize_phone(raw):
try:
return format_number(parse(raw, 'US'), PhoneNumberFormat.E164)
except: return None
df['phone_norm'] = df['phone'].apply(normalize_phone)
df['name_key'] = df['business_name'].str.lower().str.replace(r"[^a-z0-9]", ' ', regex=True).str.strip()
# simple blocking on street + phone
blocks = df.groupby(['street_address', 'phone_norm'])
clusters = []
for _, block in blocks:
names = block['name_key'].tolist()
# merge near-duplicate names within a block
while names:
base = names.pop(0)
group = [base]
matches = [m for m in names if fuzz.ratio(base, m) > 85]
for m in matches:
names.remove(m)
group.append(m)
clusters.append(group)
Advanced entity resolution (2026)
In 2025–2026, embedding-based entity resolution gained speed: encode name+address+website into dense vectors (OpenAI or local models) and use a vector DB (Milvus, Pinecone, Weaviate) to cluster similar entities across sources. This handles messy edge cases — multiple brands at the same address, franchise vs. independent listings.
Step 5 — Metrics and queries to surface SEO gaps
Compute signals that drive business decisions. Store normalized, deduped entities in a relational DB and create materialized views for these scores.
Suggested metrics
- NAP completeness: percent of listings with name, address, phone, website, hours.
- Consistency score: pairwise similarity across sources averaged per entity (0–100).
- Primary category mismatch: how many sources disagree on the primary category.
- Duplicate count: number of duplicate/conflicting listings (same phone or address but different names/URLs).
- Review delta: differences in review count between Google/Yelp/Facebook — large deltas indicate missing listings or suppressed results.
- Map presence: presence on major maps (Google, Apple, Bing, Waze) — use Places API presence as a boolean.
SQL examples
-- NAP completeness per location
SELECT
entity_id,
AVG((business_name IS NOT NULL)::int + (street_address IS NOT NULL)::int + (phone IS NOT NULL)::int + (website IS NOT NULL)::int + (hours IS NOT NULL)::int) / 5.0 AS nap_completeness
FROM listings_normalized
GROUP BY entity_id;
-- consistency score (simplified)
SELECT entity_id, AVG(similarity_score) AS consistency
FROM name_address_similarity
GROUP BY entity_id;
How to prioritize fixes
Don't fix every inconsistency — prioritize by business impact:
- High-traffic locations with low NAP completeness
- Listings present on Google but with conflicting phone/URL
- Primary category mismatches for top converting locations
- Duplicate Google listings (can split reviews and lower ranking)
Visualizations that help stakeholders
- Map heatmap: completeness score by lat/lon
- Bar chart: inconsistent fields per source (Google vs Yelp vs FB)
- Time-series: resolved duplicates vs conversions
- Tabular actionable report: per-location fixes, suggested canonical name/address, owner action (claim listing URL, call directory support)
Operational & legal checklist
- Respect robots.txt and platform terms — use APIs where required.
- Rate limits — back off on 429s, monitor 403/429 spikes and throttle.
- Data privacy — do not collect PII beyond what's publicly available; anonymize reviewer identities where needed.
- Logging & retention — store raw snapshots (HTML/screenshots) for auditability and to reprocess after normalization improvements.
- Team process — have an approvals flow for bulk edits to live listings (avoid mass mistaken changes).
Real-world cookbook: auditing a 50-location dental chain (case example)
Scenario: a regional dental group suspects ranking drops after a rebrand. They want to find:
- Duplicate Google listings still using old brand name
- Franchise vs. corporate page mismatches
- Missing hours or contact info on Facebook and Apple Maps
Execution summary:
- Seed list of known locations from the website (50 records).
- Query Google Places API for place_id and basic fields (first pass).
- Parallel Playwright runs (pool of 8 contexts + residential proxies) to fetch front-end-only fields and screenshots for verification.
- Scrape Yelp, Facebook, local directories with Scrapy and rate-limits.
- Normalize addresses with Google Geocoding + libpostal, normalize phones to E.164.
- Run an embedding-based dedupe pass in Milvus to resolve tricky duplicates and franchise vs. single-location pages.
- Generate per-location actionable report: claim/merge suggestions, category corrections, and a list of duplicates to request removal.
Outcome: 38/50 locations had at least one inconsistent listing; after prioritizing the top 10 high-traffic locations, the chain saw a 12% uplift in map impressions within 8 weeks following fixes combined with local content updates.
Advanced strategies and future-proofing (2026+)
- Embed signals into ranking models — build a local ranking predictor using features like citation consistency, review velocity, and presence on AI knowledge graphs.
- Automated remediation workflows — automate sending claim requests, template-based profile updates, and monitor status changes via webhooks or scheduled re-checks. For playbooks on verification and community signals see edge-first verification approaches.
- Vectorized entity store — keep embeddings for each normalized entity to accelerate cross-source matching as datasets grow.
- AI-assisted classification — use modern classifiers to map free-text categories to standard taxonomies with confidence scores and human-in-the-loop review for low-confidence cases.
When to stop scraping and use partnerships
If your footprint grows (hundreds of locations) or a target platform blocks aggressively, evaluate paid data providers or channel partnerships (Yext-style partnerships, Moz Local, BrightLocal-style services, or direct platform partnerships). These reduce operational risk and improve SLA-backed fixing options.
Quick troubleshooting & tips
- Seeing lots of CAPTCHAs? Switch to a higher-quality residential mobile proxy pool and spread requests over time.
- Inconsistent geocoding? Always store original address and lat/lon; prefer a consistent geocoder for canonicalization.
- False duplicates (same building multiple businesses)? Use category and website similarity as tie-breakers, and keep human review for borderline cases.
- Audit drift: schedule monthly re-crawls and keep a changelog of source vs canonical values.
Closing: actionable takeaways
- Define a canonical schema (NAP + lat/lon + categories + source metadata) before you scrape.
- Use APIs where possible; use Playwright for UI-only fields and Scrapy for server-side directories.
- Invest in residential/mobile proxies, browser context rotation, and fingerprint diversity to reduce blocks.
- Normalize early: phone → E.164, addresses → canonical geocode, categories → controlled taxonomy.
- Resolve entities using blocking + fuzzy matching or embedding vectors for hard cases.
- Prioritize fixes by business impact (traffic, conversions, top markets).
Resources & further reading
- Google Places & Geocoding APIs (use where possible)
- RapidFuzz / fuzzywuzzy for string matching
- Milvus / Pinecone / Weaviate for vector matching
- Playwright & Puppeteer stealth plugins for headless detection workarounds
- Search Engine Land — Discoverability in 2026 (Jan 2026)
Final note on compliance
Always evaluate the legal and ToS risks before scraping. Public data used for business intelligence is different from bulk republishing or competitive reuse. When in doubt, prefer APIs or licensed datasets — that also reduces operational churn from anti-bot countermeasures.
Call to action
Ready to build a repeatable local listings audit? Start with a 30-location pilot: collect canonical schema, run one pass of Google Places + Playwright verification, and run the normalization pipeline. If you want a starter repo (Scrapy + Playwright + normalization notebook + example SQL views) I’ll share a GitHub template and a checklist you can run in two days — reply with your stack preference (Python/Node) and I’ll tailor it for you.
Related Reading
- Proxy Management Tools for Small Teams: Observability, Automation, and Compliance Playbook (2026)
- Micro-Popups, Local Presence and Approval Trust Signals — What Marketplaces Need to Know in 2026
- Dubai 2026: How Micro‑Events and Local Listings Are Powering Boutique Tourism
- Edge-Powered Landing Pages for Short Stays: A 2026 Playbook to Cut TTFB and Boost Bookings
- Suite of Vendor Contract Clauses for Adtech Startups After EDO-iSpot Verdict
- Portable Power for Micro‑Mobility and Cars: Batteries, Packs and Safe Charging
- When Cloud Goes Down: How X, Cloudflare and AWS Outages Can Freeze Port Operations
- Turn Cleanup Into Fun: Pairing Robot Vacuums with Toy Storage Hacks
- A$AP Rocky’s Comeback Album: Track-by-Track Guide to ‘Don’t Be Dumb’
Related Topics
scraper
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Innovative Fundraising Through Web Scraping: Nonprofit Use Cases
Review: ShadowCloud Pro for Shoppers — Can Cloud-Backed Scraping Power Research Workflows?
Field Review: CaptureFlow 5 — Practical Testing for Low‑Latency Extraction and Edge Integration (2026)
From Our Network
Trending stories across our publication group