Scraping Local Business Data for SEO Audits: A Practical Cookbook
local-seohow-tomaps

Scraping Local Business Data for SEO Audits: A Practical Cookbook

sscraper
2026-01-31
11 min read
Advertisement

A practical cookbook for collecting, normalizing, and analyzing local business listings across maps, social, and directories to find SEO gaps.

Hook: Stop chasing inconsistent citations — automate a repeatable local listings audit

If you manage multi-location SEO, nothing is more frustrating than inconsistent business listings across maps, directories, and social platforms. In 2026 local visibility now depends on an ecosystem of maps, TikTok, Instagram, Reddit, AI-powered answers. This cookbook shows how to collect, normalize, and analyze business listings at scale using a practical scraping stack (Scrapy, Playwright/Puppeteer, Selenium, HTTP clients) and modern normalization / deduplication techniques so you can surface citation inconsistencies and local SEO gaps reliably.

Why this matters in 2026

Search behavior has continued to fragment: consumers discover brands via maps, TikTok, Instagram, Reddit, AI assistants, and traditional search. As Search Engine Land noted in January 2026, discoverability is now cross-platform and authority is a combined signal across social and search. For local businesses, inconsistent NAP (Name, Address, Phone), missing categories, or stale hours create ranking leakage and confuse AI summarizers and map ranking models.

“Discoverability is no longer about ranking first on a single platform. It’s about showing up consistently across the touchpoints that make up your audience’s search universe.” — Search Engine Land, Jan 2026

What this cookbook delivers

Overview: data model you should collect

Define a canonical schema before scraping. Collecting extra fields upfront saves time during normalization.

  • core: source, source_id/place_id, scraped_at
  • identity: business_name, alternate_names
  • contact: phone, website, email (if public)
  • address: street_address, city, region, postal_code, country, lat, lon
  • attributes: categories (primary + secondary), hours, price_range, services
  • engagement: review_count, rating, recent_reviews (text + date)
  • meta: screenshot_url, raw_html, crawl_headers, cookies

Step 1 — Choose the right acquisition method

Maps and major platforms (Google Maps, Apple Maps, Yelp, Facebook/Meta, Bing, TripAdvisor, Mapbox-powered directories) often block raw HTTP scraping and use dynamic JS and bot detections. Options:

  1. Official APIs (preferred) — Google Places API, Yelp Fusion, Facebook Graph. Use where coverage and quota fit your needs. Faster, lawful, and stable.
  2. Headless browser scraping — Playwright or Puppeteer for JS-heavy pages and when you need front-end-only fields (hours UI, structured JSON-LD rendered after scripts).
  3. HTTP scraping (Scrapy / requests) — Fast for directories with server-side rendered HTML (YellowPages, local chambers)
  4. Hybrid — Use HTTP for directories; fallback to Playwright when you detect dynamic content.

Practical rule

Prefer APIs → HTTP scrapers → headless browsers. But real audits often need the extra fields only present in map UIs, so include Playwright in your stack.

Step 2 — Example spiders and scripts

Scrapy example: scrape a server-side directory

# scrapy spider: local_directory_spider.py
import scrapy

class DirectorySpider(scrapy.Spider):
    name = 'directory'
    start_urls = ['https://example-directory.com/search?city=seattle']

    def parse(self, response):
        for card in response.css('.listing-card'):
            yield {
                'source': 'example-directory',
                'source_id': card.attrib.get('data-id'),
                'business_name': card.css('.title::text').get(),
                'street_address': card.css('.addr::text').get(),
                'phone': card.css('.phone::text').get(),
                'website': card.css('a.website::attr(href)').get(),
            }

        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Playwright (Python) example: scrape Google Maps results page

Note: Google Maps is heavily defended. Use official Places API where possible; use Playwright for UI-only fields or visual verification. Keep sessions short, randomize contexts, use residential proxies.

from playwright.sync_api import sync_playwright
import time

def scrape_maps(query):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...')
        page = context.new_page()
        page.goto(f'https://www.google.com/maps/search/{query}', timeout=60000)
        time.sleep(3)  # let UI render

        cards = page.locator('div[aria-label][role=listitem]')
        results = []
        for i in range(min(cards.count(), 10)):
            card = cards.nth(i)
            name = card.locator('h3').inner_text()
            addr = card.locator('.section-result-location').inner_text()
            results.append({'name': name, 'address': addr})

        browser.close()
        return results

Puppeteer + stealth example (Node.js)

const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

;(async () => {
  const browser = await puppeteer.launch({ headless: true })
  const page = await browser.newPage()
  await page.goto('https://www.yelp.com/search?find_desc=coffee&find_loc=Seattle')
  const listings = await page.$$eval('.container__09f24', els => els.slice(0,10).map(e => ({
    name: e.querySelector('a.link')?.innerText,
    url: e.querySelector('a.link')?.href,
  })))
  console.log(listings)
  await browser.close()
})()

Step 3 — Anti-blocking & scaling strategies (2026)

Anti-bot tech has matured: fingerprinting, behavioral analysis, and ML-based detection. Your tooling must be up to date.

  • Residential & mobile proxies — Mobile proxies mimic carrier IP ranges and reduce detection when scraping mobile-first map UIs.
  • Browser contexts & session rotation — Use one browser per proxy + fresh context + cleanable storage. Playwright's browser.new_context is essential.
  • Fingerprint diversity — Rotate user-agents, timezone, viewport, fonts and accept-language. Use stealth plugins but combine with proxy rotation.
  • Rate-limiting & randomized timing — Implement per-target politeness (ex: 1–5s jitter for map UIs) and exponential backoff on 429/503.
  • Headless detection avoidance — Keep headless false where supported; disable webdriver flags and vendor sniffing.
  • Use managed extraction services — For high-risk targets, consider provider partnerships (scraping-as-a-service) or paid datasets to avoid interruptions.

Step 4 — Normalization & entity resolution

Raw scraped data is noisy. Normalization standardizes formats; entity resolution groups records for the same physical business.

Normalization checklist

  • Normalize phone numbers to E.164 using libphonenumber.
  • Canonicalize addresses via geocoding (Google / OpenCage / Nominatim) and store lat/lon.
  • Lowercase and strip punctuation from names for comparison; keep original for display.
  • Standardize categories to a controlled taxonomy (Google/Bing categories) and map synonyms.
  • Parse and canonicalize operating hours into ISO intervals.

Fast Python pipeline: normalization + fuzzy dedupe

import pandas as pd
from phonenumbers import parse, format_number, PhoneNumberFormat
from rapidfuzz import fuzz, process

# example row: name, street_address, city, phone, lat, lon, source

def normalize_phone(raw):
    try:
        return format_number(parse(raw, 'US'), PhoneNumberFormat.E164)
    except: return None

df['phone_norm'] = df['phone'].apply(normalize_phone)
df['name_key'] = df['business_name'].str.lower().str.replace(r"[^a-z0-9]", ' ', regex=True).str.strip()

# simple blocking on street + phone
blocks = df.groupby(['street_address', 'phone_norm'])
clusters = []
for _, block in blocks:
    names = block['name_key'].tolist()
    # merge near-duplicate names within a block
    while names:
        base = names.pop(0)
        group = [base]
        matches = [m for m in names if fuzz.ratio(base, m) > 85]
        for m in matches:
            names.remove(m)
            group.append(m)
        clusters.append(group)

Advanced entity resolution (2026)

In 2025–2026, embedding-based entity resolution gained speed: encode name+address+website into dense vectors (OpenAI or local models) and use a vector DB (Milvus, Pinecone, Weaviate) to cluster similar entities across sources. This handles messy edge cases — multiple brands at the same address, franchise vs. independent listings.

Step 5 — Metrics and queries to surface SEO gaps

Compute signals that drive business decisions. Store normalized, deduped entities in a relational DB and create materialized views for these scores.

Suggested metrics

  • NAP completeness: percent of listings with name, address, phone, website, hours.
  • Consistency score: pairwise similarity across sources averaged per entity (0–100).
  • Primary category mismatch: how many sources disagree on the primary category.
  • Duplicate count: number of duplicate/conflicting listings (same phone or address but different names/URLs).
  • Review delta: differences in review count between Google/Yelp/Facebook — large deltas indicate missing listings or suppressed results.
  • Map presence: presence on major maps (Google, Apple, Bing, Waze) — use Places API presence as a boolean.

SQL examples

-- NAP completeness per location
SELECT
  entity_id,
  AVG((business_name IS NOT NULL)::int + (street_address IS NOT NULL)::int + (phone IS NOT NULL)::int + (website IS NOT NULL)::int + (hours IS NOT NULL)::int) / 5.0 AS nap_completeness
FROM listings_normalized
GROUP BY entity_id;

-- consistency score (simplified)
SELECT entity_id, AVG(similarity_score) AS consistency
FROM name_address_similarity
GROUP BY entity_id;

How to prioritize fixes

Don't fix every inconsistency — prioritize by business impact:

  1. High-traffic locations with low NAP completeness
  2. Listings present on Google but with conflicting phone/URL
  3. Primary category mismatches for top converting locations
  4. Duplicate Google listings (can split reviews and lower ranking)

Visualizations that help stakeholders

  • Map heatmap: completeness score by lat/lon
  • Bar chart: inconsistent fields per source (Google vs Yelp vs FB)
  • Time-series: resolved duplicates vs conversions
  • Tabular actionable report: per-location fixes, suggested canonical name/address, owner action (claim listing URL, call directory support)
  • Respect robots.txt and platform terms — use APIs where required.
  • Rate limits — back off on 429s, monitor 403/429 spikes and throttle.
  • Data privacy — do not collect PII beyond what's publicly available; anonymize reviewer identities where needed.
  • Logging & retention — store raw snapshots (HTML/screenshots) for auditability and to reprocess after normalization improvements.
  • Team process — have an approvals flow for bulk edits to live listings (avoid mass mistaken changes).

Real-world cookbook: auditing a 50-location dental chain (case example)

Scenario: a regional dental group suspects ranking drops after a rebrand. They want to find:

  • Duplicate Google listings still using old brand name
  • Franchise vs. corporate page mismatches
  • Missing hours or contact info on Facebook and Apple Maps

Execution summary:

  1. Seed list of known locations from the website (50 records).
  2. Query Google Places API for place_id and basic fields (first pass).
  3. Parallel Playwright runs (pool of 8 contexts + residential proxies) to fetch front-end-only fields and screenshots for verification.
  4. Scrape Yelp, Facebook, local directories with Scrapy and rate-limits.
  5. Normalize addresses with Google Geocoding + libpostal, normalize phones to E.164.
  6. Run an embedding-based dedupe pass in Milvus to resolve tricky duplicates and franchise vs. single-location pages.
  7. Generate per-location actionable report: claim/merge suggestions, category corrections, and a list of duplicates to request removal.

Outcome: 38/50 locations had at least one inconsistent listing; after prioritizing the top 10 high-traffic locations, the chain saw a 12% uplift in map impressions within 8 weeks following fixes combined with local content updates.

Advanced strategies and future-proofing (2026+)

  • Embed signals into ranking models — build a local ranking predictor using features like citation consistency, review velocity, and presence on AI knowledge graphs.
  • Automated remediation workflows — automate sending claim requests, template-based profile updates, and monitor status changes via webhooks or scheduled re-checks. For playbooks on verification and community signals see edge-first verification approaches.
  • Vectorized entity store — keep embeddings for each normalized entity to accelerate cross-source matching as datasets grow.
  • AI-assisted classification — use modern classifiers to map free-text categories to standard taxonomies with confidence scores and human-in-the-loop review for low-confidence cases.

When to stop scraping and use partnerships

If your footprint grows (hundreds of locations) or a target platform blocks aggressively, evaluate paid data providers or channel partnerships (Yext-style partnerships, Moz Local, BrightLocal-style services, or direct platform partnerships). These reduce operational risk and improve SLA-backed fixing options.

Quick troubleshooting & tips

  • Seeing lots of CAPTCHAs? Switch to a higher-quality residential mobile proxy pool and spread requests over time.
  • Inconsistent geocoding? Always store original address and lat/lon; prefer a consistent geocoder for canonicalization.
  • False duplicates (same building multiple businesses)? Use category and website similarity as tie-breakers, and keep human review for borderline cases.
  • Audit drift: schedule monthly re-crawls and keep a changelog of source vs canonical values.

Closing: actionable takeaways

  • Define a canonical schema (NAP + lat/lon + categories + source metadata) before you scrape.
  • Use APIs where possible; use Playwright for UI-only fields and Scrapy for server-side directories.
  • Invest in residential/mobile proxies, browser context rotation, and fingerprint diversity to reduce blocks.
  • Normalize early: phone → E.164, addresses → canonical geocode, categories → controlled taxonomy.
  • Resolve entities using blocking + fuzzy matching or embedding vectors for hard cases.
  • Prioritize fixes by business impact (traffic, conversions, top markets).

Resources & further reading

  • Google Places & Geocoding APIs (use where possible)
  • RapidFuzz / fuzzywuzzy for string matching
  • Milvus / Pinecone / Weaviate for vector matching
  • Playwright & Puppeteer stealth plugins for headless detection workarounds
  • Search Engine Land — Discoverability in 2026 (Jan 2026)

Final note on compliance

Always evaluate the legal and ToS risks before scraping. Public data used for business intelligence is different from bulk republishing or competitive reuse. When in doubt, prefer APIs or licensed datasets — that also reduces operational churn from anti-bot countermeasures.

Call to action

Ready to build a repeatable local listings audit? Start with a 30-location pilot: collect canonical schema, run one pass of Google Places + Playwright verification, and run the normalization pipeline. If you want a starter repo (Scrapy + Playwright + normalization notebook + example SQL views) I’ll share a GitHub template and a checklist you can run in two days — reply with your stack preference (Python/Node) and I’ll tailor it for you.

Advertisement

Related Topics

#local-seo#how-to#maps
s

scraper

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-04T00:48:33.595Z