Automated SEO Audits Using Scrapy and Playwright
A practical cookbook (2026) pairing Scrapy for crawl scheduling with Playwright rendered crawls to catch modern SEO and page-speed issues.
Hook: If your SEO audits miss JavaScript-only content, render-dependent metadata, or page-speed metrics because you’re crawling with a HEADLESS HTTP client only, you’re under-reporting the problems that actually block rankings. This cookbook shows how to combine Scrapy for crawl scheduling and link discovery with Playwright for accurate rendered crawls — so your automated SEO audit finds real-world technical and on-page issues at scale.
Why Scrapy + Playwright in 2026?
By 2026 search discoverability spans social, AI answers, and traditional search engines. Pages are increasingly rendered client-side, rely on dynamic structured data, and use client-only hydration for critical content. A modern audit crawler must do two things well:
- Scale and schedule link discovery, throttling, retries and export formats — Scrapy is battle-tested here.
- Accurately render JavaScript-driven pages and capture browser-only metrics (LCP, CLS, FCP, JS-generated meta tags, JSON-LD) — Playwright gives deterministic automation and CDP-level access.
Architecture overview — two-pass rendered crawl
Use a two-pass approach per URL to avoid the cost of rendering every page and to keep the audit fast and repeatable:
- HTTP pass (Scrapy): Fast, non-rendered request to collect raw HTTP headers, redirect chains, robots headers, sitemap hints, hreflang link tags and to decide if a page needs rendering.
- Render pass (Playwright via scrapy-playwright): Only for pages flagged as dynamic, or a sample set for metrics. Use Playwright to capture rendered HTML, compute web-vitals (LCP/CLS/TBT), extract JSON-LD, ensure rendered
<title>/meta[name="description"]presence, and assert content quality.
Prerequisites & 2026 considerations
- Python 3.11+ (for speed and newer asyncio features)
- Scrapy (>=2.8) and scrapy-playwright (maintained and production-ready)
- Playwright browsers installed (chromium is preferred for measurability)
- Proxy provider (residential or datacenter) with session pinning — fingerprinting and rate-limiting are more aggressive in 2025–2026
- Optional: memory-backed queue (Redis) and persistent storage (Elasticsearch, BigQuery, or ClickHouse) for audit histories
Install & minimal settings
Install packages:
pip install scrapy scrapy-playwright playwright
playwright install chromium
Minimal Scrapy settings (settings.py):
# settings.py
BOT_NAME = 'seo_audit'
SPIDER_MODULES = ['seo_audit.spiders']
NEWSPIDER_MODULE = 'seo_audit.spiders'
# scrapy-playwright integration
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOAD_HANDLERS = {
'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}
# Reduce Playwright concurrency; render passes are expensive
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 8
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True,
'args': ['--no-sandbox', '--disable-dev-shm-usage']
}
# Respect robots.txt for audit scope
ROBOTSTXT_OBEY = True
# Export settings
FEED_FORMAT = 'jsonlines'
FEED_URI = 'results/audit.jl'
Spider: two-pass crawl with conditional rendering
The spider below demonstrates:
- Start with an HTTP GET to collect headers and redirect chain
- Decide whether to render (simple heuristics: heavy JS frameworks, presence of placeholder elements, or a sampled render percentage)
- When rendering, use async parse to access Playwright's page API and collect web vitals and rendered DOM
import json
import hashlib
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_playwright.page import PageMethod
class SeoAuditSpider(scrapy.Spider):
name = 'seo_audit'
start_urls = ['https://example.com']
custom_settings = {
'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': 30000,
}
def parse(self, response):
# HTTP pass: grab headers, redirect chain, and a small HTML snapshot
item = {
'url': response.url,
'status': response.status,
'headers': {k.decode(): v[0].decode() for k, v in response.headers.items()},
'redirect_chain': response.request.meta.get('redirect_urls', []),
}
# Heuristic: render if content is short or contains scripts that hint at client-side routing
body_text = response.text[:4000]
needs_render = (' meta[name="description"]', '(n) => n ? n.content : null')
item['meta_description'] = meta_desc
# Word count and H1s from rendered DOM
word_count = await page.evaluate("() => document.body.innerText.split(/\s+/).filter(Boolean).length")
item['word_count'] = word_count
h1 = await page.query_selector_eval('h1', '(n) => n ? n.innerText.trim() : null')
item['h1'] = h1
# Image ALT checks
images_missing_alt = await page.evaluate("() => Array.from(document.images).filter(img => !img.alt || img.alt.trim() === '').length")
item['images_missing_alt'] = images_missing_alt
# Count internal & external links
links_count = await page.evaluate("() => { const a = Array.from(document.querySelectorAll('a[href]')); return {total: a.length, internal: a.filter(l=> new URL(l.href, location.origin).origin === location.origin).length}; }")
item['links'] = links_count
# Structured data: JSON-LD scripts
jsonlds = await page.evaluate("() => Array.from(document.querySelectorAll('script[type=\'application/ld+json\']')).map(s=>s.innerText)")
parsed_ld = []
for s in jsonlds:
try:
parsed_ld.append(json.loads(s))
except Exception:
parsed_ld.append({'_raw': s[:200]})
item['structured_data'] = parsed_ld
# Collect simple Web Vitals via the Performance API (best-effort)
perf = await page.evaluate("() => ({nav: performance.getEntriesByType('navigation').map(e=>e.toJSON()), paints: performance.getEntriesByType('paint').map(e=>e.toJSON())})")
item['performance'] = perf
# Compute a quick content signature for duplicate detection
item['content_hash'] = hashlib.sha1((await page.evaluate('() => document.body.innerText')).encode('utf-8')).hexdigest()
await page.close()
yield item
def static_checks(self, response):
data = {}
# Canonical tag
canonical = response.xpath('//link[@rel="canonical"]/@href').get()
data['canonical'] = canonical
# Robots meta
robots_meta = response.xpath('//meta[@name="robots"]/@content').get()
data['robots_meta'] = robots_meta
# hreflang links
hreflangs = response.xpath('//link[@rel="alternate" and @hreflang]').getall()
data['hreflang_count'] = len(hreflangs)
# Basic title & description (non-rendered)
data['title_static'] = response.xpath('//title/text()').get()
data['meta_description_static'] = response.xpath('//meta[@name="description"]/@content').get()
return data
Notes on the example
- parse_rendered is async so you can await Playwright page methods. This pattern is supported by scrapy-playwright.
- We close page objects to avoid leaking pages. Use
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT to bound memory.
- For heavy sites, implement a render-decider service that caches “render needed” decisions per path pattern.
Audit checks: what to detect and example heuristics
Below are common audit items and practical ways to detect them in your spider/pipelines.
1) HTTP and headers
- Status codes & redirects: Use Scrapy response.status and response.request.meta.get('redirect_urls'). Flag 4xx/5xx and long redirect chains.
- Cache headers: Check cache-control and
expires for static assets; missing cache-control on images/CSS is a performance hit.
- Server-Timing: If present, parse
Server-Timing header for backend timing.
2) Canonicalization & redirects
- Compare the
<link rel="canonical"> to the resolved URL. Flag mismatches or self-references that point elsewhere.
- Detect duplicate content via content_hash and group identical pages; surface canonical or redirect fixes.
3) On-page metadata
- Title length and presence (title > 0 and < 70 characters). Flag duplicates across domain.
- Meta description presence and length (50–160 chars guidance).
- H1 presence and whether it matches title (helpful for intent alignment checks).
4) Structured data
- Parse
application/ld+json. Validate basic schema types (Article, Product, BreadcrumbList). Missing required properties should be flagged.
- Detect duplicate or conflicting schema (e.g., two
WebPage objects with different names).
5) Content quality heuristics
- Word count thresholds for page templates; flag pages under 200 words for content intent pages.
- Image alt attribute counts; flag images missing alt text.
- Shallow internal linking (pages with < 3 internal links) — might be orphaned.
6) Render-only checks
- Rendered title/meta differ from non-rendered source — indicates client-side injection that previous crawls missed.
- Structured data inserted client-side — capture and validate.
- Client-only redirects (meta refresh or JS-based navigations) — capture navigation in playright and compare initial and final URLs.
7) Page speed & Web Vitals (best-effort)
Accurate LCP/CLS/TBT requires monitoring during load. Use a combination of these techniques:
- Inject a small script that uses the
PerformanceObserver and buffers entries. Use page.add_init_script before navigation to collect LCP/CLS reliably.
- As a fallback, collect
performance.getEntriesByType('paint') and navigation timing.
- For production-grade metrics, integrate a headless Lighthouse run or connect to CDP to export trace events — but these are heavier and usually done in sampled audits.
# example: add_init_script to capture LCP/CLS before page scripts run
await page.add_init_script("() => { window.__auditVitals = []; new PerformanceObserver((list) => list.getEntries().forEach(e => window.__auditVitals.push(e.toJSON()))).observe({type:'largest-contentful-paint', buffered:true}); new PerformanceObserver((list) => list.getEntries().forEach(e => window.__auditVitals.push(e.toJSON()))).observe({type:'layout-shift', buffered:true}); }")
# then navigate and read window.__auditVitals
vitals = await page.evaluate('() => window.__auditVitals')
Scaling & operational tips
- Reserve rendering for where it matters: Use a render-decider service or simple heuristics. Rendering every URL is expensive and slows iterations.
- Proxy pooling & fingerprinting: Use session-pinned residential proxies for sensitive sites. Rotate User-Agent and viewport sizes to reduce fingerprinting signals.
- Resource blocking for speed: When measuring content and metadata only, block images/fonts to speed render. For true LCP or CLS measurement, don’t block critical assets.
- Queueing & concurrency: Use RedisQueue (or Scrapy-Redis) for distributed crawling. Keep Playwright concurrency lower than HTTP workers to limit memory spikes.
- Sampled Lighthouse runs: Run Lighthouse or Chrome tracing on a subset of pages nightly for deeper audits.
Integrations & exporting results
Store audit outputs in systems your team already uses:
- Elasticsearch / OpenSearch for fast querying and dashboards
- BigQuery / ClickHouse for large-scale historical analysis and rollups
- Slack/Email alerts for pages with new critical regressions (e.g., indexability lost, canonical changed)
- Integrate with issue trackers (Jira/GitHub) via pipeline to create remediation tickets with example failing URLs.
2026 trends & future-proofing your audit
Recent trends from late 2025 and early 2026 change how audits should be designed:
- AI-driven SERP answers: Pages must surface clear entity-first markup (structured data + consistent on-page signals) because AI summarizers favor explicit entity schemas.
- Cross-platform discoverability: Audit internal sharing metadata (Open Graph, Twitter Card) since social discovery influences search intent prior to queries.
- Increased bot mitigation: Fingerprint-based blocking is more common — audits must include proxy rotation, realistic browser behavior and session pinning for scale. Many publishers now deploy more aggressive anti-bot controls.
- Privacy & compliance: Cookie walls and consent UIs alter what a crawler sees. Automate consent flows (safe, limited) or annotate audits to note consent gating.
Advanced strategies & troubleshooting
1) When LCP/CLS numbers are inconsistent
If your Playwright-based measurements jump around, ensure you:
- Run add_init_script before navigation to capture early metrics
- Set a consistent viewport and device emulation (avoid throttling unless you’re explicitly simulating slow networks)
- Sample pages multiple times and store medians — single-run noise is common
2) Handling SPA navigation
SPAs often change content after initial load. Use PageMethod('wait_for_selector', <selector>) or wait for specific network idle events. For auditing routes, programmatically navigate the SPA to each route using page.evaluate + history.pushState and measure after each navigation.
3) Detecting cloaking
Compare server-side (HTTP pass) and client-side (render pass) outputs for title, meta, and structured data. Significant differences can indicate cloaking or bot-targeted content.
Actionable audit checklist (for every run)
- HTTP pass for all URLs: capture headers, redirect chain, robots/meta, canonical
- Render pass for dynamic templates or sampled URLs: capture rendered title/meta/h1, JSON-LD
- Compute content_hash for duplicate detection and canonical mapping
- Gather web-vitals samples for a representative set of landing pages
- Export results to a searchable store and produce triaged lists: Critical (indexability, 500s, canonical mismatch), High (noindex, missing title), Medium (image alts, low word count), Low (meta length)
Example pipeline: triage & remediation ticket
Implement an item pipeline that scores each issue and emits remediation tasks:
- Critical: status >= 500, robots noindex present, canonical pointing to competitor
- High: title missing, meta description missing, structured data invalid
- Medium: image alts missing, low word count
Limitations & legal compliance
Automated crawlers must respect terms-of-service, robots.txt, and privacy constraints. In 2026 many publishers and CDNs have stricter anti-bot controls — always:
- Respect robots.txt (Scrapy setting)
- Throttle requests and use exponential backoff on 429/403
- Use legal review for high-volume crawling of third-party sites
Takeaways — what to implement this week
- Set up a two-pass crawl: fast HTTP pass + conditional Playwright render pass.
- Use scrapy-playwright and write async parse methods to extract rendered metadata, JSON-LD and basic web vitals.
- Sample Lighthouse or CDP traces for a weekly deep audit; keep most checks lightweight.
- Export to a searchable store and auto-create remediation tickets for critical issues.
Real audits in 2026 require render-aware crawling plus scale. Use Scrapy for orchestration and Playwright for truth.
Call to action
Ready to operationalize this cookbook? Clone the starter repo (includes settings, spider templates, and pipeline examples) and run a 1,000-URL audit in a staging environment. If you want, paste one failing URL here and I’ll provide a focused Playwright script and remediation ticket template you can drop into your workflow.
Related Reading
- How to Run an SEO Audit for Video-First Sites (YouTube + Blog Hybrid)
- Monitoring and Observability for Caches: Tools, Metrics, and Alerts
- Micro-Localization Hubs & Night Markets: Local SEO Strategies for Climate-Stressed Cities (2026)
- Live Commerce + Pop-Ups: Turning Audience Attention into Predictable Micro-Revenue in 2026
- Neighborhood features 2026 renters want: gyms, pet parks, and in-building services
- Warehouse Automation Principles You Can Use to Declutter and Organize Your Home
- Is the Bluetooth Micro Speaker Worth It for Party Gaming and LAN Nights?
- Roborock F25 Ultra vs Competitors: Which Phone-Controlled Vacuum Is Best for Busy Homes?
- Securing LLM Agents on Windows: Risks When Claude or Copilots Access Local Files
Related Topics
scraper
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Beyond Bots: How Scrapers Became Adaptive Data Orchestrators in 2026
Maps Scraping: Google Maps vs Waze Data — What You Can Legally Extract and How
Edge‑First Scraping: CDN Workers, Browser Isolation, and Cost Ops for Real‑Time Extraction (2026 Playbook)
From Our Network
Trending stories across our publication group