Automated SEO Audits Using Scrapy and Playwright
SEOScrapyPlaywright

Automated SEO Audits Using Scrapy and Playwright

sscraper
2026-01-25
10 min read
Advertisement

A practical cookbook (2026) pairing Scrapy for crawl scheduling with Playwright rendered crawls to catch modern SEO and page-speed issues.

Hook: If your SEO audits miss JavaScript-only content, render-dependent metadata, or page-speed metrics because you’re crawling with a HEADLESS HTTP client only, you’re under-reporting the problems that actually block rankings. This cookbook shows how to combine Scrapy for crawl scheduling and link discovery with Playwright for accurate rendered crawls — so your automated SEO audit finds real-world technical and on-page issues at scale.

Why Scrapy + Playwright in 2026?

By 2026 search discoverability spans social, AI answers, and traditional search engines. Pages are increasingly rendered client-side, rely on dynamic structured data, and use client-only hydration for critical content. A modern audit crawler must do two things well:

  • Scale and schedule link discovery, throttling, retries and export formats — Scrapy is battle-tested here.
  • Accurately render JavaScript-driven pages and capture browser-only metrics (LCP, CLS, FCP, JS-generated meta tags, JSON-LD) — Playwright gives deterministic automation and CDP-level access.

Architecture overview — two-pass rendered crawl

Use a two-pass approach per URL to avoid the cost of rendering every page and to keep the audit fast and repeatable:

  1. HTTP pass (Scrapy): Fast, non-rendered request to collect raw HTTP headers, redirect chains, robots headers, sitemap hints, hreflang link tags and to decide if a page needs rendering.
  2. Render pass (Playwright via scrapy-playwright): Only for pages flagged as dynamic, or a sample set for metrics. Use Playwright to capture rendered HTML, compute web-vitals (LCP/CLS/TBT), extract JSON-LD, ensure rendered <title>/meta[name="description"] presence, and assert content quality.

Prerequisites & 2026 considerations

  • Python 3.11+ (for speed and newer asyncio features)
  • Scrapy (>=2.8) and scrapy-playwright (maintained and production-ready)
  • Playwright browsers installed (chromium is preferred for measurability)
  • Proxy provider (residential or datacenter) with session pinning — fingerprinting and rate-limiting are more aggressive in 2025–2026
  • Optional: memory-backed queue (Redis) and persistent storage (Elasticsearch, BigQuery, or ClickHouse) for audit histories

Install & minimal settings

Install packages:

pip install scrapy scrapy-playwright playwright
playwright install chromium

Minimal Scrapy settings (settings.py):

# settings.py
BOT_NAME = 'seo_audit'
SPIDER_MODULES = ['seo_audit.spiders']
NEWSPIDER_MODULE = 'seo_audit.spiders'

# scrapy-playwright integration
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}

# Reduce Playwright concurrency; render passes are expensive
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 8
PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': True,
    'args': ['--no-sandbox', '--disable-dev-shm-usage']
}

# Respect robots.txt for audit scope
ROBOTSTXT_OBEY = True

# Export settings
FEED_FORMAT = 'jsonlines'
FEED_URI = 'results/audit.jl'

Spider: two-pass crawl with conditional rendering

The spider below demonstrates:

  • Start with an HTTP GET to collect headers and redirect chain
  • Decide whether to render (simple heuristics: heavy JS frameworks, presence of placeholder elements, or a sampled render percentage)
  • When rendering, use async parse to access Playwright's page API and collect web vitals and rendered DOM
import json
import hashlib
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_playwright.page import PageMethod

class SeoAuditSpider(scrapy.Spider):
    name = 'seo_audit'
    start_urls = ['https://example.com']

    custom_settings = {
        'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': 30000,
    }

    def parse(self, response):
        # HTTP pass: grab headers, redirect chain, and a small HTML snapshot
        item = {
            'url': response.url,
            'status': response.status,
            'headers': {k.decode(): v[0].decode() for k, v in response.headers.items()},
            'redirect_chain': response.request.meta.get('redirect_urls', []),
        }

        # Heuristic: render if content is short or contains scripts that hint at client-side routing
        body_text = response.text[:4000]
        needs_render = ('
meta[name="description"]', '(n) => n ? n.content : null') item['meta_description'] = meta_desc # Word count and H1s from rendered DOM word_count = await page.evaluate("() => document.body.innerText.split(/\s+/).filter(Boolean).length") item['word_count'] = word_count h1 = await page.query_selector_eval('h1', '(n) => n ? n.innerText.trim() : null') item['h1'] = h1 # Image ALT checks images_missing_alt = await page.evaluate("() => Array.from(document.images).filter(img => !img.alt || img.alt.trim() === '').length") item['images_missing_alt'] = images_missing_alt # Count internal & external links links_count = await page.evaluate("() => { const a = Array.from(document.querySelectorAll('a[href]')); return {total: a.length, internal: a.filter(l=> new URL(l.href, location.origin).origin === location.origin).length}; }") item['links'] = links_count # Structured data: JSON-LD scripts jsonlds = await page.evaluate("() => Array.from(document.querySelectorAll('script[type=\'application/ld+json\']')).map(s=>s.innerText)") parsed_ld = [] for s in jsonlds: try: parsed_ld.append(json.loads(s)) except Exception: parsed_ld.append({'_raw': s[:200]}) item['structured_data'] = parsed_ld # Collect simple Web Vitals via the Performance API (best-effort) perf = await page.evaluate("() => ({nav: performance.getEntriesByType('navigation').map(e=>e.toJSON()), paints: performance.getEntriesByType('paint').map(e=>e.toJSON())})") item['performance'] = perf # Compute a quick content signature for duplicate detection item['content_hash'] = hashlib.sha1((await page.evaluate('() => document.body.innerText')).encode('utf-8')).hexdigest() await page.close() yield item def static_checks(self, response): data = {} # Canonical tag canonical = response.xpath('//link[@rel="canonical"]/@href').get() data['canonical'] = canonical # Robots meta robots_meta = response.xpath('//meta[@name="robots"]/@content').get() data['robots_meta'] = robots_meta # hreflang links hreflangs = response.xpath('//link[@rel="alternate" and @hreflang]').getall() data['hreflang_count'] = len(hreflangs) # Basic title & description (non-rendered) data['title_static'] = response.xpath('//title/text()').get() data['meta_description_static'] = response.xpath('//meta[@name="description"]/@content').get() return data

Notes on the example

  • parse_rendered is async so you can await Playwright page methods. This pattern is supported by scrapy-playwright.
  • We close page objects to avoid leaking pages. Use PLAYWRIGHT_MAX_PAGES_PER_CONTEXT to bound memory.
  • For heavy sites, implement a render-decider service that caches “render needed” decisions per path pattern.

Audit checks: what to detect and example heuristics

Below are common audit items and practical ways to detect them in your spider/pipelines.

1) HTTP and headers

  • Status codes & redirects: Use Scrapy response.status and response.request.meta.get('redirect_urls'). Flag 4xx/5xx and long redirect chains.
  • Cache headers: Check cache-control and expires for static assets; missing cache-control on images/CSS is a performance hit.
  • Server-Timing: If present, parse Server-Timing header for backend timing.

2) Canonicalization & redirects

  • Compare the <link rel="canonical"> to the resolved URL. Flag mismatches or self-references that point elsewhere.
  • Detect duplicate content via content_hash and group identical pages; surface canonical or redirect fixes.

3) On-page metadata

  • Title length and presence (title > 0 and < 70 characters). Flag duplicates across domain.
  • Meta description presence and length (50–160 chars guidance).
  • H1 presence and whether it matches title (helpful for intent alignment checks).

4) Structured data

  • Parse application/ld+json. Validate basic schema types (Article, Product, BreadcrumbList). Missing required properties should be flagged.
  • Detect duplicate or conflicting schema (e.g., two WebPage objects with different names).

5) Content quality heuristics

  • Word count thresholds for page templates; flag pages under 200 words for content intent pages.
  • Image alt attribute counts; flag images missing alt text.
  • Shallow internal linking (pages with < 3 internal links) — might be orphaned.

6) Render-only checks

  • Rendered title/meta differ from non-rendered source — indicates client-side injection that previous crawls missed.
  • Structured data inserted client-side — capture and validate.
  • Client-only redirects (meta refresh or JS-based navigations) — capture navigation in playright and compare initial and final URLs.

7) Page speed & Web Vitals (best-effort)

Accurate LCP/CLS/TBT requires monitoring during load. Use a combination of these techniques:

  • Inject a small script that uses the PerformanceObserver and buffers entries. Use page.add_init_script before navigation to collect LCP/CLS reliably.
  • As a fallback, collect performance.getEntriesByType('paint') and navigation timing.
  • For production-grade metrics, integrate a headless Lighthouse run or connect to CDP to export trace events — but these are heavier and usually done in sampled audits.
# example: add_init_script to capture LCP/CLS before page scripts run
await page.add_init_script("() => { window.__auditVitals = []; new PerformanceObserver((list) => list.getEntries().forEach(e => window.__auditVitals.push(e.toJSON()))).observe({type:'largest-contentful-paint', buffered:true}); new PerformanceObserver((list) => list.getEntries().forEach(e => window.__auditVitals.push(e.toJSON()))).observe({type:'layout-shift', buffered:true}); }")
# then navigate and read window.__auditVitals
vitals = await page.evaluate('() => window.__auditVitals')

Scaling & operational tips

  • Reserve rendering for where it matters: Use a render-decider service or simple heuristics. Rendering every URL is expensive and slows iterations.
  • Proxy pooling & fingerprinting: Use session-pinned residential proxies for sensitive sites. Rotate User-Agent and viewport sizes to reduce fingerprinting signals.
  • Resource blocking for speed: When measuring content and metadata only, block images/fonts to speed render. For true LCP or CLS measurement, don’t block critical assets.
  • Queueing & concurrency: Use RedisQueue (or Scrapy-Redis) for distributed crawling. Keep Playwright concurrency lower than HTTP workers to limit memory spikes.
  • Sampled Lighthouse runs: Run Lighthouse or Chrome tracing on a subset of pages nightly for deeper audits.

Integrations & exporting results

Store audit outputs in systems your team already uses:

  • Elasticsearch / OpenSearch for fast querying and dashboards
  • BigQuery / ClickHouse for large-scale historical analysis and rollups
  • Slack/Email alerts for pages with new critical regressions (e.g., indexability lost, canonical changed)
  • Integrate with issue trackers (Jira/GitHub) via pipeline to create remediation tickets with example failing URLs.

Recent trends from late 2025 and early 2026 change how audits should be designed:

  • AI-driven SERP answers: Pages must surface clear entity-first markup (structured data + consistent on-page signals) because AI summarizers favor explicit entity schemas.
  • Cross-platform discoverability: Audit internal sharing metadata (Open Graph, Twitter Card) since social discovery influences search intent prior to queries.
  • Increased bot mitigation: Fingerprint-based blocking is more common — audits must include proxy rotation, realistic browser behavior and session pinning for scale. Many publishers now deploy more aggressive anti-bot controls.
  • Privacy & compliance: Cookie walls and consent UIs alter what a crawler sees. Automate consent flows (safe, limited) or annotate audits to note consent gating.

Advanced strategies & troubleshooting

1) When LCP/CLS numbers are inconsistent

If your Playwright-based measurements jump around, ensure you:

  • Run add_init_script before navigation to capture early metrics
  • Set a consistent viewport and device emulation (avoid throttling unless you’re explicitly simulating slow networks)
  • Sample pages multiple times and store medians — single-run noise is common

2) Handling SPA navigation

SPAs often change content after initial load. Use PageMethod('wait_for_selector', <selector>) or wait for specific network idle events. For auditing routes, programmatically navigate the SPA to each route using page.evaluate + history.pushState and measure after each navigation.

3) Detecting cloaking

Compare server-side (HTTP pass) and client-side (render pass) outputs for title, meta, and structured data. Significant differences can indicate cloaking or bot-targeted content.

Actionable audit checklist (for every run)

  1. HTTP pass for all URLs: capture headers, redirect chain, robots/meta, canonical
  2. Render pass for dynamic templates or sampled URLs: capture rendered title/meta/h1, JSON-LD
  3. Compute content_hash for duplicate detection and canonical mapping
  4. Gather web-vitals samples for a representative set of landing pages
  5. Export results to a searchable store and produce triaged lists: Critical (indexability, 500s, canonical mismatch), High (noindex, missing title), Medium (image alts, low word count), Low (meta length)

Example pipeline: triage & remediation ticket

Implement an item pipeline that scores each issue and emits remediation tasks:

  • Critical: status >= 500, robots noindex present, canonical pointing to competitor
  • High: title missing, meta description missing, structured data invalid
  • Medium: image alts missing, low word count

Automated crawlers must respect terms-of-service, robots.txt, and privacy constraints. In 2026 many publishers and CDNs have stricter anti-bot controls — always:

  • Respect robots.txt (Scrapy setting)
  • Throttle requests and use exponential backoff on 429/403
  • Use legal review for high-volume crawling of third-party sites

Takeaways — what to implement this week

  • Set up a two-pass crawl: fast HTTP pass + conditional Playwright render pass.
  • Use scrapy-playwright and write async parse methods to extract rendered metadata, JSON-LD and basic web vitals.
  • Sample Lighthouse or CDP traces for a weekly deep audit; keep most checks lightweight.
  • Export to a searchable store and auto-create remediation tickets for critical issues.
Real audits in 2026 require render-aware crawling plus scale. Use Scrapy for orchestration and Playwright for truth.

Call to action

Ready to operationalize this cookbook? Clone the starter repo (includes settings, spider templates, and pipeline examples) and run a 1,000-URL audit in a staging environment. If you want, paste one failing URL here and I’ll provide a focused Playwright script and remediation ticket template you can drop into your workflow.

Advertisement

Related Topics

#SEO#Scrapy#Playwright
s

scraper

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-31T13:43:07.925Z