Serverless Scraping Pipelines for ClickHouse

Blueprint for building cost-efficient, autoscaling serverless scrapers that stage batches to S3 and bulk-load into ClickHouse for analytics.

Hook: Stop paying for idle VMs and brittle scrapers — build ephemeral serverless scrapers that bulk-load into ClickHouse

If your team is fighting IP bans, escalating cloud bills, and fragile scrapers that break every UI change, you need a different blueprint. In 2026 the right approach is to run ephemeral, autoscaling serverless scraping jobs that push compressed, deduplicated batches into ClickHouse for analytics. This minimizes cost, limits attack surface, and scales to millions of pages with predictable ingest performance.

Executive summary — what this blueprint delivers

Architecture: A queue-driven serverless executor + staging (S3) → bulk ingest into ClickHouse.
Autoscaling: Use serverless for bursts, worker pools for steady-state, and queue backpressure to control concurrency.
Cost control: Short-lived functions, compressed batched uploads, spot/preemptible compute for heavy work, and S3 staging to decouple scraping from DB ingestion.
Backfill strategy: Partitioned, idempotent loads using Parquet/Native files and S3-native file functions to bulk replay historic scrapes.
Observability & resilience: Checkpointing, dedupe keys, replayable tasks, and throttling to survive anti-bot defenses.

Why ClickHouse in 2026? Quick context

ClickHouse is now a mainstream OLAP backend for high-cardinality, high-throughput analytics. After major funding and rapid feature additions in late 2025, ClickHouse improved cloud integrations, S3-native ingestion patterns, and better native parsers—making it cost-effective for storing scraped telemetry at scale. That makes it an ideal target for bulk ingest from ephemeral scrapers.

Blueprint components — the end-to-end stack

Task generator / scheduler — creates scraping tasks (URLs, selectors, metadata)
Message queue — SQS, Pub/Sub, Kafka, or Pulsar to smooth bursts
Serverless executor — Lambda, Cloud Run, Google Cloud Functions, Cloudflare Workers, or edge functions running short jobs
Proxy & anti-blocking layer — residential/rotating proxies, headless browser service, fingerprinting controls
Staging storage — S3-compatible bucket for compressed batch files (Parquet/NDJSON/GZIP)
Bulk ingest worker — small container or function that reads staging, validates, and performs bulk INSERT into ClickHouse
ClickHouse — partitioned MergeTree tables tuned for bulk load

How data flows

Tasks are queued → serverless functions fetch pages and produce NDJSON/Parquet blobs → blobs uploaded to S3 → bulk ingest worker consumes blobs and issues compressed INSERTs or uses ClickHouse S3 table function → materialized views populate aggregates.

Design patterns and concrete configs

1) Queue-first, throttle-last

Don’t let your scrapers directly hit targets uncontrolled. Always push work into a durable queue. The queue is your control plane for autoscaling and cost control.

Use visibility timeouts and retries with backoff.
Use per-domain rate tokens stored in Redis or DynamoDB to avoid bursting a single host.
Implement a cheap admission controller function that enforces global concurrency limits.

2) Serverless executor: keep functions tiny and stateless

Functions should do one thing: fetch a URL (via proxy or browserless), normalize output, and append to a local micro-batch. Upload the micro-batch to S3 when either a size threshold or time threshold is reached.

# Python-style pseudo handler (AWS Lambda example)
import json, gzip, time

def handler(event, ctx):
    batch = []
    for task in event['records']:
        html = fetch_with_proxy(task['url'])
        data = parse(html, task['spec'])
        batch.append(data)
        if len(batch) >= 50:
            flush_to_s3(batch)
            batch = []
    if batch:
        flush_to_s3(batch)

Keep function memory and runtime small. For JavaScript/Node, prefer edge runtimes for tiny fetches; for heavier headless tasks use container-based serverless (Cloud Run, AWS Lambda container images) to include Playwright or Chromium.

3) Bulk staging formats — prefer Parquet or compressed NDJSON

Small inserts into ClickHouse are expensive. Aggregate into compressed batches. Two common formats:

Parquet: Columnar, efficient, best for large numeric datasets and fast ClickHouse ingestion using clickhouse-local or s3 table functions.
NDJSON (JSONEachRow) gzipped: Simple and easy; ClickHouse accepts JSONEachRow directly over the HTTP interface with compressed payloads.

4) Efficient ClickHouse ingest

Three practical ingestion methods:

HTTP POST with compressed payloads: fast for medium-sized batches.
S3 staging + ClickHouse S3 table function: best for very large backfills or parallel distributed loading.
Native protocol via ClickHouse client libraries (bulk native format): low CPU overhead, good for high-concurrency pipelines.

# Example: HTTP bulk insert (bash)
gzip -c batch.jsonl | \
  curl -sS -X POST 'https://clickhouse.example.com:8443/?query=INSERT+INTO+analytics.scrapes+FORMAT+JSONEachRow' \
  --data-binary @- \
  -H 'Content-Encoding: gzip' -u user:pass

For huge backfills generate Parquet files and use ClickHouse's built-in S3 reader:

INSERT INTO analytics.scrapes
SELECT *
FROM s3('https://s3.amazonaws.com/my-bucket/path/*.parquet', 'AWS_KEY', 'AWS_SECRET', 'Parquet')

Autoscaling & cost control tactics

Use queue depth + consumer throttle

Autoscaling serverless purely on queue length is simple but dangerous during anti-bot escalations. Add a per-target concurrent counter and dynamic delay. When error rates or CAPTCHAs spike, throttle consumers and switch tasks to lower-cost reprocessing paths. Consider predictive detection for automated attack patterns.

Prefer micro-batches to single-row inserts

Bursting many tiny inserts into ClickHouse increases CPU and connection churn. Aim for 1–20 MB compressed batch sizes per upload (tradeoff between latency and efficiency).

Pick the right serverless execution for the job

Tiny fetches: edge functions (Cloudflare Workers, Vercel Edge) — super low-latency and cheap.
Moderate scraping with JS execution: Cloud Run or Lambda with container image — supports headless browsers in 2026 via smaller browser runtimes and Wasm-based Playwright bits (edge & Wasm).
Heavy rendering / anti-bot fights: managed browser services (browserless, Playwright Cloud), or dedicated fleets (Spot/Preemptible instances) for cost savings.

Use pre-signed uploads and workerless transforms

Let scrapers upload directly to S3 via pre-signed URLs to offload network bandwidth from your control plane. Bulk ingest workers can then validate and ingest—this also reduces function execution time (lower cost).

Backfill strategies — make historical runs predictable

Backfills are when costs explode. Use controlled backfill patterns:

Partitioned target tables: Partition by date (event_date) and shard sensibly — ClickHouse MergeTree works well with date partitions for dropping/TTL-ing stale data.
Staged Parquet batches: Write historic snapshots to S3 in parallel, then run parallel INSERT FROM s3() queries to ClickHouse. See also ethical pipeline approaches for large-scale replay.
Throttled parallelism: Limit parallel backfill workers per-host and per-provider to avoid bans and cost spikes.
Idempotency: Attach a stable id (url hash + scrape timestamp + schema version). Use INSERT...ON CONFLICT patterns by deduping with ORDER BY and primary key in MergeTree (or use a dedupe table before moving to final tables).

Example backfill flow

Spawn N workers on spot instances to crawl a historic list.
Workers write compressed Parquet files partitioned by date to s3://backfills/2025-11/.
Run parallel ClickHouse queries: INSERT INTO table SELECT * FROM s3(..., 'Parquet').
Use system.mutations and system.replication_queue to monitor progress.

Anti-bot, proxies and resilience in 2026

The anti-bot arms race continued through 2025. In 2026, fingerprinting gets harder to evade, and defenders use ML-based behavioral detection. The practical response:

Use high-quality proxy pools (residential/ISP routes) and rotate per-request.
Prefer managed browserless solutions that keep up with fingerprinting mitigations.
Implement adaptive retry: after a CAPTCHA or block, move the task into a quarantine queue for human review or heavier worker. Consider integrating predictive AI to triage likely automated attacks.
Collect telemetry to detect site-side blocking signals early (JS challenges, geo redirects).

Pro tip: In 2026 many teams combine Wasm edge scraping for simple pages and on-demand containerized Playwright for heavy pages. This hybrid reduces both cost and block risk.

ClickHouse table design for scraped data

Design your table for both fast writes and flexible analytics:

CREATE TABLE analytics.scrapes (
  event_date Date DEFAULT toDate(scrape_ts),
  scrape_ts DateTime64(3),
  url String,
  domain String,
  user_agent String,
  status UInt16,
  body String,
  metadata JSON,
  content_hash String,
  schema_version UInt8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (domain, content_hash, scrape_ts)
SETTINGS index_granularity = 8192;

Keys:

Partitioning by month allows efficient bulk backfills and TTL-based data lifecycle.
ORDER BY that includes domain and a stable hash simplifies dedupe and upsert-like operations.
JSON or nested columns for extracted fields (use ClickHouse's JSON functions or virtual columns) for flexible downstream parsing.

Monitoring, alerting and SLOs

Track these metrics:

Queue depth and task latency
Function error rates and cold start metrics
Proxy failure rates and CAPTCHA incidence
ClickHouse insert latency, mutation queue length, and disk pressure

Set SLOs like: 99% of scrapes uploaded to S3 within 60s, 99% of staging files ingested to ClickHouse within 10 minutes. Use dashboards and playbooks to route high-failure patterns to human triage.

Security, compliance and legal guardrails

Even with technical excellence, legal and privacy compliance is crucial:

Respect robots.txt and site ToS where required by policy.
Mask or redact PII before storing in ClickHouse. Use tokenization and encryption for sensitive fields.
Maintain provenance: store scrape job id, user agent, proxy id and timestamp for auditability.
Keep a legal contact & escalation playbook for takedown requests.

Real-world example: scaling a product price indexer

Scenario: you need a daily product price index across 2M SKUs across 1,000 retailers.

Task generation: nightly delta split into 100k tasks/day; queue depth controls spawning.
Execution: edge functions scrape trivial APIs; containerized Playwright fetches dynamic pages. Micro-batches of 200 records uploaded to S3.
Ingest: 2000 compressed Parquet files landed per hour → ClickHouse INSERT FROM s3() in parallel using 16 ingress workers.
Backfill: historical 6 months regen done with 100 spot instances writing Parquet to S3, then staged into ClickHouse during off-peak hours.

Outcome: steady-state cost dropped 60% vs VM fleet model, ingestion latency kept under 10 minutes for 95% of records, and backfills completed predictably using staged batch loads.

2026 trends and where to watch

Deeper ClickHouse cloud integrations and built-in S3 ingestion primitives are reducing coding overhead for bulk loads.
Serverless containers and Wasm runtimes continue to mature, making heavy JS rendering cheaper and more portable.
Anti-bot detection is increasingly ML-driven; expect vendor-specific countermeasures and more managed browser services.
Open-source projects now provide standardized NDJSON→Parquet transformers and ClickHouse loaders as battle-tested components.

Checklist: Deploy this blueprint in 10 steps

Define schema and partition strategy in ClickHouse.
Implement a task generator that emits domain-aware tasks to a queue.
Build tiny serverless fetchers with pre-signed S3 uploads.
Use a proxy manager and fingerprint telemetry collector.
Aggregate micro-batches, compress, and upload to S3 in Parquet or gzipped NDJSON.
Create an ingest worker that validates and inserts via HTTP or s3() reads.
Instrument metrics and alarms for queue depth, ingestion lag, and error spikes.
Plan backfill windows and use spot instances for heavy historic crawls.
Enforce legal & privacy guardrails; store provenance metadata.
Run dry-runs, test dedupe logic, and iteratively tune batch sizes.

Actionable takeaways

Decouple scraping from ingestion: use S3 staging to transform and dedupe before ClickHouse.
Batch, compress, and bulk insert: prefer Parquet or gzipped NDJSON over single-row writes.
Queue-driven autoscaling: control concurrency with tokens and backpressure to avoid bans and cost shocks.
Backfill with S3+ClickHouse: staged Parquet files let you replay historical data efficiently.

Final notes and recommended tools (practical picks for 2026)

Queues: AWS SQS, Google Pub/Sub, Apache Pulsar for durable control planes.
Serverless: Cloud Run / Lambda containers for heavy pages; edge workers for tiny fetches.
Headless: Playwright Cloud, browserless, or custom Lambda container with Chromium when needed.
Proxy providers: residential pools and ISP routes; always instrument failure rates.
ClickHouse: managed ClickHouse Cloud or Altinity for operational ease; use HTTP & s3() methods for ingest.
Format tools: pyarrow / duckdb for Parquet transforms in serverless-friendly runtimes.

Call to action

If you want a ready-to-deploy template, download our 2026 serverless scraping starter kit (includes Lambda/Cloud Run handlers, S3 staging scripts, and ClickHouse DDLs). Or contact our engineering team to run a 2-week proof-of-concept that backfills one month of historical scrapes into ClickHouse with predictable cost and latency.