hardwareedge-mlhow-to

Build a Raspberry Pi 5 Edge Scraper with the AI HAT+ 2

UUnknown

2026-01-22

9 min read

Quickstart: Turn a Raspberry Pi 5 + AI HAT+ 2 into an edge scraper that parses HTML, classifies pages on-device, and pushes cleaned JSON upstream.

Hook: Run resilient scrapers at the edge — no cloud GPU required

If you manage scraping at scale you know the drill: IP bans, JS-heavy sites, cloud inference bills that balloon overnight, and fragile parsers that break when a DOM class changes. In 2026 the smarter move is to push parsing and simple classification to the edge. This quickstart shows how to turn a Raspberry Pi 5 with the new AI HAT+ 2 into an efficient, low-cost Pi scraper that performs local HTML parsing, runs on-device ML for page-type or spam detection, and reliably pushes cleaned JSON to a central store.

Why Pi5 + AI HAT+ 2 makes sense in 2026

Edge inference matured in 2024–2026. Vendors shipped more energy-efficient NPUs and compact LLM runtimes; open-source toolchains for ONNX/TFLite quantization stabilized in late 2025. According to industry coverage (Forbes, Jan 2026), structured/tabular pipelines and on-device processing are a major growth area. For scraping specifically, running classification and cleaning locally reduces egress costs, improves privacy, and makes your scraping fleet less dependent on cloud inference quotas.

Key benefits

Lower cloud costs — only send valuable, cleaned rows to central storage.
Resilience — inference continues even with transient network issues.
Privacy — sensitive data can be filtered before leaving the device.
Scalability — Pi5's CPU provides a compact, manufacturable scraping node.

What you'll build (under 30–60 minutes)

By the end of this guide you'll have a working Pi scraper that:

Fetches HTML pages and extracts main text and metadata.
Runs a lightweight on-device classifier (page-type / spam) using an exported ONNX or local rule-based fallback.
Sends cleaned JSON to a central HTTP ingestion endpoint or S3 bucket with retries and exponential backoff.

Requirements

Hardware

Raspberry Pi 5 (64-bit OS recommended)
AI HAT+ 2 (vendor SDK + drivers)
MicroSD or NVMe (for storage), network connectivity (Ethernet recommended)

Software & libraries

Raspberry Pi OS (64-bit) or Ubuntu 22.04/24.04 for ARM64
Python 3.11+ (system Python or pyenv)
Requests, trafilatura or readability-lxml, beautifulsoup4
onnxruntime (or vendor runtime if AI HAT+ 2 provides a specialized runtime)
boto3 (optional, for S3), confluent-kafka (optional), requests

Step 1 — Hardware & OS setup (fast)

Flash a 64-bit OS image and enable SSH. Use Raspberry Pi Imager or a headless installer. Boot and update packages:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git python3-pip python3-venv

Attach the AI HAT+ 2 per vendor instructions. In late 2025 many vendors shipped a one-command installer for HAT firmwares — check the vendor README and run their installer to provision the NPU drivers and a user-space runtime (we'll reference a generic ai-hat-sdk in examples below; replace with the vendor package name). Boot and update packages as instructed.

Step 2 — Python environment and SDK

Create a virtualenv and install core libraries:

python3 -m venv ~/pi-scraper/venv
source ~/pi-scraper/venv/bin/activate
pip install --upgrade pip
pip install requests beautifulsoup4 trafilatura onnxruntime boto3 backoff
# If vendor provides an SDK, install it too (example):
# pip install ai-hat-sdk

Notes: If the AI HAT+ 2 includes a vendor runtime that accelerates ONNX, install that runtime instead of stock onnxruntime; vendor docs will show how.

Step 3 — HTML parsing pipeline (robust extraction)

For scraping, get the main article/content block and metadata reliably. Two pragmatic options:

trafilatura — lightweight, precise extraction for articles and pages.
readability-lxml / BeautifulSoup — useful when you need fine-grained control.

Example extractor using trafilatura:

import requests
import trafilatura

def fetch_html(url, headers=None, timeout=15):
    headers = headers or {"User-Agent": "PiScraper/1.0 (+https://your.company)"}
    r = requests.get(url, headers=headers, timeout=timeout)
    r.raise_for_status()
    return r.text

def extract_main(html, url=None):
    result = trafilatura.extract(html, url=url, include_comments=False, include_tables=False)
    return {
        "text": result or "",
        "length": len(result or "")
    }

# Usage
# html = fetch_html("https://example.com/article")
# doc = extract_main(html, url="https://example.com/article")

Tip: Add randomized delays and jitter between requests and rotate user-agents to reduce blocking. Pi5's CPU lets you run a moderate level of concurrency — tune to the target site's robots.txt and rate limits.

Step 4 — On-device inference: rules + ONNX fallback

For quick deployment, use a two-tier approach:

A tiny rule-based filter for clear cases (e.g., pages containing lots of spammy keywords).
An exported ONNX classifier for ambiguous cases, accelerated by the AI HAT+ 2 NPU.

Rule-based example

def rule_classify(text):
    text_l = text.lower()
    spam_signals = ["buy now", "click here", "free trial", "limited time"]
    score = sum(1 for s in spam_signals if s in text_l)
    if score >= 2:
        return {"label": "spam", "score": 0.95}
    return {"label": "unknown", "score": 0.5}

ONNX inference example

Assume you've trained a small classifier offline (e.g., a logistic regression or a tiny transformer) and exported to ONNX. Copy the model (classifier.onnx) and any preprocessing artifacts (e.g., vocabulary for a TF-IDF) onto the Pi.

import onnxruntime as ort
import joblib
import numpy as np

# Load artifacts
session = ort.InferenceSession("/home/pi/models/classifier.onnx")
vectorizer = joblib.load('/home/pi/models/tfidf_vectorizer.joblib')

def onnx_classify(text):
    X = vectorizer.transform([text])  # sparse matrix
    # ONNX runtimes expect dense arrays for small models; convert
    input_name = session.get_inputs()[0].name
    X_arr = X.astype(np.float32).toarray()
    result = session.run(None, {input_name: X_arr})
    probs = result[0][0]  # assume softmax output
    label_idx = int(np.argmax(probs))
    labels = ["article", "spam", "list", "ad"]
    return {"label": labels[label_idx], "score": float(probs[label_idx])}

Vendor acceleration: If AI HAT+ 2 provides an accelerated runtime, swap in their session loader (example: ai_hat_rt.load_model()) to use the NPU. Performance on Pi5 with an NPU commonly reduces inference time from 200ms to 10–30ms for small models (your mileage varies).

Step 5 — Clean & normalize (schema design)

Standardize your output so downstream pipelines can consume it easily. A minimal JSON schema:

{
  "url": "https://example.com/article",
  "fetched_at": "2026-01-18T12:34:56Z",
  "title": "...",
  "text": "...",
  "language": "en",
  "length": 1234,
  "classification": { "label": "article", "score": 0.97 },
  "source": "pi-node-01"
}

Include provenance fields (source node, model version, and confidence). This makes debugging and model rollbacks far easier in production.

Step 6 — Reliable ingestion: HTTP with retries and S3 as fallback

Send cleaned JSON to a central HTTP ingestion API. Use idempotency keys and retries. Here's a resilient sender using backoff:

import requests
import backoff
import json

INGEST_URL = "https://ingest.your.company/api/v1/documents"
API_KEY = "YOUR_API_KEY"

@backoff.on_exception(backoff.expo, (requests.exceptions.RequestException,), max_time=60)
def push_document(payload):
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
    resp = requests.post(INGEST_URL, headers=headers, data=json.dumps(payload), timeout=10)
    resp.raise_for_status()
    return resp.json()

# Optional S3 fallback
import boto3
s3 = boto3.client('s3')

def s3_fallback(bucket, key, payload):
    s3.put_object(Bucket=bucket, Key=key, Body=json.dumps(payload).encode('utf-8'))

# Usage in scraper flow:
# try: push_document(doc)
# except Exception:
#     s3_fallback('pi-scraper-fallback', f"{doc['source']}/{doc['url_hash']}.json", doc)

Idempotency: compute a deterministic hash of URL+canonicalization and include it as an idempotency key so ingestion systems can dedupe duplicates.

Step 7 — Run as a service and scale

Run your scraper as a systemd service for resilience. Example unit file (/etc/systemd/system/pi-scraper.service):

[Unit]
Description=Pi Scraper Node
After=network.target

[Service]
User=pi
WorkingDirectory=/home/pi/pi-scraper
ExecStart=/home/pi/pi-scraper/venv/bin/python -u run_scraper.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

For fleet management and remote updates, use a lightweight orchestration: Fleet management and remote updates — Watchtower-style update scripts, or Mender/Canonical’s snaps for OTA. Monitoring: log to a central Fluentd/Fluent Bit agent, and emit metrics (scrapes/sec, inference latency, failure rates) to Prometheus remote_write or a metrics endpoint.

Operational considerations & anti-blocking

Respect robots.txt and site ToS unless you have a legal review and explicit exemptions.
Proxying — use an internal proxy pool or commercial residential proxies if required. Keep local caching to lower request volume.
Headless execution — Playwright gained better ARM64 support in 2025; for JS-heavy pages run a headless browser selectively. Limit headless runs because they are expensive.
Adaptive scraping — detect 403/429 and back off; rotate through a scraping plan using exponential backoff to avoid bans.

Security, compliance and ethics

Filter PII on-device where possible. Store only the fields you need. Keep a data retention policy and log model decisions for audit. In 2026 regulators are increasingly focused on how scraped personal data is stored and used — involve legal early if your pipeline touches PII or regulated content.

Advanced topics & future-proofing

Model updates and federated learning

Instead of pushing all data up, consider an update pipeline where the central server aggregates feature deltas and ships periodic model updates (quantized ONNX) to nodes. Federated averaging or centralized fine-tuning in late 2025–2026 became more practical with compact differential update formats.

Tabular-first workflows

"From Text To Tables" — structured data extraction is a major AI frontier in 2026.

Extracting normalized tabular outputs (price, name, sku, phone) and pushing them as structured rows drastically increases downstream utility and reduces model costs. Store these rows in columnar stores (Parquet), DuckDB, or an OLAP engine.

Observability

Log model version, input hashes, and inference latencies. For data drift, compute daily summary stats of predicted labels vs. signals (e.g., traffic, manual labels) and schedule retraining when distribution shifts.

Quick checklist before you go to production

CPU + NPU utilization profiling — ensure model latency < target.
Model size & quantization — aim for sub-10MB models for the fastest updates.
Backoff & retry policies coded and tested against 429/503 errors.
Idempotency, logging, and schema versioning implemented.
Legal review for scraping targets and PII handling.

Example end-to-end flow (summarized)

Scheduler picks a URL.
Fetcher retrieves HTML (requests or Playwright if needed).
Extractor pulls main text and metadata (trafilatura or readability).
Local classifier runs (rule -> ONNX -> fallback).
Cleaner normalizes fields and computes idempotency key.
Sender posts to ingestion API with retries; S3 fallback if offline.
Metrics, logs, and model traces flung to central monitoring.

Actionable takeaways

Start small: ship rule-based filters to unblock; add ONNX models once you confirm economics.
Model export: train on a beefy server, export quantized ONNX for Pi deployment.
Provenance: include node ID, model version, and timestamp in every record.
Cost-first design: filter and compress locally — only valuable rows are sent upstream. For a deeper cost playbook see cost-estimate for scaling to 100+ nodes.

2026 trends to watch

Smaller foundation models & tabular adapters: cheaper on-device classification and structured extraction.
Improved ARM runtimes: wider adoption of vendor-accelerated ONNX/Runtimes for NPUs.
Federated update pipelines: safer model distribution without shipping raw data.

Final notes

Deploying Pi5 nodes with AI HAT+ 2 gives you a high ROI for many scraping use-cases in 2026 — lower cloud cost, faster response to site changes, and stronger privacy controls. This quickstart is intentionally pragmatic: rule-based checks first, ONNX for ambiguous cases, and robust ingestion with backoff to keep your central store clean.

Call to action

Ready to prototype? Clone the sample repo (starter code, systemd unit, model conversion scripts) and run a single Pi node this afternoon. If you want a curated checklist and a production-ready model conversion script (Tfidf -> ONNX, quantization presets), download our Pi Scraper starter kit and follow the vendor AI HAT+ 2 SDK notes to enable NPU acceleration. Reach out for an architecture review of your fleet and a cost-estimate for scaling to 100+ nodes.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.