data-modelingreviewsAI-training

Designing a Schema for Aggregating Local Reviews from Maps, Social and Directories

sscraper

2026-02-08

10 min read

Practical guide to unify maps, social and directories into a canonical reviews table for analytics and sentiment training in 2026.

Hook: Stop losing insight to fragmented local reviews

Collecting local reviews from maps, social platforms and directories is routine. Turning those heterogeneous blobs into a reliable dataset for analytics and sentiment training is not. Teams waste weeks reconciling rating scales, deduplicating cross-posts, and tracking provenance before models see a single clean row. This guide gives a practical, production-ready schema and field mapping strategy that merges noisy inputs into a canonical reviews table fit for analytics, dashboards and sentiment model training in 2026.

Why canonicalization matters in 2026

Two trends make canonicalization urgent this year. First, tabular foundation models and AI systems increasingly expect wide, clean tables rather than free text, so structured review tables unlock higher-quality fine tuning and retrieval augmented tasks. Second, platforms tightened rate limits and enforcement in late 2024 through 2025, making every successful fetch more valuable. Your pipeline must preserve provenance and maximize reuse of fetched payloads.

Key outcomes this schema delivers

Single source of truth for each review with reliable provenance
Normalized ratings, timestamps and languages for cross-source comparability
Deduplication and clustering logic to merge cross-posts while preserving source ids
Fields engineered for ML including embeddings, sentiment scores and model versioning
Compliance-ready flags for PII, consent and data retention

Design principles

Store raw payloads in a cold lake and never discard them. Raw payloads are the fallback for disputes and model re-training.
Keep provenance immutable. Each ingest should record source, API response headers and fetch metadata.
Normalize aggressively but store originals. Normalized fields plus original fields let you debug mapping errors.
Version everything - schema, mapping rules, embedding models and sentiment models.
Make dedupe probabilistic and auditable. Avoid opaque heuristics; log cluster scores and decisions.

Canonical reviews table: schema and rationale

Below is a practical CREATE TABLE that works in Snowflake, BigQuery or Postgres. Fields are grouped by intent: identity, content, normalization, provenance, ML, and compliance.

CREATE TABLE canonical_reviews (
  canonical_review_id STRING PRIMARY KEY,
  cluster_id STRING,            -- group of near-duplicate reviews
  canonical_business_id STRING, -- linked canonical business record

  -- Source identity
  source_domain STRING,         -- e.g. google_maps, yelp, facebook
  source_review_id STRING,      -- review id from source
  source_user_id STRING,        -- user id on source platform

  -- Raw payload
  raw_payload VARIANT,          -- full JSON from the API or scrape
  raw_text STRING,              -- original review text

  -- Normalized content
  content_normalized STRING,    -- language-normalized, whitespace trimmed
  content_lang STRING,          -- detected language code
  rating_orig FLOAT,            -- original rating value as scraped
  rating_norm FLOAT,            -- normalized 1-5 float

  -- Metrics and engagement
  helpful_votes INT,
  reply_text STRING,
  reply_date TIMESTAMP,

  -- Timestamps
  source_created_at TIMESTAMP,  -- timestamp from source
  ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Location data
  business_name STRING,
  business_address STRING,
  business_phone STRING,
  business_lat FLOAT,
  business_lng FLOAT,
  geohash STRING,

  -- ML fields
  sentiment_score FLOAT,        -- model polarity, e.g. -1..1
  sentiment_label STRING,       -- negative, neutral, positive
  embedding_ref STRING,         -- pointer to vector store id
  embedding_model STRING,
  sentiment_model_version STRING,

  -- Deduplication / provenance
  dedupe_confidence FLOAT,
  canonicalized BOOLEAN DEFAULT FALSE,
  chosen_variant STRING,        -- which source review was used as canonical content

  -- Compliance
  pii_flag BOOLEAN DEFAULT FALSE,
  consent_flag BOOLEAN DEFAULT TRUE,
  retention_expire_at TIMESTAMP,

  -- Audit
  mapping_version STRING,
  ingest_job_id STRING
);

Notes on key columns

canonical_review_id should be a deterministic hash combining normalized content, normalized timestamp and business id. This makes re-ingests idempotent.
cluster_id groups near-duplicates across sources. Keep cluster metadata in a separate clusters table for explainability.
raw_payload is required for defense and feature work. Keep it compressed in your lake if size is a concern.
rating_norm is normalized to a 1-5 scale as a float to simplify aggregation and model input.
pii_flag should be set by automated PII detection and manual review workflows.

Practical field mappings for common sources

Different sources use different field names and semantics. Below are mapping snippets and normalization rules you can implement in your ETL.

Google Business Profile (maps)

// mapping rules
source_domain = 'google_maps'
source_review_id = payload.reviewId
source_user_id = payload.authorId
raw_text = payload.comment
rating_orig = payload.starRating  // 1-5
source_created_at = parse_timestamp(payload.createTime)
helpful_votes = payload.voteCount

Yelp

source_domain = 'yelp'
source_review_id = payload.id
source_user_id = payload.user.id
raw_text = payload.text
rating_orig = payload.rating  // 1-5
source_created_at = parse_timestamp(payload.time_created)
helpful_votes = payload.useful + payload.funny + payload.cool  // optional aggregation

Facebook / Meta

source_domain = 'facebook'
source_review_id = payload.review_id
source_user_id = payload.reviewer.id
raw_text = payload.review_text
rating_orig = payload.rating  // sometimes stars or thumbs
if payload.recommendation_type == 'RECOMMEND' then rating_norm = 4.0 else 2.0  // heuristic

Directories and niche sites

Directories may use 1-10 scales, text like 'Good', or thumbs. Normalize with a mapping table maintained as code or config.

Rating normalization patterns

Common practice is to convert any scale to a 1-5 float. Below is a resilient SQL expression that handles numeric and categorical inputs.

-- pseudocode SQL
CASE
  WHEN rating_orig IS NULL THEN NULL
  WHEN rating_orig IN ('thumbs_up','positive','recommend') THEN 5.0
  WHEN rating_orig IN ('thumbs_down','negative','not_recommend') THEN 1.0
  WHEN CAST_SAFE_FLOAT(rating_orig) BETWEEN 1 AND 5 THEN CAST_SAFE_FLOAT(rating_orig)
  WHEN CAST_SAFE_FLOAT(rating_orig) BETWEEN 0 AND 10 THEN CAST_SAFE_FLOAT(rating_orig) / 2.0
  ELSE NULL
END AS rating_norm

Deduplication strategies

Reviews repeat across platforms for the same user and text. Dedup strategies should be layered:

Exact match on normalized content and business id. Use a canonical hash for immediate dedupe.
Deterministic near-match using normalized timestamp window (eg within 7 days), identical user phone or email, and high name similarity.
Semantic dedupe with embeddings and ANN search to detect paraphrases. Use a conservative threshold like cosine > 0.92 for exact paraphrase and 0.85 for close paraphrase that needs manual review.
Human-in-the-loop for borderline clusters. Maintain a review queue storing cluster candidates and decision metadata.

Example dedupe flow using embeddings

Generate a 768-d embedding for each raw_text using a sentence transformer or small instruction-tuned model. Record embedding_model name and version.
Index embeddings in FAISS or Milvus with timestamp windows to limit comparisons to recent reviews for the same business.
For a new review, query the index. If nearest neighbor cosine > 0.92 and business proximity matches, assign cluster_id and set dedupe_confidence.
Log the reason and retain all source ids. Choose canonical content based on earliest source_created_at or highest helpful_votes depending on downstream requirements.

Identity resolution for businesses

Canonicalizing reviews without a canonical business table will fail. Build a business canonicalization process that combines:

Normalized address comparison (split, street suffix normalization)
Phone match after stripping country codes
Geospatial proximity using haversine or geohash buckets for fuzzy matches
Name fuzzy matching with token set ratio

Persist business ids in a canonical_business table and use foreign keys in canonical_reviews.

ML-ready columns and training artifacts

When preparing data for sentiment training, you want repeatable provenance and stable features. Include these fields in both the canonical table and exported training datasets.

content_normalized - preprocessed text with lowercasing, unicode normalization and minimal tokenization; keep a copy with punctuation preserved for transformer models
content_lang - language detection output so you can filter or train per-language models
rating_norm - target for supervised sentiment if you want to predict star ratings
sentiment_label - human labels or weak labels derived from rating ranges for classification tasks
embedding_ref - reference to vector stores to avoid re-computing embeddings at training time
model version fields - sentiment_model_version and embedding_model to track feature drift

Weak labeling example for multi-task training

-- generate polarity label from normalized rating
sentiment_label = CASE
  WHEN rating_norm IS NULL THEN 'unknown'
  WHEN rating_norm <= 2.0 THEN 'negative'
  WHEN rating_norm <= 3.5 THEN 'neutral'
  ELSE 'positive'
END

Provenance and observability

Provenance is essential for auditing and legal compliance. Capture:

fetch_time, fetch_method (API vs scrape), http_response_code
rate_limit headers and remaining quota
mapping_version and ingest_job_id (tie into developer productivity and governance signals)
raw_payload persisted in cold storage with a pointer in canonical_reviews

Compliance and privacy in 2026

Regulators and platforms have increased scrutiny of scraped data and PII since 2024. Best practices in 2026 include:

Automated PII detection to set pii_flag and trigger redaction or restricted storage
Consent and retention flags aligned with your legal team and platform ToS
Minimize storage of unique user identifiers unless you have a lawful basis
Document your lawful basis and keep an audit trail for each ingest

Performance and storage recommendations

Keep raw payloads in compressed columnar Parquet in a data lake, partitioned by source_domain and ingest date.
Store canonical_reviews in your data warehouse for fast analytics; consider clustering by canonical_business_id and geohash.
Keep embeddings in a dedicated vector index and store only pointers in the warehouse to reduce cost.
Use incremental upserts keyed by canonical_review_id and mapping_version to support backfills without duplicates.

Operational patterns and code snippets

Deterministic canonical id

# python pseudocode
import hashlib

def canonical_id(business_id, normalized_text, timestamp):
    key = f"{business_id}|{normalized_text}|{timestamp.isoformat()}"
    return hashlib.sha256(key.encode('utf-8')).hexdigest()

Upsert pattern in SQL

MERGE INTO canonical_reviews t
USING staged_reviews s
ON t.canonical_review_id = s.canonical_review_id
WHEN MATCHED AND s.mapping_version > t.mapping_version THEN
  UPDATE SET
    t.raw_payload = s.raw_payload,
    t.content_normalized = s.content_normalized,
    t.rating_norm = s.rating_norm,
    t.ingested_at = CURRENT_TIMESTAMP,
    t.mapping_version = s.mapping_version
WHEN NOT MATCHED THEN
  INSERT (canonical_review_id, source_domain, source_review_id, raw_text, content_normalized, rating_norm, ...)
  VALUES (s.canonical_review_id, s.source_domain, s.source_review_id, s.raw_text, s.content_normalized, s.rating_norm, ...);

Advanced strategies and 2026 trends

Leverage recent advances to improve dedupe and feature engineering:

Tabular foundation models now accept large structured tables directly. Use consistent column names and types to improve model transferability, and include mapping_version columns as model features to let models learn mapping drift.
Embeddings at scale are cheaper in 2026. Use hybrid dedupe combining token-based fingerprints and dense embeddings for precision and recall balance.
Data contracts and schema registries are mainstream. Publish your canonical schema to a registry and validate ingest pipelines against it to reduce downstream regressions.
Model-in-the-loop where ML suggests dedupe clusters and human reviewers confirm borderline cases, speeding operations while preserving accuracy.

Common pitfalls and how to avoid them

Discarding raw payloads. You will need them for disputes and model audits.
Using opaque dedupe thresholds. Always record scores and reasons for auditability.
Mismatched timezone handling. Normalize timestamps to UTC and store original timezone when available.
Recomputing embeddings on every query. Store embeddings or pointers to avoid repeated costs.

Actionable checklist to implement today

Start persisting raw payloads in compressed Parquet with source and ingest metadata.
Implement the canonical_reviews schema in your warehouse and load a sample week of data.
Build deterministic canonical ids and test idempotent upserts.
Run a first-pass dedupe using exact and timestamp-window rules, then add embedding-based clustering for the remaining cases.
Add mapping_version and model_version fields and wire them into your CI for retraining and rollback. Tie these signals to observability and audit dashboards (provenance and observability).

Takeaways

Canonicalization is a product and engineering effort not just a one-off script. Expect iterative improvements.
Store raw and normalized so you can both audit and scale ML workflows.
Deduplicate with explainability to support business decisions and compliance.
Design for 2026 by integrating vector stores, tabular model readiness and schema registries into day one.

Practical canonicalization reduces noise, saves API calls and unlocks higher-quality sentiment models.

Next steps and call to action

Use the provided schema as a starting point and run a 2-week pilot: ingest one city across three sources, build canonical ids, run dedupe and export a training set. If you want a checklist, mapping templates for Google, Yelp, Facebook and a sample Airflow DAG that implements the flow above, download the starter kit on our engineering repo or reach out for a tailored schema review.

scraper

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.