Designing a Schema for Aggregating Local Reviews from Maps, Social and Directories
Practical guide to unify maps, social and directories into a canonical reviews table for analytics and sentiment training in 2026.
Hook: Stop losing insight to fragmented local reviews
Collecting local reviews from maps, social platforms and directories is routine. Turning those heterogeneous blobs into a reliable dataset for analytics and sentiment training is not. Teams waste weeks reconciling rating scales, deduplicating cross-posts, and tracking provenance before models see a single clean row. This guide gives a practical, production-ready schema and field mapping strategy that merges noisy inputs into a canonical reviews table fit for analytics, dashboards and sentiment model training in 2026.
Why canonicalization matters in 2026
Two trends make canonicalization urgent this year. First, tabular foundation models and AI systems increasingly expect wide, clean tables rather than free text, so structured review tables unlock higher-quality fine tuning and retrieval augmented tasks. Second, platforms tightened rate limits and enforcement in late 2024 through 2025, making every successful fetch more valuable. Your pipeline must preserve provenance and maximize reuse of fetched payloads.
Key outcomes this schema delivers
- Single source of truth for each review with reliable provenance
- Normalized ratings, timestamps and languages for cross-source comparability
- Deduplication and clustering logic to merge cross-posts while preserving source ids
- Fields engineered for ML including embeddings, sentiment scores and model versioning
- Compliance-ready flags for PII, consent and data retention
Design principles
- Store raw payloads in a cold lake and never discard them. Raw payloads are the fallback for disputes and model re-training.
- Keep provenance immutable. Each ingest should record source, API response headers and fetch metadata.
- Normalize aggressively but store originals. Normalized fields plus original fields let you debug mapping errors.
- Version everything - schema, mapping rules, embedding models and sentiment models.
- Make dedupe probabilistic and auditable. Avoid opaque heuristics; log cluster scores and decisions.
Canonical reviews table: schema and rationale
Below is a practical CREATE TABLE that works in Snowflake, BigQuery or Postgres. Fields are grouped by intent: identity, content, normalization, provenance, ML, and compliance.
CREATE TABLE canonical_reviews (
canonical_review_id STRING PRIMARY KEY,
cluster_id STRING, -- group of near-duplicate reviews
canonical_business_id STRING, -- linked canonical business record
-- Source identity
source_domain STRING, -- e.g. google_maps, yelp, facebook
source_review_id STRING, -- review id from source
source_user_id STRING, -- user id on source platform
-- Raw payload
raw_payload VARIANT, -- full JSON from the API or scrape
raw_text STRING, -- original review text
-- Normalized content
content_normalized STRING, -- language-normalized, whitespace trimmed
content_lang STRING, -- detected language code
rating_orig FLOAT, -- original rating value as scraped
rating_norm FLOAT, -- normalized 1-5 float
-- Metrics and engagement
helpful_votes INT,
reply_text STRING,
reply_date TIMESTAMP,
-- Timestamps
source_created_at TIMESTAMP, -- timestamp from source
ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Location data
business_name STRING,
business_address STRING,
business_phone STRING,
business_lat FLOAT,
business_lng FLOAT,
geohash STRING,
-- ML fields
sentiment_score FLOAT, -- model polarity, e.g. -1..1
sentiment_label STRING, -- negative, neutral, positive
embedding_ref STRING, -- pointer to vector store id
embedding_model STRING,
sentiment_model_version STRING,
-- Deduplication / provenance
dedupe_confidence FLOAT,
canonicalized BOOLEAN DEFAULT FALSE,
chosen_variant STRING, -- which source review was used as canonical content
-- Compliance
pii_flag BOOLEAN DEFAULT FALSE,
consent_flag BOOLEAN DEFAULT TRUE,
retention_expire_at TIMESTAMP,
-- Audit
mapping_version STRING,
ingest_job_id STRING
);
Notes on key columns
- canonical_review_id should be a deterministic hash combining normalized content, normalized timestamp and business id. This makes re-ingests idempotent.
- cluster_id groups near-duplicates across sources. Keep cluster metadata in a separate clusters table for explainability.
- raw_payload is required for defense and feature work. Keep it compressed in your lake if size is a concern.
- rating_norm is normalized to a 1-5 scale as a float to simplify aggregation and model input.
- pii_flag should be set by automated PII detection and manual review workflows.
Practical field mappings for common sources
Different sources use different field names and semantics. Below are mapping snippets and normalization rules you can implement in your ETL.
Google Business Profile (maps)
// mapping rules
source_domain = 'google_maps'
source_review_id = payload.reviewId
source_user_id = payload.authorId
raw_text = payload.comment
rating_orig = payload.starRating // 1-5
source_created_at = parse_timestamp(payload.createTime)
helpful_votes = payload.voteCount
Yelp
source_domain = 'yelp'
source_review_id = payload.id
source_user_id = payload.user.id
raw_text = payload.text
rating_orig = payload.rating // 1-5
source_created_at = parse_timestamp(payload.time_created)
helpful_votes = payload.useful + payload.funny + payload.cool // optional aggregation
Facebook / Meta
source_domain = 'facebook'
source_review_id = payload.review_id
source_user_id = payload.reviewer.id
raw_text = payload.review_text
rating_orig = payload.rating // sometimes stars or thumbs
if payload.recommendation_type == 'RECOMMEND' then rating_norm = 4.0 else 2.0 // heuristic
Directories and niche sites
Directories may use 1-10 scales, text like 'Good', or thumbs. Normalize with a mapping table maintained as code or config.
Rating normalization patterns
Common practice is to convert any scale to a 1-5 float. Below is a resilient SQL expression that handles numeric and categorical inputs.
-- pseudocode SQL
CASE
WHEN rating_orig IS NULL THEN NULL
WHEN rating_orig IN ('thumbs_up','positive','recommend') THEN 5.0
WHEN rating_orig IN ('thumbs_down','negative','not_recommend') THEN 1.0
WHEN CAST_SAFE_FLOAT(rating_orig) BETWEEN 1 AND 5 THEN CAST_SAFE_FLOAT(rating_orig)
WHEN CAST_SAFE_FLOAT(rating_orig) BETWEEN 0 AND 10 THEN CAST_SAFE_FLOAT(rating_orig) / 2.0
ELSE NULL
END AS rating_norm
Deduplication strategies
Reviews repeat across platforms for the same user and text. Dedup strategies should be layered:
- Exact match on normalized content and business id. Use a canonical hash for immediate dedupe.
- Deterministic near-match using normalized timestamp window (eg within 7 days), identical user phone or email, and high name similarity.
- Semantic dedupe with embeddings and ANN search to detect paraphrases. Use a conservative threshold like cosine > 0.92 for exact paraphrase and 0.85 for close paraphrase that needs manual review.
- Human-in-the-loop for borderline clusters. Maintain a review queue storing cluster candidates and decision metadata.
Example dedupe flow using embeddings
- Generate a 768-d embedding for each raw_text using a sentence transformer or small instruction-tuned model. Record embedding_model name and version.
- Index embeddings in FAISS or Milvus with timestamp windows to limit comparisons to recent reviews for the same business.
- For a new review, query the index. If nearest neighbor cosine > 0.92 and business proximity matches, assign cluster_id and set dedupe_confidence.
- Log the reason and retain all source ids. Choose canonical content based on earliest source_created_at or highest helpful_votes depending on downstream requirements.
Identity resolution for businesses
Canonicalizing reviews without a canonical business table will fail. Build a business canonicalization process that combines:
- Normalized address comparison (split, street suffix normalization)
- Phone match after stripping country codes
- Geospatial proximity using haversine or geohash buckets for fuzzy matches
- Name fuzzy matching with token set ratio
Persist business ids in a canonical_business table and use foreign keys in canonical_reviews.
ML-ready columns and training artifacts
When preparing data for sentiment training, you want repeatable provenance and stable features. Include these fields in both the canonical table and exported training datasets.
- content_normalized - preprocessed text with lowercasing, unicode normalization and minimal tokenization; keep a copy with punctuation preserved for transformer models
- content_lang - language detection output so you can filter or train per-language models
- rating_norm - target for supervised sentiment if you want to predict star ratings
- sentiment_label - human labels or weak labels derived from rating ranges for classification tasks
- embedding_ref - reference to vector stores to avoid re-computing embeddings at training time
- model version fields - sentiment_model_version and embedding_model to track feature drift
Weak labeling example for multi-task training
-- generate polarity label from normalized rating
sentiment_label = CASE
WHEN rating_norm IS NULL THEN 'unknown'
WHEN rating_norm <= 2.0 THEN 'negative'
WHEN rating_norm <= 3.5 THEN 'neutral'
ELSE 'positive'
END
Provenance and observability
Provenance is essential for auditing and legal compliance. Capture:
- fetch_time, fetch_method (API vs scrape), http_response_code
- rate_limit headers and remaining quota
- mapping_version and ingest_job_id (tie into developer productivity and governance signals)
- raw_payload persisted in cold storage with a pointer in canonical_reviews
Compliance and privacy in 2026
Regulators and platforms have increased scrutiny of scraped data and PII since 2024. Best practices in 2026 include:
- Automated PII detection to set pii_flag and trigger redaction or restricted storage
- Consent and retention flags aligned with your legal team and platform ToS
- Minimize storage of unique user identifiers unless you have a lawful basis
- Document your lawful basis and keep an audit trail for each ingest
Performance and storage recommendations
- Keep raw payloads in compressed columnar Parquet in a data lake, partitioned by source_domain and ingest date.
- Store canonical_reviews in your data warehouse for fast analytics; consider clustering by canonical_business_id and geohash.
- Keep embeddings in a dedicated vector index and store only pointers in the warehouse to reduce cost.
- Use incremental upserts keyed by canonical_review_id and mapping_version to support backfills without duplicates.
Operational patterns and code snippets
Deterministic canonical id
# python pseudocode
import hashlib
def canonical_id(business_id, normalized_text, timestamp):
key = f"{business_id}|{normalized_text}|{timestamp.isoformat()}"
return hashlib.sha256(key.encode('utf-8')).hexdigest()
Upsert pattern in SQL
MERGE INTO canonical_reviews t
USING staged_reviews s
ON t.canonical_review_id = s.canonical_review_id
WHEN MATCHED AND s.mapping_version > t.mapping_version THEN
UPDATE SET
t.raw_payload = s.raw_payload,
t.content_normalized = s.content_normalized,
t.rating_norm = s.rating_norm,
t.ingested_at = CURRENT_TIMESTAMP,
t.mapping_version = s.mapping_version
WHEN NOT MATCHED THEN
INSERT (canonical_review_id, source_domain, source_review_id, raw_text, content_normalized, rating_norm, ...)
VALUES (s.canonical_review_id, s.source_domain, s.source_review_id, s.raw_text, s.content_normalized, s.rating_norm, ...);
Advanced strategies and 2026 trends
Leverage recent advances to improve dedupe and feature engineering:
- Tabular foundation models now accept large structured tables directly. Use consistent column names and types to improve model transferability, and include mapping_version columns as model features to let models learn mapping drift.
- Embeddings at scale are cheaper in 2026. Use hybrid dedupe combining token-based fingerprints and dense embeddings for precision and recall balance.
- Data contracts and schema registries are mainstream. Publish your canonical schema to a registry and validate ingest pipelines against it to reduce downstream regressions.
- Model-in-the-loop where ML suggests dedupe clusters and human reviewers confirm borderline cases, speeding operations while preserving accuracy.
Common pitfalls and how to avoid them
- Discarding raw payloads. You will need them for disputes and model audits.
- Using opaque dedupe thresholds. Always record scores and reasons for auditability.
- Mismatched timezone handling. Normalize timestamps to UTC and store original timezone when available.
- Recomputing embeddings on every query. Store embeddings or pointers to avoid repeated costs.
Actionable checklist to implement today
- Start persisting raw payloads in compressed Parquet with source and ingest metadata.
- Implement the canonical_reviews schema in your warehouse and load a sample week of data.
- Build deterministic canonical ids and test idempotent upserts.
- Run a first-pass dedupe using exact and timestamp-window rules, then add embedding-based clustering for the remaining cases.
- Add mapping_version and model_version fields and wire them into your CI for retraining and rollback. Tie these signals to observability and audit dashboards (provenance and observability).
Takeaways
- Canonicalization is a product and engineering effort not just a one-off script. Expect iterative improvements.
- Store raw and normalized so you can both audit and scale ML workflows.
- Deduplicate with explainability to support business decisions and compliance.
- Design for 2026 by integrating vector stores, tabular model readiness and schema registries into day one.
Practical canonicalization reduces noise, saves API calls and unlocks higher-quality sentiment models.
Next steps and call to action
Use the provided schema as a starting point and run a 2-week pilot: ingest one city across three sources, build canonical ids, run dedupe and export a training set. If you want a checklist, mapping templates for Google, Yelp, Facebook and a sample Airflow DAG that implements the flow above, download the starter kit on our engineering repo or reach out for a tailored schema review.
Related Reading
- Feature Engineering Templates for Customer 360 in Small Business CRMs
- Indexing Manuals for the Edge Era (2026): Advanced Delivery, Micro‑Popups, and Creator‑Driven Support
- Observability in 2026: Subscription Health, ETL, and Real‑Time SLOs for Cloud Teams
- From Micro-App to Production: CI/CD and Governance for LLM-Built Tools
- BBC x YouTube Deal: What It Means for Music Creators Seeking Broadcast-Quality Exposure
- Dilys Powell Winners Through the Years: A List of Film Icons Honored by the London Critics’ Circle
- Collector’s Guide: Are Deep Discounts on Booster Boxes a Buying Opportunity or a Warning Sign?
- Cultural Trendwatch: What ‘Very Chinese Time’ Says About Nostalgia and Identity in Western Social Media
- Thunderbolt 5 and M4 Mac mini: What It Means for External NVMe Enclosures and USB Drives
Related Topics
scraper
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group