coefficient-portfolio

Review Intelligence: Multi-Agent Review Collection Pipeline

A six-agent system that collects, processes, and manages product reviews and competitive intelligence across multiple marketplaces. Features progressive fallback web scraping, multi-level deduplication, and automated Google Sheets insertion.

Architecture

Master Orchestration Agent (coordinates sequential execution)
    ↓
┌──────────────────────────────────────────────┐
│ Data Collection Layer                         │
│ ├── Product Scraper (5 marketplaces, ALL ratings) │
│ ├── Competitor Scraper (marketplace sources, 1-3★) │
│ └── Alternative Sources Scraper (Capterra/Reddit) │
└──────────────────────────────────────────────┘
    ↓
Data Cleaning Agent (normalization + deduplication)
    ↓
Google Sheets Insertion Agent (batch insert + duplicate detection)

Pipeline Stages

Stage 1: Product Review Collection

Scrapes ALL ratings from 5 marketplace platforms
Pre-scraping deduplication check against existing data
Progressive fallback: Exa Search → Firecrawl → Browserbase
Platform-specific scraping strategies

Stage 2: Competitor Review Collection

Collects only 1-3 star (negative) reviews from 10+ competitor tools
Strict URL constraints per marketplace
Browser automation for filtering-restricted platforms

Stage 3: Alternative Source Collection

Targets competitors not on traditional marketplaces
Sources: Capterra, SoftwareAdvice, Reddit
Reddit sentiment analysis maps posts to 1-3 star scale:
- 1★: “terrible”, “awful”, “complete waste”
- 2★: “disappointing”, “doesn’t work well”
- 3★: “okay but…”, “has issues”

Stage 4: Data Cleaning

Date normalization (handles relative dates: “2 days ago”, various formats)
Rating standardization (text → numeric 1-5 scale)
Content processing (HTML removal, encoding fixes, truncation at 2000 chars)
Reviewer name cleaning (anonymous handling, 50-char limit)
URL validation (HTTPS enforcement, tracking param removal)

Stage 5: Multi-Level Deduplication

Level 1 (Exact): URL match, content + name + date match
Level 2 (Fuzzy): >90% content similarity, same reviewer near-date match
Level 3 (Cross-Source): Same review appearing in both product and competitor sheets

Stage 6: Google Sheets Insertion

Batch processing with exponential backoff for rate limiting
Always inserts at row 2 (newest first) to keep recent reviews visible
Pre-insertion deduplication against existing sheet data
Post-insertion verification with row count validation

Agent Specs

Agent	Role	Key Capability
Master Orchestrator	Pipeline coordinator	Sequential execution, error handling, partial-result acceptance
Product Scraper	Own-product review collection	5-marketplace coverage, all ratings
Competitor Scraper	Competitive intelligence	10+ tools, negative reviews only (1-3★)
Alt Sources Scraper	Non-marketplace collection	Capterra, SoftwareAdvice, Reddit sentiment analysis
Data Cleaner	Normalization + dedup	3-level deduplication, date/rating/content standardization
Sheets Inserter	Data persistence	Batch insert, row-2 strategy, duplicate prevention

Progressive Fallback Tool Strategy

Tier 1 (Cost-Effective): Exa Search + Firecrawl
    ↓ (if blocked/restricted)
Tier 2 (Browser Automation): Browserbase

Try primary tools first → detect failure type → use Browserbase only when needed
Platform-specific preferences (Exa preferred for app stores, Firecrawl for review sites)
Session reuse for multiple pages from same domain
Immediate session cleanup to minimize cost

Data Schema

Product Reviews (7 columns)

Competitor Reviews (8 columns)

Results

36 reviews collected across 4 platforms in initial run
80% platform coverage (4/5 platforms fully accessible)
Browserbase breakthrough: Bypassed cookie consent restrictions on one marketplace, yielding 10 previously inaccessible reviews (38% increase in collection rate)
High data quality with complete metadata
Zero duplicates in final dataset

Stack

Agents: Claude Code multi-agent (6 agents, orchestrated sequential pipeline)
Web scraping: Firecrawl MCP, Exa MCP (primary), Browserbase MCP (fallback)
Data storage: Google Sheets MCP (2-sheet architecture)
Community scraping: Reddit MCP (sentiment-based review extraction)

Key Decisions

Negative reviews only for competitors — Collecting 1-3 star reviews provides competitive intelligence on weaknesses without overwhelming data volume.
Progressive fallback over single tool — Exa/Firecrawl handle 80% of cases cheaply; Browserbase reserved for restricted platforms.
Row-2 insertion strategy — Newest reviews always at top for visibility. Avoids append-to-bottom which buries recent data.
Reddit sentiment mapping — Converts freeform community complaints into structured 1-3 star ratings using keyword analysis.
Cross-source deduplication — Prevents the same review from appearing in both product and competitor sheets.

Visual Direction

Diagram: Layered architecture showing collection layer (3 scrapers) feeding into cleaning layer (normalization + 3-level dedup), then insertion layer (batch processing + row-2 strategy). Side panel shows progressive fallback decision tree (Exa → Firecrawl → Browserbase). Color-coded by marketplace source.