
How do I collect Yelp or Amazon reviews every week and keep a clean historical dataset?
Most teams that depend on Yelp or Amazon reviews hit the same wall: it’s easy to grab a snapshot once, but keeping a weekly, deduplicated, historically consistent dataset is where things fall apart—scripts break, sites change HTML, and you end up with messy CSVs no one trusts.
This guide walks through how to set up a weekly review collection pipeline for Yelp and Amazon using Apify, keep it running in the cloud, and maintain a clean historical dataset you can query, export, or feed into dashboards and AI models.
Quick Answer: Use Apify Actors like Yelp Scraper and Amazon review scrapers, schedule weekly runs in Apify Console, and write each run’s output into an append-only historical dataset (via API or integrations like Airbyte/BigQuery). Use stable IDs (review ID + product/business ID + timestamp) to deduplicate and track changes over time.
The Quick Overview
- What It Is: A cloud-based workflow using Apify to scrape Yelp and Amazon reviews on a schedule, store them in structured datasets, and keep a consistent historical record.
- Who It Is For: Data teams, product managers, and analysts who need recurring review data for sentiment analysis, competitive tracking, or AI training—without babysitting fragile scrapers.
- Core Problem Solved: You stop manually scraping or maintaining DIY scripts and instead run repeatable, monitored Actors that output clean, exportable datasets every week.
How It Works
At a high level, you:
- Pick or configure review-scraping Actors in the Apify Store (e.g., Yelp Reviews Scraper, Yelp Business Search Scraper, Amazon review scrapers).
- Define the products (Amazon) or businesses (Yelp) you want to track and how many reviews you want per run.
- Schedule weekly runs in Apify Console and send each run’s output into a historical store (Apify dataset, warehouse, or lake) with deduplication rules.
Each Actor run is a unit of work: you give it input (URLs, IDs, filters), Apify handles proxies/unblocking/cloud execution, and you get a structured dataset you can export as JSON/CSV/Excel or pull via API.
1. Define your review scope
First decide:
- Platforms: Yelp, Amazon, or both.
- Entities:
- Yelp: specific business URLs or search queries (e.g., “coffee near Austin”).
- Amazon: ASINs or product URLs in specific marketplaces (.com, .de, etc.).
- Depth & cadence:
- Weekly is typical; some teams run daily for very active products.
- Decide whether each run should fetch:
- Only new reviews since last run (preferred), or
- A rolling window (e.g., last 3 months) with deduplication.
Write this down as a list of “tracked entities”—a simple table with an ID, URL, platform, and status.
2. Choose Actors for Yelp and Amazon
For Yelp:
Use Actors from the Apify Store like:
-
Yelp Reviews Scraper
- Input: Yelp business URLs, Yelp business IDs, or Yelp search URLs (by keyword, location, or custom URL).
- Output: review text, rating, author info, vote counts (useful/funny/cool), photos, plus optional business details.
- Works across all Yelp country-specific domains.
- Can scrape up to ~10,000 reviews in a few minutes, depending on proxy configuration and limits.
- Includes a personal data toggle to comply with GDPR/CCPA if you don’t want reviewer profiles.
-
Yelp Business Search Scraper
- Input: search queries on Yelp.com (keywords, locations).
- Output: business names, addresses, reviews, ratings, hours, and more in structured JSON.
- Ideal to discover which businesses you want to track, then feed those into the Reviews Scraper.
For Amazon:
In the Apify Store you’ll find multiple Actors for Amazon products and reviews (by ASIN, URL, or search). Typical capabilities:
- Input: product URLs, ASINs, marketplace code, review filters (e.g., star rating, “verified purchase”).
- Output: review text, rating, title, date, reviewer name, verified flag, helpful votes, product metadata.
Pick an Actor that:
- Supports your marketplaces (e.g., amazon.com, amazon.co.uk).
- Exposes stable identifiers (ASIN + review ID) in the output.
- Has options to control how many pages or reviews per run.
3. Configure your initial runs
In Apify Console:
- Open your chosen Yelp/Amazon Actor from the Apify Store.
- Click Try for free or Run to open the configuration UI.
- Add a small set of test inputs:
- Yelp: 1–2 business URLs.
- Amazon: 1–2 product URLs or ASINs.
- Set limits:
- Maximum number of review pages per input.
- Any filters (date ranges, ratings).
- Run the Actor once and inspect:
- Logs for errors/blocks.
- Dataset tab for fields: make sure you see review ID, product/business ID, timestamps, rating, and text.
Once the test run looks good, you can scale out to your full list of entities.
4. Turn runs into a historical dataset
The core trick is: treat each run as an incremental batch and write its results into an append-only historical store with deduplication.
You have three main patterns:
Pattern A: Historical dataset in Apify
- Each Actor run outputs into an Apify dataset.
- You can:
- Append runs into the same dataset (default) and deduplicate downstream, or
- Keep each run in its own dataset and have a separate consolidation step that:
- Reads all runs via Apify API.
- Writes into a single “golden” dataset or external warehouse.
Use a simple dedup key like:
platform + business_or_asin + review_id
If you want to track edits/changes to reviews (e.g., user updates their review), add:
platform + business_or_asin + review_id + scraped_at_week
Then you can see how reviews evolve over time.
Pattern B: Direct to data warehouse / lake
If you’re already running BigQuery, Snowflake, Redshift, or a lake like S3/Delta:
- Use integrations (e.g., Airbyte, webhooks, custom Actor) to:
- Pull dataset output (JSON, CSV).
- Load into your historical table.
- Define your table schema with:
platform(yelp/amazon)entity_id(business ID or ASIN)review_id(site-specific ID)review_text,rating,author, etc.scraped_at(run timestamp)- Optional
review_created_atandreview_updated_atif available.
Apply deduplication in SQL:
ROW_NUMBER() OVER (
PARTITION BY platform, entity_id, review_id
ORDER BY scraped_at DESC
)
…and keep only row_number = 1 for “latest view”.
Pattern C: Direct feed into AI/RAG workflows
If your end goal is to feed reviews into vector DBs (Pinecone, Qdrant, etc.) for RAG pipelines:
- Treat Apify datasets as your raw layer.
- After each run:
- Trigger a pipeline (via webhook) that:
- Reads new reviews from the dataset.
- Cleans text (strip markup, normalize).
- Computes embeddings.
- Upserts into the vector DB with a stable
review_id.
- Trigger a pipeline (via webhook) that:
This way, your LLM always has up-to-date review signals.
5. Schedule weekly runs
Once you’re happy with a configuration:
- In Apify Console, go to your Actor.
- Click Schedules.
- Create a new schedule:
- Frequency: weekly (or daily, if you need more granularity).
- Time: align with your ETL windows (e.g., Sunday night).
- Input: store a JSON config that includes your full list of entities.
Typical JSON input pattern:
{
"businessUrls": [
"https://www.yelp.com/biz/coffee-shop-austin",
"https://www.yelp.com/biz/coffee-shop-dallas"
],
"maxReviewPages": 10,
"includeReviewerProfiles": false
}
Or for Amazon:
{
"asins": ["B08N5WRWNW", "B07FZ8S74R"],
"marketplace": "US",
"maxReviewPages": 15,
"minStars": 1,
"maxStars": 5
}
Apify will:
- Run the Actor according to schedule.
- Handle proxies and unblocking under the hood.
- Store each run’s dataset and logs for you to inspect or pull via API.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Scheduled review collection | Runs Yelp/Amazon review Actors weekly (or daily) from Apify Console. | No manual scraping; guaranteed cadence for your datasets. |
| Structured datasets per run | Outputs JSON/CSV/Excel or via API with review IDs, ratings, text, and metadata. | Easy to integrate with warehouses, BI tools, or AI pipelines. |
| Append-only historical storage | Appends new runs to a historical dataset or warehouse table. | Clean history for trend analysis and time-based reporting. |
| Proxy & unblocking layer | Uses Apify’s proxy and unblocking infrastructure for Yelp/Amazon. | Higher reliability, fewer blocks, less maintenance. |
| Monitoring & alerts | Logs, run statuses, and optional notifications when something breaks. | You know when scrapers fail before stakeholders do. |
| API & integrations | Access via Apify API, Python/JS clients, Airbyte, or webhooks to downstream tools. | Fits into existing ETL/ELT and AI pipelines with minimal glue. |
Ideal Use Cases
- Best for weekly review analytics and trend tracking: Because you can maintain a long-term history of ratings, sentiment, and review volume by product or location, without rewriting scrapers every few months.
- Best for feeding AI models and RAG pipelines: Because structured, timestamped review data is easy to transform into embeddings and load into vector databases, giving your LLM context-rich, up-to-date consumer feedback.
Limitations & Considerations
-
Site terms and compliance:
Always review Yelp’s and Amazon’s terms of service and ensure your use of public data is compliant. For Yelp, you can disable reviewer profiles via the personal data toggle to align with GDPR/CCPA requirements and only collect non-personal review content. -
Platform changes and rate limits:
Sites can change HTML structures or introduce new anti-bot measures. Using Apify’s maintained Actors and proxy/unblocking stack reduces breakage, but you should:- Monitor failure rates and error logs.
- Keep alerting in place.
- Consider Apify Professional Services if you need guaranteed maintenance.
-
Not a replacement for official APIs where required:
If you’re bound by specific partner agreements, you may need to use official APIs where available. The Apify-based approach is for public web data where scraping is permitted.
Pricing & Plans
Apify uses a usage-based model:
- You pay for platform resources (compute units, storage, proxies) used by your Actor runs.
- Many Store Actors (including Yelp scrapers) offer free trial usage so you can validate the workflow.
Typical approach for weekly review scraping:
- Start with a pay-as-you-go or entry plan to estimate:
- Cost per run (Yelp + Amazon).
- Weekly/monthly volume.
- Scale up to a higher plan or enterprise if:
- You track hundreds or thousands of entities.
- You need dedicated support, SLAs, and custom compliance.
Example fit (not official plan names):
- Starter / Pay-as-you-go: Best for small teams tracking a limited set of products or locations, experimenting with weekly review pulls and simple exports.
- Business / Enterprise: Best for companies tracking large portfolios across multiple markets, needing guaranteed uptime (99.95%), SOC2/GDPR/CCPA compliance, and possibly custom Actors maintained by Apify Professional Services.
For precise pricing and fit, it’s worth having Apify sales size your workload based on approximate number of tracked entities and runs per week.
Frequently Asked Questions
How do I avoid duplicate reviews when scraping every week?
Short Answer: Use stable IDs (review ID + product/business ID) and deduplicate in your historical dataset or warehouse.
Details:
Both Yelp and Amazon expose identifiers you can use for deduplication—either explicit review IDs or URLs that contain them. In your schema, define a unique key like:
platform + entity_id (ASIN or business ID) + review_id
On each weekly run:
- Append all scraped reviews into a staging table or dataset.
- Use SQL (or your data tool) to:
- Keep only the latest row per unique key (for “current” view), or
- Store versions over time with
scraped_atas part of the key if you want to track edits.
If you stay entirely in Apify, you can write a small custom Actor that:
- Reads the latest run’s dataset.
- Merges it into a “golden” dataset with your dedup logic.
- Exposes that dataset as your primary API/export surface.
How can I keep the pipeline stable if Yelp or Amazon change their HTML?
Short Answer: Rely on maintained Apify Store Actors and monitoring, and consider Apify Professional Services if uptime is critical.
Details:
DIY scrapers break because they sit on top of:
- Changing HTML and CSS selectors.
- New anti-bot tactics.
- Proxy/IP reputation issues.
With Apify:
- Store Actors like Yelp Reviews Scraper and Amazon scrapers are actively maintained by their authors and community.
- Apify’s infrastructure handles proxies, unblocking, cloud execution, and retries.
- You get logs and status for each run; you can set up alerts via webhooks, Slack, or monitoring tools.
For mission-critical workloads (e.g., enterprise reporting or AI pipelines that must refresh on schedule), Apify Professional Services can:
- Build and maintain custom scrapers tailored to your targets.
- Monitor them for breakages.
- Update them as sites evolve.
That way, your team focuses on analysis and modeling, not HTML changes.
Summary
If you want weekly Yelp or Amazon reviews with a clean historical record, you don’t need a fragile zoo of scripts. You need:
- A stable scraping unit – Apify Actors from the Store (Yelp Reviews Scraper, Yelp Business Search Scraper, Amazon review Actors) that handle the heavy lifting: crawling, parsing, proxies, unblocking.
- A schedule – weekly runs from Apify Console with well-defined inputs (business URLs, ASINs).
- A historical store – an append-only dataset or warehouse table with stable IDs and deduplication logic.
Once in place, this pipeline quietly collects review data week after week, giving you a trusted source for trend analysis, dashboards, and AI workflows.