
My scheduled headless browser scraper keeps failing overnight—how do I add retries, run logs, and reliable storage without building a full orchestration system?
If you’re waking up to failed Playwright/Puppeteer jobs and half-written JSON files, you’re hitting the classic “scraper as a cron script” ceiling. You don’t actually need to build a full orchestration system from scratch—you need retries, logs, storage, and scheduling bundled into something you can treat as a single deployable unit.
That’s exactly where Apify’s Actor model helps: you wrap your headless browser logic into an Actor, let the platform handle retries, run logs, storage, and scheduling, then consume the resulting dataset via API.
Quick Answer: Apify Actors let you run your existing headless browser scraper on a managed platform with built-in retries, run logs, key-value storage, and datasets. You keep your scraping logic; Apify takes over scheduling, unblocking, monitoring, and data delivery, so the job doesn’t fall over at 3 a.m.
The Quick Overview
- What It Is: A cloud runtime for your headless browser scrapers where each scraper is an “Actor” with automatic retries, logs, storage, and datasets you can fetch via API.
- Who It Is For: Engineers and data teams running scheduled Playwright/Puppeteer/Selenium/Crawlee scrapers that are brittle in cron/CI and need real observability and reliability without building a full control plane.
- Core Problem Solved: Turning “a flaky headless browser script on a server” into “a monitored, scheduled Actor with retries and structured output” so you stop firefighting overnight failures.
How It Works
At a high level, you take the scraper you already have and wrap it in an Apify Actor. The Actor runs in Apify’s cloud with proxies, unblocking, retries, logs, and storage out of the box. Every run produces a dataset you can inspect in Apify Console or pull via API into your app, data warehouse, or AI pipeline.
The lifecycle looks like this:
-
Wrap & Deploy:
- Use Apify SDK + Crawlee, Playwright, or Puppeteer.
- Push your project from local dev to Apify as an Actor.
-
Configure & Schedule:
- Define input (URLs, search terms, dates, etc.).
- Configure retries, concurrency, and timeouts.
- Set a schedule (e.g., every night at 01:00 UTC).
-
Run, Monitor, Consume:
- Apify executes the Actor, manages proxies/unblocking, and persists logs, key-value storage, and datasets.
- You inspect run logs, get alerts on failures, and read the final dataset via HTTP API / Python / JavaScript / MCP client.
Under the hood, Apify handles the bits you don’t want to reinvent: cloud deployment, proxies, unblocking, monitoring, and data processing—so your scraper stops being a fragile cron script and becomes a reusable data service.
How to add retries, logs, and storage to your headless browser scraper
1. Add robust retries at the Actor level
Instead of baking ad-hoc try/catch + setTimeout logic into your code, you configure retries at the crawling layer and let the platform re-run failed requests and, if needed, the entire Actor run.
Example with Crawlee + Playwright in an Actor:
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';
await Actor.init();
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3, // retry per URL
requestHandlerTimeoutSecs: 60,
browserPoolOptions: {
maxOpenPagesPerBrowser: 5,
},
async requestHandler({ page, request, log, pushData }) {
log.info(`Processing ${request.url}`);
await page.goto(request.url, { waitUntil: 'networkidle' });
const data = await page.evaluate(() => {
// your scraping logic here
return {
title: document.title,
url: location.href,
};
});
await pushData(data); // writes to the run dataset
},
});
await crawler.run(['https://example.com']);
await Actor.exit();
If the whole run fails (e.g., site outage), you can configure Apify to retry the entire run via the API or UI, without touching your code.
2. Get run logs you can actually debug from
When your script runs as an Actor, every log.info, log.warning, and log.error is collected and stored with the run. In Apify Console, each run has:
- Real-time logs (with search and filtering).
- Screenshots or HTML snapshots you save into key-value storage.
- Exit codes and error messages, including stack traces.
Inside your Actor code:
import { Actor, log } from 'apify';
await Actor.init();
log.info('Run started');
// ...
log.debug('Loaded page X');
// ...
log.error('Unexpected DOM structure', { url });
await Actor.exit();
If a run fails overnight, you open the run detail, see exactly where it died, inspect stored snapshots, and adjust your selectors or timing without SSH-ing into a random VM.
3. Use reliable storage instead of ad-hoc files
With Actors, you get three main storage primitives out of the box:
- Dataset: Append-only collection of items (JSON) for your scraped records.
- Key-value store (KVS): Arbitrary blobs (HTML, JSON, screenshots).
- Request queue: Managed queue for URLs to crawl with built-in persistence and state.
For example:
import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';
await Actor.init();
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
const item = {
url: request.url,
title: await page.title(),
};
await Dataset.pushData(item); // goes into the dataset of this run
},
});
await crawler.run(['https://example.com']);
const dataset = await Actor.getDataset(); // points to current run dataset
console.log(`Dataset ID: ${dataset.id}`);
await Actor.exit();
You no longer worry about half-written local files or S3 mismatches. The dataset is the contract:
- Inspect in the Apify Console.
- Export to JSON, CSV, Excel, or NDJSON.
- Access via HTTP endpoint or SDK:
https://api.apify.com/v2/datasets/{datasetId}/items.
4. Schedule runs instead of brittle cron
Instead of maintaining cron on a server, you define a schedule in Apify:
- Every night at 02:00 UTC.
- Every 10 minutes.
- Specific days of the week.
You can configure this in the UI or via API. Apify guarantees the run is triggered on time and logs the outcome. If you need backfills, you can trigger manual runs or programmatic runs with different input payloads.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Actor-based deployment | Wraps your headless browser scraper as an Actor with input, runs, and datasets. | Avoid building your own orchestration; treat scrapers as services. |
| Built-in retries & monitoring | Retries failed requests/runs, tracks status, stores logs and errors per run. | Fewer overnight failures, faster debugging. |
| Managed storage & exports | Provides datasets, key-value stores, and request queues with HTTP/SDK access and exports. | Reliable, API-ready data for apps, dashboards, or AI pipelines. |
Ideal Use Cases
- Best for scheduled Playwright/Puppeteer jobs: Because it turns your single script into a monitored Actor with retries, logs, and storage, so running every night or every hour doesn’t require a custom orchestration stack.
- Best for feeding AI/RAG pipelines: Because Actors like Website Content Crawler and your own scrapers can output clean text or Markdown into datasets that you can stream into vector databases like Pinecone, then query via LangChain or LlamaIndex.
Limitations & Considerations
- Still need to maintain scraping logic: Apify handles infra, retries, and monitoring, but if the target site’s DOM changes, you still have to update your selectors and parsing logic. The upside: debugging and deploy cycles are much faster in the Console.
- Platform quotas and plan limits: High-frequency or large-scale scraping (millions of pages/month, heavy headless usage) requires the right plan. The platform is built for scale, but you should size your plan according to concurrency and data volume.
Pricing & Plans
Apify has usage-based pricing tied to compute units, storage, and proxies, with plans that fit both individual developers and enterprise teams. You can start on a free or lower-tier plan, run your Actor on a schedule, and only upgrade when you hit scale or need enterprise features (like SOC2-compliant environments and 99.95% uptime guarantees).
- Self-serve plans: Best for developers and small teams needing reliable scheduled scraping, API access to datasets, and occasional headless workloads.
- Enterprise plans: Best for organizations needing guaranteed capacity, compliance (SOC2, GDPR, CCPA), SSO, dedicated support, and large-scale proxy/unblocking infrastructure.
For an exact fit, especially if you’re migrating multiple headless browser scrapers or an existing Scrapy/Playwright fleet, it’s worth talking to Apify sales.
Frequently Asked Questions
Can I run my existing Playwright/Puppeteer script as-is on Apify?
Short Answer: Yes, with minor wrapping to fit the Actor model.
Details:
You typically:
- Add
apifyand optionallycrawleeas dependencies. - Wrap your main function with
Actor.init()andActor.exit(). - Replace local file writes with
Dataset.pushDataorActor.setValue. - Use Apify’s input schema for parameters like URL lists, date ranges, or search terms.
In many cases, your core page interaction logic (selectors, navigation, evaluation) stays unchanged—you just gain retries, logs, and storage.
How do I consume the data from my scheduled Actor in other systems?
Short Answer: Read the dataset via HTTP or SDK, or sync it using integrations.
Details:
Every Actor run produces a dataset with a stable ID. You can:
- Fetch it via HTTP:
GET https://api.apify.com/v2/datasets/{datasetId}/items?format=json - Use official clients (Python, JavaScript, CLI, MCP clients) to read items programmatically.
- Export as CSV/Excel for BI tools.
- Push data downstream using integrations like Zapier, Google Sheets, Airbyte, Slack, Google Drive, or Pinecone for vector search.
For LLM workflows, you can schedule a Website Content Crawler or your own text-focused Actor to output clean Markdown, then embed it and load it into your RAG stack.
Summary
If your scheduled headless browser scraper keeps failing overnight, the problem usually isn’t Playwright or Puppeteer—it’s that a single script on cron has to pretend to be an orchestration system. By wrapping your scraper as an Apify Actor, you offload retries, logs, storage, proxies, unblocking, and scheduling to a platform that’s built for running web scrapers in production. You keep the scraping logic; Apify turns it into a monitored, API-consumable service that your team—and your AI workflows—can rely on.