Why do my scrapers start returning 403/429 after a few hundred requests even with delays?

Most scrapers work perfectly for the first few hundred requests, then suddenly start returning a wall of HTTP 403 and 429 errors—even if you’ve added generous delays. If that’s what you’re seeing, you’re not unlucky; you’re running into how modern anti‑bot systems work.

As someone who spent years hand‑rolling scrapers (Scrapy + Playwright, DIY proxies, on‑call alerts at 3 a.m.) and then moved everything to Apify Actors, I’ll walk through why this happens and how to design scrapers that stay alive at scale.

The Quick Overview

What It Is: A technical breakdown of why websites start serving 403/429 responses to scrapers after a short burst of success, and how to avoid these bans with modern scraping patterns.
Who It Is For: Developers, data engineers, and product teams running crawlers for price intelligence, market research, lead gen, or feeding AI/RAG pipelines.
Core Problem Solved: Your scrapers don’t just need “delay between requests”—they need IP rotation, browser simulation, fingerprinting control, and monitoring, or they will keep getting blocked.

How HTTP 403 and 429 Actually Work For Scrapers

A 403 or 429 after a few hundred requests is almost never “just bad luck.” It’s usually the result of layered detection:

Rate limiting (429):
The server thinks your client is sending “too many” requests in a given time window—from a single IP, account, or device fingerprint.
Access denial (403):
The server has decided you’re not allowed to access a resource at all anymore, often because of:
- IP reputation or region
- Detected bot fingerprint
- Suspicious behavior patterns (e.g., perfect intervals, unusual navigation)

Modern anti‑bot systems rarely block you immediately. They let you have a few hundred requests, build up a behavioral profile, then quietly move you to a stricter bucket.

Delays alone don’t fix that because:

You’re still coming from the same IP (or small pool).
Your browser fingerprint doesn’t change.
Your navigation pattern looks nothing like a human session.
You’re triggering JavaScript checks you don’t execute properly.

The Real Reasons You Get 403/429 After a Few Hundred Requests

1. Single-IP or Small-Pool Exhaustion

Even with generous sleep(2–5s) calls, if everything is hitting from:

One IP (your server),
One datacenter ASN (e.g., AWS EC2, GCP),
Or a tiny residential proxy pool,

you’ll quickly hit:

Global per-IP rate limits → 429
Reputation filters (datacenter IPs flagged as bots) → 403

What you see in practice:

First 200–500 requests: 200 OK, everything looks fine.
Then, growing chunk of responses: 429 Too Many Requests.
Eventually: mix of 403 Forbidden and CAPTCHAs.

Why delay doesn’t help:
The site isn’t just counting raw QPS. It’s applying IP reputation and per-IP quotas over rolling windows. Once an IP is “hot,” the delay can be 30 seconds and you’ll still be throttled.

2. No Real Browser or JavaScript Execution

If you’re using a bare HTTP client or simple HTML parser:

No JavaScript execution
No proper cookies/session storage
No WebGL / Canvas / audio / fonts environment

Anti‑bot vendors use that to distinguish bots from users. Results:

Some endpoints always 403 for “basic” clients.
After a few hundred requests, your client score drops below “allow” threshold → all future requests get 403/429.

Symptoms:

Your script works when tested in Chrome/Playwright.
Same URL returns 403 when fetched via requests, fetch, or simple curl.
Sometimes you see “Access denied” messages from services like Cloudflare, Akamai, PerimeterX, Datadome, etc.

3. Reused or Static Fingerprints

If every request or every “session” looks like the exact same device:

Same User-Agent
Same screen resolution
Same timezone / language
Same browser version and minor quirks
Same TCP / TLS fingerprint

…you’re easy to cluster as one automated agent.

Anti‑bot systems don’t need to see high RPS. They see:

Unrealistically consistent fingerprints
Long-running sessions with no random noise
No normal user actions (scrolling, inter-page delays, back/forward navigation)

After N requests (often a few hundred), you’re bucketed as a bot and start getting 403/429.

4. Highly Predictable Request Patterns

Even with delays, many scrapers:

Fetch pages in numeric order (page=1,2,3,…)
Use exact same URL patterns and query strings
Never fetch assets (CSS, images) or API auxiliary endpoints
Hit only product or listing pages, never home or category pages

From a detection perspective, that is extremely non‑human. Humans:

Land on random pages via search, social, referral
Navigate unpredictably
Sometimes bounce quickly, sometimes stay long

If the site uses behavior‑based detection, your “perfectly regular” scraper becomes suspicious as soon as it has enough data—again, usually after a few hundred requests.

5. Missing or Mishandled Cookies and Sessions

Lots of sites expect:

A login cookie
Session cookies set by initial landing pages
CSRF tokens set in hidden fields or headers
Consent / GDPR banners toggled

If you skip those by:

Calling API endpoints directly without going through normal pages
Dropping cookies between requests
Ignoring redirects that initialize the session

…you may pass for a while, then the server elevates suspicion and starts returning 403/429 because you look like a replayed or synthetic client.

6. Backend-Level Rate Limits & Soft Bans

Some 429 patterns aren’t even “bot protection”—they are raw backend protection:

Per-IP quotas on expensive endpoints
Per-account quotas (if authenticated)
Global concurrency limits

When you ignore Retry-After headers or fail to back off on 429s, you quickly escalate into:

Longer 429 durations
Temporary 403 or full forbidden blocks for that IP/session

7. Monitoring Blind Spots

Another subtle reason: you don’t see the early warning signs.

Scraper runs nightly; you only notice when the output dataset is suddenly tiny.
You ignore increasing 301/302, 5xx, or partial HTML responses that signal ramping mitigation.
You don’t log response headers (e.g., Retry-After, anti‑bot headers, CAPTCHA hints).

The “few hundred OK → then blocked” pattern is often preceded by gradual degradation you’re not monitoring.

Why Adding More Delay Is Not Enough

A lot of teams hit 403/429 and reflexively try:

Increasing delay between requests (sleep 1s → 5s → 10s)
Slowing down concurrency
Spreading work across hours or days

This reduces raw RPS, but:

IP Reputation issues persist.
Fingerprint and session anomalies remain.
Behavioral profile still looks synthetic.
You’re not respecting server‑communicated limits (e.g., Retry-After).

Delay is one dimension of “don’t be noisy,” but modern defenses operate on multi-dimensional signals:

Client & transport fingerprint
IP / ASN / geography
Session length and flow
Navigation graph
Header and cookie patterns

To keep scrapers alive, you need to attack the problem on all these axes.

A Better Mental Model: Treat Scrapers Like “Many Real Users”

Instead of thinking:

“I have one script that hits N URLs with sleep()”

Think:

“I have thousands of simulated users, each with:

Their own IP,

Their own browser,

Their own session,

Their own navigation pattern, …all operating below detection thresholds.”

In practice, that means:

Proxies with rotation: Residential or data center pools, rotated per session or per small batch.
Browser automation: Playwright or Puppeteer (or Selenium), preferably via a framework like Crawlee or an Apify Actor.
Session management: Cookies + localStorage + tokens preserved across small bursts of requests.
Smart scheduling: Rates tuned by site, respecting 429 and Retry-After.
Monitoring & recovery: Automatic retry, backoff, and alerting when 403/429 spikes.

This is exactly the operational stack I ended up delegating to Apify after years of running my own.

How Apify Helps You Avoid 403/429 at Scale

If you don’t want to line‑item “proxies, unblocking, browser automation, monitoring” into your own infrastructure, Apify is pretty much built to absorb that complexity.

1. Actors as the Deployment Unit

Instead of a script on a random VM, you package your scraper as an Actor:

Input: URLs, search params, pagination limits, etc.
Run: Browser automation or HTTP requests with built‑in Crawlee helpers.
Output: Structured datasets you can export or consume via API (JSON, CSV, Excel, etc.)

The platform handles:

Cloud execution
Scaling and concurrency
Storages (dataset, key‑value store, request queue)
Logs & run history

2. Proxies and Unblocking Built In

This is the biggest factor in reducing 403/429:

Apify Proxy: Global pool with:
- Datacenter and residential IPs
- Country targeting, rotation strategies
IP rotation per request or per session
Tuned for web scraping use cases (not generic web proxies)

You stop:

Burning a single IP or a tiny provider pool
Fighting bans at the network layer
Rewriting your code every time a provider changes

3. Browser Automation with Crawlee + Playwright/Puppeteer

Apify’s Crawlee library (open source) + Actors support:

Headless/“headed” Playwright/Puppeteer
Randomized fingerprints and headers
Session pools (per‑browser “users”)
Auto‑retry and backoff logic

In code terms:

You operate with high‑level crawlers (e.g., PlaywrightCrawler) that implement:
- Automatic retries on 403/429
- Delay & concurrency controls
- Rotating proxies and sessions

This alone solves a big chunk of “works for 300 requests, then dies” issues.

4. Scheduling, Monitoring, and Alerting

With Apify Console:

Schedule Actors to run on cron (e.g., every 10 minutes, hourly, daily).
Monitor:
- Run duration
- Error rate (including 403/429 spikes)
- Dataset size and trends
Integrate with:
- Slack (webhooks)
- Zapier / Google Sheets / Airbyte
- Your own systems via the Apify API

Practically, this means you:

Notice when a site tightens protections.
Adjust configuration (proxy type, concurrency, delays) from the UI.
Avoid silent data drift in your downstream systems or AI pipelines.

Practical Strategies to Stop 403/429 in Your Own Stack

Whether you use Apify or not, you can apply the same principles:

1. Introduce Real IP Rotation

Use a proxy provider with:
- A large pool
- Rotation controls (per request/session)
- Geographic targeting, if needed
Avoid a single static IP or a tiny VPS pool.

In Apify:

Configure your Actor to use Apify Proxy.
Choose residential / datacenter pools per target site.
Reduce concurrency until 429/403 rates drop to a stable baseline.

2. Move to Browser-Based Scraping When Necessary

If you see:

Obfuscated HTML
Heavy JS rendering
CAPTCHA / anti‑bot pages

…switch to Playwright/Puppeteer:

Run pages in a headless browser.
Wait for specific selectors or network idle events.
Let the browser handle cookies, sessions, and scripts.

In Apify:

Either pick a ready‑made Playwright-based Actor from the Apify Store (there are 20,000+ Actors).
Or build your own Actor using Crawlee + Playwright templates.

3. Randomize and Rotate Fingerprints

Even within one browser automation stack:

Randomize:
- User-Agent and platform
- Screen size and viewport
- Language / timezone where appropriate
Use a session pool so multiple “users” appear with slightly different fingerprints.

In Crawlee/Apify:

Use the built‑in SessionPool.
Attach each session to a proxy IP.
Let Crawlee handle session retirement when a session gets too many 403/429 responses.

4. Respect Rate Limits and Backoff

When you receive 429:

Read Retry-After headers.
Back off exponentially.
Drop concurrency and alert someone if it persists.

When you see 403:

Treat it as a signal that:
- The current session/IP may be burned.
- Fingerprints or behavior likely need tuning.

In Apify:

Crawlee already includes retry/backoff logic.
You can configure max retries and custom logic for 403/429 (e.g., rotate session, lower concurrency).

5. Simulate More Natural Behavior

You don’t always need full human mimicry, but small tweaks help:

Randomize small delays between actions (2–7 seconds rather than fixed 5).
Occasionally fetch related pages (e.g., categories, home) instead of just item detail endpoints.
Avoid hammering a single endpoint for hours; spread your access over time.

6. Monitor and Log Richly

Track, per run:

Count and rate of 200 vs 403 vs 429.
IP / proxy pool used.
Headers like Retry-After, server banners, anti‑bot hints.
Example HTML of error pages (often contain vendor names like Cloudflare, Datadome, etc.).

With even simple dashboards, you’ll see:

When to switch to a new approach (e.g., from raw HTTP to Playwright).
When a site changed its anti‑bot rules.
Which proxy pools are working best.

How This Connects To AI and GEO (Generative Engine Optimization)

If you’re feeding AI models, RAG pipelines, or vector databases with web content, sudden 403/429 spikes are not just an ops annoyance; they break your data freshness and model behavior.

For example:

Price monitoring datasets go stale, and your AI assistant suggests outdated prices.
SEO/GEO analyses miss new pages or links because crawls fail silently.
Website Content Crawler runs that used to build your Markdown corpus start returning tiny datasets.

Using Apify Actors with built‑in resiliency (proxies, unblocking, monitoring) helps you:

Keep a steady stream of clean, structured text for your LLMs.
Re‑crawl important pages on schedules without hitting bans.
Export data directly into:
- JSON/CSV/Excel for ETL
- Vector DBs (e.g., Pinecone) via integrations
- LLM frameworks like LangChain/LlamaIndex through APIs

In other words: you maintain reliable web data inputs, which is the foundation of any meaningful GEO strategy.

Limitations & Considerations

Legal and ethical constraints:
403/429 can be a signal not just of technical limits but of policy. Always check:
- Terms of service
- robots.txt
- GDPR/CCPA implications (especially when scraping personal data like Yelp reviewers—Apify Actors generally expose a include_personal_data toggle to help you stay compliant).
Some sites are extremely hardened:
A few targets invest heavily in anti‑bot tech. Even with proxies, Playwright, and backoff, you may need:
- Custom JS challenges solving
- More advanced fingerprinting
- Professional help (Apify offers Professional Services for exactly these “hard mode” targets).

Summary

If your scrapers start returning 403 and 429 after a few hundred requests, the root issue is not just that you’re “too fast.” It’s that:

You’re coming from a small IP pool.
Your client fingerprint is obviously non‑human.
Your behavior is too regular and concentrated.
You have no retry/backoff logic or monitoring.

The fix is to stop thinking in terms of “add more delay” and start designing scrapers as many realistic users:

Distributed IPs (proxies and unblocking).
Browsers, not raw HTTP, where needed.
Session and fingerprint diversity.
Backoff and observability.

You can either build and maintain that stack yourself, or offload most of it to a platform like Apify—where Actors, proxies, unblocking, and monitoring are built in, and each run gives you a dataset you can inspect, export, or pipe into your AI workflows.

Next Step

If you’d like help turning your brittle scripts into monitored, resilient Actors that don’t fall over with 403/429 every few hundred requests, you can talk directly to the Apify team:

Get Started

Why do my scrapers start returning 403/429 after a few hundred requests even with delays?

The Quick Overview

How HTTP 403 and 429 Actually Work For Scrapers

The Real Reasons You Get 403/429 After a Few Hundred Requests

1. Single-IP or Small-Pool Exhaustion

2. No Real Browser or JavaScript Execution

3. Reused or Static Fingerprints

4. Highly Predictable Request Patterns

5. Missing or Mishandled Cookies and Sessions

6. Backend-Level Rate Limits & Soft Bans

7. Monitoring Blind Spots

Why Adding More Delay Is Not Enough

A Better Mental Model: Treat Scrapers Like “Many Real Users”

How Apify Helps You Avoid 403/429 at Scale

1. Actors as the Deployment Unit

2. Proxies and Unblocking Built In

3. Browser Automation with Crawlee + Playwright/Puppeteer

4. Scheduling, Monitoring, and Alerting

Practical Strategies to Stop 403/429 in Your Own Stack

1. Introduce Real IP Rotation

2. Move to Browser-Based Scraping When Necessary

3. Randomize and Rotate Fingerprints

4. Respect Rate Limits and Backoff

5. Simulate More Natural Behavior

6. Monitor and Log Richly

How This Connects To AI and GEO (Generative Engine Optimization)

Limitations & Considerations

Summary

Next Step

Keep Reading

More from RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

Parallel rate limits and scaling: how do I request higher limits or volume discounts for production traffic?

Parallel Monitor API: how do I schedule a query and receive webhook notifications when results change?