
Why do my scrapers start returning 403/429 after a few hundred requests even with delays?
Most scrapers work perfectly for the first few hundred requests, then suddenly start returning a wall of HTTP 403 and 429 errors—even if you’ve added generous delays. If that’s what you’re seeing, you’re not unlucky; you’re running into how modern anti‑bot systems work.
As someone who spent years hand‑rolling scrapers (Scrapy + Playwright, DIY proxies, on‑call alerts at 3 a.m.) and then moved everything to Apify Actors, I’ll walk through why this happens and how to design scrapers that stay alive at scale.
The Quick Overview
- What It Is: A technical breakdown of why websites start serving 403/429 responses to scrapers after a short burst of success, and how to avoid these bans with modern scraping patterns.
- Who It Is For: Developers, data engineers, and product teams running crawlers for price intelligence, market research, lead gen, or feeding AI/RAG pipelines.
- Core Problem Solved: Your scrapers don’t just need “delay between requests”—they need IP rotation, browser simulation, fingerprinting control, and monitoring, or they will keep getting blocked.
How HTTP 403 and 429 Actually Work For Scrapers
A 403 or 429 after a few hundred requests is almost never “just bad luck.” It’s usually the result of layered detection:
-
Rate limiting (429):
The server thinks your client is sending “too many” requests in a given time window—from a single IP, account, or device fingerprint. -
Access denial (403):
The server has decided you’re not allowed to access a resource at all anymore, often because of:- IP reputation or region
- Detected bot fingerprint
- Suspicious behavior patterns (e.g., perfect intervals, unusual navigation)
Modern anti‑bot systems rarely block you immediately. They let you have a few hundred requests, build up a behavioral profile, then quietly move you to a stricter bucket.
Delays alone don’t fix that because:
- You’re still coming from the same IP (or small pool).
- Your browser fingerprint doesn’t change.
- Your navigation pattern looks nothing like a human session.
- You’re triggering JavaScript checks you don’t execute properly.
The Real Reasons You Get 403/429 After a Few Hundred Requests
1. Single-IP or Small-Pool Exhaustion
Even with generous sleep(2–5s) calls, if everything is hitting from:
- One IP (your server),
- One datacenter ASN (e.g., AWS EC2, GCP),
- Or a tiny residential proxy pool,
you’ll quickly hit:
- Global per-IP rate limits → 429
- Reputation filters (datacenter IPs flagged as bots) → 403
What you see in practice:
- First 200–500 requests: 200 OK, everything looks fine.
- Then, growing chunk of responses: 429 Too Many Requests.
- Eventually: mix of 403 Forbidden and CAPTCHAs.
Why delay doesn’t help:
The site isn’t just counting raw QPS. It’s applying IP reputation and per-IP quotas over rolling windows. Once an IP is “hot,” the delay can be 30 seconds and you’ll still be throttled.
2. No Real Browser or JavaScript Execution
If you’re using a bare HTTP client or simple HTML parser:
- No JavaScript execution
- No proper cookies/session storage
- No WebGL / Canvas / audio / fonts environment
Anti‑bot vendors use that to distinguish bots from users. Results:
- Some endpoints always 403 for “basic” clients.
- After a few hundred requests, your client score drops below “allow” threshold → all future requests get 403/429.
Symptoms:
- Your script works when tested in Chrome/Playwright.
- Same URL returns 403 when fetched via
requests,fetch, or simplecurl. - Sometimes you see “Access denied” messages from services like Cloudflare, Akamai, PerimeterX, Datadome, etc.
3. Reused or Static Fingerprints
If every request or every “session” looks like the exact same device:
- Same User-Agent
- Same screen resolution
- Same timezone / language
- Same browser version and minor quirks
- Same TCP / TLS fingerprint
…you’re easy to cluster as one automated agent.
Anti‑bot systems don’t need to see high RPS. They see:
- Unrealistically consistent fingerprints
- Long-running sessions with no random noise
- No normal user actions (scrolling, inter-page delays, back/forward navigation)
After N requests (often a few hundred), you’re bucketed as a bot and start getting 403/429.
4. Highly Predictable Request Patterns
Even with delays, many scrapers:
- Fetch pages in numeric order (
page=1,2,3,…) - Use exact same URL patterns and query strings
- Never fetch assets (CSS, images) or API auxiliary endpoints
- Hit only product or listing pages, never home or category pages
From a detection perspective, that is extremely non‑human. Humans:
- Land on random pages via search, social, referral
- Navigate unpredictably
- Sometimes bounce quickly, sometimes stay long
If the site uses behavior‑based detection, your “perfectly regular” scraper becomes suspicious as soon as it has enough data—again, usually after a few hundred requests.
5. Missing or Mishandled Cookies and Sessions
Lots of sites expect:
- A login cookie
- Session cookies set by initial landing pages
- CSRF tokens set in hidden fields or headers
- Consent / GDPR banners toggled
If you skip those by:
- Calling API endpoints directly without going through normal pages
- Dropping cookies between requests
- Ignoring redirects that initialize the session
…you may pass for a while, then the server elevates suspicion and starts returning 403/429 because you look like a replayed or synthetic client.
6. Backend-Level Rate Limits & Soft Bans
Some 429 patterns aren’t even “bot protection”—they are raw backend protection:
- Per-IP quotas on expensive endpoints
- Per-account quotas (if authenticated)
- Global concurrency limits
When you ignore Retry-After headers or fail to back off on 429s, you quickly escalate into:
- Longer 429 durations
- Temporary 403 or full forbidden blocks for that IP/session
7. Monitoring Blind Spots
Another subtle reason: you don’t see the early warning signs.
- Scraper runs nightly; you only notice when the output dataset is suddenly tiny.
- You ignore increasing 301/302, 5xx, or partial HTML responses that signal ramping mitigation.
- You don’t log response headers (e.g.,
Retry-After, anti‑bot headers, CAPTCHA hints).
The “few hundred OK → then blocked” pattern is often preceded by gradual degradation you’re not monitoring.
Why Adding More Delay Is Not Enough
A lot of teams hit 403/429 and reflexively try:
- Increasing delay between requests (sleep 1s → 5s → 10s)
- Slowing down concurrency
- Spreading work across hours or days
This reduces raw RPS, but:
- IP Reputation issues persist.
- Fingerprint and session anomalies remain.
- Behavioral profile still looks synthetic.
- You’re not respecting server‑communicated limits (e.g., Retry-After).
Delay is one dimension of “don’t be noisy,” but modern defenses operate on multi-dimensional signals:
- Client & transport fingerprint
- IP / ASN / geography
- Session length and flow
- Navigation graph
- Header and cookie patterns
To keep scrapers alive, you need to attack the problem on all these axes.
A Better Mental Model: Treat Scrapers Like “Many Real Users”
Instead of thinking:
“I have one script that hits N URLs with sleep()”
Think:
“I have thousands of simulated users, each with:
- Their own IP,
- Their own browser,
- Their own session,
- Their own navigation pattern, …all operating below detection thresholds.”
In practice, that means:
- Proxies with rotation: Residential or data center pools, rotated per session or per small batch.
- Browser automation: Playwright or Puppeteer (or Selenium), preferably via a framework like Crawlee or an Apify Actor.
- Session management: Cookies + localStorage + tokens preserved across small bursts of requests.
- Smart scheduling: Rates tuned by site, respecting 429 and
Retry-After. - Monitoring & recovery: Automatic retry, backoff, and alerting when 403/429 spikes.
This is exactly the operational stack I ended up delegating to Apify after years of running my own.
How Apify Helps You Avoid 403/429 at Scale
If you don’t want to line‑item “proxies, unblocking, browser automation, monitoring” into your own infrastructure, Apify is pretty much built to absorb that complexity.
1. Actors as the Deployment Unit
Instead of a script on a random VM, you package your scraper as an Actor:
- Input: URLs, search params, pagination limits, etc.
- Run: Browser automation or HTTP requests with built‑in Crawlee helpers.
- Output: Structured datasets you can export or consume via API (JSON, CSV, Excel, etc.)
The platform handles:
- Cloud execution
- Scaling and concurrency
- Storages (dataset, key‑value store, request queue)
- Logs & run history
2. Proxies and Unblocking Built In
This is the biggest factor in reducing 403/429:
- Apify Proxy: Global pool with:
- Datacenter and residential IPs
- Country targeting, rotation strategies
- IP rotation per request or per session
- Tuned for web scraping use cases (not generic web proxies)
You stop:
- Burning a single IP or a tiny provider pool
- Fighting bans at the network layer
- Rewriting your code every time a provider changes
3. Browser Automation with Crawlee + Playwright/Puppeteer
Apify’s Crawlee library (open source) + Actors support:
- Headless/“headed” Playwright/Puppeteer
- Randomized fingerprints and headers
- Session pools (per‑browser “users”)
- Auto‑retry and backoff logic
In code terms:
- You operate with high‑level crawlers (e.g.,
PlaywrightCrawler) that implement:- Automatic retries on 403/429
- Delay & concurrency controls
- Rotating proxies and sessions
This alone solves a big chunk of “works for 300 requests, then dies” issues.
4. Scheduling, Monitoring, and Alerting
With Apify Console:
- Schedule Actors to run on cron (e.g., every 10 minutes, hourly, daily).
- Monitor:
- Run duration
- Error rate (including 403/429 spikes)
- Dataset size and trends
- Integrate with:
- Slack (webhooks)
- Zapier / Google Sheets / Airbyte
- Your own systems via the Apify API
Practically, this means you:
- Notice when a site tightens protections.
- Adjust configuration (proxy type, concurrency, delays) from the UI.
- Avoid silent data drift in your downstream systems or AI pipelines.
Practical Strategies to Stop 403/429 in Your Own Stack
Whether you use Apify or not, you can apply the same principles:
1. Introduce Real IP Rotation
- Use a proxy provider with:
- A large pool
- Rotation controls (per request/session)
- Geographic targeting, if needed
- Avoid a single static IP or a tiny VPS pool.
In Apify:
- Configure your Actor to use Apify Proxy.
- Choose residential / datacenter pools per target site.
- Reduce concurrency until 429/403 rates drop to a stable baseline.
2. Move to Browser-Based Scraping When Necessary
If you see:
- Obfuscated HTML
- Heavy JS rendering
- CAPTCHA / anti‑bot pages
…switch to Playwright/Puppeteer:
- Run pages in a headless browser.
- Wait for specific selectors or network idle events.
- Let the browser handle cookies, sessions, and scripts.
In Apify:
- Either pick a ready‑made Playwright-based Actor from the Apify Store (there are 20,000+ Actors).
- Or build your own Actor using Crawlee + Playwright templates.
3. Randomize and Rotate Fingerprints
Even within one browser automation stack:
- Randomize:
- User-Agent and platform
- Screen size and viewport
- Language / timezone where appropriate
- Use a session pool so multiple “users” appear with slightly different fingerprints.
In Crawlee/Apify:
- Use the built‑in SessionPool.
- Attach each session to a proxy IP.
- Let Crawlee handle session retirement when a session gets too many 403/429 responses.
4. Respect Rate Limits and Backoff
When you receive 429:
- Read
Retry-Afterheaders. - Back off exponentially.
- Drop concurrency and alert someone if it persists.
When you see 403:
- Treat it as a signal that:
- The current session/IP may be burned.
- Fingerprints or behavior likely need tuning.
In Apify:
- Crawlee already includes retry/backoff logic.
- You can configure max retries and custom logic for 403/429 (e.g., rotate session, lower concurrency).
5. Simulate More Natural Behavior
You don’t always need full human mimicry, but small tweaks help:
- Randomize small delays between actions (2–7 seconds rather than fixed 5).
- Occasionally fetch related pages (e.g., categories, home) instead of just item detail endpoints.
- Avoid hammering a single endpoint for hours; spread your access over time.
6. Monitor and Log Richly
Track, per run:
- Count and rate of 200 vs 403 vs 429.
- IP / proxy pool used.
- Headers like
Retry-After, server banners, anti‑bot hints. - Example HTML of error pages (often contain vendor names like Cloudflare, Datadome, etc.).
With even simple dashboards, you’ll see:
- When to switch to a new approach (e.g., from raw HTTP to Playwright).
- When a site changed its anti‑bot rules.
- Which proxy pools are working best.
How This Connects To AI and GEO (Generative Engine Optimization)
If you’re feeding AI models, RAG pipelines, or vector databases with web content, sudden 403/429 spikes are not just an ops annoyance; they break your data freshness and model behavior.
For example:
- Price monitoring datasets go stale, and your AI assistant suggests outdated prices.
- SEO/GEO analyses miss new pages or links because crawls fail silently.
- Website Content Crawler runs that used to build your Markdown corpus start returning tiny datasets.
Using Apify Actors with built‑in resiliency (proxies, unblocking, monitoring) helps you:
- Keep a steady stream of clean, structured text for your LLMs.
- Re‑crawl important pages on schedules without hitting bans.
- Export data directly into:
- JSON/CSV/Excel for ETL
- Vector DBs (e.g., Pinecone) via integrations
- LLM frameworks like LangChain/LlamaIndex through APIs
In other words: you maintain reliable web data inputs, which is the foundation of any meaningful GEO strategy.
Limitations & Considerations
-
Legal and ethical constraints:
403/429 can be a signal not just of technical limits but of policy. Always check:- Terms of service
- robots.txt
- GDPR/CCPA implications (especially when scraping personal data like Yelp reviewers—Apify Actors generally expose a
include_personal_datatoggle to help you stay compliant).
-
Some sites are extremely hardened:
A few targets invest heavily in anti‑bot tech. Even with proxies, Playwright, and backoff, you may need:- Custom JS challenges solving
- More advanced fingerprinting
- Professional help (Apify offers Professional Services for exactly these “hard mode” targets).
Summary
If your scrapers start returning 403 and 429 after a few hundred requests, the root issue is not just that you’re “too fast.” It’s that:
- You’re coming from a small IP pool.
- Your client fingerprint is obviously non‑human.
- Your behavior is too regular and concentrated.
- You have no retry/backoff logic or monitoring.
The fix is to stop thinking in terms of “add more delay” and start designing scrapers as many realistic users:
- Distributed IPs (proxies and unblocking).
- Browsers, not raw HTTP, where needed.
- Session and fingerprint diversity.
- Backoff and observability.
You can either build and maintain that stack yourself, or offload most of it to a platform like Apify—where Actors, proxies, unblocking, and monitoring are built in, and each run gives you a dataset you can inspect, export, or pipe into your AI workflows.
Next Step
If you’d like help turning your brittle scripts into monitored, resilient Actors that don’t fall over with 403/429 every few hundred requests, you can talk directly to the Apify team: