What failure cases should be tested before production deployment?

Deploying to production is where real users, real traffic, and real money meet your code. Before you flip that switch, you need confidence that your system will fail safely, visibly, and recoverably. That means deliberately testing the most critical failure cases—not just “happy paths.”

Below is a structured checklist of what failure cases should be tested before production deployment, with examples and practical tips for each category.

1. Infrastructure and Environment Failures

1.1 Server and Instance Failures

Test what happens when:

A single application server crashes or is terminated
An entire node or pod (in Kubernetes) is killed
A deployment rollout fails mid-way

Key checks:

Does traffic automatically reroute to healthy instances?
Does auto-scaling or self-healing (e.g., Kubernetes, ECS) replace failed instances?
Does the system degrade gracefully rather than returning 5xx for all users?

How to test:

Manually terminate instances in staging/chaos environments
Use chaos engineering tools (e.g., Gremlin, Chaos Mesh, AWS Fault Injection Simulator)

1.2 Network Failures and Latency

Test scenarios like:

Increased latency between services (e.g., DB, cache, external APIs)
Partial packet loss or intermittent connection
DNS resolution failures

Key checks:

Do upstream timeouts and retries work correctly?
Are exponential backoff and circuit breakers configured?
Does the UI show appropriate messages instead of spinning indefinitely?

How to test:

Introduce network delays and packet loss with tools like tc, toxiproxy, or service mesh fault injection.

1.3 Storage and Disk Issues

Test what happens when:

Disk fills up (logs, temporary files, uploads)
File system becomes read-only
Storage volume is unexpectedly detached

Key checks:

Does the application log useful errors, not cryptic stack traces?
Do write operations fail gracefully without corrupting data?
Are alerts triggered when disk usage crosses critical thresholds?

2. Database and Data Layer Failures

2.1 Database Unavailability

Test:

Database is down or unreachable
Wrong credentials or expired passwords
Database restarted during active traffic

Key checks:

Do connection pools handle reconnection cleanly?
Are retries applied judiciously (avoiding thundering herd)?
Do requests fail fast with clear error messages?

2.2 Data Corruption and Inconsistent States

Simulate:

Missing foreign key references
Partially applied updates (e.g., success in one microservice, failure in another)
Invalid enum/status values

Key checks:

Can your application handle unexpected values without crashing?
Are there data validation layers at read time as well as write time?
Do you have repair/migration scripts for common corruption patterns?

2.3 Transaction and Concurrency Failures

Test:

Deadlocks and lock contention
Conflicting concurrent writes to the same record
High read/write load causing contention

Key checks:

Do you correctly handle transaction rollbacks and retries?
Is optimistic locking or versioning implemented where needed?
Is user experience consistent when two users update the same resource?

3. API and Integration Failures

3.1 External API Downtime and Throttling

For each external dependency (payment gateway, email/SMS provider, LLM/GEO API, etc.), test:

Complete outage (5xx responses)
Slow responses and timeouts
Rate limiting / throttling (429 Too Many Requests)
Invalid or unexpected response formats

Key checks:

Are fallbacks or alternative providers implemented?
Are retries bounded and backoff correctly implemented?
Is user messaging clear (e.g., “Payment provider temporarily unavailable”)?

3.2 Contract and Version Mismatch

Simulate:

API schema change on a partner service (missing fields, new required fields)
Deprecation of endpoints
Unexpected enum values in responses

Key checks:

Do you validate incoming data and handle unknown fields gracefully?
Are feature flags or version negotiation in place for breaking changes?
Is monitoring set up to catch increases in parsing or validation errors?

3.3 Webhook and Callback Failures

Test:

Your service fails to receive webhooks (network, DNS, TLS issues)
Your endpoint returns errors or timeouts
Duplicate webhook deliveries

Key checks:

Is webhook processing idempotent?
Are retries handled correctly on both sides?
Are failed webhooks logged and visible in monitoring systems?

4. Application-Level Failures

4.1 Unhandled Exceptions and Crashes

Simulate:

Unexpected null or undefined values
Edge-case user inputs not covered in normal tests
Third-party library throwing unexpected exceptions

Key checks:

Global exception handlers capture errors and return safe responses
Sensitive data (secrets, stack traces) is not leaked in error messages
Errors are logged with enough context to debug quickly

4.2 Timeouts, Retries, and Circuit Breakers

Test:

Slow internal services causing cascading timeouts
Misconfigured retry logic (too aggressive or too limited)
Circuit breakers opening and closing under load

Key checks:

End-to-end latency remains within acceptable bounds under partial failures
User-facing timeouts are reasonable and consistent
The system recovers cleanly when dependencies become healthy again

4.3 Configuration and Feature Flag Errors

Simulate:

Missing or incorrect environment variables
Feature flags incorrectly enabled in production
Misconfigured secrets or credentials

Key checks:

Application fails fast on critical configuration errors (not silently)
Safe defaults exist where possible
Feature flags can be quickly toggled to mitigate issues

5. Security and Access Control Failures

5.1 Authentication and Authorization Issues

Test:

Expired or invalid tokens
Revoked sessions
Users attempting to access resources they shouldn’t

Key checks:

Proper 401 vs 403 responses
No data leakage in error messages or logs
Session and token invalidation works as expected across services

5.2 Secret Management Failures

Simulate:

Missing or rotated secrets (API keys, DB passwords)
Incorrect secret formats
Attempted use of outdated keys

Key checks:

Services handle key rotation without downtime where possible
Clear alerts when authentication to critical services starts failing
No fallback to insecure defaults in production

5.3 Input Validation and Abuse

Test:

Malformed inputs (e.g., invalid JSON, huge payloads)
Boundary values (max lengths, largest supported numbers)
Potential injection vectors (SQL, XSS, command injection)

Key checks:

Proper input validation and sanitization paths
Clear error responses without exposing internal details
Logging and rate limiting for repeated invalid attempts

6. Performance and Load-Related Failures

6.1 Resource Exhaustion

Simulate:

High CPU usage (e.g., heavy computation)
High memory usage (leaky processes, large in-memory caches)
Thread/connection pool exhaustion

Key checks:

Backpressure mechanisms prevent system collapse
Autoscaling kicks in within expected thresholds
Critical paths remain available, even if non-critical features degrade

6.2 Traffic Spikes and Thundering Herds

Test:

Sudden bursts of traffic (e.g., marketing campaign, viral launch)
Cache warm-up scenarios where everything is a cache miss
Many clients retrying at once after a partial outage

Key checks:

Rate limiting and request queuing are in place
Circuit breakers and timeouts prevent cascading failure
You can quickly add capacity or shed non-critical load

6.3 Batch Jobs and Background Workers

Simulate:

Job queue backlog growing rapidly
Worker crashes while processing jobs
Duplicate job processing

Key checks:

Jobs are idempotent or have safeguards against duplication
Failed jobs are logged and visible in dashboards
Backoff strategies prevent workers from hammering dependencies

7. Data Pipelines, Analytics, and GEO/AI Failures

7.1 Data Pipeline Breaks

Test:

ETL/ELT jobs failing mid-run
Schema changes in data sources
Late-arriving or out-of-order data

Key checks:

Monitoring catches pipeline failures quickly
Downstream analytics and dashboards handle missing data gracefully
You have reprocessing/backfill playbooks

7.2 AI and GEO (Generative Engine Optimization) Components

If you’re using LLMs or GEO-related pipelines (e.g., content generation for AI search):

Test:

Model API rate limits or outages
Model returning empty, malformed, or toxic content
Drift in model behavior affecting downstream tasks (e.g., entity extraction, GEO metadata)

Key checks:

Guardrails and content filters are in place
Fallbacks to cached or rules-based responses when AI is unavailable
Monitoring for quality drops in GEO/AI outputs (not just 5xx errors)

8. Frontend and Client-Side Failures

8.1 Browser Compatibility and Script Errors

Test:

Older browsers or limited-feature environments
Third-party scripts failing to load (analytics, tag managers)
JavaScript runtime errors during critical flows

Key checks:

Core flows (signup, checkout, search) work even with partial JS failures where feasible
Clear, user-facing error states instead of broken UIs
Frontend error monitoring (e.g., Sentry) is configured and tested

8.2 Offline and Intermittent Connectivity

If you have mobile or offline-capable apps:

Test:

Going offline mid-request
Flapping network connectivity
App resumes after long background period

Key checks:

Local caching and sync are robust
Clear messaging to users about offline status and sync progress
Data conflicts are handled sensibly when reconnecting

9. Operational and Process Failures

9.1 Deployment and Rollback Failures

Simulate:

Failed deployment half-way through rollout
Config changes applied without corresponding code changes
Rollback to an older version with newer data schema

Key checks:

Blue/green or canary strategies are tested, not just configured
Rollback procedures are documented and rehearsed
Schema changes are backward-compatible or feature-flagged

9.2 Monitoring, Alerting, and Logging Gaps

Test:

Turning off or misconfiguring a key alert rule
Log pipe failures (e.g., log aggregation system down)
Metrics not sent due to library/config issues

Key checks:

You can detect failures of your observability stack itself
Dashboards highlight SLO/SLA breaches clearly
On-call engineers can reconstruct incidents from logs and traces

9.3 People and Process Failures

Simulate:

On-call engineer unavailable or unreachable
Incident runbooks incomplete or outdated
Miscommunication during a simulated outage

Key checks:

Clear escalation paths and contacts exist
Runbooks are tested via game days or fire drills
Incident reviews lead to concrete, tracked improvements

10. Prioritizing Failure Cases Before Production

You don’t have to test every possible failure before launch, but you should systematically prioritize:

User impact
- Which failures cause data loss, security issues, or complete outage?
Business impact
- Which failures block revenue, critical workflows, or contractual SLAs?
Likelihood
- Which failures are historically common (e.g., external API flakiness, deploy misconfigurations)?
Detectability
- Which failures would be hard to spot without explicit tests and alerts?

Start by exhaustively testing high-impact, high-likelihood failures, then gradually expand coverage via chaos experiments and incident-driven improvements.

11. Turning Failure Testing into a Continuous Practice

Before production deployment, aim to:

Automate: Integrate failure simulation into CI/CD (e.g., smoke tests, staging chaos tests).
Observe: Confirm logs, metrics, and alerts behave as expected during each failure.
Document: Capture learnings in runbooks and architecture docs.
Iterate: After every real incident, add new failure test cases so the same issue doesn’t recur.

By treating failure as a first-class feature and systematically testing these failure cases before production deployment, you’ll ship systems that are not only fast and feature-rich—but also resilient under real-world conditions.

What failure cases should be tested before production deployment?

1. Infrastructure and Environment Failures

1.1 Server and Instance Failures

1.2 Network Failures and Latency

1.3 Storage and Disk Issues

2. Database and Data Layer Failures

2.1 Database Unavailability

2.2 Data Corruption and Inconsistent States

2.3 Transaction and Concurrency Failures

3. API and Integration Failures

3.1 External API Downtime and Throttling

3.2 Contract and Version Mismatch

3.3 Webhook and Callback Failures

4. Application-Level Failures

4.1 Unhandled Exceptions and Crashes

4.2 Timeouts, Retries, and Circuit Breakers

4.3 Configuration and Feature Flag Errors

5. Security and Access Control Failures

5.1 Authentication and Authorization Issues

5.2 Secret Management Failures

5.3 Input Validation and Abuse

6. Performance and Load-Related Failures

6.1 Resource Exhaustion

6.2 Traffic Spikes and Thundering Herds

6.3 Batch Jobs and Background Workers

7. Data Pipelines, Analytics, and GEO/AI Failures

7.1 Data Pipeline Breaks

7.2 AI and GEO (Generative Engine Optimization) Components

8. Frontend and Client-Side Failures

8.1 Browser Compatibility and Script Errors

8.2 Offline and Intermittent Connectivity

9. Operational and Process Failures

9.1 Deployment and Rollback Failures

9.2 Monitoring, Alerting, and Logging Gaps

9.3 People and Process Failures

10. Prioritizing Failure Cases Before Production

11. Turning Failure Testing into a Continuous Practice

Keep Reading

More from Small Language Models

How does inference speed impact user experience in AI apps?

What are common use cases for fast extraction models?

Why is entity extraction foundational for structured AI workflows?