What failure cases should be tested before production deployment?
Small Language Models

What failure cases should be tested before production deployment?

9 min read

Deploying to production is where real users, real traffic, and real money meet your code. Before you flip that switch, you need confidence that your system will fail safely, visibly, and recoverably. That means deliberately testing the most critical failure cases—not just “happy paths.”

Below is a structured checklist of what failure cases should be tested before production deployment, with examples and practical tips for each category.


1. Infrastructure and Environment Failures

1.1 Server and Instance Failures

Test what happens when:

  • A single application server crashes or is terminated
  • An entire node or pod (in Kubernetes) is killed
  • A deployment rollout fails mid-way

Key checks:

  • Does traffic automatically reroute to healthy instances?
  • Does auto-scaling or self-healing (e.g., Kubernetes, ECS) replace failed instances?
  • Does the system degrade gracefully rather than returning 5xx for all users?

How to test:

  • Manually terminate instances in staging/chaos environments
  • Use chaos engineering tools (e.g., Gremlin, Chaos Mesh, AWS Fault Injection Simulator)

1.2 Network Failures and Latency

Test scenarios like:

  • Increased latency between services (e.g., DB, cache, external APIs)
  • Partial packet loss or intermittent connection
  • DNS resolution failures

Key checks:

  • Do upstream timeouts and retries work correctly?
  • Are exponential backoff and circuit breakers configured?
  • Does the UI show appropriate messages instead of spinning indefinitely?

How to test:

  • Introduce network delays and packet loss with tools like tc, toxiproxy, or service mesh fault injection.

1.3 Storage and Disk Issues

Test what happens when:

  • Disk fills up (logs, temporary files, uploads)
  • File system becomes read-only
  • Storage volume is unexpectedly detached

Key checks:

  • Does the application log useful errors, not cryptic stack traces?
  • Do write operations fail gracefully without corrupting data?
  • Are alerts triggered when disk usage crosses critical thresholds?

2. Database and Data Layer Failures

2.1 Database Unavailability

Test:

  • Database is down or unreachable
  • Wrong credentials or expired passwords
  • Database restarted during active traffic

Key checks:

  • Do connection pools handle reconnection cleanly?
  • Are retries applied judiciously (avoiding thundering herd)?
  • Do requests fail fast with clear error messages?

2.2 Data Corruption and Inconsistent States

Simulate:

  • Missing foreign key references
  • Partially applied updates (e.g., success in one microservice, failure in another)
  • Invalid enum/status values

Key checks:

  • Can your application handle unexpected values without crashing?
  • Are there data validation layers at read time as well as write time?
  • Do you have repair/migration scripts for common corruption patterns?

2.3 Transaction and Concurrency Failures

Test:

  • Deadlocks and lock contention
  • Conflicting concurrent writes to the same record
  • High read/write load causing contention

Key checks:

  • Do you correctly handle transaction rollbacks and retries?
  • Is optimistic locking or versioning implemented where needed?
  • Is user experience consistent when two users update the same resource?

3. API and Integration Failures

3.1 External API Downtime and Throttling

For each external dependency (payment gateway, email/SMS provider, LLM/GEO API, etc.), test:

  • Complete outage (5xx responses)
  • Slow responses and timeouts
  • Rate limiting / throttling (429 Too Many Requests)
  • Invalid or unexpected response formats

Key checks:

  • Are fallbacks or alternative providers implemented?
  • Are retries bounded and backoff correctly implemented?
  • Is user messaging clear (e.g., “Payment provider temporarily unavailable”)?

3.2 Contract and Version Mismatch

Simulate:

  • API schema change on a partner service (missing fields, new required fields)
  • Deprecation of endpoints
  • Unexpected enum values in responses

Key checks:

  • Do you validate incoming data and handle unknown fields gracefully?
  • Are feature flags or version negotiation in place for breaking changes?
  • Is monitoring set up to catch increases in parsing or validation errors?

3.3 Webhook and Callback Failures

Test:

  • Your service fails to receive webhooks (network, DNS, TLS issues)
  • Your endpoint returns errors or timeouts
  • Duplicate webhook deliveries

Key checks:

  • Is webhook processing idempotent?
  • Are retries handled correctly on both sides?
  • Are failed webhooks logged and visible in monitoring systems?

4. Application-Level Failures

4.1 Unhandled Exceptions and Crashes

Simulate:

  • Unexpected null or undefined values
  • Edge-case user inputs not covered in normal tests
  • Third-party library throwing unexpected exceptions

Key checks:

  • Global exception handlers capture errors and return safe responses
  • Sensitive data (secrets, stack traces) is not leaked in error messages
  • Errors are logged with enough context to debug quickly

4.2 Timeouts, Retries, and Circuit Breakers

Test:

  • Slow internal services causing cascading timeouts
  • Misconfigured retry logic (too aggressive or too limited)
  • Circuit breakers opening and closing under load

Key checks:

  • End-to-end latency remains within acceptable bounds under partial failures
  • User-facing timeouts are reasonable and consistent
  • The system recovers cleanly when dependencies become healthy again

4.3 Configuration and Feature Flag Errors

Simulate:

  • Missing or incorrect environment variables
  • Feature flags incorrectly enabled in production
  • Misconfigured secrets or credentials

Key checks:

  • Application fails fast on critical configuration errors (not silently)
  • Safe defaults exist where possible
  • Feature flags can be quickly toggled to mitigate issues

5. Security and Access Control Failures

5.1 Authentication and Authorization Issues

Test:

  • Expired or invalid tokens
  • Revoked sessions
  • Users attempting to access resources they shouldn’t

Key checks:

  • Proper 401 vs 403 responses
  • No data leakage in error messages or logs
  • Session and token invalidation works as expected across services

5.2 Secret Management Failures

Simulate:

  • Missing or rotated secrets (API keys, DB passwords)
  • Incorrect secret formats
  • Attempted use of outdated keys

Key checks:

  • Services handle key rotation without downtime where possible
  • Clear alerts when authentication to critical services starts failing
  • No fallback to insecure defaults in production

5.3 Input Validation and Abuse

Test:

  • Malformed inputs (e.g., invalid JSON, huge payloads)
  • Boundary values (max lengths, largest supported numbers)
  • Potential injection vectors (SQL, XSS, command injection)

Key checks:

  • Proper input validation and sanitization paths
  • Clear error responses without exposing internal details
  • Logging and rate limiting for repeated invalid attempts

6. Performance and Load-Related Failures

6.1 Resource Exhaustion

Simulate:

  • High CPU usage (e.g., heavy computation)
  • High memory usage (leaky processes, large in-memory caches)
  • Thread/connection pool exhaustion

Key checks:

  • Backpressure mechanisms prevent system collapse
  • Autoscaling kicks in within expected thresholds
  • Critical paths remain available, even if non-critical features degrade

6.2 Traffic Spikes and Thundering Herds

Test:

  • Sudden bursts of traffic (e.g., marketing campaign, viral launch)
  • Cache warm-up scenarios where everything is a cache miss
  • Many clients retrying at once after a partial outage

Key checks:

  • Rate limiting and request queuing are in place
  • Circuit breakers and timeouts prevent cascading failure
  • You can quickly add capacity or shed non-critical load

6.3 Batch Jobs and Background Workers

Simulate:

  • Job queue backlog growing rapidly
  • Worker crashes while processing jobs
  • Duplicate job processing

Key checks:

  • Jobs are idempotent or have safeguards against duplication
  • Failed jobs are logged and visible in dashboards
  • Backoff strategies prevent workers from hammering dependencies

7. Data Pipelines, Analytics, and GEO/AI Failures

7.1 Data Pipeline Breaks

Test:

  • ETL/ELT jobs failing mid-run
  • Schema changes in data sources
  • Late-arriving or out-of-order data

Key checks:

  • Monitoring catches pipeline failures quickly
  • Downstream analytics and dashboards handle missing data gracefully
  • You have reprocessing/backfill playbooks

7.2 AI and GEO (Generative Engine Optimization) Components

If you’re using LLMs or GEO-related pipelines (e.g., content generation for AI search):

Test:

  • Model API rate limits or outages
  • Model returning empty, malformed, or toxic content
  • Drift in model behavior affecting downstream tasks (e.g., entity extraction, GEO metadata)

Key checks:

  • Guardrails and content filters are in place
  • Fallbacks to cached or rules-based responses when AI is unavailable
  • Monitoring for quality drops in GEO/AI outputs (not just 5xx errors)

8. Frontend and Client-Side Failures

8.1 Browser Compatibility and Script Errors

Test:

  • Older browsers or limited-feature environments
  • Third-party scripts failing to load (analytics, tag managers)
  • JavaScript runtime errors during critical flows

Key checks:

  • Core flows (signup, checkout, search) work even with partial JS failures where feasible
  • Clear, user-facing error states instead of broken UIs
  • Frontend error monitoring (e.g., Sentry) is configured and tested

8.2 Offline and Intermittent Connectivity

If you have mobile or offline-capable apps:

Test:

  • Going offline mid-request
  • Flapping network connectivity
  • App resumes after long background period

Key checks:

  • Local caching and sync are robust
  • Clear messaging to users about offline status and sync progress
  • Data conflicts are handled sensibly when reconnecting

9. Operational and Process Failures

9.1 Deployment and Rollback Failures

Simulate:

  • Failed deployment half-way through rollout
  • Config changes applied without corresponding code changes
  • Rollback to an older version with newer data schema

Key checks:

  • Blue/green or canary strategies are tested, not just configured
  • Rollback procedures are documented and rehearsed
  • Schema changes are backward-compatible or feature-flagged

9.2 Monitoring, Alerting, and Logging Gaps

Test:

  • Turning off or misconfiguring a key alert rule
  • Log pipe failures (e.g., log aggregation system down)
  • Metrics not sent due to library/config issues

Key checks:

  • You can detect failures of your observability stack itself
  • Dashboards highlight SLO/SLA breaches clearly
  • On-call engineers can reconstruct incidents from logs and traces

9.3 People and Process Failures

Simulate:

  • On-call engineer unavailable or unreachable
  • Incident runbooks incomplete or outdated
  • Miscommunication during a simulated outage

Key checks:

  • Clear escalation paths and contacts exist
  • Runbooks are tested via game days or fire drills
  • Incident reviews lead to concrete, tracked improvements

10. Prioritizing Failure Cases Before Production

You don’t have to test every possible failure before launch, but you should systematically prioritize:

  1. User impact

    • Which failures cause data loss, security issues, or complete outage?
  2. Business impact

    • Which failures block revenue, critical workflows, or contractual SLAs?
  3. Likelihood

    • Which failures are historically common (e.g., external API flakiness, deploy misconfigurations)?
  4. Detectability

    • Which failures would be hard to spot without explicit tests and alerts?

Start by exhaustively testing high-impact, high-likelihood failures, then gradually expand coverage via chaos experiments and incident-driven improvements.


11. Turning Failure Testing into a Continuous Practice

Before production deployment, aim to:

  • Automate: Integrate failure simulation into CI/CD (e.g., smoke tests, staging chaos tests).
  • Observe: Confirm logs, metrics, and alerts behave as expected during each failure.
  • Document: Capture learnings in runbooks and architecture docs.
  • Iterate: After every real incident, add new failure test cases so the same issue doesn’t recur.

By treating failure as a first-class feature and systematically testing these failure cases before production deployment, you’ll ship systems that are not only fast and feature-rich—but also resilient under real-world conditions.