Inferless startup plan: what’s included with the 10,000 requests/month minimum and GPU concurrency limits?
AI Inference Acceleration

Inferless startup plan: what’s included with the 10,000 requests/month minimum and GPU concurrency limits?

9 min read

If you’re evaluating Inferless for a production use case with real traffic, the Startup plan is the point where “weekend demo” turns into “we’re actually serving users.” The 10,000 inference requests/month minimum and GPU concurrency limits are there to guarantee you have enough headroom for spiky workloads without paying for idle clusters.

This breakdown walks through exactly what’s included, how the limits work in practice, and how to reason about cost, throughput, and growth.

Note: All details here reflect the current Inferless positioning for fast-growing teams: serverless GPU inference, pay-per-second billing, and scale-to-zero semantics. Always cross-check the pricing page for the latest numbers.


Quick overview: what the Startup plan is designed for

The Startup tier is built for fast-growing teams that:

  • Have at least 10,000 inference requests per month
  • Need to handle spiky, unpredictable traffic without owning GPU clusters
  • Want usage-based billing instead of 24/7 GPU reservations
  • Are ready for production behaviors: logs, webhooks, and higher concurrency

In practice, this tier gives you:

  • A serverless GPU inference environment
  • A hard floor of 10,000 requests/month
  • Higher GPU concurrency than the free/starter usage
  • Longer log retention
  • Slack-based support with guaranteed response windows
  • Included credits to ramp up without immediate spend shock

What “10,000 inference requests/month minimum” actually means

1. Volume minimum, not a flat reservation

The 10,000 requests/month is a minimum usage expectation, not a fixed GPU reservation:

  • You don’t pre-book GPUs for the month.
  • Instead, you pay per second for GPU time your endpoints actually use.
  • The minimum simply indicates that the plan is meant for workloads hitting at least ~10k calls/month, not occasional testing.

It’s the “this is real traffic now” line, not a tax. If you’re only doing a handful of calls during development, you’re better off starting on the baseline pay-per-second experience with the free credit.

2. How 10,000 requests maps to real-world usage

10,000 inferences/month is enough for:

  • A small SaaS product in private beta
  • Internal tools used across a team
  • Batch jobs that run daily on a moderate dataset
  • Evaluation pipelines that run frequently but not continuously

For example:

  • If your average user triggers 20 model calls/day,
  • And you have 20 active users,
  • You’ll hit 400 calls/day → ~12,000 calls/month.

That’s the kind of usage profile the Startup plan aims at: real users hitting real latency constraints, but not yet internet-scale.


GPU concurrency limits: how the “5 concurrency” ceiling works

In the Inferless docs, the Startup description includes:

  • GPU concurrency of 5

Think of this as the maximum number of in-flight inference requests per GPU (subject to your configured concurrency and model behavior). In practical terms:

  • Up to 5 requests can be processed simultaneously on the same GPU (or GPU slice, depending on shared vs dedicated).
  • This is tied to how many parallel calls your endpoint can handle before queuing kicks in.

1. Why concurrency matters more than “how many GPUs”

For spiky workloads, concurrency and autoscaling matter more than raw GPU count:

  • Concurrency determines how many requests a single replica can handle in parallel.
  • Inferless’ in-house built load balancer decides when to spin up additional replicas and GPUs based on traffic.
  • With scale-to-zero, you don’t pay for idle GPUs between spikes.

With a concurrency of 5, a single GPU-backed replica can:

  • Handle 5 simultaneous requests without queueing.
  • Scale out to more replicas (and more GPUs) as traffic rises, each replica also honoring the concurrency setting.

2. How concurrency interacts with Dynamic Batching

For many models (especially LLMs and vision models), you’ll get better throughput if you enable Dynamic Batching via Server-Side Request Combining:

  • Multiple incoming requests within a small time window are combined into a single batched GPU call.
  • This increases tokens/sec or images/sec per GPU-second, which matters when you pay per second, for exactly what you use.

GPU concurrency of 5 gives you room to:

  • Run several batches in parallel.
  • Absorb bursts of requests without immediately hitting queue delays.

Other key inclusions in the Startup plan

While the question is framed around 10,000 requests and GPU concurrency, the plan also includes operational knobs you actually need in production.

1. Unlimited deployed webhook endpoints

The Startup plan explicitly includes:

  • Unlimited deployed webhook endpoints

This matters if you:

  • Fan out inference results to multiple systems (e.g., logging, analytics, CRMs).
  • Run asynchronous workflows where the inference endpoint calls back into your application.
  • Wire up downstream post-processing jobs.

You’re not constrained to “one webhook per project” or similar artificial limits. You can:

  • Configure different webhooks per endpoint or per workflow.
  • Support multi-tenant behaviors and event-based pipelines without hitting a webhook ceiling.

2. Log retention: 15 days

Startup-tier log behavior:

  • 15 days of log retention

You get:

  • Build logs – What happened during container build, dependency install, runtime setup.
  • Call logs – Request traces, errors, and timing data that help you debug and optimize.

From an ops perspective, 15 days is enough to:

  • Debug recent incidents.
  • Analyze behavior around releases and model updates.
  • Spot regressions and unexpected latencies across two weekly cycles.

For longer compliance windows (e.g., quarterly audits), the Enterprise tier steps this up to 365 days of log retention.

3. Support via private Slack connect (within 48 working hours)

Startup support characteristics:

  • Private Slack Connect channel
  • Response within 48 working hours

This is meant for teams that:

  • Want a real-time-ish backchannel to ask about deployments, scaling behavior, or request anomalies.
  • Don’t need a 24/7 SRE-on-call level SLA yet, but do need predictable, human support as they ramp traffic.

For stricter SLAs and more handholding, Enterprise adds:

  • Dedicated support engineer
  • Higher concurrency
  • Longer log retention

4. Included credits: $30 (≈ 10 hours of GPU)

The Startup tier mentions:

  • Include Credits: $30

This aligns with Inferless’ typical onramp:

  • 10 hours of free credit, no credit card required, framed as $30 free credit on current pricing tables.

What that gives you in practice:

  • Enough GPU time to:
    • Deploy a model.
    • Run test traffic.
    • Hit early staging or beta users.
  • Without immediately worrying about a surprise bill while you’re still tuning concurrency, batching, and timeouts.

How this compares to Enterprise

It’s useful to place Startup in the broader Inferless lineup:

Startup (fast-growing teams)

  • Minimum: 10,000 inference requests/month
  • GPU concurrency: 5
  • Log retention: 15 days
  • Webhooks: Unlimited deployed webhook endpoints
  • Support: Private Slack connect, responses within 48 working hours
  • Included credits: $30

Best if you:

  • Have a growing production workload.
  • Need good concurrency and reasonable log history.
  • Want responsive but not white-glove support.

Enterprise (high-volume workloads)

  • Minimum: 100,000 inference requests/month
  • GPU concurrency: 50
  • Log retention: 365 days
  • Webhooks: Unlimited deployed webhook endpoints
  • Support: Private Slack connect + dedicated support engineer
  • Included credits: Custom

Best if you:

  • Have strict latency targets and high QPS.
  • Need very high concurrency (up to 50) for heavy traffic.
  • Need a full year of logs for compliance and deep debugging.

How the Startup plan behaves under spiky workloads

If you’re coming from a self-managed cluster, the key change is:

You don’t manage GPUs or nodes; you define endpoints and let the platform scale.

Under the Startup plan, the behavior is:

  • Scale-to-zero when idle:
    • No idle GPU cost between traffic bursts.
  • Scale from zero to hundreds of GPUs when traffic hits:
    • Driven by an in-house built load balancer that keeps scale-up overhead low.
  • Cold start expectations:
    • First call to an endpoint can see a cold start of 10–20s.
    • Successive calls are much faster; your overall latency is then dominated by model inference time and batching behavior.

Your main knobs:

  • Concurrency settings: How many parallel requests per replica before queuing.
  • Timeouts: Upper bound for individual calls.
  • Dynamic Batching: Enable or tune to squeeze more throughput out of each GPU-second.
  • Custom Runtime: If you need your own Docker container and dependencies to avoid “works on my laptop, fails in prod” issues.

The Startup plan gives you enough concurrency and logs to actually tune these settings against live traffic without terrain restrictions.


Cost thinking: combining per-second billing with the Startup profile

Inferless pricing is GPU-time based, e.g.:

  • Example from docs: A dedicated A100 80GB at $0.0014/sec
  • If it runs:
    • 1 machine for 14,400 seconds (4 hours), and
    • 2 machines for 10,800 seconds (3 hours),
    • Total billed GPU time = 25,200 seconds → 25,200 × 0.0014 = $35.28

Under the Startup plan:

  • You still pay per second for exactly what you use.
  • The 10,000 request minimum just signals that this pricing model makes sense for your scale.
  • Your actual cost depends on:
    • GPU type (T4 / A10 / A100; shared vs dedicated).
    • Average inference time per request.
    • How aggressively you batch and set concurrency.
    • How bursty your traffic is.

For many startups, the net effect (especially when they come from “always-on” clusters) mirrors what customers like Cleanlab report: up to ~90% GPU cost savings and production go-live in less than a day, because you eliminate idle node spend and cluster babysitting.


When you should move to the Startup plan

Use the Startup plan if:

  • You’ve moved beyond local notebooks and one-off demos.
  • You expect 10,000+ inferences per month from real users or internal tools.
  • You care about:
    • Parallelism: need more than 1–2 concurrent requests.
    • Observability: need 15 days of logs to debug production issues.
    • Support: want a Slack line into the team.

Stay on the lighter, free-credit path if:

  • You’re still experimenting with model choice and don’t know your traffic profile.
  • You only occasionally hit the model from CI or ad-hoc scripts.
  • You don’t need Slack-based support yet.

Move to Enterprise if:

  • Your QPS is sustained and high.
  • You’re saturating the GPU concurrency of 5 and need up to 50 concurrent requests per GPU.
  • Compliance and audit constraints require 365-day logs and tighter SLAs.

Final takeaway

The Inferless Startup plan is essentially the “production cutoff” for fast-growing teams:

  • 10,000 requests/month minimum → you’re serving real users.
  • GPU concurrency of 5 → you can actually handle bursty traffic with parallel inferences and Dynamic Batching.
  • Unlimited webhooks, 15-day logs, Slack support, and $30 credits → enough operational scaffolding to run a real service without building infrastructure yourself.

If you’re ready to turn your model file into a production API endpoint in minutes—and expect actual traffic, not just tests—this is the tier where the economics and controls line up.

Get Started