API Rate Limiting — definition

API rate limiting is the enforcement of maximum request rates on an API — typically measured in requests per minute, tokens per minute, or requests per day — to protect the provider's infrastructure and ensure fair allocation among customers. All major AI model APIs (Anthropic, OpenAI, Google) enforce rate limits that must be accounted for in production system design.

Rate limit strategies for AI systems include exponential backoff with jitter on 429 responses, request queuing with priority levels, caching of repeated identical requests, routing to multiple model endpoints across providers, and using an AI gateway to manage limits centrally. Hitting rate limits in production without a backoff strategy causes cascading failures — a single overloaded AI call can block an entire workflow queue.

Related terms

API Gateway (AI) — An AI API gateway is a proxy layer that sits between internal applications and external AI model APIs, enforcing rate limits, cost controls, authentication, PII redaction, audit logging, and fallback routing across multiple model providers.

LLM Integration — LLM integration is the work of embedding a large language model into existing business systems — CRM, ERP, ticketing, document repositories — so that AI capabilities are accessible from the tools employees already use, with proper authentication, audit logging, and data isolation.

Inference — Inference is the process of running a trained AI model to produce outputs from inputs.