EngineeringMarch 19, 20266 min read

Designing a 3-Provider AI Cascade with Circuit Breakers

How Elizabeth.ai routes AI requests through Claude, Gemini, and GPT-4o with circuit breakers — achieving 99.9% availability despite individual provider outages.

By Elizabeth.ai Team

The Reliability Problem with AI APIs

If your product depends on a single AI provider, you have a single point of failure. AI APIs go down. They have rate limits. They experience degraded performance. OpenAI had 12 notable incidents in 2025 alone. Anthropic and Google have had their share too.

For Elizabeth.ai, AI parsing failures directly impact merchants. A customer sends an order, the AI is down, the order is not confirmed, the customer moves on to another seller. That is unacceptable.

We needed an architecture that survives individual provider failures without degraded user experience.

The Cascade Pattern

Our solution is a three-provider cascade with circuit breakers. When an AI parsing request comes in:

Try the primary provider (Claude Sonnet 4)
If it fails or is circuit-broken, try the secondary (Gemini 2.5 Flash)
If that also fails or is circuit-broken, try the tertiary (GPT-4o)
If all three are down, return a graceful degradation (queue for retry, notify merchant)

The cascade is ordered by our preference: Claude Sonnet 4 is our primary because it consistently delivers the best structured extraction quality for Taglish content. Gemini 2.5 Flash is second because of its speed and multilingual strength. GPT-4o is the final fallback with the broadest language coverage.

Circuit Breaker Implementation

The circuit breaker pattern, borrowed from electrical engineering, prevents a system from repeatedly calling a service that is known to be failing. Our implementation tracks three states per provider:

Closed (Normal)

All requests flow through normally. The circuit breaker monitors:

Failure count: Number of consecutive failures
Error rate: Percentage of failures in a rolling window
Latency: Average response time

Open (Tripped)

When failures exceed a threshold, the circuit "trips open." All requests immediately skip this provider and route to the next in the cascade. No API call is made — the skip is instantaneous.

The circuit stays open for a configurable timeout period (we use 60 seconds for transient issues, 5 minutes for sustained outages).

Half-Open (Testing)

After the timeout expires, the circuit enters a half-open state. It allows a single test request through. If the request succeeds, the circuit closes (provider is healthy again). If it fails, the circuit re-opens for another timeout period.

Closed → (failures exceed threshold) → Open
Open → (timeout expires) → Half-Open
Half-Open → (test succeeds) → Closed
Half-Open → (test fails) → Open

Provider Health Tracking

Each provider in our cascade has its own circuit breaker instance with independent health tracking. The health state is stored in memory (not database) for speed — circuit breaker decisions must be near-instantaneous to avoid adding latency to the request path.

We track the following metrics per provider:

Consecutive failures: Triggers circuit open after 3 consecutive failures
Rolling error rate: If > 50% of requests fail in a 2-minute window, trip the circuit
Latency degradation: If average latency exceeds 3x the provider's baseline, treat as degraded
Rate limit signals: HTTP 429 responses trigger immediate circuit open with a longer timeout

Why Per-Provider, Not Global?

A global circuit breaker that trips when "AI" is unhealthy would be too coarse. Provider A might be down while B and C are fine. By tracking each provider independently, we route around failures with surgical precision.

The Cascade in Practice

Here is what a typical failure scenario looks like:

10:00:00 — Claude API starts returning 500 errors
10:00:02 — Circuit breaker for Claude trips after 3 consecutive failures
10:00:02 — All new requests automatically route to Gemini (next in cascade)
10:00:02 — No user-visible impact. Orders continue to be parsed.
10:01:02 — Circuit enters half-open, sends a test request to Claude
10:01:02 — Test request fails → circuit re-opens for another 60 seconds
10:02:02 — Another test request → Claude returns 200 → circuit closes
10:02:02 — New requests resume routing to Claude as primary

Total user-visible downtime: zero. The cascade failover happened in under 2 milliseconds.

Token Tracking Without Blocking

Every AI call consumes tokens, and we track usage for billing and monitoring. But token tracking must never block the response path. If tracking fails (database issue, network blip), the order confirmation must still go through.

Our approach: fire-and-forget token tracking. After the AI response is received and the parsed order is being processed, we dispatch a non-blocking async call to log the token usage. If it fails, we log the error but do not retry — the data point is lost, which is acceptable. Losing a billing data point is far less costly than delaying an order confirmation.

AI Response received
├── Parse order → confirm to customer (BLOCKING - critical path)
└── Log token usage → billing database (FIRE-AND-FORGET - non-critical)

Model Selection by Call Type

Not all AI calls are equal. Order parsing requires strong structured extraction. Intent classification needs speed. Customer-facing responses need natural language quality.

We use an AICallType enum to select the appropriate model and parameters for each call type:

ORDER_PARSING: Highest accuracy model, structured output, temperature = 0
INTENT_CLASSIFICATION: Actually handled by pure regex (no AI at all) — this is a deliberate design choice to keep latency minimal for the most frequent operation
CUSTOMER_RESPONSE: Natural language model, moderate temperature, persona-aware

This ensures we are not over-provisioning expensive models for simple tasks or under-provisioning for critical parsing.

Lessons Learned

Circuit breakers must be fast. Any latency added by the failover logic defeats the purpose. Keep state in memory, make decisions in microseconds.
Three providers is the sweet spot. Two providers gives you one failover. Three gives you two, which covers simultaneous degradation of one provider and rate limiting of another — a scenario we have simulated in our test environment.
Do not retry on the same provider. When a provider is failing, retrying just adds load to an already stressed system. Trip the circuit and move on.
Monitor the cascade, not just individual providers. Track how often each provider is the one that actually serves the request. If your tertiary is serving 30% of traffic, something is wrong with your primary.
Graceful degradation matters. Even with three providers, total AI failure is possible (however unlikely). Have a plan: queue orders for retry, notify the merchant, or fall back to a simplified regex-only parse with manual review.

For more on our parsing architecture, see Hybrid Regex + AI: Why We Parse 60% Without AI and Building a Taglish-Aware NLP Parser.

Interested in the engineering behind Elizabeth.ai? Follow our engineering blog for more deep-dives into the systems that power Filipino commerce automation.