Application Circuit Breakers
Saleor leverages external integrations for critical delegation tasks like payments and shipping. When these external services are down or become unreliable, it can slow down all requests that rely on them, eventually leading to the storefronts becoming unresponsive.
To mitigate this risk, Saleor implements circuit breakers for synchronous webhooks. Circuit breakers are a mechanism that temporarily shuts down any app that is observed to misbehave, preventing it from affecting the overall system performance.
Each application has its own circuit breaker that monitors the synchronous webhooks sent to it. When the number of failed requests exceeds a certain threshold, the breaker trips and all further requests to that app are rejected until a cooldown period has passed.
Breaker States​
Closed​
This is the normal operation state for a healthy app. All webhooks are sent as usual.
Open​
The breaker becomes open when the number of errors exceeds the threshold. In this state, all webhooks are rejected and the app is not contacted at all, assuming to be down.
Half-Open​
Once a cooldown period has passed, the breaker allows the requests to be sent, but the threshold for tripping back to the open state is much lower than for the Closed state.
This is different from other implementations of breakers that may instead throttle requests while in the half-open state.
State Transition Diagram​
Observability​
The breaker states can be observed on the App type using the following fields: App.breakerState and App.breakerLastStateChange.
A mutation exists to manually reset the breaker state: appReenableSyncWebhooks. Using it allows the cooldown period to be skipped and forces the breaker to return to the closed state immediately.
Configuring the Feature​
To turn the feature on, set the BREAKER_BOARD_ENABLED environment variable to True.
BREAKER_BOARD_SYNC_EVENTS is the setting which allows you to specify the events that should be monitored by the circuit breaker (comma separated).
For example: "checkout_calculate_taxes, shipping_list_methods_for_checkout".
BREAKER_BOARD_DRY_RUN_SYNC_EVENTS setting is off by default, if you set it to True, the breaker will act as it's on, but only in terms of logging, etc.
However it will not block any requests. This is useful for testing the breaker without affecting the webhooks behavior.
Transitions Between States​
Breaker states are being changed automatically as the requests go by. The breaker is monitoring the events and act accordingly. Note that all of circuit breaker events (both success and failures) have a 5 minutes TTL (time to live) and are stored in Redis. On each transition, the events are purged, so each state has a fresh start and is not contaminated by older events.
Closed to Open​
The conditions on which breaker enters open state are controlled by two constants:
BREAKER_BOARD_FAILURE_MIN_COUNT- it is set to 100 and is the minimum number of failures that must occur before the breaker can trip.BREAKER_BOARD_FAILURE_THRESHOLD_PERCENTAGE- it is set to 35 (percent) and is the ratio of failures to total requests that must be exceeded before the breaker trips, assumingBREAKER_BOARD_FAILURE_MIN_COUNTis satisfied.BREAKER_BOARD_TTL_SECONDS- it is set to (last) 5 minutes and is the time window in which the breaker is monitoring the events. Older events are automatically removed.
These constants can be found here.
Open to Half-open​
After the breaker enters open state, no webhooks are sent for a certain amount of time. This time is called the cooldown period and it defaults to 2 minutes. Then the breaker automatically transitions to the half-open state.
Half-open back to Open​
The breaker will trip again if the number of errors exceeds the threshold and minimum count. The conditions on which breaker re-enters open state are controlled by another two constants:
BREAKER_BOARD_FAILURE_MIN_COUNT_RECOVERY- simmiliar toBREAKER_BOARD_FAILURE_MIN_COUNT, but it defaults to 20 and is the minimum number of failures that must occur before the breaker can trip again.BREAKER_BOARD_FAILURE_THRESHOLD_PERCENTAGE_RECOVERY- simmiliar toBREAKER_BOARD_FAILURE_THRESHOLD_PERCENTAGE, but it defaults to 30 (percent) and is the ratio of failures to total requests that must be exceeded before the breaker trips again.BREAKER_BOARD_TTL_SECONDS- it is set to (last) 5 minutes and is the time window in which the breaker is monitoring the events. Older events are automatically removed.
These constants can be found here.
Half-open back to Closed​
If the number of successful requests is met, the breaker transitions back to the closed state. This is controlled by BREAKER_BOARD_SUCCESS_COUNT_RECOVERY constant, which defaults to 50.