On 23 November, the week of Black Friday and our busiest of the year, Reward Gateway suffered a critical database cluster issue. This caused a complete outage of our service for 29 minutes. This outage was not as a result of a change made by our Operations or Engineering team, but due to a race condition in the underlying database software.
At Reward Gateway, we use Percona’s XtraDB Cluster to hold application data. This is a derivative of MySQL that provides a platform of greater performance and scalability, with replication using the Galera library. This allows us to run multiple database nodes in a high-availability configuration within each region we operate.
On 23 November, the Operations team was alerted to a critical outage of our service. This was late in the afternoon when people were returning home and not related to any change.
When the team looked at the MySQL error logs, they saw something that looked like a data consistency issue. The server was reporting an error like:
2021-11-23T16:37:24.862227Z 1 [ERROR] Slave SQL: Could not execute Delete_rows event on table database.tablename; Can't find record in 'tablename', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 4317, Error_code: 1032
2021-11-23T16:37:24.862232Z 1 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 1390133966
2021-11-23T16:37:24.862368Z 1 [Note] WSREP: Applier statement rollback needed
This error indicated that one of the database nodes had received a delete statement and replicated it to another only for the delete to not be valid. This was the same error we had seen in an outage on 17 March this year. However, at that time, we had just finished an online schema upgrade and completed a failover between database nodes. This time nothing had changed.
2021-03-17T07:51:06.899222Z 8 [ERROR] Slave SQL: Could not execute Delete_rows event on table database.tablename; Can't find record in 'tablename', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 6803, Error_code: 1032
2021-03-17T07:51:06.899243Z 8 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 325387220
2021-03-17T07:51:06.899786Z 8 [Note] WSREP: Applier statement rollback needed
In a failure like this, our first goal is to restore our service to customers. We followed our Incident Response Plans and validated the state of the database nodes. We then used the (single) writer as the primary node and used it to bootstrap the other nodes in the database cluster. This recovered end-user service quickly and allowed us time to copy relevant logs for review.
When we looked at application logs as well as the database node log, we could identify a single web request that modified data in the table. This table looked simple with a standard structure and a foreign key in place. The application queries related to the request also looked harmless, but we did notice a sequence of
DELETE wrapped in a transaction.
Given that this was not the first time this sequence had run, that we had an earlier outage with similar logs, and how logically simple the statements were, we started to suspect that this was caused by a bug in the underlying database cluster itself. Googling some of the error message fragments also hinted at this and, with it causing such a serious failure, we were determined to find a resolution.
When a statement is executed by XtraDB Cluster, it is first executed optimistically against the original node it is received by and a write-set is compiled. At the point of commit, the replication engine then replicates the write-set operations, while the commit is blocked, to all the other database nodes. All of the database nodes – including the original node – then perform a certification test, to check for conflicts. If the certification test fails, because another statement has modified the same rows, the write-set is dropped and the database cluster rolls back to the original transaction. If the certification test passes, the changes are applied and the transaction can be committed.
This certification and writing process is handled by a multi-threaded process. This means, it is possible for a database node to handle multiple write-set operations from the original node simultaneously. So while this is logically synchronous, the actual writing and committing to the tablespace happens asynchronously on each database node with a guarantee that the change propagates to all nodes.
This opens up a small window for a race condition. A typical cause for this is when a foreign key relation is in place and there is a dependency between a child and parent table in a transaction. Neither write-set would conflict and would be accepted on the original node, but if the other database nodes receive the write-sets in the wrong order, the certification test and integrity checks can cause the cluster to enter an indeterminate state. To prevent any further damage, XtraDB will then refuse to accept further write-sets without a manual resolution.
Although we only have one writer, resolving conflicts at this level is risk-prone and not possible (to our knowledge) with standard tools. The only option here is to bootstrap the whole database cluster again from the original node. We do not consider this a sustainable or a reliable long-term option.
Reading in detail the documentation, and similar issues reported with Galera rollbacks, we found that there are two potential workarounds. One was to stop parallelism, but this would have led to underutilised multicore machines and performance issues when introducing new nodes. This is not something we considered.
The other workaround we found, and have applied, was to change how write-sets are ordered and how the certification test works. This is possible with two flags that are available on each node:
cert-optimistic-pa=NOto the replication parameters makes sure that the order writes are certified and applied in is exactly the same as it is on the original database node, and starts no sooner than the write-set on the source is committed.
- Changing the
OPTIMIZEDwhich relaxes the certification test to allow more concurrency. This specifically affects how foreign keys are handled, for example, if two write-sets contain non-conflicting inserts to a child table referencing a single parent.
This problem with foreign key relations is something that Percona are aware of and have acknowledged is not adequately documented. Although there has been some improvement, to us, it is still not clear enough that caution must be used with foreign keys and Galera – and the settings which affect this. We will now be seeing how we can improve the documentation to highlight it.
We hope that this post is useful to others who encounter this issue and will save them time.