How we migrated our Webhook Service to Kafka
Webhooks as part of our open platform
Olo uses webhooks extensively to notify our partners when important events take place in our ordering platform. Webhooks are an important part of our open platform philosophy and allow integrations between Olo and external systems to be customized in a way that suits our customers’ specific operational needs.
Consider a webhook that signals when an order has been placed. Restaurants might use this webhook information to:
- display the order details on a kitchen display system (KDS) to help with food preparation
- store the information in a database for payments reconciliation
- ship information to a 3rd party business intelligence tool for advanced reporting
In another scenario, a webhook triggers whenever a new user has signed up, and a customer may use that information to keep their mailing list or CRM system up to date.
The design gives partners an efficient way to respond to the events they are interested in without needing to poll for changes (and it saves us the associated load on our system).
At a technical level, webhooks are implemented by posting a message to an HTTP endpoint that our partner provides. We try to ensure that all messages reach their destination, and have implemented several resilience measures such as retries and circuit breakers in cases of transient issues.
Webhooks as a potential scaling issue
Our Webhook Sending Service is responsible for composing and scheduling webhooks for delivery. In our first incarnation of this service, when our ordering platform wanted to publish a webhook, it would send a small message to the Webhook Sending Service using AWS Simple Queue Service (SQS). The service would then query our transactional database for the necessary data before building and sending the webhook.
Applications could send webhooks easily since they only needed a few key pieces of information. For example: the Order ID and the type of event that had taken place. The Webhook Sending Service took care of the rest.
One downside of this approach was the additional load on a transactional database that was already busy doing other important work, like processing orders. We were hoping to address scaling concerns by using an alternative source for this data.
We already relied on Kafka for internal communication between Olo subsystems and services. As an example, we publish an event whenever an order has been placed, and if an internal service needs to do something when an order is placed, the service simply subscribes to this Kafka topic.
Kafka and SQS are both first-class messaging technologies at Olo, and over time we’ve developed guidance on when to prefer one over the other. We generally prefer SQS as a means of providing inexpensive point to point messaging when an event can be treated as an instruction to a service working asynchronously, and when there’s only a single subscriber. Kafka is our preferred option for events we describe in the past tense (Order Placed, Credit Card Billed) or when we’d need to have multiple (or dynamic) subscribers.
In our case, the subscriber-agnostic nature of Kafka meant that our order submission code would not have to change to add a new subscriber. The code has no awareness of which subscribers are reading from the topic at all. All of this results in low coupling between our services.
Leveraging Kafka streams instead of SQS
We determined that the Kafka events would likely work well as the triggering mechanism for sending our webhooks. The events already contained the vast majority of the data we needed (we use the fat events pattern), and we wouldn’t have to make too many changes in order to be able to construct our webhooks without having to query the database at all.
On top of that - and almost as exciting - this design would allow us to add new types of webhooks in the future without having to modify our core order submission logic. This would reduce the risk of such changes significantly, and the team in charge of the Webhook Sending Service would be able to iterate on webhook-related functionality quickly and with a high degree of autonomy.
An approach we considered but decided against was to add the missing data to SQS. This would have allowed the Webhook Sending Service to avoid querying the database, but it would not have given us the low coupling between our order submission logic and our webhook sending logic that we got from using Kafka instead.
A challenging migration
While migrating from one source of data to another may sound easy in theory, the devil is in the details. We had a number of challenges during this project, including:
- Coordination was required in order to avoid sending duplicates or dropping webhooks during the migration period.
- We wanted to ensure that we were sending the exact same data as before and we needed to preserve formatting for things like phone numbers and decimal places.
Our primary goal was to perform the migration without changing any of the webhooks sent to our partners. We decided that the safest way would be to divide the migration into two distinct steps:
- Compare: Build webhooks using Kafka as the source, but don’t send these yet. Instead, compare the payloads with the payloads that use SQS and the database as their source, and iterate to fix any discrepancies.
- Migrate: When we reach a 100% match this allows us to confidently switch to sending webhooks based on Kafka data.
Each of the two steps has some interesting challenges.
Sampled comparisons using Scientist.NET
We were able to reuse an existing homegrown framework for comparisons at Olo, so implementing this part was quick. The framework uses Scientist.NET to perform the comparisons, streams the mismatches to a Redshift database, and reports the results to Datadog for observation.
From our application, all we had to do was generate the legacy and Kafka-based webhook payloads and hand these to the framework along with instructions on how to compare the two JSON blobs, everything else just worked out of the box. Whenever a mismatch showed up we would use the data in Redshift to manually compare the two payloads and work to correct the discrepancy.
One interesting thing to note about the comparisons is that due to how the data from SQS and Kafka is received, doing the comparisons required us to make additional database calls. To avoid unnecessarily increasing database load, we decided to use sampling to specify how many webhooks should go through the comparison logic. A feature flag allowed us to run comparisons for a small percentage of webhooks and in turn let us control the number of additional database calls we were comfortable making. After running comparisons for a few hours we felt confident that we had processed a representative sample of webhooks.
As a result of the comparisons we found a few different classes of discrepancies:
- Missing data: The Kafka events were missing a few pieces of information needed for our webhooks. In these cases we modified the events to start including the data we needed.
- Null vs blank string: We had some cases where legacy payloads would say
nullto indicate that data was missing and our new payloads had a blank string instead. In order to maintain backwards compatibility we updated the new payloads to match the legacy behavior.
- Number precision: Floating point fields didn’t always serialize consistently when running experiments. We accepted the few discrepancies as false positives.
We spent a few weeks running comparisons and fixing discrepancies. Eventually we were ready to proceed with the migration.
Migrating safely with a simple Redis lock
The SQS-based requests to send a webhook and the Kafka events could be off in either direction by a few seconds due to things like batching or a backed up queue. In the Webhook Sending Service we had different threads for reading from SQS and from Kafka.
All of this would have made it challenging to switch from SQS to Kafka in one move. We could theoretically have ended up with duplicates if the same webhook were sent both from SQS and from Kafka. We could have dropped messages if we switched before we processed from SQS and the corresponding Kafka event had already passed.
To protect against these issues we decided to use a simple Redis lock. The SQS and Kafka consumers both try to acquire a semaphore - implemented as a SETNX command - and whichever one succeeds sends the webhook message. The strategy was to run both consumers in parallel for about a minute, making sure that we didn’t drop any webhooks, and using the Redis lock to avoid any duplicates.
Using SETNX is very simple, but it has some limitations when used for locking, so we wanted to verify that it was good enough for our specific, short-lived scenario or if we would have to spend more effort implementing something like Redlock. In our staging environment we set up a custom webhook recipient endpoint and used a load test to send a few thousand webhooks with both SQS and Kafka enabled, using SETNX as a locking mechanism. This allowed us to confirm that we got exactly one copy of each webhook and that the simple lock would do the trick.
Big Enough design upfront
On this project we did a fair amount of design work upfront and iterated on the design as we learned more about the problem at hand. Creating diagrams made it easier to discuss and update the design, allowing us to ensure that everyone on the team was on the same page. Spending a couple of extra days ensuring we had a decent design in place seemed like a good investment, and limited the number of surprises during the implementation phase. We also found it helpful to be able to reference the diagram later in the project, during implementation and testing.
Our design relied on feature flags to control the migration process. First, they allowed us to enable Kafka processing and payload sampling until we were happy with the result. Next, we used them to run Kafka in parallel with SQS using the Redis lock to prevent duplicates, before shutting off SQS processing and relying only on the Kafka pipeline.
Doing the migration
We decided to sample and migrate a few types of webhooks at a time, starting with a few low-volume ones and gradually working our way to the busier ones as we gained confidence. We ran multiple of these migrations in parallel - sampling some types of webhooks while completing the migrations for others.
In order to keep track of all of the feature flags and ensure that we got it right, we created a step by step process that we followed for each webhook. While the actual plan is slightly more detailed, the overall workflow looks something like the following:
- Turn on Kafka processing, enabling reading off the stream but not sending.
- Sample Kafka vs. SQS payload accuracy at 5%. Report any inaccuracies and fix.
- When satisfied, shut off sampling.
- Shut off kafka processing and observe the stream no longer being read from.
- Turn on the sending mechanism for Kafka. This will enable the sending feature for Kafka, but not the processing feature.
- Enable the Redis lock. At this point SQS will always grab the lock as we haven’t yet enabled Kafka stream processing.
- Turn on Kafka processing. Now, both SQS and Kafka will both attempt to send the webhook, the locking mechanism prevents both from sending the payload.
- Turn off SQS processing leaving only the Kafka process running.
- Turn off the Redis locking mechanism.
To keep track of how far along we were in the migration process for each webhook we created a simple checklist in Jira:
Our service was already instrumented quite extensively, and during the migration process we kept an eye on the level of webhooks sent from our system, verifying that this did not change.
Where we did see a big change, however, was in the number of database calls made from our service. Post migration, we reduced the number of database calls per second from within the Webhook Sending Service by almost three orders of magnitude.
Wrapping it up
By re-using our Kafka stream, we were able to:
- Remove millions of extra database queries a day.
- Got rid of extra HTTP calls made to SQS from within our most high traffic applications, saving ~15ms on average per order.
- Future proof our webhook sending service by leveraging an ever-expanding stream of information.