Bulletproofing Kafka and avoiding the poison pill

Whenever we build a system component that is a crucial part of our customer journey, it's critical that every effort goes into making it as bulletproof as possible. We don't want blind spots to become the reason for unavailable services for hundreds or thousands of customers. But if making something bulletproof means risking the security and integrity of the banking platform itself, that’s equally disastrous.

After battle-testing some microservices that rely on a Kafka event stream, we learned a few things about resilience. From our experience, major issues can be avoided and easily addressed.

Why we can't overlook non-technical requirements

Building a platform that can scale up to serve a billion global customers for any client isn’t easy. And when those clients are reputable banks, and the platform is designed to handle real money, it’s near impossible to do so without focusing on customer security.

Here at 10x, we place great importance on ensuring our platform is safe and secure for our customers. That's why we have several teams dedicated to making this possible. Our Customer Screening team, for instance, works to validate that only those users with legitimate proof of identity can open a bank account. To achieve this, we perform a set of pre-flight checks for each user during the onboarding process to confirm that:

We are dealing with a real-life person
There are no illegal activities
Occupation-related risks are mitigated (i.e. bribery prevention)

This is managed by building a Screening Engine – a system that reacts to events read from a Kafka Topic. And for each event, it conducts mandatory security checks. The goal is to ensure the process is smooth and transparent, but also set up to react in a timely fashion if misuse is detected.

Screening System (Sml)3

Because of this, it’s crucial the system is highly reliable because onboarding should be fast and should work every single time for every customer. If something goes wrong, it may cause reputational damage, and regulatory penalties could also be applied.

In addition to that, the exposure to exploitation and scams could drive customers off the platform. That’s why this needs to withstand just about any failure scenario we throw at it.

The challenges of building a customer screening system

Building a system based around a stream of events flowing over Kafka has its benefits, but there are implications to consider. We can’t, for instance, have our topic partitions blocked for any reason. If this happened, it would stop the current customer from being onboarded, and everyone after them.

In building our platform, we brainstormed all the things that could cause this to happen and put countermeasures in place to deal with each scenario. Here are the key measures that were considered…

We went ahead and implemented everything that was needed with very few issues. By the end of it, we felt relieved that everything looked good and safe.

The problem with Kafka Consumers

Issues began to emerge from the woodwork when we opened up our onboarding topic to the wider platform. After a few days, several teams started using our topic to invoke the customer screening process for various reasons. Sometime later, we noticed a huge consumer lag on one of the partitions. Everything was blocked and our system couldn’t process anything. We believed we had considered all scenarios and applied countermeasures. How could anything go wrong?

During our investigation, we discovered that one of our Kafka Consumers was stuck trying to process a record, but failed each time during deserialization, hence not incrementing the partition offset. So, when the consumer returned for the next record, it continuously fetched the dud record it couldn’t interpret.

"As long as both the producer and the consumer are using the same compatible serializers and deserializers, everything works fine.

Compatible serializers and deserializers are the key.

You will end up in a poison pill scenario when the producer serializer and the consumer(s) deserializer are incompatible. This incompatibility can occur in both key and value deserializers."

It turns out that despite using Avro to enforce the schema on our topic, some producer libraries may not respect this. As consumers, we’re not protected by the Schema Registry unless the producer is actually using it. This left us exposed to unanticipated deserialisation errors. Read on to discover what we did to find a solution.

Handling Kafka poison pills

We were, in essence, dealing with a poison pill: a corrupted Kafka record that cannot be deserialized. If this were to happen in production, it would block all event processing from the affected topic partition. This would go on regardless of how many attempts were made to reprocess the event, at least until the data retention period for this topic expired. This means a proportion of users waiting an unreasonable amount of time to open an account. (There’s a great article about this written by Tim van Baarsen at Confluent, where you can discover more about Kafka poison pills.)

The way Spring Kafka deals with this sort of error by default is through a handler called SeekToCurrentErrorHandler which logs consumption errors that occur in the container layer and then skips to the next offset. However, this handler cannot cope with deserialisation exceptions and will enter an infinite retry loop. To fix this, consumers can use a different handler called ErrorHandlingDeserializer. This would be configured to deserialize keys and values by delegating the processing to the original deserializers being used. In our case, these were UUID for Keys and Avro for values.

Screenshot 2021-11-02 at 11.33.22

To override the default behaviour of just logging the error, a custom implementation of the Spring Kafka interface ConsumerRecordRecoverer can be created.

Screenshot 2021-11-02 at 11.39.18

This should be plugged back into the SeekToCurrentErrorHandler when configuring the container factory.

Screenshot 2021-11-02 at 11.58.02

By over-relying on what Avro provides when it comes to payload structure, we had overlooked the fundamental issue of what happens if deserialisation itself doesn't succeed. It’s still possible to end up with incompatible producer/consumer pairing despite using Avro and the Confluent Schema Registry, e.g., a new producer that ignores schema checks. We chose to sidestep the non-deserializable records using a tried and tested approach of placing them onto an error topic before triggering an alert to prompt investigation.

The actual fix we made to resolve the issue was straightforward to implement, thanks to the awesome team at Spring (reference documentation here). However, our lesson in humility could have been very expensive, and is something any company could experience for themselves. In fact, we’ve come across many others discussing the same issues.

Final thoughts and learnings from the experience

Once everything was back to normal, we decided to ensure that we wouldn't face this again even if we changed the consumer deserialisation configuration. We couldn't test this by placing poison pills on our onboarding topic because that was being used by multiple teams with different backlog priorities and velocities. That could block consumers that have not yet implemented a defense against poison pills.

Instead, we created a different topic where we could produce a range of badly serialized records without any risks. So now, whenever we make changes to our consumers and want to check if we are still protected, we can reconfigure them to read from the poison topic instead.

No matter how edge-case a situation may seem, consider what’s at stake and go the extra mile with your testing scenarios. It's not always possible to think of everything that can go wrong ahead of time. However, the risk of having issues show up in production environments can be mitigated by having a strategy that tests a lot of different unhappy paths that could potentially break your system.

Having alerts on errors and throughput also provides the observability needed to discover issues early in the development life cycle. By not accepting code as production-ready until it has been battle-tested, we can avoid the possibility of issues occurring for real customers.