<img alt="" src="https://secure.meet3monk.com/218648.png" style="display:none;">
Skip to content

Bulletproofing Kafka and avoiding the poison pill

10x Polygon top blog section 10x Polygon bottom blog section
10x blog Featured image
Whenever we build a system component that is a crucial part of our customer journey, it's critical that every effort goes into making it as bulletproof as possible. We don't want blind spots to become the reason for unavailable services for hundreds (or even thousands) of customers. But if making something bulletproof means risking the security and integrity of the banking platform itself, that’s equally disastrous.
 
After battle-testing some microservices that rely on a Kafka event stream, we learned a few things about resilience. This blog post proves that, from our experience, major issues can be avoided and easily addressed. We hope you find a few useful insights if you’re building something similar.
 
Why we can’t overlook non-technical requirements  
 
Building a platform that can scale up to serve a billion global customers for any given client isn’t easy. And when those clients are reputable banks, and the platform is designed to handle real money, it’s near impossible to do so without focusing on customer security.
 
Here at 10xwe place great importance on ensuring our platform is safe and secure for our customers. That's why we have several teams dedicated to making this possible.  Our Customer Screening team, for instance, works to validate that only those users with legitimate proof of identity can open a bank account. To achieve this, we perform a set of pre-flight checks for each user during the onboarding process to confirm that:
 
  • We are dealing with a real-life person 
  • There are no illegal activities 
  • Occupation-related risks are mitigated (i.e. bribery prevention) 

This is managed by building a Screening Engine a system that reacts to events read from a Kafka Topic. And for each event, it conducts mandatory security checks. The goal is to ensure the process is smooth and transparent, but also set up to react in a timely fashion if misuse is detected.

Screening System (Sml)3

Because of this, it’s crucial the system is highly reliable and super-resilient because onboarding should be fast, and it should work every single time for every customer.        If something goes wrong, it not only causes huge reputational damage, but regulatory penalties could also be applied, and exposure to exploitation and scams could drive customers off the platform. There's no sugarcoating this: it needs to be indestructible!

The challenges of building a customer screening system

Building a system based around a stream of events flowing over Kafka has its benefits, but there are implications to consider. We can’t, for instance, have our topic partitions blocked for any reason. If this happened, it would stop the current customer from being onboarded, and everyone after them.
 
In building our platform, we brainstormed all the things that could cause this to happen and put countermeasures in place to deal with each scenario. Here are the key measures that were considered…
Bulletproof Decisions3
 

We went ahead and implemented everything that was needed with very few issues. By the end of it, we felt relieved that everything looked good and safe.

The problem with Kafka Consumers 

Issues began to emerge from the woodwork when we opened up our onboarding topic to the wider platform. After a few days, several teams started using our topic to invoke the customer screening process for various reasons. Some time later, we noticed a huge consumer lag on one of the partitions. Everything was blocked and our system couldn’t process anything. We believed we had considered all scenarios and applied countermeasures. How could anything go wrong?
 
During our investigation, we discovered that one of our Kafka Consumers was stuck trying to process a record, but failed each time during deserialisation, hence not incrementing the partition offset. So when the consumer returned for the next record, it continuously fetched the dud record it couldn’t interpret.
 

"As long as both the producer and the consumer are using the same compatible serializers and deserializers, everything works fine. 
 
Compatible serializers and deserializers are the key. 

You will end up in a poison pill scenario when the producer serializer and the consumer(s) deserializer are incompatible. This incompatibility can occur in both key and value deserializers."

It turns out that despite using Avro to enforce the schema on our topic, some producer libraries may not respect this. As consumers, we’re not protected by the Schema Registry unless the producer is actually using it. This left us exposed to unanticipated deserialisation errors. Read on to discover what we did to find a solution …
 
Handling Kafka Poison pills  
 

We were in essence dealing with a poison pill: a corrupted Kafka record that cannot be deserialised. If this were to happen in production, it would block all event processing from the affected topic partition. This would go on regardless of how many attempts were made to reprocess the event, at least until the data retention period for this topic expired. This means a proportion of users waiting an unreasonable amount of time to open an account. (There’s a great article about this written by Tim van Baarsen at Confluent, where you can discover more about Kafka poison pills.)

The way Spring Kafka deals with this sort of error by default is through a handler called SeekToCurrentErrorHandler which logs consumption errors that occur in the container layer and then skips to the next offset. However, this handler cannot cope with deserialisation exceptions and will enter an infinite retry loop. To fix this, consumers can use a different handler called ErrorHandlingDeserializer. This would be configured to deserialise keys and values by delegating the processing to the original deserialisers being used. In our case, these were UUID for Keys and Avro for values. 

Screenshot 2021-11-02 at 11.33.22

To override the default behaviour of just logging the error, a custom implementation of the Spring Kafka interface ConsumerRecordRecoverer can be created.

Screenshot 2021-11-02 at 11.39.18

This should be plugged back into the SeekToCurrentErrorHandler when configuring the container factory. 

Screenshot 2021-11-02 at 11.58.02

By over-relying on what Avro provides when it comes to payload structure, we had overlooked the fundamental issue of what happens if deserialisation itself doesn't succeed. It’s still possible to end up with incompatible producer/consumer pairing despite using Avro and the Confluent Schema Registry, e.g. a new producer that ignores schema checks. We chose to sidestep the non-deserialisable records using a tried and tested approach of placing them onto an error topic before triggering an alert to prompt investigation.
 
Screening System3
 
The actual fix we made to resolve the issue was straightforward to implement, thanks to the awesome team at Spring (reference documentation here). However, our lesson in humility could have been very expensive, and is something any company could experience for themselves. In fact, we’ve come across many others discussing the same issues.
 
Final thoughts and learnings from the experience 
 

Once everything was back to normal, we had to ensure that we wouldn't face this again even if we changed the consumer deserialisation configuration. We couldn't test this by placing poison pills on our onboarding topic because that is being used by multiple teams with different backlog priorities and velocities. That could block consumers that have not yet implemented a defence against poison pills.

What we did instead was to create a different topic where we could produce a range of badly serialized records without any risks. So now, whenever we make changes to our consumers and want to check if we are still protected, we can reconfigure them to read from the poison topic instead.

No matter how edge-case a situation may seem, consider what’s at stake and stay humble. It's not always possible to think of everything that can go wrong ahead of time. However, the risk of having issues show up in production environments can be mitigated by having a strategy that tests a lot of different unhappy paths that could potentially break your system. Besides this, having alerts on errors and throughput provides the observability needed to discover issues early in the development life cycle. By not accepting code as production-ready until it has been battle-tested, we can avoid the possibility of issues occurring for real customers.

Credits:
Adam Smalley
Steve Murkin