- We are dealing with a real-life person
- There are no illegal activities
- Occupation-related risks are mitigated (i.e. bribery prevention)
This is managed by building a Screening Engine – a system that reacts to events read from a Kafka Topic. And for each event, it conducts mandatory security checks. The goal is to ensure the process is smooth and transparent, but also set up to react in a timely fashion if misuse is detected.
Because of this, it’s crucial the system is highly reliable and super-resilient because onboarding should be fast, and it should work every single time for every customer. If something goes wrong, it not only causes huge reputational damage, but regulatory penalties could also be applied, and exposure to exploitation and scams could drive customers off the platform. There's no sugarcoating this: it needs to be indestructible!
The challenges of building a customer screening system
We went ahead and implemented everything that was needed with very few issues. By the end of it, we felt relieved that everything looked good and safe.
The problem with Kafka Consumers
"As long as both the producer and the consumer are using the same compatible serializers and deserializers, everything works fine.
Compatible serializers and deserializers are the key.
You will end up in a poison pill scenario when the producer serializer and the consumer(s) deserializer are incompatible. This incompatibility can occur in both key and value deserializers."
We were in essence dealing with a poison pill: a corrupted Kafka record that cannot be deserialised. If this were to happen in production, it would block all event processing from the affected topic partition. This would go on regardless of how many attempts were made to reprocess the event, at least until the data retention period for this topic expired. This means a proportion of users waiting an unreasonable amount of time to open an account. (There’s a great article about this written by Tim van Baarsen at Confluent, where you can discover more about Kafka poison pills.)
The way Spring Kafka deals with this sort of error by default is through a handler called SeekToCurrentErrorHandler which logs consumption errors that occur in the container layer and then skips to the next offset. However, this handler cannot cope with deserialisation exceptions and will enter an infinite retry loop. To fix this, consumers can use a different handler called ErrorHandlingDeserializer. This would be configured to deserialise keys and values by delegating the processing to the original deserialisers being used. In our case, these were UUID for Keys and Avro for values.
To override the default behaviour of just logging the error, a custom implementation of the Spring Kafka interface ConsumerRecordRecoverer can be created.
This should be plugged back into the SeekToCurrentErrorHandler when configuring the container factory.
Once everything was back to normal, we had to ensure that we wouldn't face this again even if we changed the consumer deserialisation configuration. We couldn't test this by placing poison pills on our onboarding topic because that is being used by multiple teams with different backlog priorities and velocities. That could block consumers that have not yet implemented a defence against poison pills.
What we did instead was to create a different topic where we could produce a range of badly serialized records without any risks. So now, whenever we make changes to our consumers and want to check if we are still protected, we can reconfigure them to read from the poison topic instead.
No matter how edge-case a situation may seem, consider what’s at stake and stay humble. It's not always possible to think of everything that can go wrong ahead of time. However, the risk of having issues show up in production environments can be mitigated by having a strategy that tests a lot of different unhappy paths that could potentially break your system. Besides this, having alerts on errors and throughput provides the observability needed to discover issues early in the development life cycle. By not accepting code as production-ready until it has been battle-tested, we can avoid the possibility of issues occurring for real customers.Credits: