How do you handle Amazon Kinesis Record duplicates?

Question

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.

The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.

Do you know a better / more efficient way of handling duplicates?

score 20 · Accepted Answer · answered Apr 06 '17 at 14:30

We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.

Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.

We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.

However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.

Hi Dmitry. I was running several benchmarks using something similar to the JustGiving infrastructure explained here: https://aws.amazon.com/blogs/compute/serverless-cross-account-stream-replication-using-aws-lambda-amazon-dynamodb-and-amazon-kinesis-firehose/ . Why did you compute a MD5 checksum instead of using Shardid + SequenceNumber for your DDB Table? — Antonio, Apr 06 '17 at 23:16
Hi @Antonio. In our case it was possible that producer would post the same message multiple times. If it was the case, then Kinesis would consider them as different messages anyway (simply because there were 2 or more posts from the producer). As we knew that every message must be unique, we simply disregarded the messages which md5 has already been seen. Also, md5 was calculated by producers, saving some compute time for cosumers (given relatively large volume of data going through Kinesis). — Dmitry Deryabin, Apr 07 '17 at 10:35
Just wanted to throw out there - AWS notes that different producers can naturally produce the same record multiple times due to error cases, and also more commonly, multiple consumers could pull the same set of records. I'm dealing with this on our system now too. We use elasticsearch, and the plan for the moment is to use elastics built in versioning to ensure that the same record isn't updated at the same time, and then memozie a list of recent events applied to a record on the record itself. — genexp, Sep 05 '17 at 13:29

az3 · Answer 2 · 2017-03-28T12:38:49.670

The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.

You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.

To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).

How do you handle Amazon Kinesis Record duplicates?

2 Answers2