19

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.

The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.

Do you know a better / more efficient way of handling duplicates?

Antonio
  • 1,301
  • 1
  • 18
  • 31

2 Answers2

20

We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.

Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.

We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.

However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.

Dmitry Deryabin
  • 1,320
  • 11
  • 23
  • 1
    Hi Dmitry. I was running several benchmarks using something similar to the JustGiving infrastructure explained here: https://aws.amazon.com/blogs/compute/serverless-cross-account-stream-replication-using-aws-lambda-amazon-dynamodb-and-amazon-kinesis-firehose/ . Why did you compute a MD5 checksum instead of using Shardid + SequenceNumber for your DDB Table? – Antonio Apr 06 '17 at 23:16
  • 3
    Hi @Antonio. In our case it was possible that producer would post the same message multiple times. If it was the case, then Kinesis would consider them as different messages anyway (simply because there were 2 or more posts from the producer). As we knew that every message must be unique, we simply disregarded the messages which md5 has already been seen. Also, md5 was calculated by producers, saving some compute time for cosumers (given relatively large volume of data going through Kinesis). – Dmitry Deryabin Apr 07 '17 at 10:35
  • Just wanted to throw out there - AWS notes that different producers can naturally produce the same record multiple times due to error cases, and also more commonly, multiple consumers could pull the same set of records. I'm dealing with this on our system now too. We use elasticsearch, and the plan for the moment is to use elastics built in versioning to ensure that the same record isn't updated at the same time, and then memozie a list of recent events applied to a record on the record itself. – genexp Sep 05 '17 at 13:29
14

The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.

You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.

To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).

See also the following concepts;

  1. At-least-once Delivery: http://www.cloudcomputingpatterns.org/at_least_once_delivery/
  2. Exactly-once Delivery: http://www.cloudcomputingpatterns.org/exactly_once_delivery/
  3. Idempotent Processor: http://www.cloudcomputingpatterns.org/idempotent_processor/
az3
  • 3,174
  • 25
  • 28
  • Thank you for your answer. I cannot use SQS due to the high throughput. The high throughput is also the reason why I'm benchmarking several solutions with different durable storages (Mysql / PgSQL / Aurora / ElasticSearch / DynamoDB). The best way to temporarily store the Event IDs is Redis, but ElastiCache cannot grant you the data durability. That's why I was looking for alternative ways of doing it. – Antonio Mar 28 '17 at 13:37
  • 1
    Redis grants you strict tx tracking but it is single node and RDS is too slow, you are right. DynamoDB seems to be your only PaaS solution. If you would like to manage EC2 instances however, you can try in-memory clustered solutions such as Hazelcast or VoltDB (on a lot of r3 nodes)? – az3 Mar 28 '17 at 14:44
  • In-memory databases are not durable. If your Hazelcast cluster fails, you are not able to understand which messages you already processed. :( – Antonio Mar 28 '17 at 19:09