9

What is shards in kinesis data stream and partition key. I read aws documents but I don't get it. Can someone explain it in simple terms?

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298
Desp
  • 95
  • 1
  • 3

1 Answers1

13

From Amazon Kinesis Data Streams Terminology and Concepts - Amazon Kinesis Data Streams:

A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.

So, a shard has two purposes:

  • A certain amount of capacity/throughput
  • An ordered list of messages

If your application must process all messages in order, then you can only use one shard. Think of it as a line at a bank — if there is one line, then everybody gets served in order.

However, if messages only need to be ordered for a certain subset of messages, they can be sent to separate shards. For example, multiple lines in a bank, where each line gets served in order. Or, think of a bus sending GPS coordinates. Each bus sends messages to only a single shard. A shard might contain messages from multiple buses, but each bus only sends to one shard. This way, when the messages from that shard is processed, all messages from a particular bus are processed in order.

This is controlled by using a Partition Key, which identifies the source. The partition key is hashed and assigned to a shard. Thus, all messages with the same partition key will go to the same shard.

At the back-end, there is a typically one worker per shard that is processing the messages, in order, from that shard.

If your system does not care about preserving message order, then use a random partition key. This means the message will be sent to any shard.

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298
  • 2
    So shards contains a set of data records. Maximum size of data record can be 1mb and shards can have max 1000 records. So does that mean max size of a shards is 1000mb?? – Desp Jun 10 '19 at 16:23
  • 2
    Shards do not quite "contain" records. Think of them as a hose, through which records pass. The hose can only accept a certain quantity of messages (water) over a period of time. If you want to send more water, you need more hoses in parallel (more shards in parallel). – John Rotenstein Jun 10 '19 at 22:29
  • @JohnRotenstein Do you know what they exactly mean by a `transaction` in `5 transactions per second for reads`. They didn't explicitly mention what they mean by a transaction in the docs – Paras Diwan Jan 16 '20 at 12:04
  • Found it. `GetRecords` is a a read transaction which can get 10k records, and they allow 5 of these read transactions per second https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html – Paras Diwan Jan 16 '20 at 13:45
  • @JohnRotenstein do you mean *sub**s**et* of messages? – peer Apr 26 '21 at 08:33
  • @Peer Oopsie! Fixed. – John Rotenstein Apr 26 '21 at 09:18