I am trying to move my python code from using dynamodb
to dynamodb2
to have access to the global secondary index capability. One concept that to me is a lot less clear in ddb2
compared to ddb
is that of a batch. Here's once version of my new code which was basically modified from my original ddb
code:
item_pIds = []
batch = table.batch_write()
count = 0
while True:
m = inq.read()
count = count + 1
mStr = json.dumps(m)
pid = m['primaryId']
if pid in item_pIds:
print "pid=%d already exists in the batch, ignoring" % pid
continue
item_pIds.append(pid)
sid = m['secondaryId']
item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
batch.put_item(data=item_data)
if count >= 25:
batch = table.batch_write()
count = 0
item_pIds = []
So what I am doing here is I am getting (JSON) messages from a queue. Each message has a primaryId
and a secondaryId
. The secondaryId
is not unique in that I might get several messages at about the same time that have the same. The primaryId
is sort of unique. That is, if I get a set of messages at about the same time that have the same primaryId
, it's bad. However, from time to time, say once in a few hours I may get a message that need to override an existing message with the same primaryId
. So this seems to align well with the statement from the dynamodb2
documentation page similar to that of ddb
:
DynamoDB’s maximum batch size is 25 items per request. If you attempt to put/delete more than that, the context manager will batch as many as it can up to that number, then flush them to DynamoDB and continue batching as more calls come in.
However, what I noticed is that a large chunk of messages that I get through the queue never make it to the database. That is, when I try to retrieve them later, they are not there. So I was told that a better way of handling batch writes is by doing something like this:
with table.batch_write() as batch:
while True:
m = inq.read()
mStr = json.dumps(m)
pid = m['primaryId']
sid = m['secondaryId']
item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
batch.put_item(data=item_data)
That is, I only call batch_write()
once similar to how I would open a file only once and then write into it continuously. But in this case, I don't understand what the "rule of 25 max" means. When does a batch start and end? And how do I check for duplicate primaryId
s? That is, remembering all messages that I ever received through the queue is not realistic since (i) I have too many of them (the system runs 24/7) and (ii) as I stated before, occasional repeated ids are OK.
Sorry for the long message.