3

What's the best practice for using kedro with MongoDB or other document databases? MongoDB, for example, doesn't have a query language analogous to SQL. Most Mongo "queries" in Python (using PyMongo) will look something like this:

from pymongo import MongoClient
client = MongoClient(...)  // Credentials go here

posts = client.test_database.posts

posts.find_one({"author": "Mike"})

And then you'll get something back like this:

{u'_id': ObjectId('...'),
 u'author': u'Mike',
 u'date': datetime.datetime(...),
 u'tags': [u'mongodb', u'python', u'pymongo'],
 u'text': u'My first blog post!'}

Now my question is: where should the logic go to find this post and then parse it into a dataframe? It doesn't seem appropriate try to create a MongoQueryDataSet class because you'll end up having to wrap the entire PyMongo API with clunky yaml arguments if you want to support things like inserts, aggregations, etc.

Should a MongoDataSet class just return a MongoClient object and capture any further logic in a kedro node?

In general, where should data loading logic live when you're working with databases that have these functional (non-SQL) APIs without simple query strings?

1 Answers1

1

Where should the logic go to find this post and then parse it into a dataframe?

Imo, MongoDataSet is not such a bad idea. Kedro has already got quite a number of contrib datasets wrapping IO logic for various sources, so to me MongoDataSet fits fairly nicely into this logic.

You'll end up having to wrap the entire PyMongo API with clunky yaml arguments if you want to support things like inserts, aggregations, etc.

I would say it's not a strong requirement to create a complete wrapper of entire pymongo right away. Even if your dataset is only capable of doing find() on load and insert_many() on save, that's already a good start.

Should a MongoDataSet class just return a MongoClient object and capture any further logic in a kedro node?

Kedro has this philosophy of nodes being pure Python functions, and this approach is quite different to me since nodes receive too much "control" over how they handle data load and save. Also, it breaks the interchangeability between the datasets - if you (or someone else) decide to drop MongoDataSet in the future and swap it to something else (e.g., JSONLocalDataSet or JSONBlobDataSet) in your project, it will just work with the 'pure' node, but not in case of a MongoClient - you'll have to change the node logic too - and this is what Kedro recommends to avoid.


As another option that doesn't imply creation of a new dataset, you may also consider using kedro.io.LambdaDataSet - you need to provide your own hooks for save and load. Note, however, that LambdaDataSet can't be defined in catalog.yml and has to be added to the DataCatalog 'manually' on Python side.

Dmitry Deryabin
  • 1,320
  • 11
  • 23