1

I'm setting up a kubernetes cluster with many different components for our application stack and I'm trying to balance storage requirements while minimizing the number of components.

We have a web scrapper that downloads tens of thousands of HTML files (and maybe PDFs) every day and I want to store these somewhere (along with some JSON metadata). I want the files stored in a redundant scalable way but having millions of small files seems like a bad fit with e.g. GlusterFS.

At the same time we have some very large binary files used by our system (several gigabytes large) and also probably many smaller binary files (10's of MBs). These do not seem like a good fit for any distribtued NoSQL DB like MongoDB.

So I'm considering using MongoDB + GlusterFS to separately address these two needs but I would rather reduce the number of moving pieces and just use one system. I have also read various warnings about using GlusterFS without e.g. Redhat support (which we definitely will not have).

Can anyone recommend an alternative? I am looking for something that is a distributed binary object store which is easy to setup/maintain and supports both small and large files. One advantage of our setup is that files will rarely ever be updated or deleted (just written and then read) and we don't even need indexing (that will be handled separately by elasticsearch) or high speed access for reads.

  • Ceph, maybe? Both are well supported by k8s per http://kubernetes.io/docs/user-guide/persistent-volumes/, so I don't think this has anything with k8s but more about Ceph vs. GlusterFS. I don't want to post any Google result here as all comparison has a certain level of bias. You'd better avoid MongoDB if possible. – Hang Jan 05 '17 at 02:48
  • Thanks for your response. Can you comment on why I should avoid mongo? Is ceph much better for many small files than glusterfs -- for the later latency becomes a problem. – Prefer Anon Jan 05 '17 at 13:41
  • It's really not a question to SO, I know moderators don't like questions like "which one is better", there were some helpful discussions here: https://www.reddit.com/r/sysadmin/comments/2t85ya/anyone_using_glusterfs/, in short, the key point is to build your own test cases (nobody knows how "small" is small) and try it out on both system. For MongoDB, start from here: https://www.reddit.com/r/programming/search?q=mongodb – Hang Jan 10 '17 at 12:07
  • You may find [this](https://stackoverflow.com/a/53159654/2361497) answer interesting. – Vitaly Isaev Nov 05 '18 at 17:58

1 Answers1

2

Are you in a cloud? If in AWS S3 would be a good spot, object storage sounds like what you might want, but not sure of your requirements.

If not in a cloud, you could run Minio (https://www.minio.io/) which would give you the same type of object storage that s3 would give you.

I do something similar now where I store binary documents in MongoDB and we back the nodes with EBS volumes.

Steve Sloka
  • 2,846
  • 2
  • 19
  • 29
  • Thanks Steve, Minio looks very interesting, I will check it out (I'm not in a cloud, these are bare metal servers). My concern is that some of my binary docs will be gigabytes large and apparently gridfs is not built for this (because it has to reconstruct the objects in memory). – Prefer Anon Jan 11 '17 at 09:07