9

So currently I am storing all thumbnails in a single directory with file name as the md5 hash of the full path to the full size image. But I've read here that this causes issues when directory reaches thousands of files. They will be located slower and slower by the linux file system.

What alternatives do I have, considering I can only locate the thumbnail by the original image path? Dates would be the best options, like year/month/day/md5_hash.jpg, but that would require me to store and read the date from somewhere, so it would be add some extra steps.

I was thinking to split the md5, like first two characters = subfolder name, rest = file name. That would give me like 15*15 subfolders, but I'd like to hear better options, thanks!


Another idea I just got: create a separate server for organizing thumbnails. The server would keep track of thumbnail counts and create additional folders when a certain limit is reached and reuse old folders when thumbs are removed. Downside is that I need a separate db that maps hashes to thumbnail paths :(

Alex
  • 60,472
  • 154
  • 401
  • 592
  • It's not clear to me what problem you are trying to solve, or even if it really is a problem. You want to optimise *"efficiency"*, but what do you mean? Least wasted space on disk? Fastest lookup time? Do you need the reverse mapping where you have the thumbnail name but want the hi-res image, or just the issue where you have the hi-res image and want the thumbnail. How many images do you have? What happens if you rename a directory of hi-res images? – Mark Setchell Jul 11 '20 at 11:35
  • How big are the hi-res images? How big are the thumbnails? Are the hi-res images JPEG? Have you considered storing the thumbnails inside the hi-res images? Is startup time important? Is your app distributed - you could load thumbnails into Redis maybe. – Mark Setchell Jul 11 '20 at 11:40

5 Answers5

6

We use FreeBSD (file system UFS), not Linux, so some details may be different.

Background

We have several million files on this system that need to be served as quick as possible from a website, for individual access. The system we have been using has worked very well over the last 16 years.

Server 1 (named: Tom) has the main user website with a fairly standard Apache set-up and a MySQL data base. Nothing special at all.

Server 2 (named: Jerry) is where the user files are stored and has been customised for speedy delivery of these small files.

Jerry's hard drive is tweaked during creation to make sure we do not run out of inodes - something you need to consider when creating millions of small files.

Jerry's Apache config is tweaked for very short connection times and single file access per connection. Without these tweaks, you will have open connections sitting there wasting resources. This Apache config would not suit the main system (Tom) at all and would cause a number of issues.

As you are serving "thumbnails", not individual requests, you might need a slightly different structure. To be honest, I do not know enough about your needs to really advise what would be best for your webserver config.

Historically, we used multiple SCSI drives across a number of servers. At the moment, we have a single server with 300MB/s drives. The business has been in decline for a while (thanks to Facebook), but we are still doing more than 2 million file request per day. At our peak it was more like 10 million per day.

Our structure (a possible answer)

Everything on Jerry is tweaked for the small file delivery and nothing else.

Jerry is a webserver, but we treat it more like a database. Everything that is not needed is removed.

Each file is given a 4 character ID. The ID is alpha-numeric (0-9,a-z,A-Z). This gives you 61*61*61*61 combinations (or 13,845,841 IDs).

We have multiple domains as well, so each domain has a maximum of 13,845,841 IDs. We got very close on the popular "domains" to this limit before Facebook came along and we had plans ready to go that would allow for 5 character IDs, but did not need it in the end.

File system look-ups are very fast if you know the full path to the file. It is only slow if you need to scan for file matches. We took full advantage of this.

Each 4 character id is a series of directories. for example, aBc9 is /path/to/a/B/c/9.

This is a very high number of unique IDs across only 4 directories. Each directory has a maximum of 61 sub-directories. Creating fast look-ups without flooding the file system index.

Located in directory ./9 (the last directory in the ID) are the necessary metadata files and the raw data file. The metadata is a known file name and so is the data file. We also have other known files in each folder, but you get the idea.

If a users is updating or checking the metadata, the ID is known so a request for the metadata is returned.

If the data file is requested, again, the ID is known, so the data is returned. No scanning or complex checking is performed.

If the ID is invalid, an invalid result is returned.

Nothing complex, everything for speed.

Our issues

When you are talking about millions of small files it is possible to run out of inodes. Be sure to factor this in to your disk creation for the server from the start. Plan ahead.

We disabled and / or edited a number of the FreeBSD system checks. The maintenance cronjobs are not designed for systems with so many files.

The Apache configure was a bit of trial and error to get it just right. When you do get it, the relief is huge. Apache's mod_status is very helpful.

The very first thing to do is disable all log files. Next, disable everything and re-add only what you need.

The code for the delivery (and saving) of the metadata and raw data is also very optimised. Forget code libraries. Every line of code has been checked and re-checked over the years for speed.

Conclusion

If you really do have a lot of thumbnails, split the system. Serve the small files from a dedicated server that has been optimised for that reason. Keep the main system tweaked for more standard usage.

A directory based ID system (be that random 4 characters or parts of an MD5) can be fast so long as you do not need to scan for files.

Your base operating system will need to be tweaked so the system checks are not sucking up your system resources.

Disable the webserver logfile creation. You are almost never going to need to it and it will create a bottleneck on the file system. If you need stats, you can get a general overview from mod_status.

To be very honest, not enough information is really known about your individual case and needs. I am unsure if any of my personal experience would be of help.

Good luck!

Tigger
  • 8,307
  • 4
  • 32
  • 38
  • 1
    I've done a very similar thing using GUIDs for filenames in a tree of directories named from the first few chars. This was not as efficient as yours (I'd have only 16 directories in each directory). A refinement was to use file date as a determinant for how deep in the tree a file would be. So I started off with 4 levels of directory, and when that started looking too busy at the 4th level I'd create a fifth, with new files going in at that level. Remembering the date of that decision was all that was needed to decide whether to plumb 4 or 5 directories deep when looking up a file. And, so on. – bazza Jul 13 '20 at 19:12
3

The best, efficient, minimal and simplest method is SeaweedFS

Since 2017, I am using SeaweedFS to store about 4 million jpegs each 24 hours. Currently the DB holds over 2 billion records. I never had an issue with it at all and it saves a lot of disk space compared to storing as File-System files.

Below is the author Intro:

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

  1. to store billions of files!
  2. to serve the files fast!

Details:

My project contains 2 images for each event, one is thumbnail and the other is full frame. At first phase of the project I stored the images as files with directory structure year/month/day/[thumb|full].jpg but after few days I had to browse through the files and it was nightmare and the disk response was slow. and in case of deleting large amount of files (over million) it would take hours. So I decided to do research on how big guys as google, facebook, instagram and twitter stores billions of images, and I found couple of youtube videos explains parts of the architectures, then I came across SeaweedFS and I gave it a try and I took quick look to the source code "release ver 0.76" and everything seems fine "no fishy code".
the only note was the logo fetched over CDN rather than locally.

The beauty of seaweedFS lies in its simplicity and stability, and it's kind of hidden gem (guess until now). Beside its ability to store billion of files and access them in flash of milliseconds, it auto purge the files based on TTL, that's very useful feature since most customers have finite amount of storage, hence they can't keep all the data forever. And the second thing I love about it is saving a lot of storage, example:

In my server each file was consuming Multiple of 8 KB from the disk space (due to File System structure), so even that most of my thumbnails had the size of 1 or 2 KB it consume 8 KB, so when you add up all that wasted bytes you end up wasting large percent of storage, in SeaWeedFS each file metadata take extra 40 bytes only, and that's a legacy!.

Hope that's help.

Jawad Al Shaikh
  • 2,067
  • 1
  • 23
  • 36
1

If you use first 2 characters of md5 as folder name, and suppose you have 100 thumbnails, with only 2 thumbnails sharing first 2 filename characters in common, you would soon run into the problem of slow filesystem.

Can you please share the directory structure, where the original images are stored?

Maybe, you can create the thumbnail directory structure, based on creation date of the original image?

Suppose original image was created on 3rd May 2019, then thumbnail directory structure could be thumbnails/52019/abc123.jpg. (Consider abc123 to be a hash)

So, to locate the above thumbnail, you need to:

  1. Read the creation date of the original image
  2. Compute the md5 hash of the original image's full path (In this case, it's abc123)
  3. Go to the thumbnails folder
  4. Locate the subfolder, based on creation date of original image. In this case, it's 52019
  5. Search for the file, using hash of the original image's full path

Hope this answers your question well.

Faraaz Malak
  • 309
  • 1
  • 6
1

I've read here that this causes issues when directory reaches thousands of files

  1. Looks like premature optimization to me. You worry about thousands. But right now I have around 10 000 files in the ~/.cache/thumbnails directory and I have no problems with that. How many thumbnails do you really need? Make them! And then test your performance.

  2. Where have you read it? What were exact issues described there? Because from this and this you can figure that even with a half million files in a single directory you can access them quite fast. Yes, you'll have hard time with huge directories when you'll use some tools (like ls), but sure you can write your server better.

  3. And, as an option, you can create parallel directory structure. So that for a file z/y/x/image.png thumbnail goes to thumbnails/z/y/x/image.png. That way you'll have benefits of:

    1. human readability
    2. easy diff of directory trees of original images and thumbnails in case of bugs
    3. no need for md5 hashes
    4. simpler code in case you'll need some batch operations (like delete all thumbnails for files from z/y/x/)

    It also can be more efficient. But I'm not sure - test it.

x00
  • 11,008
  • 2
  • 10
  • 34
  • ok maybe it is premature ejaculation, but why is that a bad thing? It's better to optimize the software now, than later when it becomes much more complex – Alex Jul 11 '20 at 18:13
  • @Alex, yes, but no :) Sometimes that's true, but this particular decision surely will be local to a single function. So if you'll decide to change it that will require a change to this single function. I imagine it will take less time than you'll spend in guessing for the best option. So the complexity of the whole app shouldn't matter – x00 Jul 11 '20 at 18:19
1

I'm not sure what kind of application you're building, but depending on the amount of users, speed of your server and how often thumbnails are getting accessed, you could maybe use a cache-like system? Store the generated thumbnails as you propose, with MD5 hashes, and remove them after a certain amount of time. If thumbnails are accessed mostly when the images are first put on the server and their use goes down over time, you can just remove them (in the middle of the night, or whenever it's used the least) and regenerate them if they are needed again, provided this isn't done a lot.

Another option you might have depending on the directory structure of your original files is to separate your original files into directories and store the thumbnails in a directory in the directory of their original. This way, if you know the path of the original, you already know a large part of the path of the thumbnail.

Luctia
  • 140
  • 1
  • 12