We use FreeBSD (file system UFS), not Linux, so some details may be different.
Background
We have several million files on this system that need to be served as quick as possible from a website, for individual access. The system we have been using has worked very well over the last 16 years.
Server 1 (named: Tom) has the main user website with a fairly standard Apache set-up and a MySQL data base. Nothing special at all.
Server 2 (named: Jerry) is where the user files are stored and has been customised for speedy delivery of these small files.
Jerry's hard drive is tweaked during creation to make sure we do not run out of inodes - something you need to consider when creating millions of small files.
Jerry's Apache config is tweaked for very short connection times and single file access per connection. Without these tweaks, you will have open connections sitting there wasting resources. This Apache config would not suit the main system (Tom) at all and would cause a number of issues.
As you are serving "thumbnails", not individual requests, you might need a slightly different structure. To be honest, I do not know enough about your needs to really advise what would be best for your webserver config.
Historically, we used multiple SCSI drives across a number of servers. At the moment, we have a single server with 300MB/s drives. The business has been in decline for a while (thanks to Facebook), but we are still doing more than 2 million file request per day. At our peak it was more like 10 million per day.
Our structure (a possible answer)
Everything on Jerry is tweaked for the small file delivery and nothing else.
Jerry is a webserver, but we treat it more like a database. Everything that is not needed is removed.
Each file is given a 4 character ID. The ID is alpha-numeric (0-9,a-z,A-Z). This gives you 61*61*61*61 combinations (or 13,845,841 IDs).
We have multiple domains as well, so each domain has a maximum of 13,845,841 IDs. We got very close on the popular "domains" to this limit before Facebook came along and we had plans ready to go that would allow for 5 character IDs, but did not need it in the end.
File system look-ups are very fast if you know the full path to the file. It is only slow if you need to scan for file matches. We took full advantage of this.
Each 4 character id is a series of directories. for example, aBc9
is /path/to/a/B/c/9
.
This is a very high number of unique IDs across only 4 directories. Each directory has a maximum of 61 sub-directories. Creating fast look-ups without flooding the file system index.
Located in directory ./9
(the last directory in the ID) are the necessary metadata files and the raw data file. The metadata is a known file name and so is the data file. We also have other known files in each folder, but you get the idea.
If a users is updating or checking the metadata, the ID is known so a request for the metadata is returned.
If the data file is requested, again, the ID is known, so the data is returned. No scanning or complex checking is performed.
If the ID is invalid, an invalid result is returned.
Nothing complex, everything for speed.
Our issues
When you are talking about millions of small files it is possible to run out of inodes. Be sure to factor this in to your disk creation for the server from the start. Plan ahead.
We disabled and / or edited a number of the FreeBSD system checks. The maintenance cronjobs are not designed for systems with so many files.
The Apache configure was a bit of trial and error to get it just right. When you do get it, the relief is huge. Apache's mod_status
is very helpful.
The very first thing to do is disable all log files. Next, disable everything and re-add only what you need.
The code for the delivery (and saving) of the metadata and raw data is also very optimised. Forget code libraries. Every line of code has been checked and re-checked over the years for speed.
Conclusion
If you really do have a lot of thumbnails, split the system. Serve the small files from a dedicated server that has been optimised for that reason. Keep the main system tweaked for more standard usage.
A directory based ID system (be that random 4 characters or parts of an MD5) can be fast so long as you do not need to scan for files.
Your base operating system will need to be tweaked so the system checks are not sucking up your system resources.
Disable the webserver logfile creation. You are almost never going to need to it and it will create a bottleneck on the file system. If you need stats, you can get a general overview from mod_status
.
To be very honest, not enough information is really known about your individual case and needs. I am unsure if any of my personal experience would be of help.
Good luck!