Calculate MD5 of 90,000+ files and store to a database

Question

I am working on a script that downloads all of my images, calculates the MD5 hash, and then stores that hash in a new column in the database. I have a script that selects the images from the database and saves them locally. The image's unique id becomes the filename.

My problem is that, while cURLQueue works great for quickly downloading many files, calculating the MD5 hash of each file in a callback slows the downloading down. That was my first attempt. For my next attempt, I would like to separate the downloading and hashing parts of my code. What is the best way to do this? I would prefer to use PHP, as that is what I am most familiar with and what our servers run, but PHP's thread support is lacking to say the least.

Thoughts are to have a parent process that establishes a SQLite connection, then spawn many children that choose an image, calculate the hash of it, store it in the database, and then delete the image. Am I going down the right path?

See http://stackoverflow.com/questions/13322901/making-a-large-processing-job-small-er/13393543#13393543 almost the same issue ... — Baba, Nov 19 '12 at 20:34

score 1 · Accepted Answer · edited May 23 '17 at 12:29

There are a number of ways to approach this, but which you choose really depends on the particulars of your project.

A simple way would be to download the images with one PHP, then place them on the file system and add an entry to the queue database. Then a second PHP program would read the queue, and process those waiting.

For the second PHP program, you could setup a cron job to just check regularly and process all that are waiting. A second way would be to spawn the PHP program in the background every time a download finishes. The second method is more optimal, but a little more involved. Check out the post below for info on how to run a PHP script in the background.

Is there a way to use shell_exec without waiting for the command to complete?

Ended up solving the problem a different way, but this is a good answer. — Mechcozmo, Dec 09 '12 at 21:48

score 0 · Answer 2 · answered Nov 19 '12 at 21:35

I've covered a similar issue at work, but it will need an amqp server like rabbitmq.

Imagine to have 3 php scripts:

first: add the urls to the queue
second: get the url from the queue, download the file and adds the downloaded filename to the queue
third: get the filename to the queue and sets the md5 into the database

We use such way to handle multiple image download/processing using python scripts (php is not that far).

You can check some php libraries here and some basic examples here.

In this way we can scale each worker depending on each queue length. So if you have tons of urls to be downloaded you just start another script #2, if you have lot of unprocessed file you just start a new script #3 and so on.

Calculate MD5 of 90,000+ files and store to a database

2 Answers2