4

I have over 1.3milion images that I have to compare with each other, and a few hundreds per day are added.

My company take an image and create a version that can be utilized by our vendors.

The files are often very similar to each other, for example two different companies can send us two different images, a JPG and a GIF, both with the McDonald Logo, with months between the submissions.

What is happening is that at the end we find ourselves creating two different times the same logo when we could simply copy/paste the already created one or at least suggest it as a possible starting point for the artists.

I have looked around for algorithms to create a fingerprint or something that will allow me to do a simple query when a new image is uploaded, time is relatively not an issues, if it takes 1 second to create the fingerprint it will take 150 days to create the fingerprints but it will be a great deal in saving that we might even get 3 or 4 servers to do it.

I am fluent in PHP, but if the algorithm is in pseudocode or even C I can read it and try to translate (unless it uses some C specific libraries)

Currently I am doing an MD5 of all the images to catch the ones that are exactly the same, this question came up when I was thinking to do a resize of the image and run the md5 on the resized image to catch the ones that have been saved in a different format and resized, but then I would still not have a good enough recognition.

If I didn't mention it, I will be happy with something that just suggest possible "similar" images.

EDIT

Keep in mind that the check needs to be done multiple times per minute, so the best solution is one that gives me some values per image that I can store and use in the future to compare with the image that I am looking at without having to re-scan the whole server.

I am reading some pages that mention histograms, or resizing the image to a very small size, strip possible tags and then convert it in grayscale, do the hash of that files and use it for comparison. If I am succesful I will post the code/answer here

Fabrizio
  • 3,544
  • 2
  • 26
  • 32
  • 1
    Look at some related questions about how to get a image signature/fingerprint which you can then compare for similarity: [OpenCV / SURF How to generate a image hash / fingerprint / signature out of the descriptors?](http://stackoverflow.com/questions/7205489/opencv-fingerprint-image-and-compare-against-database), [Near-Duplicate Image Detection](http://stackoverflow.com/questions/1034900/near-duplicate-image-detection/), [Image fingerprint to compare similarity of many images](http://stackoverflow.com/questions/596262/image-fingerprint-to-compare-similarity-of-many-images), ... – Albert Jan 15 '13 at 08:45

2 Answers2

2

Try using file_get_contents and: http://www.php.net/manual/en/function.hash-file.php

If the hashes match, then you know they are the exact same.

EDIT: If possible I would think storing the image hashes, and the image path in a database table might help you limit server load. It is much easier to run the hash algorithm once on your initial images and store the hash in a table... Then when new images are submitted you can hash the image and then do a lookup on the database table. If the hash is already there discard it. You can use the hash as the table index and so once you find a match you dont need to check the rest.

The other option is to not use a database...But then you would have to always do a n lookup. That is check hash the incoming image and then run in memory a n time search against all saved images.

EDIT #2: Please view the solution here: Image comparison - fast algorithm

Community
  • 1
  • 1
Shawn
  • 3,313
  • 7
  • 42
  • 58
  • 1
    That is what I am currently doing, I store size, creation time, name, and MD5 of the files, but this doesn't help me with similar images, only with exact matches – Fabrizio Aug 01 '12 at 13:24
  • Then I don't think there is a possible solution for you, if not simply because how would one tell the difference if a mcdonalds green arches is the same as mcdonalds with gold arches. As a human its easier, but of course the only way to tell the code is to look for byte data related to a certain color. But then what happens if two images mostly have the same exact RGB code such as mcdonalds yellow and a school bus? I have edited my answer to include a previous solution. – Shawn Aug 01 '12 at 14:04
  • Also as far as a maintenance avoidance in the future... Maybe you can have your images stored by client and have a process for a sales member (or customer service) to view each incoming image and reject it if that client has the image already? Just trying to give a less technical solution as well... – Shawn Aug 01 '12 at 14:08
  • Shawn, we already do that, the problem is the same client creating new accounts as we offer one free service for new accounts, or companies where each employee upload the same logo. In those cases we wouldn't want to recreate the logo, but use what we already have done once. – Fabrizio Aug 01 '12 at 16:04
  • A way to prevent abuse from the same company is to use their fed_tax_id on account on signup and then have your team verify it on irs.gov. One of my employers did this. – Shawn Aug 01 '12 at 19:06
  • Shawn, thank you for your suggestion butI am not trying to verify if they are real customers or not, I am just trying to save time by not doing work that has already been done. For example if 25 different McDonalds send us their logos, we will work on them for free, but once that the first one is done there is no need for us to recreate the other 24 if a simple copy/paste would suffice. – Fabrizio Aug 01 '12 at 21:05
0

To speedup the process, sort all the files with size and compare internals only if two sizes are equal. To compare internal data, using hash comparison is also fastest way. Hope this helps.

mrd081
  • 279
  • 1
  • 11
  • the MD5 comparison is what I am already doing, sorting by size would only help if I didn't store the final hash somewhere, but I am storing the hashes in a database. MD5 would only give me the exact matches while I am looking for similar images – Fabrizio Aug 01 '12 at 13:26
  • For comparison sake, how about storing a size in the database ? Comparing int size will be faster than comparing every md5 with other isn't it ? – mrd081 Aug 03 '12 at 08:55
  • yes, it will be faster but very unreliable. You can have two completely different images having the same size, for example a white pixel image and a black pixel image. – Fabrizio Aug 06 '12 at 19:30
  • I am saying use md5 only if size matches. – mrd081 Aug 08 '12 at 05:56
  • I am storing the md5 in a database together with other file information. the query to check the md5 is really fast (0.0007) no need to check for size – Fabrizio Aug 10 '12 at 16:06