10

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm.

I was using a C# implementation from this blog post, Set Similarity and Min Hash.

But while trying to use the implementation, I have run into 2 problems.

  • What value should I set universe value to?
  • When passing image byte array to HashSet, it only contains distinct byte values; thus comparing values from 1 ~ 256.

What is this universe in MinHash?
And what can I do to improve the C# MinHash implementation?

Since HashSet<byte> contains values upto 256, similarity value always come out to 1.

Here is the source that uses the C# MinHash implementation from Set Similarity and Min Hash:

class Program
{
    static void Main(string[] args)
    {
        var imageSet1 = GetImageByte(@".\Images\01.JPG");
        var imageSet2 = GetImageByte(@".\Images\02.TIF");
        //var app = new MinHash(256);
        var app = new MinHash(Math.Min(imageSet1.Count, imageSet2.Count));
        double imageSimilarity = app.Similarity(imageSet1, imageSet2);
        Console.WriteLine("similarity = {0}", imageSimilarity);
    }

    private static HashSet<byte> GetImageByte(string imagePath)
    {
        using (var fs = new FileStream(imagePath, FileMode.Open, FileAccess.Read))
        using (var br = new BinaryReader(fs))
        {
            //List<int> bytes = br.ReadBytes((int)fs.Length).Cast<int>().ToList();
            var bytes = new List<byte>(br.ReadBytes((int) fs.Length).ToArray());
            return new HashSet<byte>(bytes);
        }
    }
}
Cœur
  • 32,421
  • 21
  • 173
  • 232
dance2die
  • 31,758
  • 34
  • 122
  • 177

2 Answers2

11

Taking your second question first:

And what can I do to improve the C# MinHash implementation?

You are trying to compare images on the byte level for files that are inherently structured very differently (you are using a TIFF as one image and a GIF in another). Even if visually these files were exactly the same, your implementation would never find the duplicates unless the files were of the same type.

That said, your minhash implementation should depend on comparable attributes of the images which you hash in order to create signatures.

While the value of the bytes are definitely attributes of the image, they can't be compared to each other if they're in different formats.

For images, you could use, for example, the RGB (and possibly alpha) values for each pixel in the image. These values are comparable no matter what format the image is in (you could use CMYK, or any other color space you want).

However, using the individual values for each pixel will give you poor results. The Jaccard similarity is used to compare the values from each set (regardless of whether or not you hash anything) and because sets don't have any order assigned to them, images that have the same number of pixels of the same color but arranged in different spaces will result in a false positive.

Take for example the following images:

red, green red, green

They are both 100px x 100px with 50 pixels of red and 50 pixels of green.

Using the Jaccard similarity to compare the two, you'd get the following (note that since the values are the same, the set only contains one element per color. If you want, you can use the Jaccard bag comparison to compare bags that have multiple counts of the same item, but in this case, the value will turn out to be the same):

Legend:
    g = green
    r = red

left image = { r, g }
right image = { r, g }

similarity = intersection(left, right) / union(left, right)

similarity = 1 / 1 = 100%

A note about the representation of right image = { r, g }: because sets are unordered, { r, g } is the same as { g, r }, so they are in effect, the same, even without Jaccard comparison being calculated, this point makes it obvious.

But obviously, these images are not the same.

This is why shingling is usually employed in order find distinct mini-regions within the set that can be used collectively to uniquely identify an item.

For images, you can use consecutive RGB values (in this case, going from left-to-right, top-to-bottom, wrapping around when an edge is hit) of a fixed length to generate shingles. In this case, assuming a shingle length of three, your sets look like this (note I'm using square brackets to indicate attributes/vectors, as the shingles are not sets in themselves):

left image = { [r, r, r], [r, r, g], [r, g, g], [g, g, g] }
right image = { [g, g, g], [g, g, r], [g, r, r], [r, r, r] } 

And gives you a Jaccard similarity of:

intersection(left, right) = 2
union(right, left) = 6

similarity(left, right) = 2 / 6 = 33.33%

This is a much closer estimate of how similar these images are (in that they're not) than the original.

Note that shingles can be of any length that you choose. You'll have to decide what shingles produce the appropriate Jaccard similarity results (and the threshold) answer the question "how similar are these?"

Now, answering your first question:

what is the universe value?

In this particular case, it's the number of items that can possibly exist in the universe. If you were using single RGB pixels, the universe would be:

255 * 255 * 255 = 16,581,375

With shingling, the value is much, much higher, as you're dealing with combinations of these items. Ideally, you want to generate a set of perfect hash functions for your set of hash functions that minhash. However, because of limitations of type systems (or, because you don't want to store very large numbers in another storage medium), your focus should be on hash functions that minimize collisions.

If you know the number of possible items in the universe of items, then it can help you generate hash functions that reduce the number of collisions.

In the implementation that you reference this universe size is used to generate a random number and then pass those numbers to generate multiple hash functions for minhashing which ideally, would produce minimal collisions.

casperOne
  • 70,959
  • 17
  • 175
  • 239
0

Briefly, Minhash alone is a poor solution for finding similar images. When used in conjunction with appropriate image feature extraction, it should work well. But this is far from straightforward. I'll explain:

Broadly speaking, Minhash calculates similarities based on number of shared features. Choosing appropriate features to generate your minhashes from is critical. In your case, you must choose features that are likely to be shared by similar images (but unlikely to be shared by dissimilar images). By "shared", I mean the same feature is found identically in both images.

For text documents, this is easy: the features used are typically shingles of text, e.g. "the cat sat", "cat sat on", "sat on the", "on the mat". These are simple to generate and likely to be shared between similar documents.

With images it's much harder. You can't compare runs of bytes, since a JPEG and a PNG of the same image will have entirely different byte patterns. Neither can you compare runs of pixel colour values, since these colour values will differ slightly between JPEG and PNG images. And then consider what happens if one image is scaled slightly, or blurred, or slightly rotated, or has its colour-balance adjusted: these are all the kinds of modification that similarity detection should be robust against, but any of these will result in changes to all pixels, so if features are based simply on pixel runs, the images will be considered entirely non-similar.

Image similarity detection is complex, and relies on using features that don't change with scaling, rotation, cropping, blurring or substantial colour adjustments. There are many techniques out there, and they generally detect broad shapes and colour relationships within the image. This is not for the novice, and you'll need to be willing to read some fairly mathematical papers in the field of computer vision. Once you can detect features like these, they can be usefully fed into the minhash algorithm and give good results.

What is this universe in MinHash?

That blog's author apparently intends universeSize to be the number of different features that could ever possibly exist, but doesn't do anything sensible with the value. They merely use it to reduce the randomness of the hash functions, which is always a bad idea. The universe should be considered infinite for all practical purposes, and there's no reason to use such a variable in a minhash implementation. I see a lot of problems in that code.

Ben Whitmore
  • 724
  • 4
  • 10