I have the following table storing data about images:
images
- id (int)
- sample_1_1 (int)
- sample_1_2 (int)
- sample_1_3 (int)
- sample_2_1 (int)
- sample_2_2 (int)
- sample_2_3 (int)
- ... # Up until sample_25_3
The task is to calcuate the distance between the collected data. Currently, I am using a 75-dimensional (that's right, 3 * 25 = 75) euclid distance calculation programmed as stored procedures into the database:
CREATE DEFINER=`root`@`localhost`
FUNCTION `distanceBetween`(compareId INT, toId INT) RETURNS double
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE distance DOUBLE;
SELECT euclidDistance(
i1.sample_1_1, i1.sample_1_2, i1.sample_1_3,
i2.sample_1_1, i2.sample_1_2, i2.sample_1_3,
...
) INTO distance
FROM images i1, (SELECT * FROM images WHERE id = toId) i2
WHERE i1.id = compareId;
RETURN distance;
END
With another subroutine calculating the actual distance between 2 75-dim. vectors:
CREATE DEFINER=`root`@`localhost`
FUNCTION `euclidDistance`(
img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT,
img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT,
...
) RETURNS double
RETURN SQRT(
quadDiff(img1_sample1_1, img2_sample1_1)
+ quadDiff(img1_sample1_2, img2_sample1_2)
+ quadDiff(img1_sample1_3, img2_sample1_3)
+ ...
)
And another subroutine to calculate the squared difference between two values:
CREATE DEFINER=`root`@`localhost`
FUNCTION `quadDiff`(var1 INT, var2 INT) RETURNS int(11)
RETURN POW(var1 - var2, 2)
The functions itself are perfectly fine and return deterministic results that are both mathematically and logically correct.
The problem comes when I want to get the "closest" images to a given image - meaning the images that have the least distance to any given image. To do so, I use another procedure:
CREATE DEFINER=`root`@`localhost`
PROCEDURE `getSimilarImages`(imageId INT, `limit` INT)
BEGIN
SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
WHERE i1.id = imageId
ORDER BY distance
LIMIT 10;
END
The database currently features around 30.000 images. This means that a CALL getSimilarImages(123, 10);
takes about 12 seconds to finish. This is way too long for any application, be it web or application based.
So, I want to speed things up. What are my options? Do you see any potential to optimize the process of image comparing or calculating the distances?
I thought about caching the result of the procedure, but I have no clue how to do so. I could also compare every image to every other image as soon as new images are added, but that will adding images a very long process, which is also not acceptable.
I can provide more information on the system setup if it helps, but I appreciate any pointers you can provide. The current situation is not good and I really need to do something, because the image database will only grow with every hour the system is up.