Fast data structure in Python for indexing a bunch of images as duplicates

Question

Introduction: I want to replace about 280'000 images of math formulas on the Encyclopedia of Mathematics by their corresponding TEX code. To do this, I have classified all of these images (or better: their URLs) into a list of 100'000 lists.

Each "sublist" contains strings of urls such that each url in that sublist links to the same image. The list looks something like [["https://www.encyclopediaofmath.org/legacyimages/a/a130/a130010/a1300105.png", "https://www.encyclopediaofmath.org/legacyimages/a/a010/a010080/a01008021.png", ...], ["https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w1300801.png", "https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w130080211.png"], ...].

For each sublist, I have (or am still in the process of) determined the corresponding TEX code for one image of that sublist. Since the images within each sublist are identical, I have (or still am) determined the TEX code for every image url in the whole list.

Now I want to replace the images in each article (such as this one) by the known TEX code. This results in me having to index the image URLs of each article in this list of sublists.

My question: Do you know of any better data structures than a list of lists for the above task?

Example code:

dups = [[i, i+1] for i in range(100000)]
for i in range(10000):
    for j in range(100000):
        if i in dups[j]:
            print(f"Found number {i} in {j}-th list")
            break

In the above example, dups is a simplified version of my list of lists (and I am using numbers instead of strings.) As you can notice, the above program takes some time to finish. I would like to improve dups so that a similar type of indexing can be done faster.

Remark 1: The above code essentially makes 1 + 2 + 3 + ... + n comparisons if dups has a length of n. This leads to n * (n+1)/2 comparisons. Since n = 100'000 in my case, this is already a lot of comparisons.

Remark 2: An obvious improvement would be to transform each sublist into a Python set and to consider a list of sets. However, most of my sublists contain less than 3 images, so I doubt that this would greatly improve runtime.

Remark 3: Note that I can hardly control the order of the "incoming" images (basically I have to follow the article structure) and that I can not construct a full order inside the list of lists (since I can not break the sublists apart.) I have thus not figured out a way to implement binary search.

score 1 · Accepted Answer · answered Apr 15 '19 at 15:39

While it might introduce data redundancy I would propose to use a binary search tree. Your list of lists is a good idea for indexing but it has one significant issue, which indeed is the runtime.

Your metric for the tree could simply be an alphabetic comparison of the links (a < z, aa > z etc.). Thus, essentially you have binary search and just some redundant data. If we do the math, you have 280,000 images, which means the average search time in a BST will be log[2](280,000), which is approximately 18 steps. That you have about three of the same TEX codes really does not matter considering the improvement in speed, just store it 3 times. Treat it like a key and value pair. In your BST the key is your link, the corresponding value is just stored with it (which you can use your list of lists for). You can also have the value of your pair be the index of the sublist it is in. But my general suggestion would be to ignore your sublists when searching and use them again when you're done with it.

A tree would look something like this:

                                (link, code/index)
                              /                     \
                      (link,code/index)       (link, code/index)
                            / \                      / \
                            etc.                     etc.

If you want to or have to stick to your sublist idea then my only suggestion is to create a dictionary based on your lists. See here for the time complexity of that.

Though if possible I would implement this in a language which has pointers or in such a way that your code for every link is the same object to save space.

I like the tree idea and I think it can be implemented even in Python. Since lists are not hashable, I guess I would have to index the lists in the dictionary by their TEX code? Anyway, I will accept this as an answer soon provided that no one makes some other ground breaking suggestion :D — Maximilian Janisch, Apr 15 '19 at 20:50
@MaximilianJanisch Perhaps I am misunderstanding, but indexing lists by TEX code sounds a bit odd. While lists are not hashable, strings are! Maybe look at something like this:`sublist A = [l1,l2,l3]` (for let's say \Alpha) now we build a dict `dict = {l1: "\Alpha",. l2: "\Alpha", l3: "\Alpha", ... (other stuff here)}`. This will allow you to simply throw in a link and get back the code. Takes up a lot of memory but it's fast in access. Maybe only use the file path to your images, too, to make things faster (i.e. just the different parts). — MrDeal, Apr 17 '19 at 01:46
The dictionary sounds very easy to implement too! Also, it will not greatly change my storage usage I think (not that it really matters in this case anyway) since I would only have to store about 180000 more TEX codes — Maximilian Janisch, Apr 17 '19 at 08:16

Fast data structure in Python for indexing a bunch of images as duplicates

1 Answers1