Introduction: I want to replace about 280'000 images of math formulas on the Encyclopedia of Mathematics by their corresponding TEX code. To do this, I have classified all of these images (or better: their URLs) into a list of 100'000 lists.
Each "sublist" contains strings of urls such that each url in that sublist links to the same image. The list looks something like [["https://www.encyclopediaofmath.org/legacyimages/a/a130/a130010/a1300105.png", "https://www.encyclopediaofmath.org/legacyimages/a/a010/a010080/a01008021.png", ...], ["https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w1300801.png", "https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w130080211.png"], ...]
.
For each sublist, I have (or am still in the process of) determined the corresponding TEX code for one image of that sublist. Since the images within each sublist are identical, I have (or still am) determined the TEX code for every image url in the whole list.
Now I want to replace the images in each article (such as this one) by the known TEX code. This results in me having to index the image URLs of each article in this list of sublists.
My question: Do you know of any better data structures than a list of lists for the above task?
Example code:
dups = [[i, i+1] for i in range(100000)]
for i in range(10000):
for j in range(100000):
if i in dups[j]:
print(f"Found number {i} in {j}-th list")
break
In the above example, dups
is a simplified version of my list of lists (and I am using numbers instead of strings.) As you can notice, the above program takes some time to finish. I would like to improve dups so that a similar type of indexing can be done faster.
Remark 1: The above code essentially makes 1 + 2 + 3 + ... + n comparisons if dups has a length of n. This leads to n * (n+1)/2 comparisons. Since n = 100'000 in my case, this is already a lot of comparisons.
Remark 2: An obvious improvement would be to transform each sublist into a Python set and to consider a list of sets. However, most of my sublists contain less than 3 images, so I doubt that this would greatly improve runtime.
Remark 3: Note that I can hardly control the order of the "incoming" images (basically I have to follow the article structure) and that I can not construct a full order inside the list of lists (since I can not break the sublists apart.) I have thus not figured out a way to implement binary search.