I have a large array, that looks like something below:
np.random.seed(42)
arr = np.random.permutation(np.array([
(1,1,2,2,2,2,3,3,4,4,4),
(8,9,3,4,7,9,1,9,3,4,50000)
]).T)
It isn't sorted, the rows of this array are unique, I also know the bounds for the values in both columns, they are [0, n]
and [0, k]
. So the maximum possible size of the array is (n+1)*(k+1)
, but the actual size is closer to log of that.
I need to search the array by both columns to find such row
that arr[row,:] = (i,j)
, and return -1
when (i,j)
is absent in the array. The naive implementation for such function is:
def get(arr, i, j):
cond = (arr[:,0] == i) & (arr[:,1] == j)
if np.any(cond):
return np.where(cond)[0][0]
else:
return -1
Unfortunately, since in my case arr
is very large (>90M rows), this is very inefficient, especially since I would need to call get()
multiple times.
Alternatively I tried translating this to a dict with (i,j)
keys, such that
index[(i,j)] = row
that can be accessed by:
def get(index, i, j):
try:
retuen index[(i,j)]
except KeyError:
return -1
This works (and is much faster when tested on smaller data than I have), but again, creating the dict on-the-fly by
index = {}
for row in range(arr.shape[0]):
i,j = arr[row, :]
index[(i,j)] = row
takes huge amount of time and eats lots of RAM in my case. I was also thinking of first sorting arr
and then using something like np.searchsorted
, but this didn't lead me anywhere.
So what I need is a fast function get(arr, i, j)
that returns
>>> get(arr, 2, 3)
4
>>> get(arr, 4, 100)
-1