Searching large array by two columns

Question

I have a large array, that looks like something below:

np.random.seed(42)

arr = np.random.permutation(np.array([
    (1,1,2,2,2,2,3,3,4,4,4),
    (8,9,3,4,7,9,1,9,3,4,50000)
]).T)

It isn't sorted, the rows of this array are unique, I also know the bounds for the values in both columns, they are [0, n] and [0, k]. So the maximum possible size of the array is (n+1)*(k+1), but the actual size is closer to log of that.

I need to search the array by both columns to find such row that arr[row,:] = (i,j), and return -1 when (i,j) is absent in the array. The naive implementation for such function is:

def get(arr, i, j):
    cond = (arr[:,0] == i) & (arr[:,1] == j)
    if np.any(cond):
        return np.where(cond)[0][0]
    else:
        return -1

Unfortunately, since in my case arr is very large (>90M rows), this is very inefficient, especially since I would need to call get() multiple times.

Alternatively I tried translating this to a dict with (i,j) keys, such that

index[(i,j)] = row

that can be accessed by:

def get(index, i, j):
   try:
      retuen index[(i,j)]
   except KeyError:
      return -1

This works (and is much faster when tested on smaller data than I have), but again, creating the dict on-the-fly by

index = {}
for row in range(arr.shape[0]):
    i,j = arr[row, :]
    index[(i,j)] = row

takes huge amount of time and eats lots of RAM in my case. I was also thinking of first sorting arr and then using something like np.searchsorted, but this didn't lead me anywhere.

So what I need is a fast function get(arr, i, j) that returns

>>> get(arr, 2, 3)
4
>>> get(arr, 4, 100)
-1

For your first `get`, you should really just be returning `np.where((arr[:,0] == 2) & (arr[:,1] == 3))` and checking for an empty array. Second, how large are your arrays? — user3483203, Jun 20 '18 at 14:16
Do you need to get the first index or all indices for found matches? Or just -1 and anything else? — Jan Christoph Terasa, Jun 20 '18 at 14:30
@ChristophTerasa It would match only once since the rows are *unique* — kmario23, Jun 20 '18 at 14:31
I have experimented using `np.einsum`, but I didn't get a faster version than `get`. — Jan Christoph Terasa, Jun 20 '18 at 15:23
Is it important to return the index, or just `True` and `False`? — Jan Christoph Terasa, Jun 20 '18 at 15:40

kmario23 · Answer 1 · 2018-06-20T14:58:48.670

1

A partial solution would be:

In [36]: arr
Out[36]: 
array([[    2,     9],
       [    1,     8],
       [    4,     4],
       [    4, 50000],
       [    2,     3],
       [    1,     9],
       [    4,     3],
       [    2,     7],
       [    3,     9],
       [    2,     4],
       [    3,     1]])

In [37]: (i,j) = (2, 3)

# we can use `assume_unique=True` which can speed up the calculation    
In [38]: np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)
Out[38]: 
array([[False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False]])

# we can use `assume_unique=True` which can speed up the calculation
In [39]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)

In [40]: np.argwhere(mask)
Out[40]: array([[4, 0]])

If you need the final result as a scalar, then don't use keepdims argument and cast the array to a scalar like:

    # we can use `assume_unique=True` which can speed up the calculation
In [41]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)

In [42]: np.argwhere(mask)
Out[42]: array([[4]])

In [43]: np.asscalar(np.argwhere(mask))
Out[43]: 4

edited Jun 20 '18 at 14:58

answered Jun 20 '18 at 14:43

kmario23

42,075
12
123
130

This is about a factor of 3-4 slower than the `get` in the question, according to my testing. Is it faster for larger arrays? – Jan Christoph Terasa Jun 20 '18 at 14:48
The problem is that this searches all the rows each time (slow), while I'd need something faster, like hashing table, that does not loop through the rows – Tim Jun 20 '18 at 14:49
@Tim ok I see, I will have to do some more tests to give a concrete answer. – kmario23 Jun 20 '18 at 14:50
The dict solution I mention is a great benchmark, but the time/memory used for creating dict with >90M entries is unacceptable (once it hanged my computer). – Tim Jun 20 '18 at 14:59
@Tim can you test this solution with the `assume_unique=True` flag and see whether you gain some speedup? (see my updated answer). The documentation says speedup can be achieved on using this flag: https://docs.scipy.org/doc/numpy/reference/generated/numpy.isin.html – kmario23 Jun 20 '18 at 14:59

Jan Christoph Terasa · Answer 2 · 2018-06-20T18:18:43.723

1

Solution

Python offers a set type to store unique values, but sadly no ordered version of a set. But you can use the ordered-set package.

Create an OrderedSet from the data. Fortunately, this only needs to be done once:

import ordered_set

o = ordered_set.OrderedSet(map(tuple, arr))

def ordered_get(o, i, j):
    try:
        return o.index((i,j))
    except KeyError:
        return -1

Runtime

Finding the index of a value should be O(1), according to the documentation:

In [46]: %timeit get(arr, 2, 3)
10.6 µs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [47]: %timeit ordered_get(o, 2, 3)
1.16 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [48]: %timeit ordered_get(o, 2, 300)
1.05 µs ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Testing this for a much larger array:

a2 = random.randint(10000, size=1000000).reshape(-1,2)
o2 = ordered_set.OrderedSet()
for t in map(tuple, a2):
    o2.add(t)

In [65]: %timeit get(a2, 2, 3)
1.05 ms ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [66]: %timeit ordered_get(o2, 2, 3)
1.03 µs ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [67]: %timeit ordered_get(o2, 2, 30000)
1.06 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Looks like it indeed is O(1) runtime.

edited Jun 20 '18 at 18:18

answered Jun 20 '18 at 15:58

Jan Christoph Terasa

4,807
19
30

I just tried this with 63M items, and the ipython instance uses 11.7G of memory. It might be worthwhile to divide the dataset, or just use a machine with sufficient memory, if this has to be done in one go. – Jan Christoph Terasa Jun 20 '18 at 18:16
Thanks. Unfortunately, as the dict solution, with approx 100M items I also get into RAM issues (I tested different variants of my ideas, your solution and other approaches). If you look at the source code, OrdredSet uses python dict under the hood (https://github.com/LuminosoInsight/ordered-set/blob/master/ordered_set.py) so it is not a surprise. – Tim Jun 20 '18 at 21:39
@Tim I took a look at how it is implemented, and as you said it's a pretty straightforward implementation. Maybe it is possible to implement a more space-efficient set, possibly by sacrificing the O(1) runtime. – Jan Christoph Terasa Jun 20 '18 at 21:43
On a sidenote, if this is work-related, there is a certain point where further study is more expensive than just slapping in another 32 gigs of RAM. – Jan Christoph Terasa Jun 21 '18 at 04:38
I know, but I thought that if the data fits RAM and I am able to work with it, then creating something like hashing table for it should be manageable. It seems more complicated then I thought... – Tim Jun 21 '18 at 05:02
Maybe it's worthwhile then to use a language where you can control the size of the data structures more directly. Depending on `n` and `k`, the default size of a python `int` stored in a `tuple` on a 64-bit platform (8 bytes!) might be too large. Also, a tuple has an overhead of 48 bytes as well. So, let's say your numbers fit into `int32`, instead of using 8 bytes for each 2-tuple/struct, you end up with 64 bytes per tuple! – Jan Christoph Terasa Jun 21 '18 at 07:49
Yes, probably. Thanks for help and many good suggestions. FYI, I also found this: https://github.com/nnemkin/sparsedict However in the end, I am afraid that I would need to re-design my problem to work with different data structures. – Tim Jun 21 '18 at 07:52

AGN Gazer · Answer 3 · 2018-06-20T23:08:48.950

def get_agn(arr, i, j):
    idx = np.flatnonzero((arr[:,0] == j) & (arr[:,1] == j))
    return -1 if idx.size == 0 else idx[0]

Also, just in case you are thinking about the ordered_set solution, here is a better one (however, in both cases see timing tests below):

d = { (i, j): k for k, (i, j) in enumerate(arr)}
def unordered_get(d, i, j):
    return d.get((i, j), -1)

and it's "full" equivalent (that builds the dictionary inside the function):

def unordered_get_full(arr, i, j):
    d = { (i, j): k for k, (i, j) in enumerate(arr)}
    return d.get((i, j), -1)

Timing tests:

First, define @kmario23 function:

def get_kmario23(arr, i, j):
    # fundamentally, kmario23's code re-aranged to return scalars
    # and -1 when (i, j) not found:
    mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)
    idx = np.argwhere(mask)[0]
    return -1 if idx.size == 0 else np.asscalar(idx[0])

Second, define @ChristophTerasa function (original and the full version):

import ordered_set
o = ordered_set.OrderedSet(map(tuple, arr))
def ordered_get(o, i, j):
    try:
        return o.index((i,j))
    except KeyError:
        return -1

def ordered_get_full(arr, i, j):
    # "Full" version that builds ordered set inside the function
    o = ordered_set.OrderedSet(map(tuple, arr))
    try:
        return o.index((i,j))
    except KeyError:
        return -1

Generate some large data:

arr = np.random.randint(1, 2000, 200000).reshape((-1, 2))

Timing results:

In [55]: %timeit get_agn(arr, *arr[-1])
149 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [56]: %timeit get_kmario23(arr, *arr[-1])
1.42 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [57]: %timeit get_kmario23(arr, *arr[0])
1.2 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Ordered set tests:

In [80]: o = ordered_set.OrderedSet(map(tuple, arr))

In [81]: %timeit ordered_get(o, *arr[-1])
1.74 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [82]: %timeit ordered_get_full(arr, *arr[-1]) # include ordered set creation time
166 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Unordered dictionary tests:

In [83]: d = { (i, j): k for k, (i, j) in enumerate(arr)}

In [84]: %timeit unordered_get(d, *arr[-1])
1.18 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [85]: %timeit unordered_get_full(arr, *arr[-1])
102 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, when taking into account the time needed to create either ordered set or unordered dictionary, these methods are quite slow. You must plan running several hundred searches on the same data for these methods to make sense. Even then, there is no need to use ordered_set package - regular dictionaries are faster.

score 0 · Accepted Answer · answered Jun 21 '18 at 09:14

It seems I was over-thinking this problem, there is easy solution. I was considering either filtering and subsetting the array or using dict index[(i,j)] = row. Filtering and subsetting was slow (O(n) when searching), while using dict was fast (O(1) access time), but creating the dict was slow and memory intensive.

The simple solution for this problem is using nested dicts.

index = {}

for row in range(arr.shape[0]):
    i,j = arr[row, :]
    try:
        index[i][j] = row
    except KeyError:
        index[i] = {}
        index[i][j] = row

def get(index, i, j):
    try:
        return index[i][j]
    except KeyError:
        return -1

Alternatively, instead of dict on higher level, I could use index = defaultdict(dict), what would allow for assigning index[i][j] = row directly, without the try ... except conditions, but then the defaultdict(dict) object would create empty {} when queried for nonexistent i by the get(index, i, j) function, so it would be expanding the index unnecessarily.

The access time is O(1) for the first dict and O(1) for the nested dicts, so basically it's O(1). The upper level dict has manageable size (bounded by n < n*k), while the nested dicts are small (the nesting order is chosen based on the fact that in my case k << n). Building the nested dict is also very fast, even for >90M rows in the array. Moreover, it can be easily extended to more complicated cases.

Searching large array by two columns

4 Answers4

Solution

Runtime

Timing tests: