The fastest way to get filtered data checking substring value within ndarray

Question

I have a big array of data:

>>> len(b)
6636849
>>> print(b)
[['60D19E9E-4E2C-11E2-AA9A-52540027E502' '100015361']
 ['60D19EB6-4E2C-11E2-AA9A-52540027E502' '100015385']
 ['60D19ECE-4E2C-11E2-AA9A-52540027E502' '100015409']
 ..., 
 ['8CC90633-447E-11E6-B010-005056A76B49' '106636785']
 ['F8C74244-447E-11E6-B010-005056A76B49' '106636809']
 ['F8C7425C-447E-11E6-B010-005056A76B49' '106636833']]

I need to get the filtered dataset, i.e, everything containing (or starting with) '106' in the string). Something like the following code with substring operation instead of math operation:

>>> len(b[b[:,1] > '10660600'])
30850

Would you consider parsing the second column as integers? – Mad Physicist Aug 25 '16 at 18:22 — Mad Physicist, Aug 25 '16 at 18:22

Benjamin · Accepted Answer · 2016-08-25T17:42:05.073

1

I don't think numpy is well suited for this type of operation. You can do it simply using basic python operations. Here it is with some sample data a:

import random # for the test data
a = []
for i in range(10000):
    a.append(["".join(random.sample('abcdefg',3)), "".join(random.sample('01234567890',8))])

answer = [i for i in a if i[1].find('106') != -1]

Keep in mind that startswith is going to be a lot faster than find, because find has to look for matching substrings in all positions.

It's not too clear why you need do this with such a large list/array in the first place, and there might be a better solution when it comes to not including these values in the list in the first place.

edited Aug 25 '16 at 17:42

answered Aug 25 '16 at 17:34

Benjamin

10,449
13
64
112

Cool, indeed: len(b[np.char.startswith(b[:,1], "1001")]) – user1464922 Aug 26 '16 at 09:50
Actually I want to create my own multidimensional cube with good querying performance, just for fun and better understanding. Something like on-disk MOLAP engine. Probably I'm moving to the wrong direction, but there are my very first steps in this realm. – user1464922 Aug 26 '16 at 09:57

score 1 · Answer 2 · edited May 23 '17 at 10:34

Here's a simple pandas solution

import pandas as pd

df = pd.DataFrame(b, columns=['1st String', '2nd String'])
df_filtered = df[df['2nd String'].str.contains('106')]

This gives you

In [29]: df_filtered
Out[29]: 
                             1st String 2nd String
3  8CC90633-447E-11E6-B010-005056A76B49  106636785
4  F8C74244-447E-11E6-B010-005056A76B49  106636809
5  F8C7425C-447E-11E6-B010-005056A76B49  106636833

Update: Timing Results

Using Benjamin's list a as the test sample:

In [20]: %timeit [i for i in a if i[1].find('106') != -1]
100 loops, best of 3: 2.2 ms per loop

In [21]: %timeit df[df['2nd String'].str.contains('106')]
100 loops, best of 3: 5.94 ms per loop

So it looks like Benjamin's answer is actually about 3x faster. This surprises me since I was under the impression that the operation in pandas is vectorized. Moreover, the speed ratio does not change when a is 100 times longer.

Thanks, it's very informative. Performance case is very important to me. — user1464922, Aug 28 '16 at 09:44

score 0 · Answer 3 · answered Aug 26 '16 at 17:56

Look at the functions in the np.char submodule:

data = [['60D19E9E-4E2C-11E2-AA9A-52540027E502', '100015361'],
 ['60D19EB6-4E2C-11E2-AA9A-52540027E502', '100015385'],
 ['60D19ECE-4E2C-11E2-AA9A-52540027E502', '100015409'],
 ['8CC90633-447E-11E6-B010-005056A76B49', '106636785'],
 ['F8C74244-447E-11E6-B010-005056A76B49', '106636809'],
 ['F8C7425C-447E-11E6-B010-005056A76B49', '106636833']]
data = np.array([r[1] for r in data], np.str)
idx = np.char.startswith(data,  '106')
print(idx)

The fastest way to get filtered data checking substring value within ndarray

3 Answers3