0

I have a big array of data:

>>> len(b)
6636849
>>> print(b)
[['60D19E9E-4E2C-11E2-AA9A-52540027E502' '100015361']
 ['60D19EB6-4E2C-11E2-AA9A-52540027E502' '100015385']
 ['60D19ECE-4E2C-11E2-AA9A-52540027E502' '100015409']
 ..., 
 ['8CC90633-447E-11E6-B010-005056A76B49' '106636785']
 ['F8C74244-447E-11E6-B010-005056A76B49' '106636809']
 ['F8C7425C-447E-11E6-B010-005056A76B49' '106636833']]

I need to get the filtered dataset, i.e, everything containing (or starting with) '106' in the string). Something like the following code with substring operation instead of math operation:

>>> len(b[b[:,1] > '10660600'])
30850
Mad Physicist
  • 76,709
  • 19
  • 122
  • 186
user1464922
  • 301
  • 1
  • 3
  • 9

3 Answers3

1

I don't think numpy is well suited for this type of operation. You can do it simply using basic python operations. Here it is with some sample data a:

import random # for the test data
a = []
for i in range(10000):
    a.append(["".join(random.sample('abcdefg',3)), "".join(random.sample('01234567890',8))])

answer = [i for i in a if i[1].find('106') != -1]

Keep in mind that startswith is going to be a lot faster than find, because find has to look for matching substrings in all positions.

It's not too clear why you need do this with such a large list/array in the first place, and there might be a better solution when it comes to not including these values in the list in the first place.

Benjamin
  • 10,449
  • 13
  • 64
  • 112
  • Cool, indeed: len(b[np.char.startswith(b[:,1], "1001")]) – user1464922 Aug 26 '16 at 09:50
  • Actually I want to create my own multidimensional cube with good querying performance, just for fun and better understanding. Something like on-disk MOLAP engine. Probably I'm moving to the wrong direction, but there are my very first steps in this realm. – user1464922 Aug 26 '16 at 09:57
1

Here's a simple pandas solution

import pandas as pd

df = pd.DataFrame(b, columns=['1st String', '2nd String'])
df_filtered = df[df['2nd String'].str.contains('106')]

This gives you

In [29]: df_filtered
Out[29]: 
                             1st String 2nd String
3  8CC90633-447E-11E6-B010-005056A76B49  106636785
4  F8C74244-447E-11E6-B010-005056A76B49  106636809
5  F8C7425C-447E-11E6-B010-005056A76B49  106636833

Update: Timing Results

Using Benjamin's list a as the test sample:

In [20]: %timeit [i for i in a if i[1].find('106') != -1]
100 loops, best of 3: 2.2 ms per loop

In [21]: %timeit df[df['2nd String'].str.contains('106')]
100 loops, best of 3: 5.94 ms per loop

So it looks like Benjamin's answer is actually about 3x faster. This surprises me since I was under the impression that the operation in pandas is vectorized. Moreover, the speed ratio does not change when a is 100 times longer.

Community
  • 1
  • 1
lanery
  • 4,159
  • 3
  • 24
  • 37
0

Look at the functions in the np.char submodule:

data = [['60D19E9E-4E2C-11E2-AA9A-52540027E502', '100015361'],
 ['60D19EB6-4E2C-11E2-AA9A-52540027E502', '100015385'],
 ['60D19ECE-4E2C-11E2-AA9A-52540027E502', '100015409'],
 ['8CC90633-447E-11E6-B010-005056A76B49', '106636785'],
 ['F8C74244-447E-11E6-B010-005056A76B49', '106636809'],
 ['F8C7425C-447E-11E6-B010-005056A76B49', '106636833']]
data = np.array([r[1] for r in data], np.str)
idx = np.char.startswith(data,  '106')
print(idx)
Eelco Hoogendoorn
  • 9,321
  • 1
  • 39
  • 38