4

The quick problem

I would like to be able to compare specific dtype fields from two numpy structured arrays that are guaranteed to have the same dtype. I would like to do this in a way that allows the fields we are comparing to be different each time a function is called based on the given inputs (i.e. I can't easily hard code the comparisons for each individual field)

The long problem with examples

I am trying to compare specific fields from two numpy structured arrays with the same dtype. for instance, say we have

import numpy as np
from io import BytesIO

a = np.genfromtxt(BytesIO('12 23 0|23.2|17.9|0\n12 23 1|13.4|16.9|0'.encode()),dtype=[('id','U7'),('pos',[('x',float),('y',float)]),('flag','U1')],delimiter='|')

b = np.genfromtxt(BytesIO(' |23.0|17.91|0'.encode()),dtype=[('id','U7'),('pos',[('x',float),('y',float)]),('flag','U1')],delimiter='|')

which gives

In[156]: a
Out[154]: 
array([('12 23 0', (23.2, 17.9), '0'), ('12 23 1', (13.4, 16.9), '0')], 
      dtype=[('id', '<U7'), ('pos', [('x', '<f8'), ('y', '<f8')]), ('flag', '<U1')])

and

In[153]: b
Out[151]: 
array([('', (23.0, 17.91), '0')], 
      dtype=[('id', '<U7'), ('pos', [('x', '<f8'), ('y', '<f8')]), ('flag', '<U1')])

Now lets say that I want to check and find any entries in a whose a['pos']['x'] field is greater than the b['pos']['x'] field and return these entries to a new numpy array, something like this would work

newArr = a[a["pos"]["x"]>b["pos"]["x"]]

Now imagine we want to keep only entries in a where both the x and y fields are greater than their counterparts in b. This is fairly simple as we could again do

newArr = a[np.array([np.array([a['pos']['x']>b['pos']['x']),a['pos']['y']>b['pos']['y'])).all(axis=0)]

which returns an empty array which is the correct answer.

Now however, imagine that we have a very complicated dtype for these arrays (say with 34 fields -- see here for an example of the dtype I'm working with) and we want to be able to compare any of them but likely not all of them (similar to the previous example but with more dtype fields overall and more of them we want to compare. Further, what if the fields we want to compare can change from run to run (so we can't really hard code it in the way I did above). That is the problem I am trying to find the solution to.

My current (unfinished) attempts at solutions

Using masked arrays

My first thought to solving this problem was to use masked arrays to select the data type fields that we want to compare. Something like this (assuming we can make all our comparisons the same):

mask = np.ones(z.shape,dtype=[('id',bool),('pos',[('x',bool),('y',bool)]),('flag',bool)])
# unmask the x and y fields so we can compare them 
mask['pos']['x']=0
mask['pos']['y']=0

maskedA = np.ma.masked_array(a, mask=mask)
# We need to do this or the masked array gets angry (at least in python 3)
b.shape = (1,)

maskedB = np.ma.masked_array(b, mask=mask)

Now I would want to do something like

test = (maskedA>maskedB).any(axis=1)

but this doesn't work because you can compare structured arrays like this --

TypeError: unorderable types: MaskedArray() > MaskedArray()

I've also tried compressing the masked arrays

test = (maskedA.compressed()>maskedB.compressed()).any(axis=1)

which results in a different error

TypeError: ufunc 'logical_not' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Now, I realize that the above errors are likely because I don't fully understand how structured and masked arrays work but that is partially why I am asking this question. Is there any way to do something like this using masked arrays?

The solution I just thought of that will probably work and is probably better overall...

So the other option that I just thought of while writing this up is to just do the comparisons when I would be parsing the user's input to form array b anyway. It would really just be adding a couple of lines to each conditional in the parser to do the comparison and tack the results into a numpy boolean array that I could then use to extract the proper entries from a. Now that I think about it this is probably the way to go.

The conclusion to my long and rambling problem.

Despite the fact that I think I found a solution to this problem I am still going to post this question at least for a little bit to see if (a) anyone has any ideas about how to do logical comparisons with structured/masked numpy arrays because I think it would be a useful thing to know and (b) to see if anyone has a better idea then what I cam up with. Note that you can very easily form a MWE by copying line by line the snippets in the "The long problem with examples" section and I don't see any reason to take up more space by doing this.

Community
  • 1
  • 1
Andrew
  • 613
  • 5
  • 18

2 Answers2

3

To apply a comparison to a sequence of columns, you must use a Python loop. The loop can come in the form of a list comprehension for example:

In [87]: np.all([a['pos'][key] > b['pos'][key] for key in a['pos'].dtype.names], axis=0)
Out[87]: array([False, False], dtype=bool)

This computes a['pos'][key] > b['pos'][key] for every field in a['pos'], and then reduces the arrays using np.all along the 0-axis.

If you wish to apply the comparison to some list of fields, you could of course replace a['pos'].dtype.names with that list.

unutbu
  • 711,858
  • 148
  • 1,594
  • 1,547
  • Thanks. I thought of that as well, however I don't know that it can be made to work when you have a dtype with nested levels, ie `dtype= [('firstGroup',[('homework',int),('classwork',int)]),('roomsOfTheHouse',[('bathroom',[('sink',str),('tub',str)]),('kitchen',[('floor',str),('counter',str)])]` – Andrew Dec 22 '15 at 20:30
2

I've answered a lot of structured array questions, and some masked array ones, but have never explored their combination. Masking has been a part of numpy for a long time. Structured arrays are newer. It's unclear whether developers ever put special effort into developing. I'd have to look at the the code in /usr/lib/python3/dist-packages/numpy/ma/core.py.

But it is clear that functionality across fields is limited.

You can 'view' a subset of the fields:

In [116]: a['pos'][['y','x']]
Out[116]: 
array([(17.9, 23.2), (16.9, 13.4)], 
      dtype=[('y', '<f8'), ('x', '<f8')])

but you can't set several fields at once:

In [117]: a['pos'][['y','x']]=0
...
IndexError: unsupported iterator index

and comparisons (and probably other operations) with those column views are not implemented.

In [123]: a['pos'][['y','x']]>b['pos'][['y','x']]
...
TypeError: unorderable types: numpy.ndarray() > numpy.ndarray()

unutbu has already suggested the iterative approach:

In [127]: [a['pos'][name]>b['pos'][name] for name in ['x','y']]
Out[127]: [array([ True, False], dtype=bool), array([False, False], dtype=bool)]

Iterating over the names of a dtype is quite common when dealing with structured arrays. recarray functions that copy arrays, do this sort of field by field copy (recursively if needed). genfromtxt probably does some sort of name iteration when it converts your flat list of inputs to a nested set of tuples that match the dtype.

It might help to convert deeply nested levels to arrays. For example I could convert the ('x','y') to a (2,) array:

In [141]: a1=np.array([('12 23 0', (23.2, 17.9), '0'), ('12 23 1', (13.4, 16.9), '0')], 
      dtype=[('id', '<U7'), ('pos', '<f8',(2,)), ('flag', '<U1')])
In [142]: b1=np.array([('', (23.0, 17.91), '0')], dtype=a1.dtype)
In [143]: a1['pos']>b1['pos']
Out[143]: 
array([[ True, False],
       [False, False]], dtype=bool)
In [145]: a1['pos']
Out[145]: 
array([[ 23.2,  17.9],
       [ 13.4,  16.9]])

I can do the same comparison with the original a by converting it to numeric array - by using a copy, view and reshape. copy puts the desired data elements together in a contiguous buffer, view changes the dtype (without changing the data buffer).

In [150]: a['pos'].copy().view(float)
Out[150]: array([ 23.2,  17.9,  13.4,  16.9])

In [153]: a['pos'].copy().view(float).reshape(-1,2)>b['pos'].copy().view(float)
Out[153]: 
array([[ True, False],
       [False, False]], dtype=bool)
hpaulj
  • 175,871
  • 13
  • 170
  • 282
  • I'm accepting this answer because it does a great job of considering the topic in depth (including discussing the other great answer from unutbu) and also explaining why things are the way they are. – Andrew Dec 28 '15 at 14:00