3

I'm trying to initialize a NumPy structured matrix of size (x,y) where the value of x is ~ 10^3 and y's value is ~ 10^6.

The first column of the matrix is an ID (integer), and the rest are triplets (int8), where each member of the triplet should have a different default value.

i.e. assuming the default values are [2,5,9] I'd like to initialize the following matrix:

0 2 5 9 2 5 9 2 5 9 ...
0 2 5 9 2 5 9 2 5 9 ...
0 2 5 9 2 5 9 2 5 9 ...
0 2 5 9 2 5 9 2 5 9 ...
...

The problem here (VS. this similar question) is that each column has a different unique name that should be recorded.

The fastest way I could think of initializing the matrix is:

default_age       = 2
default_height    = 5
default_shoe_size = 9

columns = ["id", 
           "a_age", 
           "a_height", 
           "a_shoe_size", 
           "b_age", 
           "b_height", 
           "b_shoe_size",
           #...
           ]

y = len(columns)    
x = 10**4

# generate matrix
mat = numpy.zeros(shape=x,
                  dtype={"names"   : columns,
                         "formats" : ['i'] + ['int8'] * (len(columns) - 1)})
# fill the triplets with default values
for i in xrange(y/3):
    j = i * 3
    mat[mat.dtype.names[j+1]] = default_age
    mat[mat.dtype.names[j+2]] = default_height
    mat[mat.dtype.names[j+3]] = default_shoe_size

What is the fastest way to initialize such a matrix?

Thanks!

Community
  • 1
  • 1
NStiner
  • 107
  • 1
  • 7
  • Is there a reason you'd rather not just use [`pandas`](http://pandas.pydata.org) dataframes? – jme May 02 '15 at 19:16
  • 1
    Something is fishy here. You are creating a 2-d array (with shape `(x, len(columns))`), and each element of this array is itself a structure with `len(columns)` fields. Are you sure that is what you intended? (My guess is that you really want a *one-dimensional* structured array.) – Warren Weckesser May 02 '15 at 19:21
  • While I haven't digested your structure description, my experience is that copying data to a structured array, field by field, is generally the fastest way to go. Either that or make a list of all the necessary tuples. – hpaulj May 02 '15 at 19:31
  • @Warren Weckesser You're right- I did mean to create a 1D structured array, edited the question to reflect that. Thanks! – NStiner May 02 '15 at 19:37
  • @hpaulj The matrix has to be structured. It is also huge, and would consist mainly of default values, so I'd like to fill it with those as fast as possible. – NStiner May 02 '15 at 19:44
  • 1
    Before worrying about 'fastest', you should give us a working example. You don't specify `x` or `y`, and your `mat[:,i+1]` indexing will not work with a structured array. – hpaulj May 03 '15 at 00:20
  • @hpaulj Thanks, I corrected the code so it now works. – NStiner May 04 '15 at 10:57

4 Answers4

3

This is my tweak of your sample, adjusted so it runs. Note that I iterate over the columns by field name

dt=np.dtype({"names": columns, "formats" : ['i'] + ['int8'] * (len(columns) - 1)})
mat=np.zeros((10,),dtype=dt)
for i in range(1,7,3):
    mat[dt.names[i]]=default_age
    mat[dt.names[i+1]]=default_height
    mat[dt.names[i+2]]=default_shoe_size

producing

array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('collections.ChainMap(np.arange(6).reshape(3,2))[0]_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

As long as the number of field names is substantially few than the number of rows, I think this will be as fast, or faster, than any other way.

In my sample x=(10,). Your mat[:,j+1] expression has not been corrected to handle a structured 1d array.

A structured array is probably not the best way to go if you have very many columns (fields) (compared to the number of rows).

If all of your fields are 'int', I'd use a regular 2d array. Structured arrays are most useful when fields have differing types of elements.


Here's a way of initializing a regular 2d array with these values, and optionally casting it to a structured array

values=np.array([2,5,9])
x, y = 10, 2
mat1=np.repeat(np.repeat(values[None,:],y,0).reshape(1,3*y),x,0)

producing:

array([[2, 5, 9, 2, 5, 9],
       [2, 5, 9, 2, 5, 9],
       ...,
       [2, 5, 9, 2, 5, 9]])

Add on the id column

mat1=np.concatenate([np.zeros((x,1),int),mat1],1)
array([[0, 2, 5, 9, 2, 5, 9],
       [0, 2, 5, 9, 2, 5, 9],
       ...
       [0, 2, 5, 9, 2, 5, 9],
       [0, 2, 5, 9, 2, 5, 9]])

A new dtype - with all plain 'int':

dt1=np.dtype({"names"   : columns, "formats" : ['i'] + ['int'] * (len(columns) - 1)})
mat2=np.empty((x,),dtype=dt1)

If done right, the data for mat1 should be the same size and byte order as for mat2. In which case I can 'copy' it (actually just change pointers).

mat2.data=mat1.data

mat2 looks just like the earlier mat, except the dtype is a little different (with i4 instead of i1 fields)

array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', '<i4'), ('a_height', '<i4'), ('a_shoe_size', '<i4'), ('b_age', '<i4'), ('b_height', '<i4'), ('b_shoe_size', '<i4')])

Another way to use mat1 values to initialize a structured array is with an intermediary list of tuples:

np.array([tuple(row) for row in mat1],dtype=dt)
array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

I haven't run time tests, in part because I don't have an idea of what your x,y values are like.

Convert structured array with various numeric data types to regular array

or from the answer in https://stackoverflow.com/a/21818731/901925, the np.ndarray constructor can be used to create a new array using preexisting data buffer. It still needs to use dt1, the all i8 dtype.

np.ndarray((x,), dt1, mat1)

Also ndarray to structured_array and float to int, with more on using view v. astype for this conversion.

Community
  • 1
  • 1
hpaulj
  • 175,871
  • 13
  • 170
  • 282
1

You can build up an array using the usual tile and column_stack provided by numpy, then use np.core.records.fromarrays:

import numpy as np

default_age       = 2
default_height    = 5
default_shoe_size = 9
n_rows = 10

columns = [
    "id", 
    "a_age", 
    "a_height", 
    "a_shoe_size", 
    "b_age", 
    "b_height", 
    "b_shoe_size",
    ]

# generate matrix
dtype = {
    "names": columns,
    "formats": ['i'] + ['int8'] * (len(columns) - 1)
    }

ids = np.zeros(n_rows)
people = np.tile([default_age, default_height, default_shoe_size], (n_rows,2))
data = np.column_stack((ids, people))

mat = np.core.records.fromarrays(list(data.T), dtype=dtype)

Which gives:

>>> mat
rec.array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])
jme
  • 16,819
  • 5
  • 33
  • 38
-1

You can fill in default values, with a for-loop. If you have the default values for example in a dictionary:

default_values = {
    "a_age": 3,
    "a_height": 5,
}
for column, value in default_values.items():
    mat[column] = value
Daniel
  • 39,063
  • 4
  • 50
  • 76
-1

You could use an enum to represent the column names

class Columns(Enum):
    id = 0
    a_age = 1
    a_height = 2
    a_shoe_size = 3
    b_age = 4
    b_height = 5
    b_shoe_size = 6
    ...

Then use the normal array of arrays initialization and access syntax, or whatever object you want to use. Just in place of the column index, you would use Columns.a_age for example. For more information on enums, check here How can I represent an 'Enum' in Python?

Community
  • 1
  • 1
Mitchell Carroll
  • 449
  • 5
  • 13