13

I have a numpy array of type object. I want to find the columns with numerical values and cast them to float. Also I want to find the indices of the columns with object values. this is my attempt:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A' : [1,2,3,4,5],'B' : ['A', 'A', 'C', 'D','B']})
X = df.values.copy()
obj_ind = []
for ind in range(X.shape[1]):
    try:
        X[:,ind] = X[:,ind].astype(np.float32)
    except:
        obj_ind = np.append(obj_ind,ind)

print obj_ind

print X.dtype

and this is the output I get:

[ 1.]
object
EdChum
  • 294,303
  • 173
  • 671
  • 486
MAS
  • 2,665
  • 7
  • 26
  • 49
  • It's unclear what you're expecting here, your output shows that the second column could not be cast to float and that the dtype is `object` which is correct as this is a `str` dtype, if you wanted the column name then you return `obj_ind = np.append(obj_ind,x.columns[ind])` – EdChum Aug 25 '15 at 15:10
  • I want to convert my first columns to type float @EdChum – MAS Aug 25 '15 at 15:53
  • 1
    the elements of `numpy` arrays can't have different `dtypes`. You might need a structured array instead – tmdavison Aug 25 '15 at 16:22
  • Does this answer your question? [Converting numpy dtypes to native python types](https://stackoverflow.com/questions/9452775/converting-numpy-dtypes-to-native-python-types) – Trilarion May 15 '20 at 15:27

3 Answers3

17

Generally your idea of trying to apply astype to each column is fine.

In [590]: X[:,0].astype(int)
Out[590]: array([1, 2, 3, 4, 5])

But you have to collect the results in a separate list. You can't just put them back in X. That list can then be concatenated.

In [601]: numlist=[]; obj_ind=[]

In [602]: for ind in range(X.shape[1]):
   .....:     try:
   .....:         x = X[:,ind].astype(np.float32)
   .....:         numlist.append(x)
   .....:     except:
   .....:         obj_ind.append(ind)

In [603]: numlist
Out[603]: [array([ 3.,  4.,  5.,  6.,  7.], dtype=float32)]

In [604]: np.column_stack(numlist)
Out[604]: 
array([[ 3.],
       [ 4.],
       [ 5.],
       [ 6.],
       [ 7.]], dtype=float32)

In [606]: obj_ind
Out[606]: [1]

X is a numpy array with dtype object:

In [582]: X
Out[582]: 
array([[1, 'A'],
       [2, 'A'],
       [3, 'C'],
       [4, 'D'],
       [5, 'B']], dtype=object)

You could use the same conversion logic to create a structured array with a mix of int and object fields.

In [616]: ytype=[]

In [617]: for ind in range(X.shape[1]):
    try:                        
        x = X[:,ind].astype(np.float32)
        ytype.append('i4')
    except:
        ytype.append('O')       

In [618]: ytype
Out[618]: ['i4', 'O']

In [620]: Y=np.zeros(X.shape[0],dtype=','.join(ytype))

In [621]: for i in range(X.shape[1]):
    Y[Y.dtype.names[i]] = X[:,i]

In [622]: Y
Out[622]: 
array([(3, 'A'), (4, 'A'), (5, 'C'), (6, 'D'), (7, 'B')], 
      dtype=[('f0', '<i4'), ('f1', 'O')])

Y['f0'] gives the the numeric field.

hpaulj
  • 175,871
  • 13
  • 170
  • 282
1

df.dtypes return a pandas series which can be operated further

# find columns of type int
mask = df.dtypes==int
# select columns for for the same
cols = df.dtypes[mask].index
# select these columns and convert to float
new_cols_df = df[cols].apply(lambda x: x.astype(float), axis=1)
# Replace these columns in original df
df[new_cols_df.columns] = new_cols_df
shanmuga
  • 3,667
  • 2
  • 16
  • 33
  • what I have posted is a minimal working example. In my full code I will not have access to df. Only to X. @shanmuga – MAS Aug 25 '15 at 16:05
1

I think this might help

def func(x):
  a = None
  try:
    a = x.astype(float)
  except:
    # x.name represents the current index value 
    # which is column name in this case
    obj.append(x.name) 
    a = x
  return a

obj = []
new_df = df.apply(func, axis=0)

This will keep the object columns as such which you can use later.

Note: While using pandas.DataFrame avoid using iteration using loop as this much slower than performing the same operation using apply.

shanmuga
  • 3,667
  • 2
  • 16
  • 33