Will pandas dataframe object work with sklearn kmeans clustering?

Question

dataset is pandas dataframe. This is sklearn.cluster.KMeans

 km = KMeans(n_clusters = n_Clusters)

 km.fit(dataset)

 prediction = km.predict(dataset)

This is how I decide which entity belongs to which cluster:

 for i in range(len(prediction)):
     cluster_fit_dict[dataset.index[i]] = prediction[i]

This is how dataset looks:

 A 1 2 3 4 5 6
 B 2 3 4 5 6 7
 C 1 4 2 7 8 1
 ...

where A,B,C are indices

Is this the correct way of using k-means?

Your question is a little unclear, sklearn accepts numpy arrays as inputs generally and so pandas dataframes are compatible, in certain cases I have found that you need to ask for a numpy array back so :`df.values` or df.col.values` as an example, so basically it should work, please try and if you hit a snag come back with code and data — EdChum, Jan 19 '15 at 07:57

user666 · Answer 1 · 2018-09-26T17:02:36.103

31

Assuming all the values in the dataframe are numeric,

# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T

Alternatively, you could try KMeans++ for Pandas.

edited Sep 26 '18 at 17:02

answered May 07 '15 at 20:57

user666

3,775
2
22
32

2

Note that a much better way to create the results is `results = pd.DataFrame(data=labels, columns=['cluster'], index=collapsed.index) `, which removes the need for transpose, adds proper indexing and label – FooBar Mar 07 '16 at 14:53
3

@FooBar what is collapsed / collapsed.index? – 3pitt Dec 19 '17 at 19:36
dataset.as_matrix() is deprecated, use Dataset.values intead https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html – mjimcua Aug 22 '18 at 08:57

score 19 · Accepted Answer · answered Jan 19 '15 at 08:47

To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:

dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)

If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.

If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

Will pandas dataframe object work with sklearn kmeans clustering?

2 Answers2