K-Means in categorical data

Bibek Dhakal
4 min readMay 20, 2020
Source:https://www.reddit.com/r/datascience/comments/d6buto/kmeans_be_like_mine_mine_mine/

Like supervised data can be used for Predictive modelling, unsupervised data are mostly used for grouping together with similar features. Data with numerals are easier to handle. They can be used with label encoding or leaving as it is for the future. But with Categorical data!!!

Well, categorical data are the types of data which are present in categories like we say Name, Food Place, Group etc.

Let us take with an example of handling categorical data and clustering them using the K-Means algorithm.

We have got a dataset of a hospital with their attributes like Age, Sex, Final Diagnosis and Place where they come from.

Let us import essential libraries first as below:

import numpy as np
import pandas as pd
import researchpy as rp
import seaborn as sns
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

After importing libraries in python, we are gonna read the .csv files from pandas.

nch = pd.read_csv("final.csv")

After reading the csv files we are gonna remove the duplicated files as to not keep the same files in our dataset.

nch.drop_duplicates(keep=False)
nch.drop_duplicates(keep=False, inplace=True)

After dropping duplicated data from our set, lets us Label Encode our dataset for further process. We need to change our categorical to numerical for clustering as K-Means doesn’t work with categorical data.

Here, we are using Sklearn library to encode our data.

from sklearn.preprocessing import LabelEncoder #changing to numerical by label encoder
number = LabelEncoder()
nch["Sex"] = number.fit_transform(nch["Sex"].astype('str'))
nch["Place"] = number.fit_transform(nch["Place"].astype('str'))
nch["FinalDiagnosis"] = number.fit_transform(nch["FinalDiagnosis"].astype('str'))

Hurrah, we have changed to numerical type. We need to check them whether they are in numerical or not. If not then, using pandas libraries we are changing their properties to numeric.

nch.Age = pd.to_numeric(nch.Age)  #change to numeric type data
nch.Sex = pd.to_numeric(nch.Sex)
nch.Place = pd.to_numeric(nch.Place)
nch.FinalDiagnosis = pd.to_numeric(nch.FinalDiagnosis)

We have got our dataset in numeric but still we cant apply algorithm because changing to label encoding data and applying K-Means may not give good result so we are again changing them to One-Hot encoded data with One-Hot Encoding method.

nch_onehot = nch.copy()
nch_onehot = pd.get_dummies(nch_onehot, columns = ['Sex','FinalDiagnosis','Place'], prefix = ['Sex','FinalDiagnosis','Place'] )

When we apply one-hot encoding, our dataset attributes increased upto 42.But before moving to think about it, let us use scaling method(Standard Scaling) to scale our data.

from sklearn.preprocessing import StandardScaler
SS = StandardScaler()
Nchs = SS.fit_transform(nch_onehot)

Now we need to reduce the dimension of our data, so we are applying PCA technique.

#fitting the pca algorithm with our data
pca=PCA().fit(Nchs)
#plotting the cumulative summation of the explained variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Variance % for each components')
plt.title('Explained variance')
plt.show()
PCA visualization

From above figure it seems that our components bends after approx. 37 so we select nearly 37 for 100% variance data.

pca=PCA(n_components=37)
pca.fit(Nchs)
x_pca=pca.transform(Nchs)

Now comes the last part. We are applying K-Means algorithm to our data. But how to determine K value!! It is easy. We will use Elbow method to determine the value of K. We will visualize it and see from where it forms an elbow.

from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(x_pca)
Sum_of_squared_distances.append(km.inertia_)
#Visualing the plot
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Fig: Elbow Method

From about K =2, it has formed an elbow. So we are taking k=2 for our K-Means clustering.

kmeans = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None, algorithm='auto')
y_kmeans=kmeans.fit_predict(x_pca)
#Plotting through K-Meansplt.scatter(x_pca[y_kmeans==0,0],x_pca[y_kmeans==0,1],s=100,c='red',label='Cluster1')
plt.scatter(x_pca[y_kmeans==1,0],x_pca[y_kmeans==1,1],s=100,c='blue',label='Cluster2')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=300,c='yellow',label='Centroids')
plt.title('Clusters of data')
plt.legend()
plt.show()
Fig:K-Means with K=2

In this way, we can work with the categorical data for clustering technique.

Github codes are available here.

References:

  1. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/
  2. https://www.youtube.com/watch?v=4b5d3muPQmA
  3. https://en.wikipedia.org/wiki/Cluster_analysis
  4. https://www.datacamp.com/community/tutorials/categorical-data

--

--