K means Clustering in R example Iris Data

Created by

Rischan Mafrur

Chonnam National University of South Korea

http://rischanlab.github.io

May 27, 2014

In this tutorial I want to show you how to use K means in R with Iris Data example.

We can show the iris data with this command, just type "iris" for show the all data :

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Or we can use command "names" for show the label/column names

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

In this we assign the data from column 1-4 (features) to variable x, and the class to variable y

x = iris[,-5]
y = iris$Species

Create kmeans model with this command: (You need to put the number how many cluster you want, in this case I use 3 because we already now in iris data we have 3 classes)

kc <- kmeans(x,3)

type "kc" or kmeans model for show summary

kc
## K-means clustering with 3 clusters of sizes 62, 38, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1        5.902       2.748        4.394       1.434
## 2        6.850       3.074        5.742       2.071
## 3        5.006       3.428        1.462       0.246
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
## [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
## [141] 2 2 1 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82 23.88 15.15
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

After we know the result, we need to know how many error and missing data, so we need to compare the clustering result with the species/classes iris data.

we use table for comapre:

table(y,kc$cluster)
##             
## y             1  2  3
##   setosa      0  0 50
##   versicolor 48  2  0
##   virginica  14 36  0

For plotting we can use plot function, In this case I plot the Sepal length as x-axis and Sepal Width as y-axis, you can choose different.

plot(x[c("Sepal.Length", "Sepal.Width")], col=kc$cluster)
points(kc$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=23, cex=3)
plot of chunk unnamed-chunk-7