Outlier Detection in R

Created by

Rischan Mafrur

Chonnam National University of South Korea

http://rischanlab.github.io

May 28, 2014

In this page i wanna show you how to detect the outlier and how to remove outlier, First one is for univariate outlier detection and then how to apply it to multivariate data

First step, generating data

set.seed(1234)
x <- rnorm(1000)
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.400  -0.673  -0.040  -0.027   0.616   3.200

Outlier Detection with box plot

boxplot.stats(x)$out
##  [1]  3.044 -2.732 -2.856  2.919 -3.233 -2.652 -3.396  3.196 -2.730 -2.704
## [11] -2.864 -2.661  2.706 -2.907 -2.874 -2.757 -2.740

Plot the box plot

boxplot(x)
plot of chunk unnamed-chunk-3

Ok, it's very simple, isn't it? so the next step is for multivariate data, in this case i use data frame

y <- rnorm(10000)
d <-data.frame(x,y)
rm(x,y)
head(d,3)
##         x       y
## 1 -1.2071 -1.2053
## 2  0.2774  0.3015
## 3  1.0844 -1.5391
dim(d)
## [1] 10000     2

Now we have data frame d and we remove value in variable x and y but we attach (d), attach is for attaching variable/columns name so we don't need to define explisitly like d$x with attach we only call x.

attach(d)

find the index of outlier from x

(a <- which(x %in% boxplot.stats(x)$out))
##   [1]  178  181  192  227  237  382  392  486  487  517  558  717  771  788
##  [15]  901  949  967 1178 1181 1192 1227 1237 1382 1392 1486 1487 1517 1558
##  [29] 1717 1771 1788 1901 1949 1967 2178 2181 2192 2227 2237 2382 2392 2486
##  [43] 2487 2517 2558 2717 2771 2788 2901 2949 2967 3178 3181 3192 3227 3237
##  [57] 3382 3392 3486 3487 3517 3558 3717 3771 3788 3901 3949 3967 4178 4181
##  [71] 4192 4227 4237 4382 4392 4486 4487 4517 4558 4717 4771 4788 4901 4949
##  [85] 4967 5178 5181 5192 5227 5237 5382 5392 5486 5487 5517 5558 5717 5771
##  [99] 5788 5901 5949 5967 6178 6181 6192 6227 6237 6382 6392 6486 6487 6517
## [113] 6558 6717 6771 6788 6901 6949 6967 7178 7181 7192 7227 7237 7382 7392
## [127] 7486 7487 7517 7558 7717 7771 7788 7901 7949 7967 8178 8181 8192 8227
## [141] 8237 8382 8392 8486 8487 8517 8558 8717 8771 8788 8901 8949 8967 9178
## [155] 9181 9192 9227 9237 9382 9392 9486 9487 9517 9558 9717 9771 9788 9901
## [169] 9949 9967

find the index of outlier from y

(b<- which(y %in% boxplot.stats(y)$out))
##  [1]  121  317  359  517  660  815 1024 1111 1264 1355 1414 1844 2053 2343
## [15] 2440 2650 2819 2976 3391 3983 4092 4101 4119 4346 4449 4612 5086 5162
## [29] 5358 5484 5557 5983 6044 6106 6128 6282 6289 6294 6318 6384 6707 6906
## [43] 6999 7034 7476 7659 7666 7794 7802 8100 8116 8156 8255 8445 8655 8961
## [57] 9209 9490 9598 9673

Outlier list intersect between x and y

(outlier_d <- intersect(a,b))
## [1] 517

Plot d data frame

plot(d)
points(d[outlier_d,], col ="red",pch="+", cex=2)
plot of chunk unnamed-chunk-9

Outlier list union between x and y

(outlier_du <- union(a,b))
##   [1]  178  181  192  227  237  382  392  486  487  517  558  717  771  788
##  [15]  901  949  967 1178 1181 1192 1227 1237 1382 1392 1486 1487 1517 1558
##  [29] 1717 1771 1788 1901 1949 1967 2178 2181 2192 2227 2237 2382 2392 2486
##  [43] 2487 2517 2558 2717 2771 2788 2901 2949 2967 3178 3181 3192 3227 3237
##  [57] 3382 3392 3486 3487 3517 3558 3717 3771 3788 3901 3949 3967 4178 4181
##  [71] 4192 4227 4237 4382 4392 4486 4487 4517 4558 4717 4771 4788 4901 4949
##  [85] 4967 5178 5181 5192 5227 5237 5382 5392 5486 5487 5517 5558 5717 5771
##  [99] 5788 5901 5949 5967 6178 6181 6192 6227 6237 6382 6392 6486 6487 6517
## [113] 6558 6717 6771 6788 6901 6949 6967 7178 7181 7192 7227 7237 7382 7392
## [127] 7486 7487 7517 7558 7717 7771 7788 7901 7949 7967 8178 8181 8192 8227
## [141] 8237 8382 8392 8486 8487 8517 8558 8717 8771 8788 8901 8949 8967 9178
## [155] 9181 9192 9227 9237 9382 9392 9486 9487 9517 9558 9717 9771 9788 9901
## [169] 9949 9967  121  317  359  660  815 1024 1111 1264 1355 1414 1844 2053
## [183] 2343 2440 2650 2819 2976 3391 3983 4092 4101 4119 4346 4449 4612 5086
## [197] 5162 5358 5484 5557 5983 6044 6106 6128 6282 6289 6294 6318 6384 6707
## [211] 6906 6999 7034 7476 7659 7666 7794 7802 8100 8116 8156 8255 8445 8655
## [225] 8961 9209 9490 9598 9673

Plot d data frame

plot(d)
points(d[outlier_du,], col ="red",pch="+", cex=2)
plot of chunk unnamed-chunk-11