K-Means Clustering

Steps

Only applicable if mean is defined (other wise K-mode for categorical)
Need to specify K
Sensitive to outliers
- Remove some data points
- Random sampling, only select subset of the dataset at a time
Sensitive to initial seeds
Not suitable for discovering clusters that are not hyper-ellipsoids (hyper-spheres)
More computation is needed if there are more data pointers or more feature dimensions
- Principal Component Analysis can be used to reduce dimensionality while preserving variation present in the dataset, up to the maximum extend

No/Minimum re-assignments of data pointers to different clusters
No/Minimum change of centroids
Minimum decrease in the sum of squared error (SSE, or Sum of Intra Cluster Distnace) between successive iteration
- $C_{j}$ is the j-th cluster
- $m_{j}$ is the centroid of cluster $C_{j}$
- $d i s t (x, m_{j})$ is the distance between data pointer $x$ and centroid $m_{j}$

SSE = j = 1 \sum k x \in C_{j} \sum d i s t (x, m_{j})^{2}