Steps
Characteristics
- Usually use euclidean distance
- Data should be standardized
Pros
- Easy to understand/implement
- Efficient if K and number of iteration is small
Cons
- Only applicable if mean is defined (other wise K-mode for categorical)
- Need to specify K
- Sensitive to outliers
- Remove some data points
- Random sampling, only select subset of the dataset at a time
- Sensitive to initial seeds
- Not suitable for discovering clusters that are not hyper-ellipsoids (hyper-spheres)
- More computation is needed if there are more data pointers or more feature dimensions
- Principal Component Analysis can be used to reduce dimensionality while preserving variation present in the dataset, up to the maximum extend
Stopping Criterion
- No/Minimum re-assignments of data pointers to different clusters
- No/Minimum change of centroids
- Minimum decrease in the sum of squared error (SSE, or Sum of Intra Cluster Distnace) between successive iteration
- Cj is the j-th cluster
- mj is the centroid of cluster Cj
- dist(x,mj) is the distance between data pointer x and centroid mj
SSE=j=1∑kx∈Cj∑dist(x,mj)2
Clustering Quality
- Maximizes inter-clusters distance (isolation)
- Minimized intra-cluster distance (compactness)
Choosing K
- Elbow method
- Silhouette method