Steps

Characteristics

  • Usually use euclidean distance
  • Data should be standardized

Pros

  • Easy to understand/implement
  • Efficient if K and number of iteration is small

Cons

  • Only applicable if mean is defined (other wise K-mode for categorical)
  • Need to specify K
  • Sensitive to outliers
    • Remove some data points
    • Random sampling, only select subset of the dataset at a time
  • Sensitive to initial seeds
  • Not suitable for discovering clusters that are not hyper-ellipsoids (hyper-spheres)
  • More computation is needed if there are more data pointers or more feature dimensions
    • Principal Component Analysis can be used to reduce dimensionality while preserving variation present in the dataset, up to the maximum extend

Stopping Criterion

  • No/Minimum re-assignments of data pointers to different clusters
  • No/Minimum change of centroids
  • Minimum decrease in the sum of squared error (SSE, or Sum of Intra Cluster Distnace) between successive iteration
    • is the j-th cluster
    • is the centroid of cluster
    • is the distance between data pointer and centroid

Clustering Quality

  • Maximizes inter-clusters distance (isolation)
  • Minimized intra-cluster distance (compactness)

Choosing K

  • Elbow method
  • Silhouette method