K-means

Original

Disadvantages

  • If the initial guess is bad, no points will be assigned to the initial mean
  • The value of k is unknown

Sequential

The mean of cluster with points
The cluster
The number of points in the new cluster

Forgetful Sequential

The number of points in the new cluster
A constant where

Hierarchical

Agglomerative (bottom-up)

Divisive (top-down)

Polythetic

  1. Calculate , choose highest, split into
  2. Calculate , where does not include
  3. Calculate , where does not include
  4. Calculate , choose highest
  5. Repeat until all is negative

Monothetic

Usually used when the data consists of binary variables

Chi-Square Measure

The correlation between two attributes

  1. Calculate Chi-Square for every pair of attributes
  2. Calculate the average correlation (only summation is needed for comparison)
  3. Choose the highest for splitting

Common

Linkage Methods

  • Single
    • Nearest neighbor
  • Complete
    • Most distant member
  • Group average
    • Average of the distances between all pairs of records
  • Centroid
    • Distance between the mean vector
    • When a large cluster is merged with a small one the centroid of the combined cluster will be closer to the large one
  • Median
    • Mid-point of the original two cluster centres
  • McQuitty’s method
  • Ward’s method

Distance Measurements

  • Jacard’s coefficients
  • Matching coefficients