K-means
Original
Disadvantages
- If the initial guess is bad, no points will be assigned to the initial mean
- The value of k is unknown
Sequential
| The mean of cluster with points | |
| The cluster | |
| The number of points in the new cluster |
Forgetful Sequential
| The number of points in the new cluster | |
| A constant where |
Hierarchical
Agglomerative (bottom-up)
Divisive (top-down)
Polythetic
- Calculate , choose highest, split into
- Calculate , where does not include
- Calculate , where does not include
- Calculate , choose highest
- Repeat until all is negative
Monothetic
Usually used when the data consists of binary variables
Chi-Square Measure
The correlation between two attributes
- Calculate
Chi-Squarefor every pair of attributes - Calculate the average correlation (only summation is needed for comparison)
- Choose the highest for splitting
Common
Linkage Methods
- Single
- Nearest neighbor
- Complete
- Most distant member
- Group average
- Average of the distances between all pairs of records
- Centroid
- Distance between the mean vector
- When a large cluster is merged with a small one the centroid of the combined cluster will be closer to the large one
- Median
- Mid-point of the original two cluster centres
- McQuitty’s method
- Ward’s method
Distance Measurements
- Jacard’s coefficients
- Matching coefficients