Clustering

K-means

Original

m_{i} (t - 1) x_{1} + x_{2} + \dots + x_{t - 1} m_{i} (t) = \frac{x _{1} + x _{2} + \dots + x _{t - 1}}{t - 1} = m_{i} (t - 1) \cdot (t - 1) = \frac{( x _{1} + x _{2} + \dots + x _{t - 1} ) + x _{t}}{t} = \frac{m _{i} ( t - 1 ) \cdot ( t - 1 ) + x _{t}}{t} = m_{i} (t - 1) - \frac{m _{i} ( t - 1 )}{t} + \frac{x _{t}}{t} = m_{i} (t - 1) + \frac{1}{t} (x_{t} - m_{i} (t - 1))


$m_{i} (x)$	The mean of cluster $i$ with $x$ points
$i$	The $i - t h$ cluster
$t$	The number of points in the new cluster

m_{n} = m_{n - 1} + a (x_{n} - m_{n - 1}) = (1 - a) m_{n - 1} + a x_{n} = (1 - a) [(1 - a) m_{n - 2} + a x_{n - 1}] + a x_{n} = (1 - a)^{2} m_{n - 2} + (1 - a) a x_{n - 1} + a x_{n} = (1 - a)^{2} [(1 - a) m_{n - 3} + a x_{n - 2}] + (1 - a) a x_{n - 1} + a x_{n} = (1 - a)^{3} m_{n - 3} + (1 - a)^{2} a x_{n - 2} + (1 - a) a x_{n - 1} + a x_{n} = \dots = (1 - a)^{n} m_{0} + (1 - a)^{n - 1} a x_{1} + (1 - a)^{n - 2} a x_{2} + \dots + (1 - a)^{2} a x_{n - 2} + (1 - a) a x_{n - 1} + a x_{n} = (1 - a)^{n} m_{0} + k = 1 \sum n (1 - a)^{n - k} a x_{k}


$n$	The number of points in the new cluster
$a$	A constant where $0 \leq a \leq 1$

Usually used when the data consists of binary variables

X^{2} = \frac{( a d - b c ) ^{2} N}{( a + b ) ( a + c ) ( b + d ) ( c + d )}

$X^{2} \propto$ The correlation between two attributes

Single
- Nearest neighbor
Complete
- Most distant member
Group average
- Average of the distances between all pairs of records
Centroid
- Distance between the mean vector
- When a large cluster is merged with a small one $\to$ the centroid of the combined cluster will be closer to the large one
Median
- Mid-point of the original two cluster centres
McQuitty’s method

Suppose there are clusters C_{1}, C_{2}, and C_{x} C_{1} and C_{2} are merged to become C_{1, 2} D (C_{x}, C_{1, 2}) = \frac{D ( C _{1} , C _{x} ) + D ( C _{2} , C _{x} )}{2}

Information loss = Error sum-of-squares (ESS)

J = \frac{M _{11}}{M _{01} + M _{10} + M _{11}}

M (A, B) = \frac{M _{00} + M _{11}}{n}