k-means
algorithm for finding the best clustering scheme
procedure
distortion (sum of square errors sse)
is the sum of all distances between an element $x_{i}$ of the dataset and it’s encoded/decoded output squared
$$ \sum_{i=1}^{N}{(x_{i}-Decode(Encode(x_{i}))^2} \space with: $$ $$ Encode: \mathbb{R}^{D}\to [1 … K] $$ $$ Decode: [1 … K]\to \mathbb{R}^{D} $$
it measures how much the clustering scheme change the dataset
so in order to have a minimal distortion of the dataset:
- the $Encode$ function must translate $x_{i}$ in the nearest center
- the gradient of the distortion function w.r.t. the center of the cluster must be $0$ so
$$ c_{j}= \frac{1}{|OwnedBy(c_{j})|}\sum_{i\in OwnedBy(c_{j})}{x_{i}} $$
- each center must be the centroid of the points it owns
choosing starting point
choosing the starting point correctly is important, some possible choices are:
- select random starting points
- choose the $2…k$ starting point as far as possible from the previious ones
choosing the $k$ number of clusters
choosing the $k$ number of clusters correctly is important, one possible strategy is to use a quantitive evaluation of the quality of the clustering scheme.
The best value to aim to is a compromise between the minimization of intra-cluster distances and the maximization of the inter-cluster distances
empty cluster
during the clustering some clusters can become empty, so in this case there are 2 choices:
- choose a new centroid away from the empty one
- choose a new centroid at random with the maximum SSE in order to split in half the cluster with the lowest quality
outliers
there can be points far away from the centroid, this points are a bad influence for the SSE, in some cases this points need to be removed
complexity
the complexity of the algorithm is :
$$ \mathcal{O}(TKND)\space with: $$
- $T$ as the number of iterations
- $K$ as the number of clusters
- $N$ as the number of data points
- $D$ as the number of dimensions
pros
- kmeans is efficient nearly linear in the number of datapoints
cons
- k-means cannot work in space where distance cannot be computed
- cannot work with nominal data
- requires the K parameter (it can be computed but it is a cost)
- it is very sensitive to outliers
- does not deal with noise
- does not deal properly with non convex clusters