k-means

algorithm for finding the best clustering scheme

procedure

flowchart TD A[acquire the number of cluster k as an input of the user]; B[chose k random points as temprary centers]; C[find the nearest center and define \n the first temprary clustering scheme]; D[computes the centroid of the cluster]; A-->B B-->C C-->D D-->|repeat until the best \n clustering scheme|A

distortion (sum of square errors sse)

is the sum of all distances between an element $x_{i}$ of the dataset and it’s encoded/decoded output squared

$$ \sum_{i=1}^{N}{(x_{i}-Decode(Encode(x_{i}))^2} \space with: $$ $$ Encode: \mathbb{R}^{D}\to [1 … K] $$ $$ Decode: [1 … K]\to \mathbb{R}^{D} $$

it measures how much the clustering scheme change the dataset

so in order to have a minimal distortion of the dataset:

$$ c_{j}= \frac{1}{|OwnedBy(c_{j})|}\sum_{i\in OwnedBy(c_{j})}{x_{i}} $$

choosing starting point

choosing the starting point correctly is important, some possible choices are:

choosing the $k$ number of clusters

choosing the $k$ number of clusters correctly is important, one possible strategy is to use a quantitive evaluation of the quality of the clustering scheme.

The best value to aim to is a compromise between the minimization of intra-cluster distances and the maximization of the inter-cluster distances

empty cluster

during the clustering some clusters can become empty, so in this case there are 2 choices:

outliers

there can be points far away from the centroid, this points are a bad influence for the SSE, in some cases this points need to be removed

complexity

the complexity of the algorithm is :

$$ \mathcal{O}(TKND)\space with: $$

pros

cons

Link map