distances
properties of distances
- symmetry $Dist(p,q) = Dist(q, p)$
- positive definiteness $Dist(p,q) >= 0 \forall\space{p,q}$
- Triangle inequality $Dist(p,q) <= Dist(p,r) + Dist(r,q) \forall\space{p,q,r}$
euclidean distance
$$ \sqrt{\sum_{d=1}^{D}{(p_{d}-q_{d})^2}} $$
Where $D$ is the number of dimensions (attributes) and $p_{d}$ and $q_{d}$ are, respectively, the d-th attributes (components) of data objects p and q. Standardization/Rescaling is necessary if scales differ
minkowski distance $l_{r}$
generalization of euclidean distance
$$ (\sum_{d=1}^{D}{|p_{d}-q_{d}|^r})^{\frac{1}{r}} $$
Where $D$ is the number of dimensions (attributes) and $p_{d}$ and $q_{d}$ are, respectively, the d-th attributes (components) of data objects p and q. Standardization/Rescaling is necessary if scales differ. $r$ is a parameter which is chosen depending on the data set and the application
cases
-
$r=1$ Manhattan distance best at discriminate between 0 distance and near 0 distance
-
$r=2$ Euclidean distance
-
$r=\infty$ Chebyshev, supremum, $L_{max}$ norm, $L_{\infty}$ norm considers only the feature with the maximum difference $$ \max_{d}{|p_{d}-q_{d}|} $$
mahalanobis distance
The Mahlanobis distance between two points p and q decreases if, keeping the same euclidean distance, the segment connecting the points is stretched along a direction of greater variation of data. The distribution is described by the covariance matrix of the data set
$$ \sqrt{(p-q)\sum^{-1}{(p-q)^T}} $$