2 distance metrics, 1 euclidean distance, 2 mahalanobis distance – Metrohm Vision – Theory User Manual
Page 18: Distance metrics, Euclidean distance, Mahalanobis distance
16
▪▪▪▪▪▪▪
Multiplication of a spectrum by a set of eigenvectors yields a set of scores, which can be interpreted
as coordinates of the spectrum in the principal component space:
s AE
=
It is also possible to reconstruct a spectrum from its principal component scores, knowing the
transformation eigenvectors:
A
sE
r
T
=
If the transformed spectrum belonged to the training set, and the number of principal components in
the model was such that 100 % of the variance was accounted for, then the reconstructed spectrum
is identical to the transformed spectrum. Normally, since only the primary principal components are
used for reconstruction, the spectra are slightly different. The difference is called the residual
spectrum:
R A A
r
= −
The variance of the residual spectrum:
V
R R
r
T
=
Can be used as an indicator whether the spectrum belongs to the same distribution as the training
set spectra in so called Residual Variance method.
3.2
Distance Metrics
3.2.1
Euclidean Distance
Euclidean distance between two objects (e.g., spectra) x and y with n coordinates (wavelengths or
principal component scores) is calculated according to the following formula:
D
x
y
E
i
i
i
n
=
−
=
∑
(
)
2
1
3.2.2
Mahalanobis Distance
The Mahalanobis distance, on the other hand, is the distance between a spectrum A and the center
of the distribution of a set of spectra represented by a covariance matrix C as:
D
A
C
A
M
T
2
1
=
−
−
−
(
)
(
)
µ
µ
Where
µ
denotes the mean spectrum of the distribution. The spectrum A may or may not belong to
the training set. Calculating the Mahalanobis distance for a spectrum in this way has disadvantages.
An alternative, simpler approach calculates Mahalanobis distance from primary principal component
scores after secondary PCs have been rejected from the model:
D
n
s
s
M i
ij
ij
i
n
j
k
,
(
)
2
2
2
2
1
1
1
=
−
=
=
∑
∑
Where the index j runs over all k principal components used for the model, and the index i runs over
all n samples in the training set.