4 sample selection methods, 1 outlier detection, 2 redundant samples selection – Metrohm Vision – Theory User Manual
Page 20: Sample selection methods, Mahalanobis distance in principal component space, Outlier detection, Redundant samples selection, 4sample selection methods
18
▪▪▪▪▪▪▪
4
Sample Selection Methods
4.1
Mahalanobis Distance in Principal Component Space
4.1.1
Outlier Detection
Principal Component Analysis performed on spectra for a given product yields a set of eigenvectors
with corresponding eigenvalues. From the cumulative variance threshold defined for the model, the
number of PCs in the model is determined. Multiplication of spectra and eigenvectors yields scores
(spectral coordinates in PC space).
Calculation of the Mahalanobis distance is done on PC scores of all product spectra. Assuming that
the spectra in the training set are distributed normally, Mahalanobis distances are distributed
according to chi-square function. Chi-square is a function with well-known properties. From the chi-
square function one can calculate probability that a given sample belongs to the distribution
represented by the training set.
The method’s threshold defines boundaries of the model ellipsoid. During analysis, samples outside
the ellipsoid will fail identification or qualification. The threshold can be of two types: probability
or match value.
Threshold expressed as probability is the recommended type. Vision has built-in chi-square
distribution function. A sample’s Mahalanobis distance and the number of degrees of freedom of the
training set is passed to this function, which returns a probability that the sample does not belong to
the distribution represented by the training set of spectra.
The chi-square distribution (and consequently Mahalanobis distance value for samples in the training
set) depends strongly on the number of samples in the training set. For example, if a training set
contains hundreds of spectra, the Mahalanobis distance value is expected to exceed one hundred for
even good samples. For this reason, scaling is usually performed by dividing a mean value into the
distance value.
However, Vision uses a different scaling factor, the number of degrees of freedom. The Mahalanobis
distance scaled in this way is used when the Match Value is the type of outlier threshold chosen.
The default value, 0.6, is not statistically meaningful and has been established experimentally.
Samples with scaled Mahalanobis distance above this value will be tagged as outliers.
4.1.2
Redundant Samples Selection
Redundant samples are detected based on Euclidean distances in PC space (calculated on PC scores).
After removal of outlier samples, remaining samples undergo redundant sample selection.
If the distance threshold method is used to select redundant samples, Vision randomly picks a
spectrum and calculates distances from this spectrum to all other spectra. This spectrum is placed in
the training (or calibration) set, and all spectra with distances smaller than the threshold are placed in
the acceptance (validation) set. The process continues until all spectra are distributed between
appropriate sets.
Because the calculated distances are not scaled, threshold values depend on the product spectra.
Therefore, to optimize sample selection for a given product, several runs may be required. For this
reason, By Number of Samples is the preferred option for sample selection. In this case Vision