beautypg.com

2 preprocessing calibration data, 3 selecting the number of factors, 4 cross validation – Metrohm Vision – Theory User Manual

Page 9: 4 statistical evaluation of calibration equations, 1 multiple correlation coefficient, Preprocessing calibration data, Selecting the number of factors, Cross validation, Statistical evaluation of calibration equations, Multiple correlation coefficient

background image

▪▪▪▪▪▪▪

7

1.3.2

Preprocessing Calibration Data

Spectral and constituent data are routinely preprocessed before the PLS calibration. (Here, the term
“preprocessing” refers to calculations performed during the PLS calibration, not operations on
spectra before calibration such as taking the second derivative.) The vector of constituent values is
mean centered and scaled to variance of one (1). The set of training spectra is always mean centered.
Of course, the mean values and scaling factors are accounted for in the final calibration equation.

Before calculating spectral loadings, another calculation is performed to determine how effectively
the data at each wavelength explains residual concentration values. (It is analogous to the correlation
coefficient described for MLR.) The result is called a weighting vector, or simply a weight. Vision
scales the data so weights are proportional to the product of correlation and variance in the spectral
data, with the result that wavelengths with high absorptivity are emphasized.

1.3.3

Selecting the Number of Factors

Generally, more PLS factors can be calculated than are appropriate for use in a final calibration.
Deciding how many factors to use is an important part of PLS calibration. With too few factors, the
calibration accounts for too little information and gives correspondingly high prediction errors. When
too many factors are used, the model overfits the calibration data (noise or systematic errors unique
to the training is included in the model), resulting in a model that is not robust or stable.

Usually the optimal number of factor is established during cross validation or using the external
prediction set. When a validation set is available, Vision calculates the PRESS (Prediction Residual
Error – Sum of Squares) for each factor, then recommends use of the factor having the minimum
PRESS value.

1.3.4

Cross Validation

As an alternative to using validation samples in the calculation of PRESS, a cross validation can be
performed using the training set. In cross validation, samples in the training set are grouped into
subsets. Such a subset may contain several, or only one sample.

During cross validation, one subset is withheld while a calibration is created with the remaining
training samples. Then, the resulting calibration is used to analyze samples in the subset as
unknowns. Finally, the predicted constituent values are subtracted from the reference (lab) values,
and their differences squared and summed. The first subset is returned to the training set, and in turn
every remaining subset is analyzed in the same fashion as the first. The resulting PRESS value at each
factor is an indicator of how well PLS model performs. A related indicator of performance is MSECV
(Mean Squared Error of Cross Validation).

1.4

Statistical Evaluation of Calibration Equations

Several parameters calculated during calibration or prediction of the validation set indicate the quality
of the calibration equation its usefulness in predicting unknowns.

1.4.1

Multiple Correlation Coefficient

Multiple Correlation Coefficient (R²) is a measure of how well the spectral data fit the constituent
values. This statistical quantity, also called Coefficient of Multiple Determination, is equal to zero (0)
when spectral response is unrelated to constituent data (the relationship is statistically random). A
value of one (1) signifies that the constituent values fit spectral data perfectly and all residuals are