beautypg.com

3 qualitative library development, 1 principal component analysis, Qualitative library development – Metrohm Vision – Theory User Manual

Page 17: Principal component analysis, 3qualitative library development

background image

▪▪▪▪▪▪▪

15

3

Qualitative Library Development

3.1

Principal Component Analysis

The near IR spectrum comprises intensity measurements at hundreds of wavelengths. Because there
is always some correlation among absorbance values within a spectrum, the information contained in
spectra for a set of related materials is all the more redundant. Thus, while there are many
wavelengths, there may be relatively little unique spectral information.

Principal Component Analysis (PCA) is a method capable of describing the unique variances in a set of
spectra using linear combinations of wavelengths (Principal Components, or PCs). These PCs are
uncorrelated and, because information in the data set is redundant, relatively few are required to
account for the significant information in the spectral data set.

Mathematically, the Principal Component Analysis is performed by calculating a set of eigenvectors
that diagonalize the covariance matrix C of the training set of spectra:

C EDE

=

1

Where C is the covariance matrix, E is the Eigen vectors matrix, and D is a square diagonal matrix
with Eigen values. The Eigen vectors have the same length as spectra in the training set, and are
orthonormal. From this property it follows that:

E

E

T

=

1

There are numerous algorithms that can be used for Eigen value decomposition of a matrix. Vision
software uses a well established and numerically stable algorithm called Singular Value
Decomposition
. The algorithm operates on a mean-centered training set of spectra and returns a
set of eigenvectors together with associated Eigen values. Eigen vectors are arranged in the
decreasing order of Eigen values.

If all possible Eigen vectors were included in a PC model, it would account for 100% of the variance
in the spectral training set. This would happen if N-1 PCs were used (N is the number of spectra in
the training set, and one degree of freedom is lost due to mean-centering). Normally, only a few PCs
(the primary PCs) will account for the majority of variance in the training set. Remaining principal
components (called secondary PCs) are usually attributed to noise.

A quantity called cumulative variance shows what is a percent of total variance described by a given
number (m in this case) of PCs:

V

c

i

i

m

i

i

N

=

=

=

λ

λ

2

1

2

1

1

Where

λ

i is the Eigen value that corresponds to the ith Eigen vector. The default value of cumulative

variance is 95 %.

Principal components can be interpreted as the axes of a new, orthogonal system of coordinates.
Multiplication of a spectrum by an eigenvector yields a number, called a score:

s

AE

i

i

=