6 library clustering, 1 general description, 2 the minimal spanning tree algorithm – Metrohm Vision – Theory User Manual
Page 26: 3 clustering algorithm, Library clustering, General description, The minimal spanning tree algorithm, Clustering algorithm, 6library clustering
24
▪▪▪▪▪▪▪
6
Library Clustering
6.1
General Description
Library clustering is an algorithm designed to generate a simplified representation of a large, diverse
library. At the first level of clustering, a library is separated into clusters that contain similar spectra.
At the second level, each of these clusters is in turn subdivided into sub-clusters. This process can
continue until clusters can no longer be subdivided by application of a set of clustering parameters,
or until the maximum number of clustering levels (32) has been reached. The lowest, non-divisible
level clusters are called leaf clusters.
6.2
The Minimal Spanning Tree Algorithm
To describe the mechanism of clustering, it is necessary to understand the basic algorithm used in
clustering, the minimal spanning tree algorithm.
Imagine a library with n products, each represented by a mean product spectrum. The number of all
possible distances between products is of the order of n². The purpose of the minimal spanning tree
algorithm is to reduce the number of distances to the most significant products.
1.
The algorithm starts with all products in a bin designated A.
2.
A product is randomly chosen as a current product and moved to a bin designated B.
3.
The distances from the current product to all remaining products in Bin A are calculated.
4.
The shortest distance from current product to all remaining products in Bin A is determined.
5.
The product for which the shortest distance was found is made the current product and
moved to Bin B. The distance value and the names of products to which it connects are saved.
If Bin A is empty, the algorithm stops here.
6.
The distances from the current product to all remaining products in the Bin A are calculated. If
the shortest distance is smaller than the distance saved in (5.), go to (4.).
7.
If the shortest distance is longer than the distance saved in (5.), move back to the last current
product, make it current product and go to (4.).
The result of the algorithm is a minimal spanning tree, which is a representation of the library with
the following properties:
•
The number of distances between library products is reduced from n(n-1)/2 to n-1.
•
The sum of those distances is the smallest for all possible trees.
•
The final result of the algorithm does not depend on the choice of the starting point.
6.3
Clustering Algorithm
Three parameters must be defined to create a clustering method:
1.
PC cumulative variance
2.
Variance radius scale
3.
The maximum leaf cluster size