1 reliability, 2 availability, Reliability – FUJITSU SPARC ENTERPRISE SERVER M9000 User Manual
Page 54: Availability

2-14
SPARC Enterprise M8000/M9000 Servers Overview Guide • December 2010
RAS for M8000/M9000 servers minimize system downtime by providing for error 
checking at appropriate locations and by providing centralized monitoring and 
control of error checking. 
Also M8000/M9000 servers can be configured with clustering software or 
centralized management software to enhance the RAS function.
Any scheduled system halt, such as a periodic maintenance or system configuration 
change can also be performed without affecting operating resources. This can 
improve service uptime significantly.
2.4.1
Reliability
Reliability represents the length of time the server can operate normally without 
failure.
Reliability is equally important to both hardware and software.
To improve quality, adequate components must be selected with consideration given 
to the product service life and the required response in case of a failure. In 
evaluations such as stress tests that check the service life, components and products 
are inspected to determine whether they meet the target reliability levels.
Furthermore, software errors are not only triggered by program errors, but also by 
hardware errors.
M8000/M9000 servers provide the following functions to realize high reliability.
■
Monitoring by the XSCF to periodically check whether software such as the 
Oracle Solaris OS is running in domains (host watchdog monitoring).
■
Memory patrol is periodically performed to detect memory software errors and 
stuck faults, even in memory areas not normally used, to prevent use of faulty 
memory and thereby prevent system failures caused by faulty memory from 
occurring.
■
Since ECC protects functional data in all routes including a computing unit, a 
register, cache memory, and a system bus, all 1-bit errors can be automatically 
corrected by hardware to ensure data integrity.
2.4.2
Availability
Availability is characterized by how easily a server fails and how quickly the user 
can be recovered from the failure. The amount of time the system is usable is 
represented as a percentage.
