An apparatus for and method of enhancing reliability within a cluster lock
processing system having a relatively large number of commodity cluster
instruction processors which are managed by a cluster lock manager.
Because the commodity processors have virtually no system viability
features such as memory protection, failure recovery, etc., the
cluster/lock processors assume the responsibility for providing these
functions. The low cost of the commodity cluster instruction processors
makes the system almost linearly scalable. The cluster/locking, caching,
and mass storage accessing functions are fully integrated into a single
hardware platform which performs the role of the cluster/lock master.
Upon failure of this hardware platform, a second redundant hardware
platform converts from slave to master role. The logic for the failure
detection and role swapping is placed within software, which can run as
an application under a commonly available operating system. Furthermore,
the recovery is completely accomplished without assistance of the Host
computer(s) or ultimate user(s) coupled via the Host computer(s).
Following repair of the failed server, it is restarted in an orderly
fashion to resume a slave role. For the server to be completely restored,
coherent memory must be copied from master to slave. Because cluster lock
processing must be paused throughout the system to transfer the copy, it
is important to minimize the transfer time to minimize the impact on
system throughput.