Fault tolerance and dynamic partitioning in large -scale parallel systems
Fault-tolerance and dynamic partitioning are two important issues in the design of large-scale parallel systems. Most previous work in the fault-tolerant design of multistage interconnection networks (MINs) has been based on improving the reliabilities of MINs themselves. This study is to investigate the possibility of adding redundancy to MINs, as well as to other subsystems, to enhance the overall system reliability, and to analyze the improvement that can be obtained. The Dynamic Redundancy (DR) network presented provides the full capability of a Generalized Cube and can tolerate network faults and support a system to tolerate processing element faults without degradation in performance. It is shown that no matter how much redundancy is added into an MIN, the system reliability cannot exceed a certain bound; however, using the DR and spare PEs, this bound can be exceeded. Incorporating the DR network and spare PEs into the basic PASM structure is examined.^ The problem of partitioning parallel systems is also discussed. Many parallel systems can be partitioned into independent subsystems of different sizes, each subsystem having the characteristics of the complete system with the same size. A parallel system can be partitioned to simultaneously execute tasks with various sizes and computation structures. Inappropriate partitioning strategies may create many resource fragments, like the fragmentation problem in paging memory, and may cause the loss of computation power. Dynamic partitioning can alleviate the resource fragmentation problem. It is studied based on a lattice model, a special partial ordering relation on a set. Procedures to manage resources in partitionable systems are presented. These procedures can be applied each time a subsystem changes its status. ^
Major Professor: Howard Jay Siegel, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our