#### Abstract

The mirin contribution of this report is the development of novel algorithms {that make efficient use of the communication system in distributed memory architectures with processing elements interconnected by a hypercube network. These algorithms achieve almost optimal overlap of communication delays by computation, leading to a minimization of communication overhead. Rigorous ana1ytical and numerical performance analysis of our parallel algorithms are presented as well. The parallel algorithm under study in this report is the parallel Gauss-Jordan matrix inversion algorithm. Many of the ideas introduced in this report, however, apply to other linear algebra algorithms as well. Parallel Gauss-Jordan matrix inversion algorithms on the hypercube multiprocessors have been extensively studied in the literature. Two common data partitioning strategies for matrix algorithms are row-wise partitioning and submatrix partitioning. It has been claimed that for the parallel G,J inversion algorithm, submatrix partitioning scheme exhibit commlinication overhead advantages; not shared by partitions limited to rows or columns. Most parallel algorithms proposed in the literature, however, do not attempt to overlap inter-processor communication by computation. As zr result, the formula execution time=computation time + communication time is used to analyze the complexity of the parallel algorithm. However, during most of the communication time, the processors are actually idle, waiting for the data to arrive. Most commercially available parallel machines provide communication interrupt handling capability. By utilizing this feature, we believe a lot of parallel matrix algorithms can be improved by overlapping interprocessor communication and computation. In this report we piropose and analyze new parallel GJ inversion algorithms under different data partitioning strategies, with or without partial pivoting. We first propose a new broadcasting algorithm on the hypercube multiprocessor for parallel GJ algorithm. This algorithm ensures that the data are sent out from the source and arrives at the destinations at the earliest possible time. We then give the parallel GJ inversion algorithm using row partitioning. The strategy to overlap communication by computation is for each processor to compute arnd send out the data needed by the other processors as early as possible. We prove a lower bouind on the matrix size such that data transmission is flilly overlapped by computation. We also prove that the message length in the input buffer of each processor is art most 2. We also consider the algorithms under submatrix partitioning, with or without pivoting. We show that when submatrix partitioning is used, even when t,he communication is fully overlapped by computation, the communication overhead is larger than when using row partitioning. Thus, we show that by minimizing communication overhead, the row partitioning scheme can indeed have better overall performance than the submatrix partitioning scheme. Finally we extend the idea of overlapping communication and computation to the parallel LU factorization algorithm.

#### Date of this Version

June 1995