This paper explores area/parallelism tradeoffs in the design of distributed shared-memory (DSM) multiprocessors built out of large single-chip computing nodes. In this context, area-efficiency arguments motivate a heterogeneous organization consisting of few nodes with large caches designed for single-thread parallelism, and a larger number of nodes with smaller caches designed jror multi-thread parallelism. This paper quantitatively studies the performance of such organization for a set of homogent: ous multiprocessor programs from the SPLASH-2 benchmark suite. These programs are mapped onto the heterogeneous processors without source code modifications via static thread a.ssignment policies. A constant-area simulation analysis shows that a 4-node heterogeneous DSM with 21 processors outperforms i t s homogeneous counterpart with 4 processors by an average of 36% for the studied mu/- tiprocessor workload, while having the same performance for sequential codes. Also studied are the implications of the degree of heterogeneity in the functional units of such heterogeneou.3 DSkI on overall system cost and performance. This paper presents a sensitivity analysis based on a factorial design experiment that determines the relative impact of heterogeneity on performance. The studied benchmarks are affected, on average, primarily by heterogeneity in processor performance (59.9%), followed by cache sizes (18.2%), memory latency (14.6%) and network latency (5.6%).
Date of this Version