Abstract
The technology-push of die stacking and application-pull of
Big Data machine learning (BDML) have created a unique
opportunity for processing-near-memory (PNM). This paper
makes four contributions: (1) While previous PNM work
explores general MapReduce workloads, we identify three
workload characteristics: (a) irregular-and-compute-light (i.e.,
perform only a few operations per input word which include
data-dependent branches and indirect memory accesses); (b)
compact (i.e., the computation has a small intermediate live
data and uses only a small amount of contiguous input data);
and (c) memory-row-dense (i.e., process the input data without
skipping over many bytes). We show that BDMLs have
or can be transformed to have these characteristics which,
except for irregularity, are necessary for bandwidth- and energyefficient
PNM, irrespective of the architecture. (2) Based on
these characteristics, we propose RowCore, a row-oriented
PNM architecture, which (pre)fetches and operates on entire
memory rows to exploit BDMLs’ row-density. Instead
of this row-centric access and compute-schedule, traditional
architectures opportunistically improve row locality while
fetching and operating on cache blocks. (3) RowCore employs
well-known MIMD execution to handle BDMLs’ irregularity,
and sequential prefetch of input data to hide memory
latency. In RowCore, however, one corelet prefetches
a row for all the corelets which may stray far from each
other due to their MIMD execution. Consequently, a leading
corelet may prematurely evict the prefetched data before
a lagging corelet has consumed the data. RowCore employs
novel cross-corelet flow-control to prevent such eviction. (4)
RowCore further exploits its flow-controlled prefetch for frequency
scaling based on novel coarse-grain compute-memory
rate-matching which decreases (increases) the processor clock
speed when the prefetch buffers are empty (full). Using simulations,
we show that RowCore improves performance and
energy, by 135% and 20% over a GPGPU with prefetch,
and by 35% and 34% over a multicore with prefetch, when
all three architectures use the same resources (i.e., number
of cores, and on-processor-die memory) and identical diestacking
(i.e., GPGPUs/multicores/RowCore and DRAM).
Keywords
Process-Near-Memory, Big Data, Machine Learning
Date of this Version
10-17-2016