The technology-push of die stacking and application-pull of

Big Data machine learning (BDML) have created a unique

opportunity for processing-near-memory (PNM). This paper

makes four contributions: (1) While previous PNM work

explores general MapReduce workloads, we identify three

workload characteristics: (a) irregular-and-compute-light (i.e.,

perform only a few operations per input word which include

data-dependent branches and indirect memory accesses); (b)

compact (i.e., the computation has a small intermediate live

data and uses only a small amount of contiguous input data);

and (c) memory-row-dense (i.e., process the input data without

skipping over many bytes). We show that BDMLs have

or can be transformed to have these characteristics which,

except for irregularity, are necessary for bandwidth- and energyefficient

PNM, irrespective of the architecture. (2) Based on

these characteristics, we propose RowCore, a row-oriented

PNM architecture, which (pre)fetches and operates on entire

memory rows to exploit BDMLs’ row-density. Instead

of this row-centric access and compute-schedule, traditional

architectures opportunistically improve row locality while

fetching and operating on cache blocks. (3) RowCore employs

well-known MIMD execution to handle BDMLs’ irregularity,

and sequential prefetch of input data to hide memory

latency. In RowCore, however, one corelet prefetches

a row for all the corelets which may stray far from each

other due to their MIMD execution. Consequently, a leading

corelet may prematurely evict the prefetched data before

a lagging corelet has consumed the data. RowCore employs

novel cross-corelet flow-control to prevent such eviction. (4)

RowCore further exploits its flow-controlled prefetch for frequency

scaling based on novel coarse-grain compute-memory

rate-matching which decreases (increases) the processor clock

speed when the prefetch buffers are empty (full). Using simulations,

we show that RowCore improves performance and

energy, by 135% and 20% over a GPGPU with prefetch,

and by 35% and 34% over a multicore with prefetch, when

all three architectures use the same resources (i.e., number

of cores, and on-processor-die memory) and identical diestacking

(i.e., GPGPUs/multicores/RowCore and DRAM).


Process-Near-Memory, Big Data, Machine Learning

Date of this Version