Hardware Architectures For Long Short Term Memory
Long-Short Term Memory (LSTM) has the ability to retain memory and learn from data sequences. LSTMs are computationally expensive, such that general processors consume large amounts of power to deliver desired performance. CPUs do not currently offer large parallelism, while large GPUs are power hungry. It is important to develop hardware implementations that are low-power and high-performance, so that LSTM algorithms can be deployed in embedded systems. In this thesis, we present three hardware architectures that accelerate LSTM on Xilinx's FPGA. Each design uses different strategies that balance off-chip memory bandwidth and area to achieve high performance and scalability. The first architecture streams all data from off-chip memory to the co-processor. The design achieves high performance, but it is limited by the high memory bandwidth requirement. The second design makes use of on-chip memory to store all necessary data. This achieves low memory bandwidth, but it is limited by the available on-chip memory. Both previous designs are either limited by memory bandwidth or available resources. The third design balances these two points to achieve high performance and scalability. Each co-processor was tested with a 2 layers and 128 hidden units LSTM character level language model. All of the implemented designs are faster and more power efficient than other general purpose processors. One of the designs achieved up to 63x better performance per unit power than a dual core ARM Cortex-A9. This work can potentially evolve to a LSTM co-processor for future embedded systems and mobile devices.
Culurciello, Purdue University.
Electrical engineering|Artificial intelligence
Off-Campus Purdue Users:
To access this dissertation, please log in to our