Application of Machine Learning in Improving System Reliability and Performance
Improving the reliability and performance are of utmost importance for any system. This thesis presents two machine learning based techniques- one which improves the reliability of parallel programs by detecting silent data corruption, and the other which improves the performance of factory operations by optimizing the scheduling algorithm and detecting bottlenecks. The size and complexity of supercomputing clusters are rapidly increasing to cater to the needs of complex scientific applications. At the same time, the feature size and operating voltage level of the internal components are decreasing. This dual trend makes these machines extremely vulnerable to soft errors or random bit flips. For complex parallel applications, these soft errors can lead to silent data corruption which could lead to large inaccuracies in the final computational results. Hence, it is important to determine the presence and severity of such errors early on, so that proper counter measures can be taken. In this paper, we introduce a tool called Sirius, which can accurately identify silent data corruptions based on the simple insight that there exist spatial and temporal locality within most variables in such programs. Spatial locality means that values of the variable at nodes that are close by in a network sense, are also close numerically. Similarly, temporal locality means that the values change slowly and in a continuous manner with time. Sirius uses neural networks to learn such locality patterns, separately for each critical variable, and produces probabilistic assertions which can be embedded in the code of the parallel program to detect silent data corruptions. We have implemented this technique on parallel benchmark programs - LULESH and CoMD. Our evaluations show that Sirius can detect silent errors in the code with much higher accuracy compared to previously proposed methods. Sirius detected 98% of the silent data corruptions with a false positive rate of less than 0.02 as compared to the false positive rate 0.06 incurred by the state of the art acceleration based prediction (ABP) based technique. As advancements in electronics and computer engineering has led to improved simulation tools and software, there is a high thrust on simulation based optimization of factory operations. This is particularly useful as it lets the person who simulates, observe the effect of the changes on the factory model without impacting actual production. Improper and inefficient scheduling and the presence of system bottlenecks are two major factors that affect the throughput, and thereby, the profits of a factory. We introduce Minerva, a machine learning based technique that can be applied on simulation of factory models to ensure optimal scheduling and to identify bottlenecks. Minerva uses reinforcement learning to provide a schedule that performs significantly better than popular scheduling techniques in the case of a more realistic extension of Job Shop Scheduling Problems. Minerva also uses neural networks to detect bottleneck resources in the system with much higher accuracy than traditional bottleneck identification methods. We evaluated Minerva on two representative benchmarks and found that Minerva performs significantly better than popular scheduling techniques in the case of a more realistic factory model. For a given scheduling algorithm, Minerva is able to detect the system bottleneck with high accuracy of 95.2% which is almost 25% better than the best among the popular bottleneck identification methods.
Bagchi, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our