A unified framework for transparent parallelism and fault-tolerance in distributed systems

Sunghwan Yoo, Purdue University


Today, many distributed systems are deployed in high-performance computing environments such as a multi-core architecture or a managed network like a data center. As the new computing architectures require more parallelism to improve performance and responsiveness, implementing distributed applications that work consistently in parallel architectures without causing any deadlock or data race issues have become a challenging task. Even more, data center applications must handle fault-tolerance as well because random or correlated crash-restart failures can happen in data centers. Many approaches to solve these issues have been proposed independently to make data center applications to be concurrent, fault-tolerant, or both. Popular applications like graph computing systems or non-relational database systems have their own mechanism to handle concurrency and failures. There are even more generic frameworks that provide both parallelism and fault tolerance in data computing frameworks, message-passing interfaces, and software transactional memory systems. However, making a data center application that works in these generic frameworks may require major restructuring or learning a new paradigm. In this dissertation, we present a solution that provides parallelism, and another solutions that provides fault-tolerance, and both in an event-driven system framework transparently. First, we present InContext, a concurrent event execution model that runs events in parallel by associating access behaviors with the shared variables. Second, we present Ken, an uncoordinated rollback recovery protocol for event-driven systems that can mask crash-restart failures and guarantee composable reliability. We also present MaceKen, integrated with Mace frameworks, that transparently provides crash-restart fault-tolerance for legacy Mace applications. Finally, we propose MultiKen, a combined framework for parallelism and fault-tolerance in event-driven systems.




Xu, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server