Data-driven approaches to improve dependability of cloud services
The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. Towards improving the dependability of the underlying datacenter networks, in this dissertation, we make one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. Specifically, we make the following contributions: 1. Analysis Methodology for Structured Data: Our dataset includes multiple sources of structured network telemetry data spanning three years logged in monitoring servers of a large cloud provider comprising 100k+ servers, 10k+ core network devices, 2k+ middleboxes and 100k+ network links across 10+ datacenters. This dataset covers a wide range of network data sources, including syslog and SNMP alerts, and traffic carried by links. To this end, we describe a systematic methodology for analyzing this structured data based on event processing to extract events having service-level impact. 2. Analysis Methodology for Unstructured Data Our dataset also includes an important piece of operational knowledge – network trouble tickets, which are diaries written by network operators to keep track of their troubleshooting efforts while fixing a problem. To this end, we take a practical step towards automatically analyzing natural language text in network trouble tickets to infer the problem symptoms, troubleshooting activities and resolution actions. Our system, NetSieve combines statistical natural language processing (NLP), knowledge representation, and ontology modeling to achieve these goals. 3. Data-Driven Approaches to Deriving Actionable Insights: Our overarching goal in this dissertation is to enable operators to understand global problem trends instead of making decisions based on isolated incidents. We outline several analyses rooted in reliability analysis and applied statistics for characterizing network failures and deriving actionable insights from them. Our study reveals several important findings on (a) the failure characteristics of network elements, (b) the availability of network domains, (c) service impact, (d) causes of network failures, (e) effectiveness of repairs, and (f) modeling failures. ^ As part of this dissertation, we have built a broad range of systems including real-time network dashboards, a big data analytics system for analyzing network telemetry data, and an inference tool for root cause analysis in network troubleshooting. Several components of the dissertation work either have undergone a tech-transfer or are being used by multiple business groups inside Microsoft. NetWiser, a Microsoft Research project entailing this dissertation, was awarded the Microsoft Trustworthy Computing Reliability Award for 2013. ^ The problem inference system part of this dissertation, NetSieve, is currently being used across different teams within Microsoft to improve network management: the Network Architecture team for comparing device reliability across platforms and vendors, the Capacity Planning team for understanding why network redundancy is ineffective in masking failures, and the Incident Management and Operations team for finding the top-k problems and failing components while troubleshooting devices and determining whether past repairs were effective. Since its inception, NetSieve has also been used to automate root cause analysis of security incidents within Microsoft's datacenters and recently found its way into commercial use through Microsoft's System Center Advisor (http://www.systemcenteradvisor.com).^
Cristina Nita-Rotaru, Purdue University.
Information Technology|Information Science|Computer Science