In recent years, large distributed systems have taken a prominent role not just in scientific inquiry, but also in our daily lives. When we perform a search on Google, stream content from Netflix, place an order on Amazon, or catch up on the latest comings-and-goings on Facebook, our seemingly minute requests are processed by complex systems that sometimes include hundreds of thousands of computers, connected by both local and wide area networks. Distributed systems help programmers aggregate the resources of many networked computers to construct highly available and scalable services.

Recent research in the field of Distributed Systems have described several solutions for managing large-scale data and computation. However, building and using these systems poses a number of more fundamental challenges: How do we keep the system operating correctly even when individual machines fail? How do we ensure that all the machines have a consistent view of the system’s state? (and how do we ensure this in the presence of failures?) How can we determine the order of events in a system where we can’t assume a single global clock? Many of these fundamental problems were identified and solved over the course of several decades.

This class provides an overview of influential research that provided the basis of most large-scale, cloud infrastructures today. Students read, review, and discuss papers on important distributed systems topics, including distributed consensus, consistency models and algorithms, service-oriented architectures, large-scale data storage, and distributed transactions, big-data processing frameworks, and distributed systems security.