In recent years, large distributed systems have taken a prominent
role not just in scientific inquiry, but also in our daily lives. When
we perform a search on Google, stream content from Netflix, place an
order on Amazon, or catch up on the latest comings-and-goings on
Facebook, our seemingly minute requests are processed by complex systems
that sometimes include hundreds of thousands of computers, connected by
both local and wide area networks. Distributed systems help programmers aggregate the resources of many
networked computers to construct highly available and scalable services.
Recent research in the field of Distributed Systems have described several solutions for managing large-scale data and computation. However, building and using these systems poses a number of more fundamental challenges: How do we keep the system operating correctly even when individual machines fail? How do we ensure that all the machines have a consistent view of the system’s state? (and how do we ensure this in the presence of failures?) How can we determine the order of events in a system where we can’t assume a single global clock? Many of these fundamental problems were identified and solved over the course of several decades.
This class provides an overview of influential research that provided the
basis of most large-scale, cloud infrastructures today. Students read, review, and discuss papers
on important distributed systems topics, including distributed consensus, consistency
models and algorithms, service-oriented architectures, large-scale data storage, and distributed
transactions, big-data processing frameworks, and distributed systems security.