Foundations of Distributed Systems
January 7, 2026Note: All code examples are in Rust. You should be comfortable with basic Rust: ownership, borrowing, structs, enums,
Result, andmatch. If terms likeimplor&mut selfare unfamiliar, work through Rust Book concepts, I've already have notes on those.
How This Document Is Organized
This guide has four distinct parts:
| Part | What It Covers | Why You Need It |
|---|---|---|
| Part A: Prerequisites | Networking fundamentals, how computers communicate | You can't understand "the network is unreliable" without knowing what a network actually is |
| Part B: Distributed Systems Concepts | The actual distributed systems theory — fallacies, failures, time, ordering | This is the core content |
Suggested approach:
- Read Part A if networking is new to you (or skim if you're familiar)
- Read Part B carefully — this is the main part (code examples are embedded with concepts)
PART A: PREREQUISITES
Networking Fundamentals & Rust Basics
Skip this part if: You already understand TCP/IP, can explain what happens when you visit a website, and know what ports are.
A.1: What is a Distributed System?
A distributed system is a collection of independent computers that appears to its users as a single coherent system.
When you use Google Search, you don't think "I'm querying thousands of machines across dozens of data centers." You just see a search box and get results. That's the magic, and the challenge, of distributed systems.
But let's make this concrete. What does "computers talking to each other" actually mean?
What's Actually Happening (The Physical Reality)
When your computer sends data to a server, here's the physical reality:
- Your computer converts data into electrical signals
- Those signals travel through cables (ethernet, fiber optic) or radio waves (WiFi)
- They pass through multiple intermediate devices (routers, switches)
- Eventually they reach the destination computer
- That computer converts the signals back into data
Key insight: At the physical level, all you can do is send electrical pulses. Everything else (files, messages, web pages) is abstraction built on top of this.
The Communication Problem
Imagine you can only communicate by flashing a light on and off. How do you send the message "Hello"?
You need to agree on:
- Encoding: What patterns mean what? (Like Morse code)
- Timing: How long is a dot vs. a dash? When does one letter end and the next begin?
- Error handling: What if the other person blinks and misses a flash?
- Addressing: If multiple people are watching, who is this message for?
Networks solve all these problems through protocols: agreed-upon rules for communication. The internet is built from layers of protocols, each solving different problems.
Layers of Abstraction
Computer networking is built in layers:
┌─────────────────────────────────────────┐
│ Application Layer (HTTP, SMTP, etc.) │ ← "I want to load a webpage"
├─────────────────────────────────────────┤
│ Transport Layer (TCP, UDP) │ ← "Deliver this data reliably"
├─────────────────────────────────────────┤
│ Network Layer (IP) │ ← "Route this to the right machine"
├─────────────────────────────────────────┤
│ Link Layer (Ethernet, WiFi) │ ← "Send bits to the next hop"
├─────────────────────────────────────────┤
│ Physical Layer │ ← Actual electrical/light signals
└─────────────────────────────────────────┘
Why layers? Each layer only talks to the layers directly above and below it. This means:
- You can change the physical layer (switch from ethernet to WiFi) without rewriting your web browser
- You can change the application (switch from HTTP to email) without changing how packets are routed
Analogy: Sending a letter.
- You write a letter (Application layer: your content)
- You put it in an envelope with an address (Transport/Network: addressing)
- The postal service figures out how to route it through sorting facilities (Network: routing)
- Trucks and planes physically carry it (Physical: actual movement)
You don't think about the trucks when writing. The layers hide complexity.
A.2: IP Addresses and Packets
IP Addresses: How Machines Are Identified
Every device on the internet has an IP address: a unique identifier, like a phone number for computers.
IPv4 addresses look like: 192.168.1.1 or 142.250.80.46
- Four numbers separated by dots
- Each number is 0-255 (one byte each, so 4 bytes total)
- About 4.3 billion possible addresses (2^32)
- We've basically run out of these, which is why IPv6 exists
IPv6 addresses look like: 2001:0db8:85a3:0000:0000:8a2e:0370:7334
- Much longer (16 bytes = 128 bits)
- Enough addresses for every grain of sand on Earth
- Slowly being adopted
Special addresses you'll see often:
127.0.0.1: Called "localhost", always means "this computer I'm on right now"192.168.x.x,10.x.x.x,172.16.x.xto172.31.x.x: Private addresses (your home network uses these)0.0.0.0: Means "all interfaces" when binding a server (we'll explain this later)
When you type google.com in your browser, your computer first asks "what's the IP address for google.com?" The answer might be 142.250.80.46. Then your computer talks to that IP address. The translation from name to IP is done by DNS (Domain Name System): itself a massive distributed system.
Packets: Data Travels in Chunks
Here's something that might surprise you: when you download a 1GB file, it doesn't travel as one continuous stream of 1 billion bytes. Instead, it's broken into thousands of small packets, typically around 1,500 bytes each.
Why packets?
Sharing: Imagine a single phone line between two cities. If one person made a call that lasted an hour, no one else could use the line. But if we break conversations into tiny chunks and interleave them, many conversations can share the same line. That's what packet switching does: your Netflix stream, your neighbor's video call, and a business's database queries all share the same wires, packet by packet.
Reliability: If you sent 1GB as one continuous stream and there was an error at byte 999,999,999, you'd have to resend the entire gigabyte. With packets, you only resend the one packet that failed.
Routing: Different packets can take different paths. If one route is congested, packets can go around it.
What's in a packet?
Every packet has two parts: a header (metadata) and a payload (actual data).
┌──────────────────────────────────────────────────────┐
│ HEADER (metadata about this packet) │
│ ┌─────────────────┬──────────────────────────────┐ │
│ │ Source IP │ 192.168.1.100 │ │ ← Who sent this
│ │ Destination IP │ 142.250.80.46 │ │ ← Where it's going
│ │ Protocol │ TCP │ │ ← How to interpret payload
│ │ Length │ 1420 bytes │ │ ← How big the payload is
│ │ Checksum │ 0x3a7f │ │ ← For error detection
│ │ TTL │ 64 │ │ ← Hops before packet dies
│ │ ... more fields │ │ │
│ └─────────────────┴──────────────────────────────┘ │
├──────────────────────────────────────────────────────┤
│ PAYLOAD (actual data being sent) │
│ "GET /index.html HTTP/1.1\r\nHost: google.com..." │
│ (up to ~1460 bytes for typical internet traffic) │
└──────────────────────────────────────────────────────┘
Think of it like a postcard: the address side is the header, the message side is the payload.
Routing: How Packets Find Their Way
Your packet doesn't teleport directly to its destination. It hops through multiple routers: devices whose job is to forward packets toward their destination.
Your Computer
↓
Your Home Router (192.168.1.1)
↓
ISP's Router
↓
Regional Hub
↓
... (5-20 more hops typically) ...
↓
Google's Edge Router
↓
Google's Server (142.250.80.46)
Each router looks at the destination IP address and decides: "based on my routing table, which of my neighbors is closest to this destination?" Then it forwards the packet that direction.
You can see this yourself! In a terminal, run:
# On Mac/Linux:
traceroute google.com
# On Windows:
tracert google.com
You'll see each hop your packets take, with timing for each. This is invaluable for debugging network issues.
What Can Go Wrong at the IP Layer
Here's the crucial thing: IP provides no guarantees. It's "best effort" delivery.
Packets can be lost: A router gets overloaded and drops packets. A cable gets damaged. Radio interference corrupts a WiFi packet. The packet just... disappears.
Packets can arrive out of order: Packet 3 might take a different route than packets 1 and 2, and arrive first.
Packets can be duplicated: A router isn't sure if it forwarded a packet successfully, so it sends it twice.
Packets can be corrupted: A cosmic ray flips a bit. Electrical interference on a wire changes a 1 to a 0. (Checksums catch most of these, but not all.)
Packets can be delayed: Traffic jam at a router. Your packet waits in a queue.
This is why distributed systems are hard. The network will fail. The question is how you handle it.
A.3: TCP: Making Communication Reliable
The Problem
IP gives us the ability to send packets between machines, but with no guarantees. For many applications, that's not good enough:
- If you're loading a webpage, you need ALL the HTML, in order
- If you're downloading a file, a single missing byte corrupts the whole thing
- If you're sending a bank transfer, you really need to know it went through
TCP: Transmission Control Protocol
TCP is a protocol built on top of IP. IP handles getting individual packets to the right machine. TCP handles making that communication reliable and ordered.
TCP provides:
- Reliable delivery: If a packet is lost, TCP automatically resends it
- Ordering: TCP reassembles packets in the correct order, even if they arrive out of order
- Flow control: TCP slows down if the receiver can't keep up
- Congestion control: TCP slows down if the network is overloaded
How TCP Knows When Packets Are Lost: Acknowledgments
The key mechanism is acknowledgments (ACKs). When the receiver gets data, it sends back a message saying "I got it."
Sender Receiver
│ │
│──── Data: bytes 0-999 ────────>│
│ │ Receiver got it!
│<─────────────── ACK 1000 ──────│ "I've received up to byte 999,
│ │ send byte 1000 next"
│ │
│──── Data: bytes 1000-1999 ────>│ (this packet gets lost!)
│ │
│ ... sender waits ... │
│ ... no ACK comes ... │
│ ... timeout! ... │
│ │
│──── Data: bytes 1000-1999 ────>│ (resend!)
│ │
│<─────────────── ACK 2000 ──────│ "Got it, send byte 2000 next"
Key insight: TCP uses timeouts to detect loss. If no acknowledgment arrives within a certain time, TCP assumes the packet was lost and resends it. This is why "timeout" is such an important concept in distributed systems.
Sequence Numbers: Handling Ordering and Duplicates
Each byte of data in a TCP connection has a sequence number. This solves two problems:
Ordering: If packet with bytes 2000-2999 arrives before packet with bytes 1000-1999, TCP holds onto the later packet and waits for the earlier one. It delivers data to your application in order.
Duplicates: If the network accidentally duplicates a packet, TCP sees "I already have bytes 1000-1999" and discards the duplicate.
TCP Connections and the Three-Way Handshake
Unlike IP (which just fires packets into the void), TCP establishes a connection before sending data. Both sides agree: "we're now in a conversation."
This happens through the three-way handshake:
Client Server
│ │
│──── SYN ─────────────────────────>│
│ "I want to connect. │
│ My starting sequence │
│ number is 1000." │
│ │
│<─────────────────────── SYN-ACK ──│
│ "OK, I accept. │
│ My starting sequence │
│ number is 5000. │
│ I acknowledge your 1000." │
│ │
│──── ACK ─────────────────────────>│
│ "Great, I acknowledge │
│ your 5000." │
│ │
│ === Connection Established === │
│ │
Why three messages?
- SYN (synchronize): Client says "I want to connect" and shares its initial sequence number
- SYN-ACK: Server says "I accept" and shares its sequence number, while acknowledging the client's
- ACK: Client confirms it received the server's info
After this, both sides know the connection is established and have agreed on starting sequence numbers.
Important: A "connection" is a logical concept, not physical. There's no dedicated wire between you and Google. A connection is just both computers agreeing "we're having a conversation" and keeping track of state (sequence numbers, what's been acknowledged, etc.).
Ports: Multiple Connections Per Machine
An IP address identifies a machine. But a machine runs many programs: a web server, a database, an SSH daemon. How does the machine know which program should receive which packets?
Ports. A port is a 16-bit number (0-65535) that identifies a specific application or service.
A full network address is IP:port, like 142.250.80.46:443.
Well-known ports (you'll see these constantly):
| Port | Service | Description |
|---|---|---|
| 22 | SSH | Secure shell (remote login) |
| 80 | HTTP | Web traffic (unencrypted) |
| 443 | HTTPS | Web traffic (encrypted) |
| 5432 | PostgreSQL | Postgres database |
| 6379 | Redis | Redis cache/database |
| 27017 | MongoDB | MongoDB database |
When you visit https://google.com, your browser connects to Google's IP on port 443.
But what about your side? Your computer also needs a port for the connection. The operating system assigns a random high port (like 54321) for outgoing connections. So a full connection looks like:
Your machine (192.168.1.100:54321) <═══> Google (142.250.80.46:443)
This is called a socket: the combination of IP address + port on each side uniquely identifies a connection.
What TCP Still Can't Solve
TCP makes communication reliable, but it can't work miracles:
If the other machine crashes, TCP can't magically resurrect it. Eventually TCP will timeout and report an error.
If the network is partitioned (all routes between two machines are broken), no amount of retrying will help.
TCP doesn't guarantee speed. It guarantees delivery, but if the network is slow, your data is slow.
Connections can be killed. If a connection is idle for too long, intermediate routers or firewalls might close it.
A.4: UDP: When You Don't Want TCP's Guarantees
Sometimes Reliability Isn't Worth It
TCP's reliability comes at a cost:
- Latency: If a packet is lost, TCP waits for a timeout, then resends. This adds delay.
- Head-of-line blocking: TCP delivers data in order. If packet 1 is lost but packets 2, 3, 4 arrive, TCP holds them until packet 1 is resent and arrives. Your application can't see packets 2-4 yet.
- Connection overhead: The three-way handshake takes time.
For some applications, these tradeoffs don't make sense:
Video calls: If a video frame is lost, you don't want to pause the call and wait. Just skip that frame and keep going. A brief glitch is better than freezing.
Real-time games: You want the player's current position, not their position from 500ms ago. If a packet is lost, just use the next one.
DNS lookups: The query and response are each a single small packet. If it doesn't arrive, just send another query. No need for connection setup overhead.
UDP: User Datagram Protocol
UDP is another protocol built on IP, but much simpler than TCP.
UDP provides:
- Ports (so multiple applications can share one IP)
- Checksums (so you can detect corrupted data)
UDP does NOT provide:
- Reliability (packets might be lost)
- Ordering (packets might arrive out of order)
- Duplicate detection (you might receive the same packet twice)
- Flow control (sender can overwhelm receiver)
Mental model:
- TCP is like a phone call: you establish a connection, have a back-and-forth conversation, both sides know the connection exists
- UDP is like mailing postcards: you write, drop in mailbox, hope it arrives, no confirmation
When to Use Each
| Use TCP when... | Use UDP when... |
|---|---|
| Data must be complete and correct | Missing data is acceptable |
| Order matters | Latest data is more important than complete data |
| You're building request/response patterns | You need lowest possible latency |
| Examples: HTTP, databases, file transfer | Examples: video streaming, games, DNS |
Most distributed systems you'll build use TCP. But understanding UDP helps you appreciate what TCP does for you.
A.5: DNS: Translating Names to Addresses
The Problem
Humans like names: google.com, github.com, api.mycompany.io
Computers need numbers: 142.250.80.46
DNS (Domain Name System) translates names to IP addresses. It's like the internet's phone book.
How DNS Works
When you type github.com in your browser:
1. Browser: "What's the IP for github.com?"
│
▼
2. Check local cache on your machine
Found and not expired? → Use it
Not found? → Ask DNS resolver
│
▼
3. DNS Resolver (your ISP, or 8.8.8.8 for Google, 1.1.1.1 for Cloudflare)
Has it cached? → Return it
Not cached? → Ask authoritative servers
│
▼
4. Ask Root servers: "Who handles .com?"
│
▼
5. Ask .com TLD servers: "Who handles github.com?"
│
▼
6. Ask GitHub's nameservers: "What's the IP for github.com?"
│
▼
7. Answer: "github.com = 140.82.112.4"
│
▼
8. Response flows back up, cached at each level
│
▼
9. Browser connects to 140.82.112.4
DNS Is Itself a Distributed System!
Here's something beautiful: DNS is one of the oldest and most successful distributed systems in existence.
- Hierarchical: Root servers → TLD servers (.com, .org, .io) → domain nameservers
- Replicated: There are 13 root server "identities," but each is actually many physical servers (over 1,000 total)
- Cached: Results are cached at every level with TTL (time-to-live) values
- Eventually consistent: When you change a DNS record, it can take hours to propagate worldwide
Fun fact: When DNS has problems, the internet appears broken even though all the underlying servers are fine. You can't reach google.com by name, even though 142.250.80.46 is perfectly reachable.
What Can Go Wrong with DNS
- DNS server failure: That's why there are many servers with replication
- Slow lookups: Usually cached, but first lookup for a domain takes time
- DNS spoofing/poisoning: Attacker gives you wrong IP address
- Propagation delays: DNS changes can take hours to spread globally
- TTL expiration: Cached entries expire; then you need a fresh lookup
A.6: Sockets: How Your Code Talks to the Network
Now we get to code. Everything above is handled by the operating system. Your code interacts with the network through sockets.
What Is a Socket?
A socket is an endpoint for communication. It's the interface between your application and the TCP/IP stack in the operating system.
Think of it like this:
- The OS handles all the complexity of packets, TCP, routing
- Your application just reads and writes bytes to a socket
- The socket is the "portal" between your code and the network
The Server/Client Pattern
Most network programs follow a pattern:
Server:
- Create a socket
- Bind it to an address (IP + port): "I will listen on port 8080"
- Listen for incoming connections
- Accept a connection when a client connects
- Read/write data
- Close the connection
Client:
- Create a socket
- Connect to the server's address
- Read/write data
- Close the connection
Let's implement both in Rust, explaining every line.
A TCP Server in Rust, Explained Line by Line
// === Imports ===
// std::net contains networking primitives
// TcpListener: for accepting incoming connections (server-side)
// TcpStream: a single TCP connection (used by both client and server)
use std::net::{TcpListener, TcpStream};
// std::io contains input/output traits and types
// Read: trait that provides the `read` method
// Write: trait that provides `write` and `write_all` methods
use std::io::{Read, Write};
// === Helper Function to Handle One Connection ===
// This function takes ownership of a TcpStream (one client connection)
// and returns a Result (Ok if everything worked, Err if something failed)
fn handle_client(mut stream: TcpStream) -> std::io::Result<()> {
// peer_addr() returns the remote (client's) address
// The ? operator: if this returns an Err, immediately return that Err
// If it returns Ok(addr), unwrap the addr and continue
let peer_addr = stream.peer_addr()?;
println!("Handling connection from {}", peer_addr);
// Create a buffer to store incoming data
// [0u8; 1024] means: an array of 1024 bytes, all initialized to 0
// We need `mut` because `read` will modify this buffer
let mut buffer = [0u8; 1024];
// Read data from the connection into our buffer
// read() returns how many bytes were actually read
// This BLOCKS until data arrives or the connection closes
let bytes_read = stream.read(&mut buffer)?;
// What if the client closed the connection without sending anything?
if bytes_read == 0 {
println!("Client closed connection without sending data");
return Ok(());
}
// Convert the bytes we received to a string for display
// &buffer[..bytes_read] is a slice of just the bytes that were filled in
// from_utf8_lossy: if there are invalid UTF-8 bytes, replace them with �
// (We use this because network data might not be valid UTF-8)
let received = String::from_utf8_lossy(&buffer[..bytes_read]);
println!("Received {} bytes: {}", bytes_read, received);
// Create a response
let response = format!("Echo: {}", received);
// Send the response back
// write_all() sends ALL the bytes, retrying if necessary
// (plain write() might only send some bytes if the buffer is full)
// as_bytes() converts the String to &[u8] (a byte slice)
stream.write_all(response.as_bytes())?;
println!("Sent response to {}", peer_addr);
// Connection is automatically closed when `stream` goes out of scope
// (Rust's Drop trait handles this)
Ok(())
}
// === Main Function ===
fn main() -> std::io::Result<()> {
// === Step 1: Create and bind a listener socket ===
// TcpListener::bind() does several things:
// 1. Creates a socket
// 2. Binds it to the specified address (IP:port)
// 3. Puts it in listening mode
//
// "127.0.0.1:8888" means:
// - 127.0.0.1 = localhost (only accept connections from this machine)
// - 8888 = the port number
//
// If you used "0.0.0.0:8888", it would accept connections from any IP
// (useful for servers that should be reachable from other machines)
let listener = TcpListener::bind("127.0.0.1:8888")?;
println!("Server listening on 127.0.0.1:8888");
println!("Connect with: nc localhost 8888");
println!("Or run the client program");
println!();
// === Step 2: Accept connections in a loop ===
// listener.incoming() returns an iterator of connection attempts
// Each item is a Result<TcpStream, Error>
for stream_result in listener.incoming() {
// Handle the Result - did the connection attempt succeed?
match stream_result {
Ok(stream) => {
// We got a connection!
// Note: This is a simple single-threaded server.
// While we're handling this client, other clients wait.
// A real server would spawn a thread or use async.
if let Err(e) = handle_client(stream) {
// Log the error but keep serving other clients
eprintln!("Error handling client: {}", e);
}
}
Err(e) => {
// The connection attempt itself failed
// (Different from errors while handling a connection)
eprintln!("Failed to accept connection: {}", e);
}
}
}
Ok(())
}
Key Concepts in the Server
Binding: When we call bind("127.0.0.1:8888"), we're telling the OS: "I want to receive TCP connections sent to this IP address on port 8888." If another program is already using that port, binding fails.
Listening: After binding, the socket is in listening mode, ready to receive connections.
Accepting: incoming() blocks until a client connects, then returns a TcpStream representing that connection. The original listener keeps listening for more connections.
Blocking I/O: When we call stream.read(), our thread stops and waits until data arrives. This is called "blocking." It's simple but means we can only handle one client at a time (in this code). Real servers use threads or async to handle many clients concurrently.
A TCP Client in Rust, Explained Line by Line
use std::net::TcpStream;
use std::io::{Read, Write};
fn main() -> std::io::Result<()> {
// === Step 1: Connect to the server ===
// TcpStream::connect() does the three-way handshake:
// 1. Sends SYN to server
// 2. Receives SYN-ACK from server
// 3. Sends ACK to server
//
// If successful, returns a TcpStream representing the connection.
// If the server isn't running, or the port is wrong, this fails.
//
// Common errors:
// - ConnectionRefused: server isn't listening on that port
// - TimedOut: server didn't respond (maybe wrong IP, or firewall)
println!("Connecting to server...");
let mut stream = TcpStream::connect("127.0.0.1:8888")?;
// local_addr() returns OUR side of the connection
// peer_addr() returns the SERVER's side
println!("Connected!");
println!(" Local address: {}", stream.local_addr()?);
println!(" Server address: {}", stream.peer_addr()?);
// === Step 2: Send data to the server ===
let message = "Hello from Rust client!";
println!("\nSending: {}", message);
// write_all() ensures all bytes are sent
stream.write_all(message.as_bytes())?;
// === Step 3: Receive response from server ===
let mut buffer = [0u8; 1024];
let bytes_read = stream.read(&mut buffer)?;
if bytes_read == 0 {
println!("Server closed connection without responding");
} else {
let response = String::from_utf8_lossy(&buffer[..bytes_read]);
println!("Received: {}", response);
}
// Connection is closed when `stream` goes out of scope
println!("\nConnection closed.");
Ok(())
}
Running the Example
Create a new Rust project:
cargo new tcp_demo
cd tcp_demo
Replace src/main.rs with the server code. Create src/bin/client.rs with the client code.
Update Cargo.toml:
[package]
name = "tcp_demo"
version = "0.1.0"
edition = "2021"
[[bin]]
name = "server"
path = "src/main.rs"
[[bin]]
name = "client"
path = "src/bin/client.rs"
Run the server:
cargo run --bin server
In another terminal, run the client:
cargo run --bin client
You should see the message sent and echoed back!
A.7: Timeouts and Error Handling
The Fundamental Problem with Networks
When you send a request over a network and don't get a response, what happened?
- The request never arrived: Network problem on the way there
- The request arrived but the server crashed before processing: Server died
- The server processed it but crashed before responding: Work was done but you don't know!
- The response was sent but got lost: Network problem on the way back
- Everything is fine, just slow: Network congestion, server overloaded
Here's the crucial point: From the client's perspective, all of these look the same. You sent a request. You're waiting. No response yet. Is the server dead? Is it slow? Did your message get lost? You can't tell.
This is the core challenge of distributed systems.
Timeouts: Giving Up After a While
A timeout says: "If I don't hear back within X seconds, assume something went wrong and give up."
use std::net::TcpStream;
use std::time::Duration;
use std::io::{Read, Write};
fn main() -> std::io::Result<()> {
// Connect with a timeout
// If the server doesn't respond to our SYN within 5 seconds, give up
let address = "127.0.0.1:8888".parse().unwrap();
println!("Attempting to connect (5 second timeout)...");
let mut stream = TcpStream::connect_timeout(
&address,
Duration::from_secs(5)
)?;
// Set timeouts for reading and writing
// These apply to EVERY read/write operation on this stream
stream.set_read_timeout(Some(Duration::from_secs(5)))?;
stream.set_write_timeout(Some(Duration::from_secs(5)))?;
println!("Connected! Sending request...");
stream.write_all(b"Hello?")?;
// Now read the response
// If no data arrives within 5 seconds, read() will return an error
let mut buffer = [0u8; 1024];
match stream.read(&mut buffer) {
Ok(0) => {
println!("Server closed connection without responding");
}
Ok(n) => {
println!("Received: {}", String::from_utf8_lossy(&buffer[..n]));
}
Err(e) => {
// Check what kind of error
if e.kind() == std::io::ErrorKind::WouldBlock
|| e.kind() == std::io::ErrorKind::TimedOut {
println!("TIMEOUT: No response within 5 seconds");
} else {
println!("Error: {}", e);
}
}
}
Ok(())
}
The Timeout Dilemma
How long should the timeout be?
Too short:
- You give up on servers that are slow but healthy
- You might retry unnecessarily, creating more load
- Users see errors when things would have worked
Too long:
- Users wait forever for dead servers
- Resources (memory, connections) are tied up waiting
- System throughput drops because threads are blocked waiting
There's no perfect answer. It depends on:
- What's normal latency for this operation?
- How bad is a false positive (giving up when we shouldn't)?
- How bad is waiting too long?
Typical timeouts:
- Database queries: 1-30 seconds
- HTTP APIs: 5-30 seconds
- Health checks: 1-5 seconds
- Internal microservices: 100ms-5 seconds
Types of Network Errors
Let's understand the different errors you'll encounter:
use std::net::TcpStream;
use std::io::{self, Read, Write};
use std::time::Duration;
fn demonstrate_errors() {
// === Connection Refused ===
// Server exists but nothing is listening on that port
// You'll see this if you run the client without starting the server
match TcpStream::connect("127.0.0.1:9999") {
Err(e) if e.kind() == io::ErrorKind::ConnectionRefused => {
println!("Connection refused: nothing listening on that port");
}
_ => {}
}
// === Connection Timeout ===
// Can't reach the server at all (wrong IP, firewall, server down)
// Using a non-routable IP to demonstrate
let addr = "10.255.255.1:80".parse().unwrap();
match TcpStream::connect_timeout(&addr, Duration::from_secs(2)) {
Err(e) if e.kind() == io::ErrorKind::TimedOut => {
println!("Connection timeout: couldn't reach server");
}
_ => {}
}
// === Read Timeout ===
// Connected, but server isn't sending data
// (Need a server that accepts but doesn't respond to demonstrate)
// === Connection Reset ===
// Server abruptly closed the connection
// Usually means the server crashed or explicitly killed the connection
}
Common Error Kinds
| ErrorKind | What it means | Common causes |
|---|---|---|
ConnectionRefused |
Server rejected connection | Nothing listening on port, firewall |
TimedOut |
Operation took too long | Server overloaded, network problem |
ConnectionReset |
Connection forcibly closed | Server crashed, server rejected request |
BrokenPipe |
Writing to a closed connection | Server closed while you were sending |
NotConnected |
Socket isn't connected | Forgot to connect, or already disconnected |
WouldBlock |
Non-blocking operation can't complete | Used with non-blocking I/O |
Retries with Exponential Backoff
When something fails, a natural instinct is to try again. But naive retries can make problems worse.
Scenario: Server is overloaded, responding slowly. 100 clients timeout and all immediately retry. Now the server has 200 requests to handle. More timeouts. More retries. Server dies.
Solution: Exponential backoff. Wait longer between each retry.
use std::net::TcpStream;
use std::io::{Read, Write};
use std::time::Duration;
use std::thread;
/// Attempts an operation with retries and exponential backoff.
///
/// # Arguments
/// * `max_retries` - How many times to try before giving up
/// * `operation` - A closure that attempts the operation
///
/// # Returns
/// The result of the first successful attempt, or the last error
fn with_retry<T, E, F>(max_retries: u32, mut operation: F) -> Result<T, E>
where
F: FnMut() -> Result<T, E>,
E: std::fmt::Display,
{
let mut last_error = None;
for attempt in 0..max_retries {
if attempt > 0 {
// Exponential backoff: 1s, 2s, 4s, 8s, ...
// `1 << attempt` is 2^attempt (bit shifting)
let wait_secs = 1 << (attempt - 1);
println!(" Waiting {} seconds before retry...", wait_secs);
thread::sleep(Duration::from_secs(wait_secs));
}
println!("Attempt {} of {}...", attempt + 1, max_retries);
match operation() {
Ok(result) => {
println!(" Success!");
return Ok(result);
}
Err(e) => {
println!(" Failed: {}", e);
last_error = Some(e);
}
}
}
Err(last_error.expect("Should have at least one error"))
}
fn make_request() -> std::io::Result<String> {
let mut stream = TcpStream::connect_timeout(
&"127.0.0.1:8888".parse().unwrap(),
Duration::from_secs(2),
)?;
stream.set_read_timeout(Some(Duration::from_secs(2)))?;
stream.write_all(b"Hello")?;
let mut buffer = [0u8; 1024];
let n = stream.read(&mut buffer)?;
Ok(String::from_utf8_lossy(&buffer[..n]).to_string())
}
fn main() {
match with_retry(4, make_request) {
Ok(response) => println!("\nFinal result: {}", response),
Err(e) => println!("\nAll retries failed: {}", e),
}
}
Idempotency: A Critical Concept
Question: If a request times out, should you retry?
It depends on whether the operation is idempotent.
Idempotent means: doing it multiple times has the same effect as doing it once.
| Operation | Idempotent? | Safe to retry? |
|---|---|---|
| GET a webpage | Yes | ✓ |
| Set user's email to "alice@example.com" | Yes | ✓ |
| Delete record with ID 123 | Yes | ✓ (if not exists, that's fine) |
| Add $100 to account | NO | ✗ (might add $200!) |
| Create new order | NO | ✗ (might create duplicate!) |
| Increment counter | NO | ✗ |
Why this matters: Remember, you don't know if your request succeeded. Maybe the server processed it and the response got lost. If you retry a non-idempotent operation, you might do it twice!
Making operations idempotent:
- Use unique IDs: "Create order with ID abc123": if it already exists, just return it
- Use conditional updates: "Set counter to 5 if current value is 4"
- Use request IDs: Server tracks which requests it's seen and deduplicates
A.8: HTTP: The Protocol of the Web
What is HTTP?
HTTP (HyperText Transfer Protocol) is an application-layer protocol built on TCP. It's how web browsers talk to web servers, how APIs communicate, and probably the most common protocol you'll work with.
The Request-Response Pattern
HTTP follows a simple pattern:
- Client sends a request (method, path, headers, optional body)
- Server sends a response (status code, headers, optional body)
Client Server
│ │
│──── HTTP Request ─────────────────>
│ GET /users/42 HTTP/1.1 │
│ Host: api.example.com │
│ Accept: application/json │
│ │
│<───────────────── HTTP Response ──│
│ HTTP/1.1 200 OK │
│ Content-Type: application/json│
│ {"id": 42, "name": "Alice"} │
│ │
Anatomy of an HTTP Request
GET /search?q=rust HTTP/1.1 ← Request line: METHOD PATH VERSION
Host: www.google.com ← Headers start
User-Agent: Mozilla/5.0
Accept: text/html
Accept-Language: en-US
← Empty line marks end of headers
← Body would go here (for POST/PUT)
Request line components:
- Method: What you want to do (GET, POST, PUT, DELETE, etc.)
- Path: What resource you're accessing
- Version: HTTP protocol version
Common HTTP methods:
| Method | Meaning | Has body? | Idempotent? |
|---|---|---|---|
| GET | Retrieve data | No | Yes |
| POST | Create new data | Yes | No |
| PUT | Replace data entirely | Yes | Yes |
| PATCH | Update data partially | Yes | Not always |
| DELETE | Remove data | No | Yes |
| HEAD | Like GET but no body | No | Yes |
Anatomy of an HTTP Response
HTTP/1.1 200 OK ← Status line: VERSION CODE MESSAGE
Content-Type: application/json ← Headers start
Content-Length: 42
Date: Mon, 01 Jan 2024 12:00:00 GMT
← Empty line marks end of headers
{"id": 42, "name": "Alice"} ← Body
Common status codes:
| Code | Meaning | When you'll see it |
|---|---|---|
| 200 | OK | Request succeeded |
| 201 | Created | POST succeeded, resource created |
| 204 | No Content | Success, but no body to return |
| 301 | Moved Permanently | Resource has a new URL (redirect) |
| 302 | Found | Temporary redirect |
| 400 | Bad Request | Your request was malformed |
| 401 | Unauthorized | Need to authenticate |
| 403 | Forbidden | Authenticated but not allowed |
| 404 | Not Found | Resource doesn't exist |
| 429 | Too Many Requests | Rate limited |
| 500 | Internal Server Error | Server crashed |
| 502 | Bad Gateway | Proxy/load balancer can't reach backend |
| 503 | Service Unavailable | Server overloaded or down |
| 504 | Gateway Timeout | Proxy/load balancer timeout |
Making HTTP Requests in Rust
For HTTP in Rust, we'll use the reqwest crate. It handles all the complexity (TLS, connection pooling, etc.).
# Cargo.toml
[dependencies]
reqwest = { version = "0.11", features = ["blocking", "json"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
use serde::Deserialize;
// Define a struct matching the JSON we expect
// `Deserialize` is a serde trait that lets us parse JSON into this struct
#[derive(Debug, Deserialize)]
struct User {
login: String,
id: u64,
name: Option<String>, // Option because this field might be null
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
// === Simple GET request ===
println!("Fetching user from GitHub API...\n");
// The `blocking` module provides synchronous (blocking) HTTP
// There's also async reqwest for production use
let response = reqwest::blocking::get(
"https://api.github.com/users/octocat"
)?;
// Check the status
println!("Status: {}", response.status());
println!("Headers:");
for (name, value) in response.headers() {
println!(" {}: {:?}", name, value);
}
// Parse JSON body into our struct
// The ::<User> is a "turbofish": tells Rust what type to deserialize into
let user: User = response.json()?;
println!("\nParsed user:");
println!(" Login: {}", user.login);
println!(" ID: {}", user.id);
println!(" Name: {:?}", user.name);
Ok(())
}
Building More Complex Requests
use reqwest::blocking::Client;
use reqwest::header::{AUTHORIZATION, CONTENT_TYPE, USER_AGENT};
use serde::{Deserialize, Serialize};
use std::time::Duration;
// For request body
#[derive(Serialize)]
struct CreateIssue {
title: String,
body: String,
}
// For response body
#[derive(Debug, Deserialize)]
struct Issue {
id: u64,
number: u64,
title: String,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a reusable client with configuration
// In production, you'd typically create ONE client and reuse it
// (it maintains a connection pool)
let client = Client::builder()
.timeout(Duration::from_secs(10)) // Request timeout
.user_agent("my-rust-app/1.0") // Identify ourselves
.build()?;
// Build a POST request
let issue = CreateIssue {
title: "Bug report".to_string(),
body: "Something is broken!".to_string(),
};
let response = client
.post("https://api.github.com/repos/owner/repo/issues")
.header(AUTHORIZATION, "Bearer your-token-here")
.header(CONTENT_TYPE, "application/json")
.json(&issue) // Serialize `issue` to JSON and set as body
.send()?;
// Handle different status codes
match response.status().as_u16() {
201 => {
let created: Issue = response.json()?;
println!("Created issue #{}: {}", created.number, created.title);
}
401 => {
println!("Authentication failed - check your token");
}
403 => {
println!("Forbidden - you don't have permission");
}
404 => {
println!("Repository not found");
}
422 => {
// Unprocessable Entity - validation error
let error: serde_json::Value = response.json()?;
println!("Validation error: {}", error);
}
code => {
println!("Unexpected status: {}", code);
println!("Body: {}", response.text()?);
}
}
Ok(())
}
PART B: DISTRIBUTED SYSTEMS CONCEPTS
The Core Theory
This is the main content. Part A was background; this is what you're here to learn.
B.1: Why Distributed Systems Are Hard (The Eight Fallacies)
Now that you understand how networks actually work, let's revisit why distributed systems are challenging.
In 1994, Peter Deutsch identified eight assumptions that developers new to distributed systems incorrectly make. These are the Eight Fallacies of Distributed Computing.
Fallacy 1: The Network is Reliable
The assumption: When I send a message, it will arrive.
Reality: You now understand why this is false. Packets can be:
- Dropped by overloaded routers
- Lost due to cable damage or radio interference
- Blocked by firewalls
- Delayed indefinitely
Even with TCP, which handles retransmission, connections can fail entirely if the network is partitioned.
What this means for your code:
- Every network call can fail
- You need timeouts on everything
- You need to decide: retry? fail? use cached data?
Fallacy 2: Latency is Zero
The assumption: Network communication is instant.
Reality: The speed of light is a real constraint.
| Distance | One-way latency (minimum, speed of light) |
|---|---|
| Same rack | 0.0005 ms |
| Same datacenter | 0.5 ms |
| NYC to London | 28 ms |
| NYC to Tokyo | 50 ms |
And that's just physics. Real latency includes:
- Routing decisions at each hop
- Queuing delays at routers
- Processing time at the server
- TCP acknowledgment round-trips
What this means for your code:
- A function that makes 10 sequential network calls takes 10× the latency
- Batch requests when possible
- Consider where data lives relative to computation
Fallacy 3: Bandwidth is Infinite
The assumption: You can send as much data as you want.
Reality: Links have limited capacity, and you share them with everyone else.
What this means:
- Compress data before sending
- Only send what you need
- Consider moving computation to the data instead of moving data to computation
Fallacy 4: The Network is Secure
The assumption: Data sent over the network is safe.
Reality: Your packets pass through many routers you don't control. Without encryption, anyone can read or modify them.
What this means:
- Use TLS (HTTPS) for everything
- Authenticate the parties you're communicating with
- Don't trust data just because it came from "inside" your network
Fallacy 5: Topology Doesn't Change
The assumption: The network structure stays the same.
Reality: Servers come and go. Routes change. Cloud instances are ephemeral.
What this means:
- Don't hardcode IP addresses
- Use service discovery (DNS, Consul, Kubernetes services)
- Design for nodes joining and leaving
Fallacy 6: There is One Administrator
The assumption: One person or team controls the entire system.
Reality: You depend on your ISP, cloud provider, third-party APIs, DNS providers, CDNs...
What this means:
- External services will fail in ways you don't expect
- Have fallbacks for critical dependencies
- Monitor third-party services
Fallacy 7: Transport Cost is Zero
The assumption: Sending data is free.
Reality: Network calls cost:
- CPU time for serialization/deserialization
- Memory for buffers
- Actual money (cloud egress fees!)
Example: AWS charges $0.09/GB for data leaving their network. A service sending 100TB/month pays $9,000 just for egress.
Fallacy 8: The Network is Homogeneous
The assumption: All parts of the network are the same.
Reality: Different links have different speeds, reliability, and characteristics. The last mile to a mobile phone is very different from a datacenter interconnect.
A.9: Concurrency: Hard Even on One Machine
Before we get deeper into distributed systems, we need to understand concurrency. This matters because:
- Concurrency bugs are hard to find on one machine
- In distributed systems, they're even harder
- You can't use simple locks across machines
Why Concurrency Matters
A web server needs to handle thousands of connections. If each request takes 100ms, and you handle them one at a time, you can only do 10 requests/second. That's terrible.
You need to handle multiple requests concurrently.
Processes vs. Threads
Process: A running program with its own isolated memory space. Processes can't accidentally corrupt each other's memory.
Thread: A lightweight unit of execution within a process. All threads in a process share the same memory space.
┌────────────────────────────────────────────────────────┐
│ Process │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Shared Memory Space │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │ │
│ │ │ Thread 1 │ │ Thread 2 │ │ Thread 3 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Can access │ │ Can access │ │Can access│ │ │
│ │ │ any variable│ │ any variable│ │ any │ │ │
│ │ │ in shared │ │ in shared │ │ variable │ │ │
│ │ │ memory! │ │ memory! │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └──────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Sharing memory is powerful but dangerous.
Race Conditions
A race condition occurs when the result of a program depends on the timing of uncontrollable events (like thread scheduling).
Let's see a classic example. This code is intentionally broken to demonstrate the problem:
use std::thread;
// This is DELIBERATELY BROKEN CODE to demonstrate race conditions
// DON'T DO THIS - Rust actually prevents this at compile time!
fn main() {
// Imagine we could share this between threads (we can't without synchronization)
let mut counter = 0;
// Thread 1 wants to increment counter
// Internally this is: read counter → add 1 → write counter
// Thread 2 also wants to increment counter
// Same steps: read counter → add 1 → write counter
// If both threads run truly concurrently:
//
// Thread 1 Thread 2
// -------- --------
// read counter (0)
// read counter (0)
// add 1 (result: 1)
// add 1 (result: 1)
// write counter (1)
// write counter (1)
//
// Final value: 1 (should be 2!)
// This is a race condition.
}
Both threads read 0, both add 1, both write 1. One increment is lost!
Rust's protection: Rust won't let you write this code. The compiler enforces that mutable data is either:
- Owned by exactly one thread, OR
- Protected by synchronization primitives
Mutex: Mutual Exclusion
A Mutex (mutual exclusion) is a lock that ensures only one thread can access data at a time.
use std::sync::{Arc, Mutex};
use std::thread;
fn main() {
// === Breaking down the types ===
//
// Mutex<i32>: A mutex protecting an i32
// - Can only access the i32 by "locking" the mutex
// - Lock is released automatically when it goes out of scope
//
// Arc<T>: "Atomically Reference Counted" smart pointer
// - Allows multiple ownership of T
// - When all Arcs are dropped, T is dropped
// - "Atomic" means the reference counting is thread-safe
//
// Why Arc<Mutex<T>>?
// - Mutex: provides safe mutable access from multiple threads
// - Arc: allows multiple threads to own the mutex
// - Together: multiple threads can safely share and modify data
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for i in 0..10 {
// Clone the Arc to get a new reference to the same Mutex
// This DOESN'T clone the underlying data, just increments the ref count
let counter_clone = Arc::clone(&counter);
// spawn() requires the closure to have 'static lifetime
// Using `move` transfers ownership of `counter_clone` into the thread
let handle = thread::spawn(move || {
// === Locking the mutex ===
//
// lock() does two things:
// 1. Blocks until we acquire the lock (waits if another thread holds it)
// 2. Returns a MutexGuard<i32> that lets us access the data
//
// unwrap() is needed because lock() returns a Result
// It can fail if a thread panicked while holding the lock ("poisoned")
let mut num = counter_clone.lock().unwrap();
// `num` is a MutexGuard<i32>, which implements DerefMut
// So `*num` gives us &mut i32
*num += 1;
println!("Thread {} incremented counter to {}", i, *num);
// When `num` goes out of scope here, the MutexGuard is dropped
// This automatically releases the lock
});
handles.push(handle);
}
// Wait for all threads to complete
for handle in handles {
handle.join().unwrap();
}
// Get the final value
// Again, we lock to access the data
println!("\nFinal counter value: {}", *counter.lock().unwrap());
}
Breaking Down the Types
Let's make sure the types are crystal clear:
// Just the value - no sharing possible
let x: i32 = 0;
// Mutex around the value - one thread can access at a time
// But we can't share this Mutex across threads!
let m: Mutex<i32> = Mutex::new(0);
// Rc (Reference Counted) - multiple ownership, but NOT thread-safe
// let r: Rc<Mutex<i32>> = Rc::new(Mutex::new(0)); // Can't send to threads!
// Arc (Atomic Reference Counted) - multiple ownership, IS thread-safe
let a: Arc<Mutex<i32>> = Arc::new(Mutex::new(0));
// NOW we can clone `a` and send clones to different threads
Deadlock
Deadlock occurs when two or more threads are each waiting for the other to release a resource.
use std::sync::{Arc, Mutex};
use std::thread;
fn main() {
let lock_a = Arc::new(Mutex::new(0));
let lock_b = Arc::new(Mutex::new(0));
let lock_a_clone = Arc::clone(&lock_a);
let lock_b_clone = Arc::clone(&lock_b);
// Thread 1: Locks A, then tries to lock B
let handle1 = thread::spawn(move || {
let _a = lock_a.lock().unwrap();
println!("Thread 1 has lock A, trying to get lock B...");
thread::sleep(std::time::Duration::from_millis(100)); // Give thread 2 time
let _b = lock_b.lock().unwrap(); // WAITS FOREVER - thread 2 has B
println!("Thread 1 has both locks");
});
// Thread 2: Locks B, then tries to lock A
let handle2 = thread::spawn(move || {
let _b = lock_b_clone.lock().unwrap();
println!("Thread 2 has lock B, trying to get lock A...");
thread::sleep(std::time::Duration::from_millis(100));
let _a = lock_a_clone.lock().unwrap(); // WAITS FOREVER - thread 1 has A
println!("Thread 2 has both locks");
});
// This program will hang forever!
// Thread 1: Holds A, waiting for B
// Thread 2: Holds B, waiting for A
handle1.join().unwrap();
handle2.join().unwrap();
}
Avoiding deadlock:
- Always acquire locks in the same order
- Use timeout-based lock acquisition
- Design to avoid needing multiple locks
- Use lock-free data structures when possible
Why This Matters for Distributed Systems
In a distributed system, you can't use Mutex across machines. Each server has its own memory. There's no shared Arc<Mutex<T>>.
But the same problems exist! Imagine two database servers:
Server A reads balance: 100
Server B reads balance: 100
Server A writes balance: 50
Server B writes balance: 70 ← Server A's write is lost!
This is the same race condition, but across machines. Solving it requires distributed coordination mechanisms: consensus algorithms, distributed transactions, etc. You'll learn these in later phases.
A.10: Serialization: Turning Data into Bytes
When machines communicate, they need to agree on how to represent data as bytes.
The Problem
Your Rust program has types:
struct User {
name: String,
age: u32,
friends: Vec<String>,
}
The network only carries bytes. How do you send this struct to another machine (which might be running Python or Go)?
Serialization and Deserialization
Serialization: Converting data structures to bytes (also called encoding, marshaling)
Deserialization: Converting bytes back to data structures (also called decoding, unmarshaling)
Rust struct Bytes (Maybe) Rust struct
User { name: "Alice", ... } → [123, 34, 110, 97, ...] → User { name: "Alice", ... }
serialize deserialize
Serde: Rust's Serialization Framework
Serde is the de facto standard for serialization in Rust. It's a framework that lets you support many formats (JSON, YAML, TOML, MessagePack, etc.) with the same code.
# Cargo.toml
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0" # JSON format
# bincode = "1.3" # Binary format (faster, smaller)
# toml = "0.8" # TOML format
Basic Usage
use serde::{Serialize, Deserialize};
// === The derive macros ===
//
// #[derive(Serialize, Deserialize)] automatically implements
// the Serialize and Deserialize traits for your struct.
//
// This generates code that knows how to convert your struct
// to/from various formats (JSON, binary, etc.)
#[derive(Debug, Serialize, Deserialize)]
struct User {
name: String,
age: u32,
email: Option<String>, // Optional fields become null in JSON
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let user = User {
name: "Alice".to_string(),
age: 30,
email: Some("alice@example.com".to_string()),
};
// === Serialize to JSON string ===
let json_string: String = serde_json::to_string(&user)?;
println!("JSON string: {}", json_string);
// Output: {"name":"Alice","age":30,"email":"alice@example.com"}
// === Serialize to "pretty" JSON (with indentation) ===
let pretty_json: String = serde_json::to_string_pretty(&user)?;
println!("\nPretty JSON:\n{}", pretty_json);
// === Serialize to bytes (Vec<u8>) ===
let json_bytes: Vec<u8> = serde_json::to_vec(&user)?;
println!("\nBytes: {:?}", &json_bytes[..20]); // First 20 bytes
// === Deserialize from JSON string ===
let json_str = r#"{"name":"Bob","age":25,"email":null}"#;
let user2: User = serde_json::from_str(json_str)?;
println!("\nDeserialized: {:?}", user2);
// === Deserialize from bytes ===
let user3: User = serde_json::from_slice(&json_bytes)?;
println!("From bytes: {:?}", user3);
Ok(())
}
Serde Attributes for Customization
use serde::{Serialize, Deserialize};
#[derive(Debug, Serialize, Deserialize)]
struct Message {
// Rename field in JSON
#[serde(rename = "messageId")]
id: u64,
// Use different name for serializing vs deserializing
#[serde(rename(serialize = "msg", deserialize = "message"))]
content: String,
// Skip this field entirely
#[serde(skip)]
internal_state: u32,
// Use default value if missing during deserialization
#[serde(default)]
retries: u32,
// Custom default value
#[serde(default = "default_priority")]
priority: u32,
}
fn default_priority() -> u32 {
5
}
Binary vs Text Formats
JSON (serde_json):
- Human-readable
- Widely supported (every language has JSON)
- Larger size
- Slower to parse
Bincode:
- Binary (not human-readable)
- Very fast
- Very compact
- Rust-specific (though spec exists)
// With bincode (add to Cargo.toml: bincode = "1.3")
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize, Debug)]
struct Data {
values: Vec<u64>,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = Data {
values: (0..1000).collect(),
};
// JSON serialization
let json = serde_json::to_vec(&data)?;
println!("JSON size: {} bytes", json.len());
// Bincode serialization
let binary = bincode::serialize(&data)?;
println!("Bincode size: {} bytes", binary.len());
// Bincode is typically much smaller for numeric data
// JSON: ~6,893 bytes (each number is ASCII digits)
// Bincode: ~8,008 bytes (8 bytes per u64 + vec length)
Ok(())
}
Why Serialization Matters for Distributed Systems
All network communication involves serialization. Every RPC call, every message to a queue, every HTTP request serializes data.
Format choice affects performance. In high-throughput systems, serialization can be a bottleneck.
Schema evolution is tricky. What happens when you add a field? Remove one? Old code needs to read new data and vice versa.
Cross-language compatibility. If you're talking to a Python service, you need a format both understand. JSON and Protocol Buffers are common choices.
B.2: Latency vs. Throughput
Two critical metrics for distributed systems.
Definitions
Latency: How long it takes to complete one operation.
- Measured in time units (ms, seconds)
- "How long does the user wait?"
Throughput: How many operations you complete in a given time.
- Measured in operations per time (requests/sec, MB/s)
- "How much work can the system do?"
The Highway Analogy
Imagine a highway between two cities, 100 miles apart. Cars travel at 50 mph.
Latency = Time for one car to travel from City A to City B = 2 hours
Throughput = Cars arriving at City B per hour
Now imagine we add more lanes:
- 1 lane: maybe 1,000 cars/hour throughput
- 4 lanes: maybe 4,000 cars/hour throughput
We quadrupled throughput! But latency is still 2 hours. More lanes don't make individual cars go faster.
Key insight: You can often improve throughput without improving latency, and vice versa.
In Distributed Systems
| Strategy | Affects latency? | Affects throughput? |
|---|---|---|
| Put server closer to user | ✓ Reduces | Minimal |
| Add more servers | Minimal | ✓ Increases |
| Add caching | ✓ Reduces (cache hits) | ✓ Increases |
| Batch operations | Usually increases | ✓ Increases |
| Use faster hardware | ✓ Reduces | ✓ Increases |
Latency-Throughput Tradeoffs
Sometimes you must choose:
Batching: Instead of processing requests one at a time, wait until you have 100 and process them together.
- Throughput improves (amortize fixed costs over many requests)
- Latency gets worse (first request in the batch waits for the batch to fill)
Replication: Write data to 3 servers before acknowledging.
- Throughput decreases (more work per request)
- Latency increases (wait for 3 servers)
- But reliability improves!
Measuring Latency: Percentiles
Don't use averages. They hide problems.
Example: 99 requests take 10ms. 1 request takes 10,000ms.
- Average: 109ms (seems fine!)
- But 1% of users waited 10 seconds
Use percentiles instead:
- p50 (median): 50% of requests are faster than this
- p95: 95% of requests are faster than this
- p99: 99% of requests are faster than this
- p99.9: 99.9% of requests are faster than this
Why high percentiles matter:
If a user's page load requires 10 backend calls, and each has p99 = 1 second:
- Probability that ALL calls are under 1s = 0.99^10 = 90.4%
- So ~10% of users experience >1s latency on at least one call!
This is why Amazon famously targets p99.9 latency for critical services.
B.3: Time in Distributed Systems
The Problem with Clocks
On a single computer, time is simple. There's one clock, events happen in sequence.
In a distributed system, each computer has its own clock. These clocks:
Drift: Even the best quartz clocks drift ~1 second per month. Cheap ones drift more.
Jump: When a computer synchronizes with NTP (Network Time Protocol), its clock might jump forward or even backward.
Disagree: Right now, two servers in your datacenter probably show slightly different times.
Why This Matters
Scenario: A distributed database. Two servers receive writes:
Server A (clock: 10:00:00.000): SET user.email = "alice@new.com"
Server B (clock: 10:00:00.001): SET user.email = "alice@old.com"
Which write wins? If we pick "latest timestamp wins," Server B's write wins. But what if Server B's clock is 5ms fast? Server A's write was actually later!
Result: User sets their email to "new.com", but the system stores "old.com". User is very confused.
Physical Time vs. Logical Time
Physical time: What the clock on the wall says.
- Subject to drift, jumps, disagreement
- Unreliable for ordering events across machines
Logical time: A way to order events based on causality, not wall clocks.
- If A could have influenced B, then A "happened before" B
- Based on message passing, not physics
This is what Leslie Lamport figured out in 1978.
B.4: Lamport Clocks
The Insight
Leslie Lamport realized:
We don't need to know when events happened. We just need to know the order in which they could have affected each other.
If I send you a message, my sending definitely happened before your receiving. That's causality, and it doesn't require synchronized clocks.
The "Happened-Before" Relation
Lamport defined a partial ordering called "happened-before" (written →):
Event A happened-before Event B (A → B) if:
Same process, sequential order: A and B are events in the same process, and A occurred before B.
Message send/receive: A is the sending of a message, and B is the receiving of that same message.
Transitivity: If A → B and B → C, then A → C.
If neither A → B nor B → A, the events are "concurrent" (written A || B). Neither could have influenced the other.
The Lamport Clock Algorithm
Each process maintains a logical clock: just a counter.
Rules:
- Before any event: Increment your clock.
- When sending a message: Include your current clock value.
- When receiving a message with timestamp t: Set your clock to
max(your_clock, t) + 1.
Let's trace through an example:
Process P1 (clock=0) Process P2 (clock=0)
│ │
│ [Internal event] │
│ clock = 0 + 1 = 1 │
│ │
│ [Send message to P2] │
│ clock = 1 + 1 = 2 │
│ msg contains timestamp 2 │
│─────────────────────────────────>
│ │
│ [Receive message with ts=2]
│ clock = max(0, 2) + 1 = 3
│ │
│ [Internal event]
│ clock = 3 + 1 = 4
│ │
│ [Send message to P1]
│ clock = 4 + 1 = 5
│<─────────────────────────────────
│ msg contains timestamp 5
│ │
[Receive message with ts=5] │
clock = max(2, 5) + 1 = 6 │
│ │
The Guarantee
If A → B, then LC(A) < LC(B).
If A happened-before B, then A's Lamport clock is strictly less than B's.
This means: sorting events by Lamport clock gives you an ordering consistent with causality.
The Limitation
The converse is NOT true.
If LC(A) < LC(B), we CANNOT conclude A → B. The events might be concurrent.
Example:
P1: Event X with LC = 5
P2: Event Y with LC = 3
Does X happen-before Y? Or Y before X? Or are they concurrent?
We can't tell from Lamport clocks alone!
B.4: Lamport Clocks (continued)
Implementation in Rust
/// A Lamport clock for logical time ordering.
///
/// # Guarantees
/// If event A happened-before event B, then A's timestamp < B's timestamp.
///
/// # Non-guarantees
/// If A's timestamp < B's timestamp, we cannot conclude A happened-before B.
/// The events might be concurrent.
#[derive(Debug, Clone)]
pub struct LamportClock {
// The current logical time
// Using u64 gives us ~584 years at 1 billion events/second
time: u64,
}
impl LamportClock {
/// Create a new clock starting at 0.
pub fn new() -> Self {
LamportClock { time: 0 }
}
/// Increment the clock before any local event.
/// Returns the new timestamp.
///
/// Call this before any local operation that you want to timestamp:
/// - Writing to a database
/// - Sending a message
/// - Any state change
pub fn tick(&mut self) -> u64 {
self.time += 1;
self.time
}
/// Get a timestamp to include in an outgoing message.
/// This increments the clock (sending is an event).
pub fn send_timestamp(&mut self) -> u64 {
self.tick()
}
/// Update the clock when receiving a message.
///
/// This merges our local time with the sender's time,
/// ensuring our clock is at least as high as the sender's.
///
/// # Arguments
/// * `received_timestamp` - The timestamp from the received message
///
/// # Returns
/// The new local timestamp after receiving
pub fn receive(&mut self, received_timestamp: u64) -> u64 {
// Take the max of our time and received time
// Then increment (receiving is an event)
self.time = self.time.max(received_timestamp) + 1;
self.time
}
/// Get current time without incrementing.
/// Useful for debugging/logging.
pub fn current_time(&self) -> u64 {
self.time
}
}
// Let's verify the behavior
fn main() {
println!("=== Lamport Clock Demo ===\n");
let mut clock_p1 = LamportClock::new();
let mut clock_p2 = LamportClock::new();
// P1 does an internal event
let t1 = clock_p1.tick();
println!("P1 internal event: timestamp = {}", t1); // 1
// P1 sends a message
let send_ts = clock_p1.send_timestamp();
println!("P1 sends message with timestamp = {}", send_ts); // 2
// P2 receives the message
let recv_ts = clock_p2.receive(send_ts);
println!("P2 receives message, new timestamp = {}", recv_ts); // 3
// P2 does an internal event
let t2 = clock_p2.tick();
println!("P2 internal event: timestamp = {}", t2); // 4
// Meanwhile, P1 does another event (hasn't heard from P2 yet)
let t3 = clock_p1.tick();
println!("P1 internal event (no sync): timestamp = {}", t3); // 3
// P2 sends a message
let send_ts_2 = clock_p2.send_timestamp();
println!("P2 sends message with timestamp = {}", send_ts_2); // 5
// P1 receives it
let recv_ts_2 = clock_p1.receive(send_ts_2);
println!("P1 receives message, new timestamp = {}", recv_ts_2); // 6
println!("\n=== Observations ===");
println!("Notice how P1's clock jumped from 3 to 6 after receiving P2's message.");
println!("This ensures P1's subsequent events have timestamps > P2's message.");
}
Output:
=== Lamport Clock Demo ===
P1 internal event: timestamp = 1
P1 sends message with timestamp = 2
P2 receives message, new timestamp = 3
P2 internal event: timestamp = 4
P1 internal event (no sync): timestamp = 3
P2 sends message with timestamp = 5
P1 receives message, new timestamp = 6
=== Observations ===
Notice how P1's clock jumped from 3 to 6 after receiving P2's message.
This ensures P1's subsequent events have timestamps > P2's message.
B.5: Vector Clocks
The Problem with Lamport Clocks
Lamport clocks guarantee: If A → B, then LC(A) < LC(B).
But given two timestamps, we can't determine if they're causally related or concurrent.
Example:
Event X has timestamp 5
Event Y has timestamp 7
Possibilities:
- X → Y (X happened before Y)
- Y → X (Y happened before X)... wait, that's impossible if LC(X) < LC(Y)
- X || Y (concurrent)
Actually, we know Y didn't happen before X (since 7 > 5).
But we can't distinguish X → Y from X || Y!
This matters for conflict detection. If two writes are concurrent, we have a conflict. If one happened-before the other, the later one wins.
Vector Clocks: Capturing Full Causality
Idea: Instead of one counter, each process maintains a vector of counters: one for each process in the system.
If there are N processes, each maintains: [clock_for_P0, clock_for_P1, ..., clock_for_P(N-1)]
Rules:
Initialize: All zeros:
[0, 0, ..., 0]Before any event: Increment YOUR entry:
VC[my_id] += 1When sending: Include the entire vector.
When receiving vector V:
- For each entry, take the max:
VC[i] = max(VC[i], V[i]) - Then increment your entry:
VC[my_id] += 1
- For each entry, take the max:
Let's trace through:
Process P0: VC=[0,0,0] Process P1: VC=[0,0,0] Process P2: VC=[0,0,0]
│ │ │
[Event A] │ │
VC=[1,0,0] │ │
│ │ │
[Send to P1]─────────────────────>│ │
VC=[2,0,0] │ │
│ [Receive] │
│ VC = max([0,0,0], [2,0,0]) │
│ = [2,0,0] │
│ Then VC[1] += 1 │
│ VC = [2,1,0] │
│ │ │
│ [Event B] │
│ VC=[2,2,0] │
│ │ │
│ [Send to P2]───────────────────>│
│ VC=[2,3,0] │
│ │ [Receive]
│ │ VC=[2,3,1]
│ │ │
[Event C] │ [Event D]
VC=[3,0,0] │ VC=[2,3,2]
Comparing Vector Clocks
Given two vectors V and W, we can determine their causal relationship:
V = W : All entries are equal → Same logical time
V < W : All entries of V ≤ corresponding entries of W,
AND at least one is strictly less
→ V happened-before W
V > W : All entries of V ≥ corresponding entries of W,
AND at least one is strictly greater
→ W happened-before V
Otherwise : Some entries of V < W, some entries of V > W
→ V and W are CONCURRENT (neither happened-before the other)
Examples:
[1,2,0] vs [1,3,1]
1≤1 ✓ 2≤3 ✓ 0≤1 ✓
At least one strictly less? Yes (2<3 and 0<1)
→ [1,2,0] < [1,3,1] (first happened-before second)
[2,0,0] vs [1,2,1]
2≤1? NO (2>1)
1≤2? Yes
→ Neither all ≤ nor all ≥
→ CONCURRENT!
[1,2,3] vs [1,2,3]
All equal
→ EQUAL (same logical time)
B.5: Vector Clocks (continued)
Implementation in Rust
use std::cmp::Ordering;
/// The causal relationship between two vector clock timestamps.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CausalOrdering {
/// First timestamp happened-before the second
Before,
/// Second timestamp happened-before the first
After,
/// Neither happened-before the other (concurrent)
Concurrent,
/// Timestamps are identical
Equal,
}
/// A vector clock for capturing causality in distributed systems.
///
/// Unlike Lamport clocks, vector clocks can determine whether two events
/// are causally related or concurrent.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct VectorClock {
/// This node's ID (index into the vector)
node_id: usize,
/// The vector of logical times, one per node
vector: Vec<u64>,
}
impl VectorClock {
/// Create a new vector clock for a node in an N-node system.
///
/// # Arguments
/// * `node_id` - This node's ID (0 to num_nodes-1)
/// * `num_nodes` - Total number of nodes in the system
pub fn new(node_id: usize, num_nodes: usize) -> Self {
assert!(node_id < num_nodes, "node_id must be < num_nodes");
VectorClock {
node_id,
vector: vec![0; num_nodes],
}
}
/// Increment this node's entry before any local event.
/// Returns a copy of the new vector.
pub fn tick(&mut self) -> Vec<u64> {
self.vector[self.node_id] += 1;
self.vector.clone()
}
/// Get a timestamp for an outgoing message.
/// Increments the clock and returns a copy of the vector.
pub fn send_timestamp(&mut self) -> Vec<u64> {
self.tick()
}
/// Update the clock when receiving a message.
///
/// Merges the received vector with our local vector (taking max of each entry),
/// then increments our own entry.
pub fn receive(&mut self, received: &[u64]) -> Vec<u64> {
assert_eq!(
received.len(),
self.vector.len(),
"Vector lengths must match"
);
// Merge: take max of each entry
for i in 0..self.vector.len() {
self.vector[i] = self.vector[i].max(received[i]);
}
// Receiving is an event, so increment our entry
self.vector[self.node_id] += 1;
self.vector.clone()
}
/// Compare two vector clock timestamps to determine causal ordering.
///
/// # Returns
/// - `Before` if v1 happened-before v2
/// - `After` if v2 happened-before v1
/// - `Concurrent` if neither happened-before the other
/// - `Equal` if they're the same timestamp
pub fn compare(v1: &[u64], v2: &[u64]) -> CausalOrdering {
assert_eq!(v1.len(), v2.len(), "Vectors must have same length");
let mut has_less = false; // Is any v1[i] < v2[i]?
let mut has_greater = false; // Is any v1[i] > v2[i]?
for (a, b) in v1.iter().zip(v2.iter()) {
match a.cmp(b) {
Ordering::Less => has_less = true,
Ordering::Greater => has_greater = true,
Ordering::Equal => {}
}
}
match (has_less, has_greater) {
(false, false) => CausalOrdering::Equal, // All equal
(true, false) => CausalOrdering::Before, // v1 < v2
(false, true) => CausalOrdering::After, // v1 > v2
(true, true) => CausalOrdering::Concurrent, // Neither
}
}
/// Get current vector without incrementing.
pub fn current(&self) -> &[u64] {
&self.vector
}
}
fn main() {
println!("=== Vector Clock Demo ===\n");
// Three-node system
let mut vc0 = VectorClock::new(0, 3);
let mut vc1 = VectorClock::new(1, 3);
let mut vc2 = VectorClock::new(2, 3);
// Node 0 does an event
let t_a = vc0.tick();
println!("Node 0 event A: {:?}", t_a); // [1, 0, 0]
// Node 0 sends to Node 1
let msg1 = vc0.send_timestamp();
println!("Node 0 sends message: {:?}", msg1); // [2, 0, 0]
// Node 1 receives
let t_b = vc1.receive(&msg1);
println!("Node 1 receives, event B: {:?}", t_b); // [2, 1, 0]
// Node 2 does an independent event
let t_c = vc2.tick();
println!("Node 2 event C (independent): {:?}", t_c); // [0, 0, 1]
// Now let's compare timestamps!
println!("\n=== Comparisons ===");
println!("A {:?} vs B {:?}: {:?}",
t_a, t_b, VectorClock::compare(&t_a, &t_b)); // Before
println!("B {:?} vs C {:?}: {:?}",
t_b, t_c, VectorClock::compare(&t_b, &t_c)); // Concurrent!
println!("A {:?} vs C {:?}: {:?}",
t_a, t_c, VectorClock::compare(&t_a, &t_c)); // Concurrent!
// Node 1 sends to Node 2
let msg2 = vc1.send_timestamp();
println!("\nNode 1 sends message: {:?}", msg2); // [2, 2, 0]
let t_d = vc2.receive(&msg2);
println!("Node 2 receives, event D: {:?}", t_d); // [2, 2, 2]
// Now C and D are on the same node, so:
println!("\nC {:?} vs D {:?}: {:?}",
t_c, t_d, VectorClock::compare(&t_c, &t_d)); // Before (C → D)
// And B → D through the message
println!("B {:?} vs D {:?}: {:?}",
t_b, t_d, VectorClock::compare(&t_b, &t_d)); // Before (B → D)
}
Output:
=== Vector Clock Demo ===
Node 0 event A: [1, 0, 0]
Node 0 sends message: [2, 0, 0]
Node 1 receives, event B: [2, 1, 0]
Node 2 event C (independent): [0, 0, 1]
=== Comparisons ===
A [1, 0, 0] vs B [2, 1, 0]: Before
B [2, 1, 0] vs C [0, 0, 1]: Concurrent
A [1, 0, 0] vs C [0, 0, 1]: Concurrent
Node 1 sends message: [2, 2, 0]
Node 2 receives, event D: [2, 2, 2]
C [0, 0, 1] vs D [2, 2, 2]: Before
B [2, 1, 0] vs D [2, 2, 2]: Before
Why Vector Clocks Matter
Vector clocks let us answer the crucial question: Are these events concurrent?
Use case: Distributed databases
Two servers receive writes for the same key:
Server A: SET user.name = "Alice" with VC = [2, 1, 0]
Server B: SET user.name = "Bob" with VC = [1, 3, 0]
Compare: [2,1,0] vs [1,3,0]
- 2 > 1 (first entry)
- 1 < 3 (second entry)
- → CONCURRENT!
We now know there's a conflict that needs resolution (user intervention, merge function, etc.).
If instead:
Server A: SET user.name = "Alice" with VC = [2, 1, 0]
Server B: SET user.name = "Bob" with VC = [3, 2, 0]
Compare: [2,1,0] vs [3,2,0]
- 2 < 3, 1 < 2, 0 = 0
- → [2,1,0] happened-before [3,2,0]
Server B's write is strictly later: no conflict, just use "Bob".
B.6: Practical Applications
Application 1: Distributed Databases
Vector clocks (or similar mechanisms) are used in real databases:
- Amazon Dynamo (DynamoDB's predecessor)
- Riak
- Voldemort
When conflicts are detected, these systems either:
- Return all conflicting versions to the client (let application resolve)
- Use automatic conflict resolution (CRDTs, last-writer-wins, etc.)
Application 2: Distributed Logging
When debugging distributed systems, you need to correlate logs from multiple services.
Problem: Physical timestamps might be out of sync. Service B's log entry might show an earlier timestamp than Service A, even though A sent a message to B.
Solution: Include Lamport timestamps in logs. Sort by Lamport timestamp for a causally-consistent view.
Application 3: CRDTs (Conflict-Free Replicated Data Types)
CRDTs are data structures designed to be replicated across servers and merged without conflicts. They use ideas from vector clocks.
B.6: Practical Applications (continued)
Application 3: CRDTs (Conflict-Free Replicated Data Types) - Implementation
Example: Grow-only Counter (G-Counter)
/// A Grow-only Counter that can be incremented on any node
/// and merged without conflicts.
///
/// Each node only increments its own entry. Merging takes
/// the max of each entry, so no increments are ever lost.
#[derive(Debug, Clone)]
struct GCounter {
node_id: usize,
counts: Vec<u64>,
}
impl GCounter {
fn new(node_id: usize, num_nodes: usize) -> Self {
GCounter {
node_id,
counts: vec![0; num_nodes],
}
}
/// Increment the counter (only our node's entry).
fn increment(&mut self) {
self.counts[self.node_id] += 1;
}
/// Get the total count across all nodes.
fn value(&self) -> u64 {
self.counts.iter().sum()
}
/// Merge with another counter (e.g., received from another node).
/// Takes the max of each entry: no increments are lost!
fn merge(&mut self, other: &GCounter) {
for i in 0..self.counts.len() {
self.counts[i] = self.counts[i].max(other.counts[i]);
}
}
}
fn main() {
println!("=== G-Counter Demo ===\n");
let mut counter_a = GCounter::new(0, 3); // Node 0
let mut counter_b = GCounter::new(1, 3); // Node 1
// Node A increments twice
counter_a.increment();
counter_a.increment();
println!("Node A after 2 increments: counts={:?}, value={}",
counter_a.counts, counter_a.value()); // [2, 0, 0], value=2
// Node B increments three times
counter_b.increment();
counter_b.increment();
counter_b.increment();
println!("Node B after 3 increments: counts={:?}, value={}",
counter_b.counts, counter_b.value()); // [0, 3, 0], value=3
// Now they sync up (merge in both directions)
counter_a.merge(&counter_b);
counter_b.merge(&counter_a);
println!("\nAfter merge:");
println!("Node A: counts={:?}, value={}",
counter_a.counts, counter_a.value()); // [2, 3, 0], value=5
println!("Node B: counts={:?}, value={}",
counter_b.counts, counter_b.value()); // [2, 3, 0], value=5
println!("\nBoth nodes agree: total = 5 (no increments lost!)");
}
Output:
=== G-Counter Demo ===
Node A after 2 increments: counts=[2, 0, 0], value=2
Node B after 3 increments: counts=[0, 3, 0], value=3
After merge:
Node A: counts=[2, 3, 0], value=5
Node B: counts=[2, 3, 0], value=5
Both nodes agree: total = 5 (no increments lost!)
═══════════════════════════════════════════════════════════════
SUMMARY
═══════════════════════════════════════════════════════════════
What You've Learned
What You've Learned
Networking Fundamentals
- Data travels as packets through routers
- IP handles addressing and routing (best-effort, no guarantees)
- TCP provides reliability via acknowledgments and retransmission
- Ports identify applications on a machine
- DNS translates names to IP addresses
Socket Programming
- Servers: bind → listen → accept
- Clients: connect → read/write
- All I/O can fail: handle errors properly
Failure Handling
- Timeouts are essential: you can't wait forever
- Retries with exponential backoff prevent thundering herds
- Only retry idempotent operations
- Network failures are ambiguous: you often can't tell if a request succeeded
The Eight Fallacies
- The network is reliable ❌
- Latency is zero ❌
- Bandwidth is infinite ❌
- The network is secure ❌
- Topology doesn't change ❌
- There is one administrator ❌
- Transport cost is zero ❌
- The network is homogeneous ❌
Concurrency
- Race conditions occur when multiple threads access shared data
- Rust prevents most concurrency bugs at compile time
Mutexprovides mutual exclusionArcallows shared ownership across threads- Deadlock occurs when threads wait for each other
Serialization
- Serde is Rust's serialization framework
- JSON for human-readable, cross-language compatibility
- Binary formats (bincode) for speed and compactness
Time and Ordering
- Physical clocks can't be trusted across machines
- Lamport clocks: If A → B, then LC(A) < LC(B)
- Vector clocks: Can determine if events are concurrent
- Crucial for conflict detection in distributed databases