Foundations of Distributed Systems

January 7, 2026

Note: All code examples are in Rust. You should be comfortable with basic Rust: ownership, borrowing, structs, enums, Result, and match. If terms like impl or &mut self are unfamiliar, work through Rust Book concepts, I've already have notes on those.

How This Document Is Organized

This guide has four distinct parts:

Part	What It Covers	Why You Need It
Part A: Prerequisites	Networking fundamentals, how computers communicate	You can't understand "the network is unreliable" without knowing what a network actually is
Part B: Distributed Systems Concepts	The actual distributed systems theory — fallacies, failures, time, ordering	This is the core content

Suggested approach:

Read Part A if networking is new to you (or skim if you're familiar)
Read Part B carefully — this is the main part (code examples are embedded with concepts)

PART A: PREREQUISITES

Networking Fundamentals & Rust Basics

Skip this part if: You already understand TCP/IP, can explain what happens when you visit a website, and know what ports are.

A.1: What is a Distributed System?

A distributed system is a collection of independent computers that appears to its users as a single coherent system.

When you use Google Search, you don't think "I'm querying thousands of machines across dozens of data centers." You just see a search box and get results. That's the magic, and the challenge, of distributed systems.

But let's make this concrete. What does "computers talking to each other" actually mean?

What's Actually Happening (The Physical Reality)

When your computer sends data to a server, here's the physical reality:

Your computer converts data into electrical signals
Those signals travel through cables (ethernet, fiber optic) or radio waves (WiFi)
They pass through multiple intermediate devices (routers, switches)
Eventually they reach the destination computer
That computer converts the signals back into data

Key insight: At the physical level, all you can do is send electrical pulses. Everything else (files, messages, web pages) is abstraction built on top of this.

The Communication Problem

Imagine you can only communicate by flashing a light on and off. How do you send the message "Hello"?

You need to agree on:

Encoding: What patterns mean what? (Like Morse code)
Timing: How long is a dot vs. a dash? When does one letter end and the next begin?
Error handling: What if the other person blinks and misses a flash?
Addressing: If multiple people are watching, who is this message for?

Networks solve all these problems through protocols: agreed-upon rules for communication. The internet is built from layers of protocols, each solving different problems.

Layers of Abstraction

Computer networking is built in layers:

┌─────────────────────────────────────────┐
│  Application Layer (HTTP, SMTP, etc.)   │  ← "I want to load a webpage"
├─────────────────────────────────────────┤
│  Transport Layer (TCP, UDP)             │  ← "Deliver this data reliably"
├─────────────────────────────────────────┤
│  Network Layer (IP)                     │  ← "Route this to the right machine"
├─────────────────────────────────────────┤
│  Link Layer (Ethernet, WiFi)            │  ← "Send bits to the next hop"
├─────────────────────────────────────────┤
│  Physical Layer                         │  ← Actual electrical/light signals
└─────────────────────────────────────────┘

Why layers? Each layer only talks to the layers directly above and below it. This means:

You can change the physical layer (switch from ethernet to WiFi) without rewriting your web browser
You can change the application (switch from HTTP to email) without changing how packets are routed

Analogy: Sending a letter.

You write a letter (Application layer: your content)
You put it in an envelope with an address (Transport/Network: addressing)
The postal service figures out how to route it through sorting facilities (Network: routing)
Trucks and planes physically carry it (Physical: actual movement)

You don't think about the trucks when writing. The layers hide complexity.

A.2: IP Addresses and Packets

IP Addresses: How Machines Are Identified

Every device on the internet has an IP address: a unique identifier, like a phone number for computers.

IPv4 addresses look like: 192.168.1.1 or 142.250.80.46

Four numbers separated by dots
Each number is 0-255 (one byte each, so 4 bytes total)
About 4.3 billion possible addresses (2^32)
We've basically run out of these, which is why IPv6 exists

IPv6 addresses look like: 2001:0db8:85a3:0000:0000:8a2e:0370:7334

Much longer (16 bytes = 128 bits)
Enough addresses for every grain of sand on Earth
Slowly being adopted

Special addresses you'll see often:

127.0.0.1: Called "localhost", always means "this computer I'm on right now"
192.168.x.x, 10.x.x.x, 172.16.x.x to 172.31.x.x: Private addresses (your home network uses these)
0.0.0.0: Means "all interfaces" when binding a server (we'll explain this later)

When you type google.com in your browser, your computer first asks "what's the IP address for google.com?" The answer might be 142.250.80.46. Then your computer talks to that IP address. The translation from name to IP is done by DNS (Domain Name System): itself a massive distributed system.

Packets: Data Travels in Chunks

Here's something that might surprise you: when you download a 1GB file, it doesn't travel as one continuous stream of 1 billion bytes. Instead, it's broken into thousands of small packets, typically around 1,500 bytes each.

Why packets?

Sharing: Imagine a single phone line between two cities. If one person made a call that lasted an hour, no one else could use the line. But if we break conversations into tiny chunks and interleave them, many conversations can share the same line. That's what packet switching does: your Netflix stream, your neighbor's video call, and a business's database queries all share the same wires, packet by packet.
Reliability: If you sent 1GB as one continuous stream and there was an error at byte 999,999,999, you'd have to resend the entire gigabyte. With packets, you only resend the one packet that failed.
Routing: Different packets can take different paths. If one route is congested, packets can go around it.

What's in a packet?

Every packet has two parts: a header (metadata) and a payload (actual data).

┌──────────────────────────────────────────────────────┐
│  HEADER (metadata about this packet)                 │
│  ┌─────────────────┬──────────────────────────────┐  │
│  │ Source IP       │ 192.168.1.100                │  │  ← Who sent this
│  │ Destination IP  │ 142.250.80.46                │  │  ← Where it's going
│  │ Protocol        │ TCP                          │  │  ← How to interpret payload
│  │ Length          │ 1420 bytes                   │  │  ← How big the payload is
│  │ Checksum        │ 0x3a7f                       │  │  ← For error detection
│  │ TTL             │ 64                           │  │  ← Hops before packet dies
│  │ ... more fields │                              │  │
│  └─────────────────┴──────────────────────────────┘  │
├──────────────────────────────────────────────────────┤
│  PAYLOAD (actual data being sent)                    │
│  "GET /index.html HTTP/1.1\r\nHost: google.com..."   │
│  (up to ~1460 bytes for typical internet traffic)    │
└──────────────────────────────────────────────────────┘

Think of it like a postcard: the address side is the header, the message side is the payload.

Routing: How Packets Find Their Way

Your packet doesn't teleport directly to its destination. It hops through multiple routers: devices whose job is to forward packets toward their destination.

Your Computer 
    ↓
Your Home Router (192.168.1.1)
    ↓
ISP's Router
    ↓
Regional Hub
    ↓
... (5-20 more hops typically) ...
    ↓
Google's Edge Router
    ↓
Google's Server (142.250.80.46)

Each router looks at the destination IP address and decides: "based on my routing table, which of my neighbors is closest to this destination?" Then it forwards the packet that direction.

You can see this yourself! In a terminal, run:

# On Mac/Linux:
traceroute google.com

# On Windows:
tracert google.com

You'll see each hop your packets take, with timing for each. This is invaluable for debugging network issues.

What Can Go Wrong at the IP Layer

Here's the crucial thing: IP provides no guarantees. It's "best effort" delivery.

Packets can be lost: A router gets overloaded and drops packets. A cable gets damaged. Radio interference corrupts a WiFi packet. The packet just... disappears.
Packets can arrive out of order: Packet 3 might take a different route than packets 1 and 2, and arrive first.
Packets can be duplicated: A router isn't sure if it forwarded a packet successfully, so it sends it twice.
Packets can be corrupted: A cosmic ray flips a bit. Electrical interference on a wire changes a 1 to a 0. (Checksums catch most of these, but not all.)
Packets can be delayed: Traffic jam at a router. Your packet waits in a queue.

This is why distributed systems are hard. The network will fail. The question is how you handle it.

A.3: TCP: Making Communication Reliable

The Problem

IP gives us the ability to send packets between machines, but with no guarantees. For many applications, that's not good enough:

If you're loading a webpage, you need ALL the HTML, in order
If you're downloading a file, a single missing byte corrupts the whole thing
If you're sending a bank transfer, you really need to know it went through

TCP: Transmission Control Protocol

TCP is a protocol built on top of IP. IP handles getting individual packets to the right machine. TCP handles making that communication reliable and ordered.

TCP provides:

Reliable delivery: If a packet is lost, TCP automatically resends it
Ordering: TCP reassembles packets in the correct order, even if they arrive out of order
Flow control: TCP slows down if the receiver can't keep up
Congestion control: TCP slows down if the network is overloaded

How TCP Knows When Packets Are Lost: Acknowledgments

The key mechanism is acknowledgments (ACKs). When the receiver gets data, it sends back a message saying "I got it."

Sender                           Receiver
   │                                │
   │──── Data: bytes 0-999 ────────>│
   │                                │  Receiver got it!
   │<─────────────── ACK 1000 ──────│  "I've received up to byte 999,
   │                                │   send byte 1000 next"
   │                                │
   │──── Data: bytes 1000-1999 ────>│  (this packet gets lost!)
   │                                │
   │     ... sender waits ...       │
   │     ... no ACK comes ...       │
   │     ... timeout! ...           │
   │                                │
   │──── Data: bytes 1000-1999 ────>│  (resend!)
   │                                │
   │<─────────────── ACK 2000 ──────│  "Got it, send byte 2000 next"

Key insight: TCP uses timeouts to detect loss. If no acknowledgment arrives within a certain time, TCP assumes the packet was lost and resends it. This is why "timeout" is such an important concept in distributed systems.

Sequence Numbers: Handling Ordering and Duplicates

Each byte of data in a TCP connection has a sequence number. This solves two problems:

Ordering: If packet with bytes 2000-2999 arrives before packet with bytes 1000-1999, TCP holds onto the later packet and waits for the earlier one. It delivers data to your application in order.
Duplicates: If the network accidentally duplicates a packet, TCP sees "I already have bytes 1000-1999" and discards the duplicate.

TCP Connections and the Three-Way Handshake

Unlike IP (which just fires packets into the void), TCP establishes a connection before sending data. Both sides agree: "we're now in a conversation."

This happens through the three-way handshake:

Client                              Server
   │                                   │
   │──── SYN ─────────────────────────>│
   │     "I want to connect.           │
   │      My starting sequence         │
   │      number is 1000."             │
   │                                   │
   │<─────────────────────── SYN-ACK ──│
   │     "OK, I accept.                │
   │      My starting sequence         │
   │      number is 5000.              │
   │      I acknowledge your 1000."    │
   │                                   │
   │──── ACK ─────────────────────────>│
   │     "Great, I acknowledge         │
   │      your 5000."                  │
   │                                   │
   │     === Connection Established === │
   │                                   │

Why three messages?

SYN (synchronize): Client says "I want to connect" and shares its initial sequence number
SYN-ACK: Server says "I accept" and shares its sequence number, while acknowledging the client's
ACK: Client confirms it received the server's info

After this, both sides know the connection is established and have agreed on starting sequence numbers.

Important: A "connection" is a logical concept, not physical. There's no dedicated wire between you and Google. A connection is just both computers agreeing "we're having a conversation" and keeping track of state (sequence numbers, what's been acknowledged, etc.).

Ports: Multiple Connections Per Machine

An IP address identifies a machine. But a machine runs many programs: a web server, a database, an SSH daemon. How does the machine know which program should receive which packets?

Ports. A port is a 16-bit number (0-65535) that identifies a specific application or service.

A full network address is IP:port, like 142.250.80.46:443.

Well-known ports (you'll see these constantly):

Port	Service	Description
22	SSH	Secure shell (remote login)
80	HTTP	Web traffic (unencrypted)
443	HTTPS	Web traffic (encrypted)
5432	PostgreSQL	Postgres database
6379	Redis	Redis cache/database
27017	MongoDB	MongoDB database

When you visit https://google.com, your browser connects to Google's IP on port 443.

But what about your side? Your computer also needs a port for the connection. The operating system assigns a random high port (like 54321) for outgoing connections. So a full connection looks like:

Your machine (192.168.1.100:54321) <═══> Google (142.250.80.46:443)

This is called a socket: the combination of IP address + port on each side uniquely identifies a connection.

What TCP Still Can't Solve

TCP makes communication reliable, but it can't work miracles:

If the other machine crashes, TCP can't magically resurrect it. Eventually TCP will timeout and report an error.
If the network is partitioned (all routes between two machines are broken), no amount of retrying will help.
TCP doesn't guarantee speed. It guarantees delivery, but if the network is slow, your data is slow.
Connections can be killed. If a connection is idle for too long, intermediate routers or firewalls might close it.

A.4: UDP: When You Don't Want TCP's Guarantees

Sometimes Reliability Isn't Worth It

TCP's reliability comes at a cost:

Latency: If a packet is lost, TCP waits for a timeout, then resends. This adds delay.
Head-of-line blocking: TCP delivers data in order. If packet 1 is lost but packets 2, 3, 4 arrive, TCP holds them until packet 1 is resent and arrives. Your application can't see packets 2-4 yet.
Connection overhead: The three-way handshake takes time.

For some applications, these tradeoffs don't make sense:

Video calls: If a video frame is lost, you don't want to pause the call and wait. Just skip that frame and keep going. A brief glitch is better than freezing.
Real-time games: You want the player's current position, not their position from 500ms ago. If a packet is lost, just use the next one.
DNS lookups: The query and response are each a single small packet. If it doesn't arrive, just send another query. No need for connection setup overhead.

UDP: User Datagram Protocol

UDP is another protocol built on IP, but much simpler than TCP.

UDP provides:

Ports (so multiple applications can share one IP)
Checksums (so you can detect corrupted data)

UDP does NOT provide:

Reliability (packets might be lost)
Ordering (packets might arrive out of order)
Duplicate detection (you might receive the same packet twice)
Flow control (sender can overwhelm receiver)

Mental model:

TCP is like a phone call: you establish a connection, have a back-and-forth conversation, both sides know the connection exists
UDP is like mailing postcards: you write, drop in mailbox, hope it arrives, no confirmation

When to Use Each

Use TCP when...	Use UDP when...
Data must be complete and correct	Missing data is acceptable
Order matters	Latest data is more important than complete data
You're building request/response patterns	You need lowest possible latency
Examples: HTTP, databases, file transfer	Examples: video streaming, games, DNS

Most distributed systems you'll build use TCP. But understanding UDP helps you appreciate what TCP does for you.

A.5: DNS: Translating Names to Addresses

The Problem

Humans like names: google.com, github.com, api.mycompany.io

Computers need numbers: 142.250.80.46

DNS (Domain Name System) translates names to IP addresses. It's like the internet's phone book.

How DNS Works

When you type github.com in your browser:

1. Browser: "What's the IP for github.com?"
        │
        ▼
2. Check local cache on your machine
   Found and not expired? → Use it
   Not found? → Ask DNS resolver
        │
        ▼
3. DNS Resolver (your ISP, or 8.8.8.8 for Google, 1.1.1.1 for Cloudflare)
   Has it cached? → Return it
   Not cached? → Ask authoritative servers
        │
        ▼
4. Ask Root servers: "Who handles .com?"
        │
        ▼
5. Ask .com TLD servers: "Who handles github.com?"
        │
        ▼
6. Ask GitHub's nameservers: "What's the IP for github.com?"
        │
        ▼
7. Answer: "github.com = 140.82.112.4"
        │
        ▼
8. Response flows back up, cached at each level
        │
        ▼
9. Browser connects to 140.82.112.4

DNS Is Itself a Distributed System!

Here's something beautiful: DNS is one of the oldest and most successful distributed systems in existence.

Hierarchical: Root servers → TLD servers (.com, .org, .io) → domain nameservers
Replicated: There are 13 root server "identities," but each is actually many physical servers (over 1,000 total)
Cached: Results are cached at every level with TTL (time-to-live) values
Eventually consistent: When you change a DNS record, it can take hours to propagate worldwide

Fun fact: When DNS has problems, the internet appears broken even though all the underlying servers are fine. You can't reach google.com by name, even though 142.250.80.46 is perfectly reachable.

What Can Go Wrong with DNS

DNS server failure: That's why there are many servers with replication
Slow lookups: Usually cached, but first lookup for a domain takes time
DNS spoofing/poisoning: Attacker gives you wrong IP address
Propagation delays: DNS changes can take hours to spread globally
TTL expiration: Cached entries expire; then you need a fresh lookup

A.6: Sockets: How Your Code Talks to the Network

Now we get to code. Everything above is handled by the operating system. Your code interacts with the network through sockets.

What Is a Socket?

A socket is an endpoint for communication. It's the interface between your application and the TCP/IP stack in the operating system.

Think of it like this:

The OS handles all the complexity of packets, TCP, routing
Your application just reads and writes bytes to a socket
The socket is the "portal" between your code and the network

The Server/Client Pattern

Most network programs follow a pattern:

Server:

Create a socket
Bind it to an address (IP + port): "I will listen on port 8080"
Listen for incoming connections
Accept a connection when a client connects
Read/write data
Close the connection

Client:

Create a socket
Connect to the server's address
Read/write data
Close the connection

Let's implement both in Rust, explaining every line.

A TCP Server in Rust, Explained Line by Line

// === Imports ===

// std::net contains networking primitives
// TcpListener: for accepting incoming connections (server-side)
// TcpStream: a single TCP connection (used by both client and server)
use std::net::{TcpListener, TcpStream};

// std::io contains input/output traits and types
// Read: trait that provides the `read` method
// Write: trait that provides `write` and `write_all` methods
use std::io::{Read, Write};

// === Helper Function to Handle One Connection ===

// This function takes ownership of a TcpStream (one client connection)
// and returns a Result (Ok if everything worked, Err if something failed)
fn handle_client(mut stream: TcpStream) -> std::io::Result<()> {
    // peer_addr() returns the remote (client's) address
    // The ? operator: if this returns an Err, immediately return that Err
    // If it returns Ok(addr), unwrap the addr and continue
    let peer_addr = stream.peer_addr()?;
    println!("Handling connection from {}", peer_addr);
    
    // Create a buffer to store incoming data
    // [0u8; 1024] means: an array of 1024 bytes, all initialized to 0
    // We need `mut` because `read` will modify this buffer
    let mut buffer = [0u8; 1024];
    
    // Read data from the connection into our buffer
    // read() returns how many bytes were actually read
    // This BLOCKS until data arrives or the connection closes
    let bytes_read = stream.read(&mut buffer)?;
    
    // What if the client closed the connection without sending anything?
    if bytes_read == 0 {
        println!("Client closed connection without sending data");
        return Ok(());
    }
    
    // Convert the bytes we received to a string for display
    // &buffer[..bytes_read] is a slice of just the bytes that were filled in
    // from_utf8_lossy: if there are invalid UTF-8 bytes, replace them with �
    // (We use this because network data might not be valid UTF-8)
    let received = String::from_utf8_lossy(&buffer[..bytes_read]);
    println!("Received {} bytes: {}", bytes_read, received);
    
    // Create a response
    let response = format!("Echo: {}", received);
    
    // Send the response back
    // write_all() sends ALL the bytes, retrying if necessary
    // (plain write() might only send some bytes if the buffer is full)
    // as_bytes() converts the String to &[u8] (a byte slice)
    stream.write_all(response.as_bytes())?;
    
    println!("Sent response to {}", peer_addr);
    
    // Connection is automatically closed when `stream` goes out of scope
    // (Rust's Drop trait handles this)
    Ok(())
}

// === Main Function ===

fn main() -> std::io::Result<()> {
    // === Step 1: Create and bind a listener socket ===
    
    // TcpListener::bind() does several things:
    // 1. Creates a socket
    // 2. Binds it to the specified address (IP:port)
    // 3. Puts it in listening mode
    //
    // "127.0.0.1:8888" means:
    // - 127.0.0.1 = localhost (only accept connections from this machine)
    // - 8888 = the port number
    //
    // If you used "0.0.0.0:8888", it would accept connections from any IP
    // (useful for servers that should be reachable from other machines)
    let listener = TcpListener::bind("127.0.0.1:8888")?;
    
    println!("Server listening on 127.0.0.1:8888");
    println!("Connect with: nc localhost 8888");
    println!("Or run the client program");
    println!();
    
    // === Step 2: Accept connections in a loop ===
    
    // listener.incoming() returns an iterator of connection attempts
    // Each item is a Result<TcpStream, Error>
    for stream_result in listener.incoming() {
        // Handle the Result - did the connection attempt succeed?
        match stream_result {
            Ok(stream) => {
                // We got a connection!
                // Note: This is a simple single-threaded server.
                // While we're handling this client, other clients wait.
                // A real server would spawn a thread or use async.
                if let Err(e) = handle_client(stream) {
                    // Log the error but keep serving other clients
                    eprintln!("Error handling client: {}", e);
                }
            }
            Err(e) => {
                // The connection attempt itself failed
                // (Different from errors while handling a connection)
                eprintln!("Failed to accept connection: {}", e);
            }
        }
    }
    
    Ok(())
}

Key Concepts in the Server

Binding: When we call bind("127.0.0.1:8888"), we're telling the OS: "I want to receive TCP connections sent to this IP address on port 8888." If another program is already using that port, binding fails.

Listening: After binding, the socket is in listening mode, ready to receive connections.

Accepting: incoming() blocks until a client connects, then returns a TcpStream representing that connection. The original listener keeps listening for more connections.

Blocking I/O: When we call stream.read(), our thread stops and waits until data arrives. This is called "blocking." It's simple but means we can only handle one client at a time (in this code). Real servers use threads or async to handle many clients concurrently.

A TCP Client in Rust, Explained Line by Line

use std::net::TcpStream;
use std::io::{Read, Write};

fn main() -> std::io::Result<()> {
    // === Step 1: Connect to the server ===
    
    // TcpStream::connect() does the three-way handshake:
    // 1. Sends SYN to server
    // 2. Receives SYN-ACK from server
    // 3. Sends ACK to server
    //
    // If successful, returns a TcpStream representing the connection.
    // If the server isn't running, or the port is wrong, this fails.
    //
    // Common errors:
    // - ConnectionRefused: server isn't listening on that port
    // - TimedOut: server didn't respond (maybe wrong IP, or firewall)
    
    println!("Connecting to server...");
    let mut stream = TcpStream::connect("127.0.0.1:8888")?;
    
    // local_addr() returns OUR side of the connection
    // peer_addr() returns the SERVER's side
    println!("Connected!");
    println!("  Local address: {}", stream.local_addr()?);
    println!("  Server address: {}", stream.peer_addr()?);
    
    // === Step 2: Send data to the server ===
    
    let message = "Hello from Rust client!";
    println!("\nSending: {}", message);
    
    // write_all() ensures all bytes are sent
    stream.write_all(message.as_bytes())?;
    
    // === Step 3: Receive response from server ===
    
    let mut buffer = [0u8; 1024];
    let bytes_read = stream.read(&mut buffer)?;
    
    if bytes_read == 0 {
        println!("Server closed connection without responding");
    } else {
        let response = String::from_utf8_lossy(&buffer[..bytes_read]);
        println!("Received: {}", response);
    }
    
    // Connection is closed when `stream` goes out of scope
    println!("\nConnection closed.");
    
    Ok(())
}

Running the Example

Create a new Rust project:

cargo new tcp_demo
cd tcp_demo

Replace src/main.rs with the server code. Create src/bin/client.rs with the client code.

Update Cargo.toml:

[package]
name = "tcp_demo"
version = "0.1.0"
edition = "2021"

[[bin]]
name = "server"
path = "src/main.rs"

[[bin]]
name = "client"
path = "src/bin/client.rs"

Run the server:

cargo run --bin server

In another terminal, run the client:

cargo run --bin client

You should see the message sent and echoed back!

A.7: Timeouts and Error Handling

The Fundamental Problem with Networks

When you send a request over a network and don't get a response, what happened?

The request never arrived: Network problem on the way there
The request arrived but the server crashed before processing: Server died
The server processed it but crashed before responding: Work was done but you don't know!
The response was sent but got lost: Network problem on the way back
Everything is fine, just slow: Network congestion, server overloaded

Here's the crucial point: From the client's perspective, all of these look the same. You sent a request. You're waiting. No response yet. Is the server dead? Is it slow? Did your message get lost? You can't tell.

This is the core challenge of distributed systems.

Timeouts: Giving Up After a While

A timeout says: "If I don't hear back within X seconds, assume something went wrong and give up."

use std::net::TcpStream;
use std::time::Duration;
use std::io::{Read, Write};

fn main() -> std::io::Result<()> {
    // Connect with a timeout
    // If the server doesn't respond to our SYN within 5 seconds, give up
    let address = "127.0.0.1:8888".parse().unwrap();
    
    println!("Attempting to connect (5 second timeout)...");
    
    let mut stream = TcpStream::connect_timeout(
        &address,
        Duration::from_secs(5)
    )?;
    
    // Set timeouts for reading and writing
    // These apply to EVERY read/write operation on this stream
    stream.set_read_timeout(Some(Duration::from_secs(5)))?;
    stream.set_write_timeout(Some(Duration::from_secs(5)))?;
    
    println!("Connected! Sending request...");
    stream.write_all(b"Hello?")?;
    
    // Now read the response
    // If no data arrives within 5 seconds, read() will return an error
    let mut buffer = [0u8; 1024];
    
    match stream.read(&mut buffer) {
        Ok(0) => {
            println!("Server closed connection without responding");
        }
        Ok(n) => {
            println!("Received: {}", String::from_utf8_lossy(&buffer[..n]));
        }
        Err(e) => {
            // Check what kind of error
            if e.kind() == std::io::ErrorKind::WouldBlock 
               || e.kind() == std::io::ErrorKind::TimedOut {
                println!("TIMEOUT: No response within 5 seconds");
            } else {
                println!("Error: {}", e);
            }
        }
    }
    
    Ok(())
}

The Timeout Dilemma

How long should the timeout be?

Too short:

You give up on servers that are slow but healthy
You might retry unnecessarily, creating more load
Users see errors when things would have worked

Too long:

Users wait forever for dead servers
Resources (memory, connections) are tied up waiting
System throughput drops because threads are blocked waiting

There's no perfect answer. It depends on:

What's normal latency for this operation?
How bad is a false positive (giving up when we shouldn't)?
How bad is waiting too long?

Typical timeouts:

Database queries: 1-30 seconds
HTTP APIs: 5-30 seconds
Health checks: 1-5 seconds
Internal microservices: 100ms-5 seconds

Types of Network Errors

Let's understand the different errors you'll encounter:

use std::net::TcpStream;
use std::io::{self, Read, Write};
use std::time::Duration;

fn demonstrate_errors() {
    // === Connection Refused ===
    // Server exists but nothing is listening on that port
    // You'll see this if you run the client without starting the server
    match TcpStream::connect("127.0.0.1:9999") {
        Err(e) if e.kind() == io::ErrorKind::ConnectionRefused => {
            println!("Connection refused: nothing listening on that port");
        }
        _ => {}
    }
    
    // === Connection Timeout ===
    // Can't reach the server at all (wrong IP, firewall, server down)
    // Using a non-routable IP to demonstrate
    let addr = "10.255.255.1:80".parse().unwrap();
    match TcpStream::connect_timeout(&addr, Duration::from_secs(2)) {
        Err(e) if e.kind() == io::ErrorKind::TimedOut => {
            println!("Connection timeout: couldn't reach server");
        }
        _ => {}
    }
    
    // === Read Timeout ===
    // Connected, but server isn't sending data
    // (Need a server that accepts but doesn't respond to demonstrate)
    
    // === Connection Reset ===
    // Server abruptly closed the connection
    // Usually means the server crashed or explicitly killed the connection
}

Common Error Kinds

ErrorKind	What it means	Common causes
`ConnectionRefused`	Server rejected connection	Nothing listening on port, firewall
`TimedOut`	Operation took too long	Server overloaded, network problem
`ConnectionReset`	Connection forcibly closed	Server crashed, server rejected request
`BrokenPipe`	Writing to a closed connection	Server closed while you were sending
`NotConnected`	Socket isn't connected	Forgot to connect, or already disconnected
`WouldBlock`	Non-blocking operation can't complete	Used with non-blocking I/O

Retries with Exponential Backoff

When something fails, a natural instinct is to try again. But naive retries can make problems worse.

Scenario: Server is overloaded, responding slowly. 100 clients timeout and all immediately retry. Now the server has 200 requests to handle. More timeouts. More retries. Server dies.

Solution: Exponential backoff. Wait longer between each retry.

use std::net::TcpStream;
use std::io::{Read, Write};
use std::time::Duration;
use std::thread;

/// Attempts an operation with retries and exponential backoff.
/// 
/// # Arguments
/// * `max_retries` - How many times to try before giving up
/// * `operation` - A closure that attempts the operation
/// 
/// # Returns
/// The result of the first successful attempt, or the last error
fn with_retry<T, E, F>(max_retries: u32, mut operation: F) -> Result<T, E>
where
    F: FnMut() -> Result<T, E>,
    E: std::fmt::Display,
{
    let mut last_error = None;
    
    for attempt in 0..max_retries {
        if attempt > 0 {
            // Exponential backoff: 1s, 2s, 4s, 8s, ...
            // `1 << attempt` is 2^attempt (bit shifting)
            let wait_secs = 1 << (attempt - 1);
            println!("  Waiting {} seconds before retry...", wait_secs);
            thread::sleep(Duration::from_secs(wait_secs));
        }
        
        println!("Attempt {} of {}...", attempt + 1, max_retries);
        
        match operation() {
            Ok(result) => {
                println!("  Success!");
                return Ok(result);
            }
            Err(e) => {
                println!("  Failed: {}", e);
                last_error = Some(e);
            }
        }
    }
    
    Err(last_error.expect("Should have at least one error"))
}

fn make_request() -> std::io::Result<String> {
    let mut stream = TcpStream::connect_timeout(
        &"127.0.0.1:8888".parse().unwrap(),
        Duration::from_secs(2),
    )?;
    
    stream.set_read_timeout(Some(Duration::from_secs(2)))?;
    stream.write_all(b"Hello")?;
    
    let mut buffer = [0u8; 1024];
    let n = stream.read(&mut buffer)?;
    
    Ok(String::from_utf8_lossy(&buffer[..n]).to_string())
}

fn main() {
    match with_retry(4, make_request) {
        Ok(response) => println!("\nFinal result: {}", response),
        Err(e) => println!("\nAll retries failed: {}", e),
    }
}

Idempotency: A Critical Concept

Question: If a request times out, should you retry?

It depends on whether the operation is idempotent.

Idempotent means: doing it multiple times has the same effect as doing it once.

Operation	Idempotent?	Safe to retry?
GET a webpage	Yes	✓
Set user's email to "alice@example.com"	Yes	✓
Delete record with ID 123	Yes	✓ (if not exists, that's fine)
Add $100 to account	NO	✗ (might add $200!)
Create new order	NO	✗ (might create duplicate!)
Increment counter	NO	✗

Why this matters: Remember, you don't know if your request succeeded. Maybe the server processed it and the response got lost. If you retry a non-idempotent operation, you might do it twice!

Making operations idempotent:

Use unique IDs: "Create order with ID abc123": if it already exists, just return it
Use conditional updates: "Set counter to 5 if current value is 4"
Use request IDs: Server tracks which requests it's seen and deduplicates

A.8: HTTP: The Protocol of the Web

What is HTTP?

HTTP (HyperText Transfer Protocol) is an application-layer protocol built on TCP. It's how web browsers talk to web servers, how APIs communicate, and probably the most common protocol you'll work with.

The Request-Response Pattern

HTTP follows a simple pattern:

Client sends a request (method, path, headers, optional body)
Server sends a response (status code, headers, optional body)

Client                              Server
   │                                   │
   │──── HTTP Request ─────────────────>
   │     GET /users/42 HTTP/1.1        │
   │     Host: api.example.com         │
   │     Accept: application/json      │
   │                                   │
   │<───────────────── HTTP Response ──│
   │     HTTP/1.1 200 OK               │
   │     Content-Type: application/json│
   │     {"id": 42, "name": "Alice"}   │
   │                                   │

Anatomy of an HTTP Request

GET /search?q=rust HTTP/1.1       ← Request line: METHOD PATH VERSION
Host: www.google.com               ← Headers start
User-Agent: Mozilla/5.0
Accept: text/html
Accept-Language: en-US
                                   ← Empty line marks end of headers
                                   ← Body would go here (for POST/PUT)

Request line components:

Method: What you want to do (GET, POST, PUT, DELETE, etc.)
Path: What resource you're accessing
Version: HTTP protocol version

Common HTTP methods:

Method	Meaning	Has body?	Idempotent?
GET	Retrieve data	No	Yes
POST	Create new data	Yes	No
PUT	Replace data entirely	Yes	Yes
PATCH	Update data partially	Yes	Not always
DELETE	Remove data	No	Yes
HEAD	Like GET but no body	No	Yes

Anatomy of an HTTP Response

HTTP/1.1 200 OK                    ← Status line: VERSION CODE MESSAGE
Content-Type: application/json     ← Headers start
Content-Length: 42
Date: Mon, 01 Jan 2024 12:00:00 GMT
                                   ← Empty line marks end of headers
{"id": 42, "name": "Alice"}        ← Body

Common status codes:

Code	Meaning	When you'll see it
200	OK	Request succeeded
201	Created	POST succeeded, resource created
204	No Content	Success, but no body to return
301	Moved Permanently	Resource has a new URL (redirect)
302	Found	Temporary redirect
400	Bad Request	Your request was malformed
401	Unauthorized	Need to authenticate
403	Forbidden	Authenticated but not allowed
404	Not Found	Resource doesn't exist
429	Too Many Requests	Rate limited
500	Internal Server Error	Server crashed
502	Bad Gateway	Proxy/load balancer can't reach backend
503	Service Unavailable	Server overloaded or down
504	Gateway Timeout	Proxy/load balancer timeout

Making HTTP Requests in Rust

For HTTP in Rust, we'll use the reqwest crate. It handles all the complexity (TLS, connection pooling, etc.).

# Cargo.toml
[dependencies]
reqwest = { version = "0.11", features = ["blocking", "json"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

use serde::Deserialize;

// Define a struct matching the JSON we expect
// `Deserialize` is a serde trait that lets us parse JSON into this struct
#[derive(Debug, Deserialize)]
struct User {
    login: String,
    id: u64,
    name: Option<String>,  // Option because this field might be null
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // === Simple GET request ===
    
    println!("Fetching user from GitHub API...\n");
    
    // The `blocking` module provides synchronous (blocking) HTTP
    // There's also async reqwest for production use
    let response = reqwest::blocking::get(
        "https://api.github.com/users/octocat"
    )?;
    
    // Check the status
    println!("Status: {}", response.status());
    println!("Headers:");
    for (name, value) in response.headers() {
        println!("  {}: {:?}", name, value);
    }
    
    // Parse JSON body into our struct
    // The ::<User> is a "turbofish": tells Rust what type to deserialize into
    let user: User = response.json()?;
    println!("\nParsed user:");
    println!("  Login: {}", user.login);
    println!("  ID: {}", user.id);
    println!("  Name: {:?}", user.name);
    
    Ok(())
}

Building More Complex Requests

use reqwest::blocking::Client;
use reqwest::header::{AUTHORIZATION, CONTENT_TYPE, USER_AGENT};
use serde::{Deserialize, Serialize};
use std::time::Duration;

// For request body
#[derive(Serialize)]
struct CreateIssue {
    title: String,
    body: String,
}

// For response body
#[derive(Debug, Deserialize)]
struct Issue {
    id: u64,
    number: u64,
    title: String,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a reusable client with configuration
    // In production, you'd typically create ONE client and reuse it
    // (it maintains a connection pool)
    let client = Client::builder()
        .timeout(Duration::from_secs(10))  // Request timeout
        .user_agent("my-rust-app/1.0")      // Identify ourselves
        .build()?;
    
    // Build a POST request
    let issue = CreateIssue {
        title: "Bug report".to_string(),
        body: "Something is broken!".to_string(),
    };
    
    let response = client
        .post("https://api.github.com/repos/owner/repo/issues")
        .header(AUTHORIZATION, "Bearer your-token-here")
        .header(CONTENT_TYPE, "application/json")
        .json(&issue)  // Serialize `issue` to JSON and set as body
        .send()?;
    
    // Handle different status codes
    match response.status().as_u16() {
        201 => {
            let created: Issue = response.json()?;
            println!("Created issue #{}: {}", created.number, created.title);
        }
        401 => {
            println!("Authentication failed - check your token");
        }
        403 => {
            println!("Forbidden - you don't have permission");
        }
        404 => {
            println!("Repository not found");
        }
        422 => {
            // Unprocessable Entity - validation error
            let error: serde_json::Value = response.json()?;
            println!("Validation error: {}", error);
        }
        code => {
            println!("Unexpected status: {}", code);
            println!("Body: {}", response.text()?);
        }
    }
    
    Ok(())
}

PART B: DISTRIBUTED SYSTEMS CONCEPTS

The Core Theory

This is the main content. Part A was background; this is what you're here to learn.

B.1: Why Distributed Systems Are Hard (The Eight Fallacies)

Now that you understand how networks actually work, let's revisit why distributed systems are challenging.

In 1994, Peter Deutsch identified eight assumptions that developers new to distributed systems incorrectly make. These are the Eight Fallacies of Distributed Computing.

Fallacy 1: The Network is Reliable

The assumption: When I send a message, it will arrive.

Reality: You now understand why this is false. Packets can be:

Dropped by overloaded routers
Lost due to cable damage or radio interference
Blocked by firewalls
Delayed indefinitely

Even with TCP, which handles retransmission, connections can fail entirely if the network is partitioned.

What this means for your code:

Every network call can fail
You need timeouts on everything
You need to decide: retry? fail? use cached data?

Fallacy 2: Latency is Zero

The assumption: Network communication is instant.

Reality: The speed of light is a real constraint.

Distance	One-way latency (minimum, speed of light)
Same rack	0.0005 ms
Same datacenter	0.5 ms
NYC to London	28 ms
NYC to Tokyo	50 ms

And that's just physics. Real latency includes:

Routing decisions at each hop
Queuing delays at routers
Processing time at the server
TCP acknowledgment round-trips

What this means for your code:

A function that makes 10 sequential network calls takes 10× the latency
Batch requests when possible
Consider where data lives relative to computation

Fallacy 3: Bandwidth is Infinite

The assumption: You can send as much data as you want.

Reality: Links have limited capacity, and you share them with everyone else.

What this means:

Compress data before sending
Only send what you need
Consider moving computation to the data instead of moving data to computation

Fallacy 4: The Network is Secure

The assumption: Data sent over the network is safe.

Reality: Your packets pass through many routers you don't control. Without encryption, anyone can read or modify them.

What this means:

Use TLS (HTTPS) for everything
Authenticate the parties you're communicating with
Don't trust data just because it came from "inside" your network

Fallacy 5: Topology Doesn't Change

The assumption: The network structure stays the same.

Reality: Servers come and go. Routes change. Cloud instances are ephemeral.

What this means:

Don't hardcode IP addresses
Use service discovery (DNS, Consul, Kubernetes services)
Design for nodes joining and leaving

Fallacy 6: There is One Administrator

The assumption: One person or team controls the entire system.

Reality: You depend on your ISP, cloud provider, third-party APIs, DNS providers, CDNs...

What this means:

External services will fail in ways you don't expect
Have fallbacks for critical dependencies
Monitor third-party services

Fallacy 7: Transport Cost is Zero

The assumption: Sending data is free.

Reality: Network calls cost:

CPU time for serialization/deserialization
Memory for buffers
Actual money (cloud egress fees!)

Example: AWS charges $0.09/GB for data leaving their network. A service sending 100TB/month pays $9,000 just for egress.

Fallacy 8: The Network is Homogeneous

The assumption: All parts of the network are the same.

Reality: Different links have different speeds, reliability, and characteristics. The last mile to a mobile phone is very different from a datacenter interconnect.

A.9: Concurrency: Hard Even on One Machine

Before we get deeper into distributed systems, we need to understand concurrency. This matters because:

Concurrency bugs are hard to find on one machine
In distributed systems, they're even harder
You can't use simple locks across machines

Why Concurrency Matters

A web server needs to handle thousands of connections. If each request takes 100ms, and you handle them one at a time, you can only do 10 requests/second. That's terrible.

You need to handle multiple requests concurrently.

Processes vs. Threads

Process: A running program with its own isolated memory space. Processes can't accidentally corrupt each other's memory.

Thread: A lightweight unit of execution within a process. All threads in a process share the same memory space.

┌────────────────────────────────────────────────────────┐
│                      Process                            │
│  ┌──────────────────────────────────────────────────┐  │
│  │              Shared Memory Space                  │  │
│  │                                                   │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐  │  │
│  │  │  Thread 1   │  │  Thread 2   │  │ Thread 3 │  │  │
│  │  │             │  │             │  │          │  │  │
│  │  │ Can access  │  │ Can access  │  │Can access│  │  │
│  │  │ any variable│  │ any variable│  │   any    │  │  │
│  │  │ in shared   │  │ in shared   │  │ variable │  │  │
│  │  │ memory!     │  │ memory!     │  │          │  │  │
│  │  └─────────────┘  └─────────────┘  └──────────┘  │  │
│  │                                                   │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

Sharing memory is powerful but dangerous.

Race Conditions

A race condition occurs when the result of a program depends on the timing of uncontrollable events (like thread scheduling).

Let's see a classic example. This code is intentionally broken to demonstrate the problem:

use std::thread;

// This is DELIBERATELY BROKEN CODE to demonstrate race conditions
// DON'T DO THIS - Rust actually prevents this at compile time!

fn main() {
    // Imagine we could share this between threads (we can't without synchronization)
    let mut counter = 0;
    
    // Thread 1 wants to increment counter
    // Internally this is: read counter → add 1 → write counter
    
    // Thread 2 also wants to increment counter
    // Same steps: read counter → add 1 → write counter
    
    // If both threads run truly concurrently:
    //
    //   Thread 1              Thread 2
    //   --------              --------
    //   read counter (0)
    //                         read counter (0)
    //   add 1 (result: 1)
    //                         add 1 (result: 1)
    //   write counter (1)
    //                         write counter (1)
    //
    // Final value: 1 (should be 2!)
    // This is a race condition.
}

Both threads read 0, both add 1, both write 1. One increment is lost!

Rust's protection: Rust won't let you write this code. The compiler enforces that mutable data is either:

Owned by exactly one thread, OR
Protected by synchronization primitives

Mutex: Mutual Exclusion

A Mutex (mutual exclusion) is a lock that ensures only one thread can access data at a time.

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
    // === Breaking down the types ===
    //
    // Mutex<i32>: A mutex protecting an i32
    //   - Can only access the i32 by "locking" the mutex
    //   - Lock is released automatically when it goes out of scope
    //
    // Arc<T>: "Atomically Reference Counted" smart pointer
    //   - Allows multiple ownership of T
    //   - When all Arcs are dropped, T is dropped
    //   - "Atomic" means the reference counting is thread-safe
    //
    // Why Arc<Mutex<T>>?
    //   - Mutex: provides safe mutable access from multiple threads
    //   - Arc: allows multiple threads to own the mutex
    //   - Together: multiple threads can safely share and modify data
    
    let counter = Arc::new(Mutex::new(0));
    
    let mut handles = vec![];
    
    for i in 0..10 {
        // Clone the Arc to get a new reference to the same Mutex
        // This DOESN'T clone the underlying data, just increments the ref count
        let counter_clone = Arc::clone(&counter);
        
        // spawn() requires the closure to have 'static lifetime
        // Using `move` transfers ownership of `counter_clone` into the thread
        let handle = thread::spawn(move || {
            // === Locking the mutex ===
            //
            // lock() does two things:
            // 1. Blocks until we acquire the lock (waits if another thread holds it)
            // 2. Returns a MutexGuard<i32> that lets us access the data
            //
            // unwrap() is needed because lock() returns a Result
            // It can fail if a thread panicked while holding the lock ("poisoned")
            let mut num = counter_clone.lock().unwrap();
            
            // `num` is a MutexGuard<i32>, which implements DerefMut
            // So `*num` gives us &mut i32
            *num += 1;
            
            println!("Thread {} incremented counter to {}", i, *num);
            
            // When `num` goes out of scope here, the MutexGuard is dropped
            // This automatically releases the lock
        });
        
        handles.push(handle);
    }
    
    // Wait for all threads to complete
    for handle in handles {
        handle.join().unwrap();
    }
    
    // Get the final value
    // Again, we lock to access the data
    println!("\nFinal counter value: {}", *counter.lock().unwrap());
}

Breaking Down the Types

Let's make sure the types are crystal clear:

// Just the value - no sharing possible
let x: i32 = 0;

// Mutex around the value - one thread can access at a time
// But we can't share this Mutex across threads!
let m: Mutex<i32> = Mutex::new(0);

// Rc (Reference Counted) - multiple ownership, but NOT thread-safe
// let r: Rc<Mutex<i32>> = Rc::new(Mutex::new(0));  // Can't send to threads!

// Arc (Atomic Reference Counted) - multiple ownership, IS thread-safe
let a: Arc<Mutex<i32>> = Arc::new(Mutex::new(0));
// NOW we can clone `a` and send clones to different threads

Deadlock

Deadlock occurs when two or more threads are each waiting for the other to release a resource.

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
    let lock_a = Arc::new(Mutex::new(0));
    let lock_b = Arc::new(Mutex::new(0));
    
    let lock_a_clone = Arc::clone(&lock_a);
    let lock_b_clone = Arc::clone(&lock_b);
    
    // Thread 1: Locks A, then tries to lock B
    let handle1 = thread::spawn(move || {
        let _a = lock_a.lock().unwrap();
        println!("Thread 1 has lock A, trying to get lock B...");
        thread::sleep(std::time::Duration::from_millis(100));  // Give thread 2 time
        let _b = lock_b.lock().unwrap();  // WAITS FOREVER - thread 2 has B
        println!("Thread 1 has both locks");
    });
    
    // Thread 2: Locks B, then tries to lock A
    let handle2 = thread::spawn(move || {
        let _b = lock_b_clone.lock().unwrap();
        println!("Thread 2 has lock B, trying to get lock A...");
        thread::sleep(std::time::Duration::from_millis(100));
        let _a = lock_a_clone.lock().unwrap();  // WAITS FOREVER - thread 1 has A
        println!("Thread 2 has both locks");
    });
    
    // This program will hang forever!
    // Thread 1: Holds A, waiting for B
    // Thread 2: Holds B, waiting for A
    
    handle1.join().unwrap();
    handle2.join().unwrap();
}

Avoiding deadlock:

Always acquire locks in the same order
Use timeout-based lock acquisition
Design to avoid needing multiple locks
Use lock-free data structures when possible

Why This Matters for Distributed Systems

In a distributed system, you can't use Mutex across machines. Each server has its own memory. There's no shared Arc<Mutex<T>>.

But the same problems exist! Imagine two database servers:

Server A reads balance: 100
Server B reads balance: 100
Server A writes balance: 50
Server B writes balance: 70  ← Server A's write is lost!

This is the same race condition, but across machines. Solving it requires distributed coordination mechanisms: consensus algorithms, distributed transactions, etc. You'll learn these in later phases.

A.10: Serialization: Turning Data into Bytes

When machines communicate, they need to agree on how to represent data as bytes.

The Problem

Your Rust program has types:

struct User {
    name: String,
    age: u32,
    friends: Vec<String>,
}

The network only carries bytes. How do you send this struct to another machine (which might be running Python or Go)?

Serialization and Deserialization

Serialization: Converting data structures to bytes (also called encoding, marshaling)
Deserialization: Converting bytes back to data structures (also called decoding, unmarshaling)

Rust struct                          Bytes                           (Maybe) Rust struct
User { name: "Alice", ... }  →  [123, 34, 110, 97, ...]  →  User { name: "Alice", ... }
         serialize                                                 deserialize

Serde: Rust's Serialization Framework

Serde is the de facto standard for serialization in Rust. It's a framework that lets you support many formats (JSON, YAML, TOML, MessagePack, etc.) with the same code.

# Cargo.toml
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"  # JSON format
# bincode = "1.3"   # Binary format (faster, smaller)
# toml = "0.8"      # TOML format

Basic Usage

use serde::{Serialize, Deserialize};

// === The derive macros ===
//
// #[derive(Serialize, Deserialize)] automatically implements
// the Serialize and Deserialize traits for your struct.
//
// This generates code that knows how to convert your struct
// to/from various formats (JSON, binary, etc.)
#[derive(Debug, Serialize, Deserialize)]
struct User {
    name: String,
    age: u32,
    email: Option<String>,  // Optional fields become null in JSON
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let user = User {
        name: "Alice".to_string(),
        age: 30,
        email: Some("alice@example.com".to_string()),
    };
    
    // === Serialize to JSON string ===
    let json_string: String = serde_json::to_string(&user)?;
    println!("JSON string: {}", json_string);
    // Output: {"name":"Alice","age":30,"email":"alice@example.com"}
    
    // === Serialize to "pretty" JSON (with indentation) ===
    let pretty_json: String = serde_json::to_string_pretty(&user)?;
    println!("\nPretty JSON:\n{}", pretty_json);
    
    // === Serialize to bytes (Vec<u8>) ===
    let json_bytes: Vec<u8> = serde_json::to_vec(&user)?;
    println!("\nBytes: {:?}", &json_bytes[..20]);  // First 20 bytes
    
    // === Deserialize from JSON string ===
    let json_str = r#"{"name":"Bob","age":25,"email":null}"#;
    let user2: User = serde_json::from_str(json_str)?;
    println!("\nDeserialized: {:?}", user2);
    
    // === Deserialize from bytes ===
    let user3: User = serde_json::from_slice(&json_bytes)?;
    println!("From bytes: {:?}", user3);
    
    Ok(())
}

Serde Attributes for Customization

use serde::{Serialize, Deserialize};

#[derive(Debug, Serialize, Deserialize)]
struct Message {
    // Rename field in JSON
    #[serde(rename = "messageId")]
    id: u64,
    
    // Use different name for serializing vs deserializing
    #[serde(rename(serialize = "msg", deserialize = "message"))]
    content: String,
    
    // Skip this field entirely
    #[serde(skip)]
    internal_state: u32,
    
    // Use default value if missing during deserialization
    #[serde(default)]
    retries: u32,
    
    // Custom default value
    #[serde(default = "default_priority")]
    priority: u32,
}

fn default_priority() -> u32 {
    5
}

Binary vs Text Formats

JSON (serde_json):

Human-readable
Widely supported (every language has JSON)
Larger size
Slower to parse

Bincode:

Binary (not human-readable)
Very fast
Very compact
Rust-specific (though spec exists)

// With bincode (add to Cargo.toml: bincode = "1.3")

use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize, Debug)]
struct Data {
    values: Vec<u64>,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = Data {
        values: (0..1000).collect(),
    };
    
    // JSON serialization
    let json = serde_json::to_vec(&data)?;
    println!("JSON size: {} bytes", json.len());
    
    // Bincode serialization
    let binary = bincode::serialize(&data)?;
    println!("Bincode size: {} bytes", binary.len());
    
    // Bincode is typically much smaller for numeric data
    // JSON:    ~6,893 bytes (each number is ASCII digits)
    // Bincode: ~8,008 bytes (8 bytes per u64 + vec length)
    
    Ok(())
}

Why Serialization Matters for Distributed Systems

All network communication involves serialization. Every RPC call, every message to a queue, every HTTP request serializes data.
Format choice affects performance. In high-throughput systems, serialization can be a bottleneck.
Schema evolution is tricky. What happens when you add a field? Remove one? Old code needs to read new data and vice versa.
Cross-language compatibility. If you're talking to a Python service, you need a format both understand. JSON and Protocol Buffers are common choices.

B.2: Latency vs. Throughput

Two critical metrics for distributed systems.

Definitions

Latency: How long it takes to complete one operation.

Measured in time units (ms, seconds)
"How long does the user wait?"

Throughput: How many operations you complete in a given time.

Measured in operations per time (requests/sec, MB/s)
"How much work can the system do?"

The Highway Analogy

Imagine a highway between two cities, 100 miles apart. Cars travel at 50 mph.

Latency = Time for one car to travel from City A to City B = 2 hours

Throughput = Cars arriving at City B per hour

Now imagine we add more lanes:

1 lane: maybe 1,000 cars/hour throughput
4 lanes: maybe 4,000 cars/hour throughput

We quadrupled throughput! But latency is still 2 hours. More lanes don't make individual cars go faster.

Key insight: You can often improve throughput without improving latency, and vice versa.

In Distributed Systems

Strategy	Affects latency?	Affects throughput?
Put server closer to user	✓ Reduces	Minimal
Add more servers	Minimal	✓ Increases
Add caching	✓ Reduces (cache hits)	✓ Increases
Batch operations	Usually increases	✓ Increases
Use faster hardware	✓ Reduces	✓ Increases

Latency-Throughput Tradeoffs

Sometimes you must choose:

Batching: Instead of processing requests one at a time, wait until you have 100 and process them together.

Throughput improves (amortize fixed costs over many requests)
Latency gets worse (first request in the batch waits for the batch to fill)

Replication: Write data to 3 servers before acknowledging.

Throughput decreases (more work per request)
Latency increases (wait for 3 servers)
But reliability improves!

Measuring Latency: Percentiles

Don't use averages. They hide problems.

Example: 99 requests take 10ms. 1 request takes 10,000ms.

Average: 109ms (seems fine!)
But 1% of users waited 10 seconds

Use percentiles instead:

p50 (median): 50% of requests are faster than this
p95: 95% of requests are faster than this
p99: 99% of requests are faster than this
p99.9: 99.9% of requests are faster than this

Why high percentiles matter:

If a user's page load requires 10 backend calls, and each has p99 = 1 second:

Probability that ALL calls are under 1s = 0.99^10 = 90.4%
So ~10% of users experience >1s latency on at least one call!

This is why Amazon famously targets p99.9 latency for critical services.

B.3: Time in Distributed Systems

The Problem with Clocks

On a single computer, time is simple. There's one clock, events happen in sequence.

In a distributed system, each computer has its own clock. These clocks:

Drift: Even the best quartz clocks drift ~1 second per month. Cheap ones drift more.
Jump: When a computer synchronizes with NTP (Network Time Protocol), its clock might jump forward or even backward.
Disagree: Right now, two servers in your datacenter probably show slightly different times.

Why This Matters

Scenario: A distributed database. Two servers receive writes:

Server A (clock: 10:00:00.000): SET user.email = "alice@new.com"
Server B (clock: 10:00:00.001): SET user.email = "alice@old.com"

Which write wins? If we pick "latest timestamp wins," Server B's write wins. But what if Server B's clock is 5ms fast? Server A's write was actually later!

Result: User sets their email to "new.com", but the system stores "old.com". User is very confused.

Physical Time vs. Logical Time

Physical time: What the clock on the wall says.

Subject to drift, jumps, disagreement
Unreliable for ordering events across machines

Logical time: A way to order events based on causality, not wall clocks.

If A could have influenced B, then A "happened before" B
Based on message passing, not physics

This is what Leslie Lamport figured out in 1978.

B.4: Lamport Clocks

The Insight

Leslie Lamport realized:

We don't need to know when events happened. We just need to know the order in which they could have affected each other.

If I send you a message, my sending definitely happened before your receiving. That's causality, and it doesn't require synchronized clocks.

The "Happened-Before" Relation

Lamport defined a partial ordering called "happened-before" (written →):

Event A happened-before Event B (A → B) if:

Same process, sequential order: A and B are events in the same process, and A occurred before B.
Message send/receive: A is the sending of a message, and B is the receiving of that same message.
Transitivity: If A → B and B → C, then A → C.

If neither A → B nor B → A, the events are "concurrent" (written A || B). Neither could have influenced the other.

The Lamport Clock Algorithm

Each process maintains a logical clock: just a counter.

Rules:

Before any event: Increment your clock.
When sending a message: Include your current clock value.
When receiving a message with timestamp t: Set your clock to max(your_clock, t) + 1.

Let's trace through an example:

Process P1 (clock=0)              Process P2 (clock=0)
        │                                 │
        │ [Internal event]                │
        │ clock = 0 + 1 = 1               │
        │                                 │
        │ [Send message to P2]            │
        │ clock = 1 + 1 = 2               │
        │ msg contains timestamp 2        │
        │─────────────────────────────────>
        │                                 │
        │                           [Receive message with ts=2]
        │                           clock = max(0, 2) + 1 = 3
        │                                 │
        │                           [Internal event]
        │                           clock = 3 + 1 = 4
        │                                 │
        │                           [Send message to P1]
        │                           clock = 4 + 1 = 5
        │<─────────────────────────────────
        │                           msg contains timestamp 5
        │                                 │
[Receive message with ts=5]               │
clock = max(2, 5) + 1 = 6                 │
        │                                 │

The Guarantee

If A → B, then LC(A) < LC(B).

If A happened-before B, then A's Lamport clock is strictly less than B's.

This means: sorting events by Lamport clock gives you an ordering consistent with causality.

The Limitation

The converse is NOT true.

If LC(A) < LC(B), we CANNOT conclude A → B. The events might be concurrent.

Example:

P1: Event X with LC = 5
P2: Event Y with LC = 3

Does X happen-before Y? Or Y before X? Or are they concurrent?

We can't tell from Lamport clocks alone!

B.4: Lamport Clocks (continued)

Implementation in Rust

/// A Lamport clock for logical time ordering.
/// 
/// # Guarantees
/// If event A happened-before event B, then A's timestamp < B's timestamp.
/// 
/// # Non-guarantees
/// If A's timestamp < B's timestamp, we cannot conclude A happened-before B.
/// The events might be concurrent.
#[derive(Debug, Clone)]
pub struct LamportClock {
    // The current logical time
    // Using u64 gives us ~584 years at 1 billion events/second
    time: u64,
}

impl LamportClock {
    /// Create a new clock starting at 0.
    pub fn new() -> Self {
        LamportClock { time: 0 }
    }
    
    /// Increment the clock before any local event.
    /// Returns the new timestamp.
    /// 
    /// Call this before any local operation that you want to timestamp:
    /// - Writing to a database
    /// - Sending a message
    /// - Any state change
    pub fn tick(&mut self) -> u64 {
        self.time += 1;
        self.time
    }
    
    /// Get a timestamp to include in an outgoing message.
    /// This increments the clock (sending is an event).
    pub fn send_timestamp(&mut self) -> u64 {
        self.tick()
    }
    
    /// Update the clock when receiving a message.
    /// 
    /// This merges our local time with the sender's time,
    /// ensuring our clock is at least as high as the sender's.
    /// 
    /// # Arguments
    /// * `received_timestamp` - The timestamp from the received message
    /// 
    /// # Returns
    /// The new local timestamp after receiving
    pub fn receive(&mut self, received_timestamp: u64) -> u64 {
        // Take the max of our time and received time
        // Then increment (receiving is an event)
        self.time = self.time.max(received_timestamp) + 1;
        self.time
    }
    
    /// Get current time without incrementing.
    /// Useful for debugging/logging.
    pub fn current_time(&self) -> u64 {
        self.time
    }
}

// Let's verify the behavior
fn main() {
    println!("=== Lamport Clock Demo ===\n");
    
    let mut clock_p1 = LamportClock::new();
    let mut clock_p2 = LamportClock::new();
    
    // P1 does an internal event
    let t1 = clock_p1.tick();
    println!("P1 internal event: timestamp = {}", t1);  // 1
    
    // P1 sends a message
    let send_ts = clock_p1.send_timestamp();
    println!("P1 sends message with timestamp = {}", send_ts);  // 2
    
    // P2 receives the message
    let recv_ts = clock_p2.receive(send_ts);
    println!("P2 receives message, new timestamp = {}", recv_ts);  // 3
    
    // P2 does an internal event
    let t2 = clock_p2.tick();
    println!("P2 internal event: timestamp = {}", t2);  // 4
    
    // Meanwhile, P1 does another event (hasn't heard from P2 yet)
    let t3 = clock_p1.tick();
    println!("P1 internal event (no sync): timestamp = {}", t3);  // 3
    
    // P2 sends a message
    let send_ts_2 = clock_p2.send_timestamp();
    println!("P2 sends message with timestamp = {}", send_ts_2);  // 5
    
    // P1 receives it
    let recv_ts_2 = clock_p1.receive(send_ts_2);
    println!("P1 receives message, new timestamp = {}", recv_ts_2);  // 6
    
    println!("\n=== Observations ===");
    println!("Notice how P1's clock jumped from 3 to 6 after receiving P2's message.");
    println!("This ensures P1's subsequent events have timestamps > P2's message.");
}

Output:

=== Lamport Clock Demo ===

P1 internal event: timestamp = 1
P1 sends message with timestamp = 2
P2 receives message, new timestamp = 3
P2 internal event: timestamp = 4
P1 internal event (no sync): timestamp = 3
P2 sends message with timestamp = 5
P1 receives message, new timestamp = 6

=== Observations ===
Notice how P1's clock jumped from 3 to 6 after receiving P2's message.
This ensures P1's subsequent events have timestamps > P2's message.

B.5: Vector Clocks

The Problem with Lamport Clocks

Lamport clocks guarantee: If A → B, then LC(A) < LC(B).

But given two timestamps, we can't determine if they're causally related or concurrent.

Example:

Event X has timestamp 5
Event Y has timestamp 7

Possibilities:
- X → Y (X happened before Y)
- Y → X (Y happened before X)... wait, that's impossible if LC(X) < LC(Y)
- X || Y (concurrent)

Actually, we know Y didn't happen before X (since 7 > 5).
But we can't distinguish X → Y from X || Y!

This matters for conflict detection. If two writes are concurrent, we have a conflict. If one happened-before the other, the later one wins.

Vector Clocks: Capturing Full Causality

Idea: Instead of one counter, each process maintains a vector of counters: one for each process in the system.

If there are N processes, each maintains: [clock_for_P0, clock_for_P1, ..., clock_for_P(N-1)]

Rules:

Initialize: All zeros: [0, 0, ..., 0]
Before any event: Increment YOUR entry: VC[my_id] += 1
When sending: Include the entire vector.
When receiving vector V:
- For each entry, take the max: VC[i] = max(VC[i], V[i])
- Then increment your entry: VC[my_id] += 1

Let's trace through:

Process P0: VC=[0,0,0]     Process P1: VC=[0,0,0]     Process P2: VC=[0,0,0]
        │                         │                         │
[Event A]                         │                         │
VC=[1,0,0]                        │                         │
        │                         │                         │
[Send to P1]─────────────────────>│                         │
VC=[2,0,0]                        │                         │
        │                   [Receive]                       │
        │                   VC = max([0,0,0], [2,0,0])      │
        │                      = [2,0,0]                    │
        │                   Then VC[1] += 1                 │
        │                   VC = [2,1,0]                    │
        │                         │                         │
        │                   [Event B]                       │
        │                   VC=[2,2,0]                      │
        │                         │                         │
        │                   [Send to P2]───────────────────>│
        │                   VC=[2,3,0]                      │
        │                         │                   [Receive]
        │                         │                   VC=[2,3,1]
        │                         │                         │
[Event C]                         │                   [Event D]
VC=[3,0,0]                        │                   VC=[2,3,2]

Comparing Vector Clocks

Given two vectors V and W, we can determine their causal relationship:

V = W         : All entries are equal → Same logical time

V < W         : All entries of V ≤ corresponding entries of W,
                AND at least one is strictly less
                → V happened-before W

V > W         : All entries of V ≥ corresponding entries of W,
                AND at least one is strictly greater
                → W happened-before V

Otherwise     : Some entries of V < W, some entries of V > W
                → V and W are CONCURRENT (neither happened-before the other)

Examples:

[1,2,0] vs [1,3,1]
  1≤1 ✓  2≤3 ✓  0≤1 ✓
  At least one strictly less? Yes (2<3 and 0<1)
  → [1,2,0] < [1,3,1] (first happened-before second)

[2,0,0] vs [1,2,1]
  2≤1? NO (2>1)
  1≤2? Yes
  → Neither all ≤ nor all ≥
  → CONCURRENT!

[1,2,3] vs [1,2,3]
  All equal
  → EQUAL (same logical time)

B.5: Vector Clocks (continued)

Implementation in Rust

use std::cmp::Ordering;

/// The causal relationship between two vector clock timestamps.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CausalOrdering {
    /// First timestamp happened-before the second
    Before,
    /// Second timestamp happened-before the first
    After,
    /// Neither happened-before the other (concurrent)
    Concurrent,
    /// Timestamps are identical
    Equal,
}

/// A vector clock for capturing causality in distributed systems.
/// 
/// Unlike Lamport clocks, vector clocks can determine whether two events
/// are causally related or concurrent.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct VectorClock {
    /// This node's ID (index into the vector)
    node_id: usize,
    /// The vector of logical times, one per node
    vector: Vec<u64>,
}

impl VectorClock {
    /// Create a new vector clock for a node in an N-node system.
    /// 
    /// # Arguments
    /// * `node_id` - This node's ID (0 to num_nodes-1)
    /// * `num_nodes` - Total number of nodes in the system
    pub fn new(node_id: usize, num_nodes: usize) -> Self {
        assert!(node_id < num_nodes, "node_id must be < num_nodes");
        VectorClock {
            node_id,
            vector: vec![0; num_nodes],
        }
    }
    
    /// Increment this node's entry before any local event.
    /// Returns a copy of the new vector.
    pub fn tick(&mut self) -> Vec<u64> {
        self.vector[self.node_id] += 1;
        self.vector.clone()
    }
    
    /// Get a timestamp for an outgoing message.
    /// Increments the clock and returns a copy of the vector.
    pub fn send_timestamp(&mut self) -> Vec<u64> {
        self.tick()
    }
    
    /// Update the clock when receiving a message.
    /// 
    /// Merges the received vector with our local vector (taking max of each entry),
    /// then increments our own entry.
    pub fn receive(&mut self, received: &[u64]) -> Vec<u64> {
        assert_eq!(
            received.len(),
            self.vector.len(),
            "Vector lengths must match"
        );
        
        // Merge: take max of each entry
        for i in 0..self.vector.len() {
            self.vector[i] = self.vector[i].max(received[i]);
        }
        
        // Receiving is an event, so increment our entry
        self.vector[self.node_id] += 1;
        self.vector.clone()
    }
    
    /// Compare two vector clock timestamps to determine causal ordering.
    /// 
    /// # Returns
    /// - `Before` if v1 happened-before v2
    /// - `After` if v2 happened-before v1
    /// - `Concurrent` if neither happened-before the other
    /// - `Equal` if they're the same timestamp
    pub fn compare(v1: &[u64], v2: &[u64]) -> CausalOrdering {
        assert_eq!(v1.len(), v2.len(), "Vectors must have same length");
        
        let mut has_less = false;    // Is any v1[i] < v2[i]?
        let mut has_greater = false; // Is any v1[i] > v2[i]?
        
        for (a, b) in v1.iter().zip(v2.iter()) {
            match a.cmp(b) {
                Ordering::Less => has_less = true,
                Ordering::Greater => has_greater = true,
                Ordering::Equal => {}
            }
        }
        
        match (has_less, has_greater) {
            (false, false) => CausalOrdering::Equal,      // All equal
            (true, false) => CausalOrdering::Before,      // v1 < v2
            (false, true) => CausalOrdering::After,       // v1 > v2
            (true, true) => CausalOrdering::Concurrent,   // Neither
        }
    }
    
    /// Get current vector without incrementing.
    pub fn current(&self) -> &[u64] {
        &self.vector
    }
}

fn main() {
    println!("=== Vector Clock Demo ===\n");
    
    // Three-node system
    let mut vc0 = VectorClock::new(0, 3);
    let mut vc1 = VectorClock::new(1, 3);
    let mut vc2 = VectorClock::new(2, 3);
    
    // Node 0 does an event
    let t_a = vc0.tick();
    println!("Node 0 event A: {:?}", t_a);  // [1, 0, 0]
    
    // Node 0 sends to Node 1
    let msg1 = vc0.send_timestamp();
    println!("Node 0 sends message: {:?}", msg1);  // [2, 0, 0]
    
    // Node 1 receives
    let t_b = vc1.receive(&msg1);
    println!("Node 1 receives, event B: {:?}", t_b);  // [2, 1, 0]
    
    // Node 2 does an independent event
    let t_c = vc2.tick();
    println!("Node 2 event C (independent): {:?}", t_c);  // [0, 0, 1]
    
    // Now let's compare timestamps!
    println!("\n=== Comparisons ===");
    
    println!("A {:?} vs B {:?}: {:?}", 
        t_a, t_b, VectorClock::compare(&t_a, &t_b));  // Before
    
    println!("B {:?} vs C {:?}: {:?}",
        t_b, t_c, VectorClock::compare(&t_b, &t_c));  // Concurrent!
    
    println!("A {:?} vs C {:?}: {:?}",
        t_a, t_c, VectorClock::compare(&t_a, &t_c));  // Concurrent!
    
    // Node 1 sends to Node 2
    let msg2 = vc1.send_timestamp();
    println!("\nNode 1 sends message: {:?}", msg2);  // [2, 2, 0]
    
    let t_d = vc2.receive(&msg2);
    println!("Node 2 receives, event D: {:?}", t_d);  // [2, 2, 2]
    
    // Now C and D are on the same node, so:
    println!("\nC {:?} vs D {:?}: {:?}",
        t_c, t_d, VectorClock::compare(&t_c, &t_d));  // Before (C → D)
    
    // And B → D through the message
    println!("B {:?} vs D {:?}: {:?}",
        t_b, t_d, VectorClock::compare(&t_b, &t_d));  // Before (B → D)
}

Output:

=== Vector Clock Demo ===

Node 0 event A: [1, 0, 0]
Node 0 sends message: [2, 0, 0]
Node 1 receives, event B: [2, 1, 0]
Node 2 event C (independent): [0, 0, 1]

=== Comparisons ===
A [1, 0, 0] vs B [2, 1, 0]: Before
B [2, 1, 0] vs C [0, 0, 1]: Concurrent
A [1, 0, 0] vs C [0, 0, 1]: Concurrent

Node 1 sends message: [2, 2, 0]
Node 2 receives, event D: [2, 2, 2]

C [0, 0, 1] vs D [2, 2, 2]: Before
B [2, 1, 0] vs D [2, 2, 2]: Before

Why Vector Clocks Matter

Vector clocks let us answer the crucial question: Are these events concurrent?

Use case: Distributed databases

Two servers receive writes for the same key:

Server A: SET user.name = "Alice"  with VC = [2, 1, 0]
Server B: SET user.name = "Bob"    with VC = [1, 3, 0]

Compare: [2,1,0] vs [1,3,0]

2 > 1 (first entry)
1 < 3 (second entry)
→ CONCURRENT!

We now know there's a conflict that needs resolution (user intervention, merge function, etc.).

If instead:

Server A: SET user.name = "Alice"  with VC = [2, 1, 0]
Server B: SET user.name = "Bob"    with VC = [3, 2, 0]

Compare: [2,1,0] vs [3,2,0]

2 < 3, 1 < 2, 0 = 0
→ [2,1,0] happened-before [3,2,0]

Server B's write is strictly later: no conflict, just use "Bob".

B.6: Practical Applications

Application 1: Distributed Databases

Vector clocks (or similar mechanisms) are used in real databases:

Amazon Dynamo (DynamoDB's predecessor)
Riak
Voldemort

When conflicts are detected, these systems either:

Return all conflicting versions to the client (let application resolve)
Use automatic conflict resolution (CRDTs, last-writer-wins, etc.)

Application 2: Distributed Logging

When debugging distributed systems, you need to correlate logs from multiple services.

Problem: Physical timestamps might be out of sync. Service B's log entry might show an earlier timestamp than Service A, even though A sent a message to B.

Solution: Include Lamport timestamps in logs. Sort by Lamport timestamp for a causally-consistent view.

Application 3: CRDTs (Conflict-Free Replicated Data Types)

CRDTs are data structures designed to be replicated across servers and merged without conflicts. They use ideas from vector clocks.

B.6: Practical Applications (continued)

Application 3: CRDTs (Conflict-Free Replicated Data Types) - Implementation

Example: Grow-only Counter (G-Counter)

/// A Grow-only Counter that can be incremented on any node
/// and merged without conflicts.
/// 
/// Each node only increments its own entry. Merging takes
/// the max of each entry, so no increments are ever lost.
#[derive(Debug, Clone)]
struct GCounter {
    node_id: usize,
    counts: Vec<u64>,
}

impl GCounter {
    fn new(node_id: usize, num_nodes: usize) -> Self {
        GCounter {
            node_id,
            counts: vec![0; num_nodes],
        }
    }
    
    /// Increment the counter (only our node's entry).
    fn increment(&mut self) {
        self.counts[self.node_id] += 1;
    }
    
    /// Get the total count across all nodes.
    fn value(&self) -> u64 {
        self.counts.iter().sum()
    }
    
    /// Merge with another counter (e.g., received from another node).
    /// Takes the max of each entry: no increments are lost!
    fn merge(&mut self, other: &GCounter) {
        for i in 0..self.counts.len() {
            self.counts[i] = self.counts[i].max(other.counts[i]);
        }
    }
}

fn main() {
    println!("=== G-Counter Demo ===\n");
    
    let mut counter_a = GCounter::new(0, 3);  // Node 0
    let mut counter_b = GCounter::new(1, 3);  // Node 1
    
    // Node A increments twice
    counter_a.increment();
    counter_a.increment();
    println!("Node A after 2 increments: counts={:?}, value={}", 
        counter_a.counts, counter_a.value());  // [2, 0, 0], value=2
    
    // Node B increments three times
    counter_b.increment();
    counter_b.increment();
    counter_b.increment();
    println!("Node B after 3 increments: counts={:?}, value={}", 
        counter_b.counts, counter_b.value());  // [0, 3, 0], value=3
    
    // Now they sync up (merge in both directions)
    counter_a.merge(&counter_b);
    counter_b.merge(&counter_a);
    
    println!("\nAfter merge:");
    println!("Node A: counts={:?}, value={}", 
        counter_a.counts, counter_a.value());  // [2, 3, 0], value=5
    println!("Node B: counts={:?}, value={}", 
        counter_b.counts, counter_b.value());  // [2, 3, 0], value=5
    
    println!("\nBoth nodes agree: total = 5 (no increments lost!)");
}

Output:

=== G-Counter Demo ===

Node A after 2 increments: counts=[2, 0, 0], value=2
Node B after 3 increments: counts=[0, 3, 0], value=3

After merge:
Node A: counts=[2, 3, 0], value=5
Node B: counts=[2, 3, 0], value=5

Both nodes agree: total = 5 (no increments lost!)

═══════════════════════════════════════════════════════════════

SUMMARY

═══════════════════════════════════════════════════════════════

What You've Learned

Networking Fundamentals

Data travels as packets through routers
IP handles addressing and routing (best-effort, no guarantees)
TCP provides reliability via acknowledgments and retransmission
Ports identify applications on a machine
DNS translates names to IP addresses

Socket Programming

Servers: bind → listen → accept
Clients: connect → read/write
All I/O can fail: handle errors properly

Failure Handling

Timeouts are essential: you can't wait forever
Retries with exponential backoff prevent thundering herds
Only retry idempotent operations
Network failures are ambiguous: you often can't tell if a request succeeded

The Eight Fallacies

The network is reliable ❌
Latency is zero ❌
Bandwidth is infinite ❌
The network is secure ❌
Topology doesn't change ❌
There is one administrator ❌
Transport cost is zero ❌
The network is homogeneous ❌

Concurrency

Race conditions occur when multiple threads access shared data
Rust prevents most concurrency bugs at compile time
Mutex provides mutual exclusion
Arc allows shared ownership across threads
Deadlock occurs when threads wait for each other

Serialization

Serde is Rust's serialization framework
JSON for human-readable, cross-language compatibility
Binary formats (bincode) for speed and compactness

Time and Ordering

Physical clocks can't be trusted across machines
Lamport clocks: If A → B, then LC(A) < LC(B)
Vector clocks: Can determine if events are concurrent
Crucial for conflict detection in distributed databases