Concurrency vs Parallelism

Threading Models: Python GIL vs Go Routines

Concurrency vs Parallelism: The Core Difference

The Famous Explanation

Rob Pike (Go creator): "Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once."

Concurrency

Definition: Multiple tasks making progress by interleaving execution. Tasks may NOT run simultaneously.

Single CPU Core - Concurrency (Time Slicing):
Time →
Task A: ████░░░░░░░░████░░░░░░░░████
Task B: ░░░░████░░░░░░░░████░░░░░░░░
Task C: ░░░░░░░░████░░░░░░░░████░░░░

Tasks take turns, switching rapidly (context switching).
From user's perspective, all tasks appear to run "at the same time."
Reality: Only one task executes at any instant.

Parallelism

Definition: Multiple tasks running simultaneously on different CPU cores. True simultaneous execution.

4 CPU Cores - Parallelism (Simultaneous Execution):
Time →
Core 1: ████████████████████████████
Core 2: ████████████████████████████
Core 3: ████████████████████████████
Core 4: ████████████████████████████

All tasks execute at the exact same instant.
Requires multiple physical CPU cores.

Key Insight

Concurrency is about structure (how you design your program).
Parallelism is about execution (how it runs on hardware).

You can have:

Threading Models

Three Types of Threads

1. OS Threads (1:1 Model)

Examples: Java threads, C++ std::thread, Python threads (with limitations)

Application Thread 1 ←→ OS Thread 1 (Kernel scheduled)
Application Thread 2 ←→ OS Thread 2 (Kernel scheduled)
Application Thread 3 ←→ OS Thread 3 (Kernel scheduled)

- Each application thread maps to one OS thread
- OS kernel schedules threads across CPU cores
- Heavy: ~1-2MB stack per thread
- Context switch: ~1-2 microseconds (kernel involvement)
- Limit: ~thousands of threads before performance degrades

2. Green Threads / User-Level Threads (N:1 Model)

Examples: Old Python (pre-native threads), Ruby fibers

App Thread 1 ┐
App Thread 2 ├→ Single OS Thread (Runtime scheduled)
App Thread 3 ┘

- Many application threads, one OS thread
- Runtime/VM schedules threads (not OS kernel)
- Lightweight but can't use multiple cores
- Mostly obsolete (superseded by M:N model)

3. M:N Model (Hybrid Threading)

Examples: Go goroutines, Erlang processes, Tokio (Rust)

Goroutine 1 ┐
Goroutine 2 ├→ OS Thread 1 ┐
Goroutine 3 ┘               ├→ CPU Core 1
Goroutine 4 ┐               │
Goroutine 5 ├→ OS Thread 2 ─┤
Goroutine 6 ┘               ├→ CPU Core 2
Goroutine 7 ┐               │
Goroutine 8 ├→ OS Thread 3 ┘
Goroutine 9 ┘

M goroutines multiplexed onto N OS threads
- Runtime scheduler manages goroutines
- OS kernel schedules OS threads onto cores
- Best of both worlds: lightweight + multi-core

Python's GIL: The Global Interpreter Lock

What is the GIL?

Critical Understanding

The GIL is a mutex (lock) that allows only ONE thread to execute Python bytecode at a time, even on multi-core systems.

What this means:

  • Python threads CAN'T achieve true parallelism for CPU-bound tasks
  • Only one thread executes Python code at any instant
  • Multi-threading in Python is mostly useful for I/O-bound tasks (network, disk)

Why Does the GIL Exist?

Historical Context (1991):

  • Memory Management: CPython uses reference counting for garbage collection. Without GIL, every reference count modification needs a lock (huge overhead).
  • C Extensions: Many C libraries aren't thread-safe. GIL makes it safe to call C code without worrying about thread safety.
  • Simplicity: GIL makes CPython implementation simpler. Single-threaded code doesn't need locks everywhere.

Attempts to Remove GIL:

  • 1999: Greg Stein's "free threading" patch - removed GIL but made single-threaded code 2x slower (rejected)
  • 2023: PEP 703 (Python 3.13+) - Optional no-GIL mode (Sam Gross) - in progress!

How the GIL Works

sequenceDiagram participant T1 as Thread 1 participant GIL as GIL (Lock) participant T2 as Thread 2 Note over T1,T2: Both threads want to execute Python code T1->>GIL: Acquire GIL GIL->>T1: Granted (holds lock) Note over T1: Execute Python bytecode
(100 ticks or 15ms) T2->>GIL: Request GIL Note over T2: Blocked, waiting... Note over T1: Tick limit reached or I/O operation T1->>GIL: Release GIL T2->>GIL: Acquire GIL GIL->>T2: Granted (holds lock) Note over T2: Execute Python bytecode T1->>GIL: Request GIL Note over T1: Blocked, waiting... Note over T1,T2: Threads take turns holding GIL

GIL Release Conditions:

Python Threading Example: CPU-Bound (GIL Problem)

import threading
import time

def cpu_bound_task(n):
    """CPU-intensive: calculate sum (holds GIL entire time)"""
    count = 0
    for i in range(n):
        count += i * i
    return count

# Sequential execution
start = time.time()
cpu_bound_task(10_000_000)
cpu_bound_task(10_000_000)
print(f"Sequential: {time.time() - start:.2f}s")  # ~1.5s

# Multi-threaded (doesn't help due to GIL!)
start = time.time()
t1 = threading.Thread(target=cpu_bound_task, args=(10_000_000,))
t2 = threading.Thread(target=cpu_bound_task, args=(10_000_000,))
t1.start()
t2.start()
t1.join()
t2.join()
print(f"Multi-threaded: {time.time() - start:.2f}s")  # ~1.5s (same or slower!)

# Threads fight for GIL, context switching overhead
# No speedup, possibly slower due to lock contention

Python Threading Example: I/O-Bound (GIL Not a Problem)

import threading
import time
import requests

def fetch_url(url):
    """I/O-bound: network request (releases GIL during I/O)"""
    response = requests.get(url)
    return len(response.content)

urls = [
    "https://www.google.com",
    "https://www.github.com",
    "https://www.stackoverflow.com",
    "https://www.reddit.com"
]

# Sequential execution
start = time.time()
for url in urls:
    fetch_url(url)
print(f"Sequential: {time.time() - start:.2f}s")  # ~4s (1s per request)

# Multi-threaded (big speedup!)
start = time.time()
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
for t in threads:
    t.start()
for t in threads:
    t.join()
print(f"Multi-threaded: {time.time() - start:.2f}s")  # ~1s (parallel I/O)

# While waiting for network I/O, GIL is released
# Other threads can run, achieving concurrency

Workarounds for CPU-Bound Tasks in Python

1. Multiprocessing (Separate Python Interpreters, No Shared GIL)

from multiprocessing import Pool
import time

def cpu_bound_task(n):
    count = 0
    for i in range(n):
        count += i * i
    return count

if __name__ == "__main__":
    start = time.time()

    # Use process pool (each process has own GIL)
    with Pool(processes=2) as pool:
        results = pool.map(cpu_bound_task, [10_000_000, 10_000_000])

    print(f"Multiprocessing: {time.time() - start:.2f}s")  # ~0.8s (2x speedup!)

    # Each process runs on separate CPU core
    # No GIL contention, true parallelism

2. Async/Await (Cooperative Concurrency, Single Thread)

import asyncio
import aiohttp

async def fetch_url(session, url):
    """Async I/O - single thread, many concurrent requests"""
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [
        "https://www.google.com",
        "https://www.github.com",",
        "https://www.stackoverflow.com",
        "https://www.reddit.com"
    ]

    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    print(f"Fetched {len(results)} pages")

# Single thread, cooperative multitasking
# While waiting for I/O, event loop runs other tasks
# More efficient than threads for I/O (no context switching overhead)
asyncio.run(main())

3. C Extensions / Cython (Release GIL)

# NumPy releases GIL for computations
import numpy as np
import threading

def compute_matrix():
    # GIL released during NumPy operations
    a = np.random.rand(1000, 1000)
    b = np.random.rand(1000, 1000)
    c = np.dot(a, b)  # Matrix multiplication in C (no GIL)
    return c

# These threads achieve parallelism (NumPy releases GIL)
threads = [threading.Thread(target=compute_matrix) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()

When Python Threading IS Useful

Use Case Why Threading Works
Network I/O GIL released during socket operations (requests, aiohttp)
Disk I/O GIL released during file read/write
Database queries GIL released while waiting for DB response
Sleep/Delays time.sleep() releases GIL
NumPy/SciPy C extensions release GIL during computation

Go Routines: Lightweight Concurrency

What are Goroutines?

Goroutine = Lightweight thread managed by Go runtime, not OS.

Key Characteristics

  • Tiny stack: Start with 2KB (vs 1-2MB for OS threads), grows dynamically
  • Cheap creation: Can create millions of goroutines
  • Fast context switch: ~tens of nanoseconds (vs microseconds for OS threads)
  • M:N scheduling: M goroutines on N OS threads
  • No GIL: True parallelism on multiple cores

Go Runtime Scheduler (M:N Model)

Go Scheduler Components:

G = Goroutine (lightweight thread)
M = Machine (OS thread)
P = Processor (logical CPU, schedules goroutines)

Architecture:
    G1  G2  G3       G4  G5  G6       G7  G8  G9
     ↓   ↓   ↓        ↓   ↓   ↓        ↓   ↓   ↓
    [P1 Queue]       [P2 Queue]       [P3 Queue]
         ↓                ↓                ↓
        M1              M2              M3
         ↓                ↓                ↓
    CPU Core 1      CPU Core 2      CPU Core 3

GOMAXPROCS = Number of P's (default: number of CPU cores)

How it works:
1. Each P has a run queue of goroutines
2. Each P is bound to an M (OS thread)
3. M executes goroutines from P's queue
4. When goroutine blocks (I/O, syscall), M detaches from P
5. P finds/creates another M to keep running goroutines
6. Work stealing: Idle P steals goroutines from busy P's queue

Go Scheduler Visualization

graph TB subgraph "Runtime Scheduler" P1[P1: Processor 1
Run Queue: G1, G2, G3] P2[P2: Processor 2
Run Queue: G4, G5] P3[P3: Processor 3
Run Queue: G6, G7, G8] end subgraph "OS Threads" M1[M1: OS Thread] M2[M2: OS Thread] M3[M3: OS Thread] end subgraph "Hardware" C1[CPU Core 1] C2[CPU Core 2] C3[CPU Core 3] end P1 --> M1 P2 --> M2 P3 --> M3 M1 --> C1 M2 --> C2 M3 --> C3 style P1 fill:#a3be8c,color:#2e3440 style P2 fill:#a3be8c,color:#2e3440 style P3 fill:#a3be8c,color:#2e3440 style M1 fill:#88c0d0,color:#2e3440 style M2 fill:#88c0d0,color:#2e3440 style M3 fill:#88c0d0,color:#2e3440 style C1 fill:#ebcb8b,color:#2e3440 style C2 fill:#ebcb8b,color:#2e3440 style C3 fill:#ebcb8b,color:#2e3440

Creating Goroutines: Trivially Simple

package main

import (
    "fmt"
    "time"
)

func task(id int) {
    for i := 0; i < 5; i++ {
        fmt.Printf("Task %d: %d\n", id, i)
        time.Sleep(100 * time.Millisecond)
    }
}

func main() {
    // Launch 3 goroutines (just add "go" keyword!)
    go task(1)
    go task(2)
    go task(3)

    // Wait for goroutines to finish
    time.Sleep(1 * time.Second)

    fmt.Println("Main done")
}

// Output (interleaved):
// Task 1: 0
// Task 3: 0
// Task 2: 0
// Task 1: 1
// Task 2: 1
// Task 3: 1
// ...

// Creating goroutines is EXTREMELY cheap
// Can create millions without issue

Goroutines: CPU-Bound Parallelism (No GIL!)

package main

import (
    "fmt"
    "runtime"
    "sync"
    "time"
)

func cpuBoundTask(n int, wg *sync.WaitGroup) {
    defer wg.Done()

    count := 0
    for i := 0; i < n; i++ {
        count += i * i
    }
}

func main() {
    fmt.Printf("CPU cores: %d\n", runtime.NumCPU())
    runtime.GOMAXPROCS(runtime.NumCPU()) // Use all cores (default in modern Go)

    n := 100_000_000

    // Sequential execution
    start := time.Now()
    var wg sync.WaitGroup
    wg.Add(1)
    cpuBoundTask(n, &wg)
    wg.Wait()
    fmt.Printf("Sequential: %v\n", time.Since(start))  // ~100ms

    // Parallel execution with 4 goroutines
    start = time.Now()
    wg.Add(4)
    go cpuBoundTask(n/4, &wg)
    go cpuBoundTask(n/4, &wg)
    go cpuBoundTask(n/4, &wg)
    go cpuBoundTask(n/4, &wg)
    wg.Wait()
    fmt.Printf("Parallel (4 goroutines): %v\n", time.Since(start))  // ~25ms (4x speedup!)

    // TRUE parallelism - no GIL!
    // All 4 goroutines run simultaneously on different CPU cores
}

Channels: Communication Between Goroutines

"Don't communicate by sharing memory; share memory by communicating." - Go proverb

package main

import (
    "fmt"
    "time"
)

func producer(ch chan int) {
    for i := 0; i < 5; i++ {
        fmt.Printf("Producing: %d\n", i)
        ch <- i  // Send to channel (blocks if channel full)
        time.Sleep(100 * time.Millisecond)
    }
    close(ch)  // Signal no more data
}

func consumer(ch chan int) {
    for value := range ch {  // Receive until channel closed
        fmt.Printf("Consumed: %d\n", value)
    }
}

func main() {
    ch := make(chan int, 2)  // Buffered channel (capacity 2)

    go producer(ch)
    go consumer(ch)

    time.Sleep(1 * time.Second)
}

// Channels are typed, thread-safe queues
// Avoid shared memory + locks (common source of bugs)
// Compiler catches many concurrency errors at compile time

Select: Multiplexing Channels

package main

import (
    "fmt"
    "time"
)

func main() {
    ch1 := make(chan string)
    ch2 := make(chan string)

    go func() {
        time.Sleep(100 * time.Millisecond)
        ch1 <- "Message from channel 1"
    }()

    go func() {
        time.Sleep(200 * time.Millisecond)
        ch2 <- "Message from channel 2"
    }()

    // select waits on multiple channels
    for i := 0; i < 2; i++ {
        select {
        case msg1 := <-ch1:
            fmt.Println(msg1)
        case msg2 := <-ch2:
            fmt.Println(msg2)
        case <-time.After(300 * time.Millisecond):
            fmt.Println("Timeout")
        }
    }
}

// select is like switch for channels
// Blocks until one channel is ready
// If multiple ready, picks randomly (fair)

Work Stealing & Preemption

Work Stealing (Load Balancing):

P1 Queue: [G1, G2, G3, G4, G5, G6]  (6 goroutines, busy)
P2 Queue: [G7]                      (1 goroutine, mostly idle)

When P2 finishes G7:
1. P2 checks own queue (empty)
2. P2 "steals" half of P1's queue
3. P2 Queue: [G4, G5, G6]  (stolen from P1)
   P1 Queue: [G1, G2, G3]

Result: Balanced load across all cores

Preemption (No Infinite Loops):
Go 1.14+ has asynchronous preemption
Goroutine running CPU-bound loop can be preempted
Prevents one goroutine from hogging P forever

Python GIL vs Go Routines: Head-to-Head

Comparison Table

Feature Python (GIL) Go (Goroutines)
Threading Model 1:1 (OS threads) with GIL lock M:N (goroutines multiplexed on OS threads)
True Parallelism ❌ No (GIL prevents concurrent Python code) ✅ Yes (no GIL, uses all CPU cores)
CPU-Bound Tasks Threading doesn't help. Use multiprocessing. Goroutines achieve linear speedup (N cores = Nx faster)
I/O-Bound Tasks ✅ Threading works (GIL released during I/O) ✅ Goroutines work great
Memory per Thread ~1-2 MB (OS thread stack) ~2 KB (goroutine stack, grows dynamically)
Context Switch Cost ~1-2 µs (kernel mode switch) ~10-100 ns (userspace switch)
Max Practical Threads ~1,000-10,000 threads Millions of goroutines
Creation Syntax threading.Thread(target=func).start() go func() (single keyword!)
Communication Shared memory + locks (error-prone) Channels (type-safe, compiler-checked)
Workarounds Multiprocessing (separate processes, IPC overhead) Not needed (goroutines just work)
Garbage Collection Reference counting + cycle detector (GIL simplifies this) Concurrent mark-and-sweep GC (sub-ms pause times)
Async Alternative asyncio (single thread, cooperative) Goroutines ARE async (runtime-managed)

Performance Benchmark: CPU-Bound Task

Python Threading (GIL)

Task: Calculate sum of squares (100M iterations)

Sequential: 1.5s
Threading (2 threads): 1.5s  ❌ No speedup
Threading (4 threads): 1.6s  ❌ Slower (contention)

Threads fight for GIL, context switching overhead
Only one thread executes at a time

Go Goroutines (No GIL)

Task: Calculate sum of squares (100M iterations)

Sequential: 1.5s
Goroutines (2): 0.75s  ✅ 2x speedup
Goroutines (4): 0.38s  ✅ 4x speedup

Linear scaling with CPU cores
True parallel execution

When to Use What

Scenario Python Approach Go Approach
Web scraping (I/O-bound) ✅ Threading or asyncio ✅ Goroutines
Web server (I/O-bound) ✅ async (FastAPI, aiohttp) ✅ Goroutines (net/http)
Image processing (CPU-bound) ⚠️ Multiprocessing (process overhead) ✅ Goroutines (near-linear scaling)
Data pipeline (mixed I/O + CPU) ⚠️ Mix of threading + multiprocessing (complex) ✅ Goroutines (handles both naturally)
Machine learning (CPU-heavy) ✅ NumPy/PyTorch (release GIL in C/CUDA) ⚠️ Use Python ecosystem (more mature)
Microservices (many concurrent connections) ⚠️ asyncio (complex) or gunicorn workers ✅ Goroutines (designed for this)

Synchronization Primitives

Why Do We Need Synchronization?

The Problem: Race Conditions

When multiple threads access shared data, the result depends on the timing of thread execution. This is unpredictable and causes bugs.

# Example: Race condition
counter = 0

def increment():
    global counter
    for _ in range(1_000_000):
        counter += 1  # Not atomic! Read, add, write

# Two threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()

print(counter)  # Expected: 2,000,000
                # Actual: ~1,234,567 (varies each run!)

# Why? counter += 1 is three operations:
# 1. Read counter value
# 2. Add 1
# 3. Write back
# Threads interleave these operations!

Thread 1: Read (0) → Add → [INTERRUPTED]

Thread 2: Read (0) → Add → Write (1)

Thread 1: Write (1) ← Lost Thread 2's increment!

1. Locks (Mutexes)

Mutex = Mutual Exclusion. Only one thread can hold the lock at a time.

Python Lock

import threading

counter = 0
lock = threading.Lock()

def increment():
    global counter
    for _ in range(1_000_000):
        with lock:  # Acquire lock
            counter += 1
        # Lock automatically released

# Now safe!
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()

print(counter)  # Always: 2,000,000 ✓

# Alternative syntax:
# lock.acquire()
# try:
#     counter += 1
# finally:
#     lock.release()

Go Mutex

package main

import (
    "fmt"
    "sync"
)

func main() {
    var counter int
    var mu sync.Mutex
    var wg sync.WaitGroup

    for i := 0; i < 2; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for j := 0; j < 1_000_000; j++ {
                mu.Lock()
                counter++
                mu.Unlock()
            }
        }()
    }

    wg.Wait()
    fmt.Println(counter)  // Always: 2,000,000
}

Lock Performance Cost

Locks have overhead! Fine-grained locking (lock per operation) can be slower than no concurrency.

# Sequential (no lock): 0.1s
# Two threads with fine-grained locks: 2.5s (slower!)
# Coarse-grained locks (lock larger chunks): 0.8s (better)

# Lesson: Lock granularity matters

2. Reentrant Locks (Recursive Locks)

A lock that can be acquired multiple times by the same thread.

import threading

lock = threading.RLock()  # Reentrant Lock

def outer():
    with lock:
        print("Outer acquired lock")
        inner()  # Can acquire same lock again!

def inner():
    with lock:  # Same thread, same lock - OK!
        print("Inner acquired lock")

# Regular Lock would deadlock here
# RLock allows same thread to re-acquire

3. Semaphores

Like a lock, but allows N threads to access resource simultaneously. Think of it as a counter.

Use Case: Connection Pool

import threading
import time

# Allow max 3 concurrent database connections
db_semaphore = threading.Semaphore(3)

def query_database(query_id):
    print(f"Query {query_id}: Waiting for connection...")
    with db_semaphore:  # Acquire (counter--)
        print(f"Query {query_id}: Connected! Executing...")
        time.sleep(2)  # Simulate query
        print(f"Query {query_id}: Done")
    # Release (counter++)

# Launch 10 queries
threads = [threading.Thread(target=query_database, args=(i,)) for i in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

# Only 3 execute at once (waves of 3, 3, 3, 1)

# Semaphore mechanics:
# - Internal counter starts at N (3 in this case)
# - acquire() decrements counter, blocks if counter = 0
# - release() increments counter, wakes waiting thread

Binary Semaphore vs Lock

# Binary Semaphore (count=1) ≈ Lock
sem = threading.Semaphore(1)

# Difference: Semaphore can be released by different thread
# Lock must be released by thread that acquired it

4. Condition Variables

Allow threads to wait for a condition to become true. Used for thread communication.

Producer-Consumer Problem

import threading
import time
import random

queue = []
MAX_SIZE = 5
condition = threading.Condition()

def producer():
    for i in range(10):
        time.sleep(random.uniform(0.1, 0.5))
        with condition:
            # Wait if queue is full
            while len(queue) >= MAX_SIZE:
                print(f"Producer: Queue full, waiting...")
                condition.wait()  # Release lock, sleep, reacquire when notified

            item = f"Item-{i}"
            queue.append(item)
            print(f"Producer: Produced {item}, queue size: {len(queue)}")

            condition.notify()  # Wake up one waiting consumer

def consumer():
    for i in range(10):
        time.sleep(random.uniform(0.2, 0.8))
        with condition:
            # Wait if queue is empty
            while len(queue) == 0:
                print(f"Consumer: Queue empty, waiting...")
                condition.wait()

            item = queue.pop(0)
            print(f"Consumer: Consumed {item}, queue size: {len(queue)}")

            condition.notify()  # Wake up waiting producer

t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)
t1.start(); t2.start()
t1.join(); t2.join()

# Condition variables = Lock + Wait/Notify mechanism
# wait(): Atomically release lock and sleep
# notify(): Wake one waiting thread
# notify_all(): Wake all waiting threads

5. Read/Write Locks

Optimize for read-heavy workloads. Multiple readers OR one writer.

from threading import Thread, Lock
import time

# Python doesn't have built-in RWLock, but here's the concept:
class ReadWriteLock:
    def __init__(self):
        self._readers = 0
        self._writers = 0
        self._read_ready = threading.Condition(threading.Lock())
        self._write_ready = threading.Condition(threading.Lock())

    def acquire_read(self):
        with self._read_ready:
            while self._writers > 0:
                self._read_ready.wait()
            self._readers += 1

    def release_read(self):
        with self._read_ready:
            self._readers -= 1
            if self._readers == 0:
                self._write_ready.notify_all()

    def acquire_write(self):
        with self._write_ready:
            while self._writers > 0 or self._readers > 0:
                self._write_ready.wait()
            self._writers += 1

    def release_write(self):
        with self._write_ready:
            self._writers -= 1
            self._write_ready.notify_all()
            self._read_ready.notify_all()

# Usage: Many readers can read simultaneously
# Writer gets exclusive access

Go RWMutex

package main

import (
    "fmt"
    "sync"
    "time"
)

func main() {
    var rwMu sync.RWMutex
    data := make(map[string]int)

    // Multiple readers (10 concurrent readers)
    for i := 0; i < 10; i++ {
        go func(id int) {
            rwMu.RLock()  // Read lock (shared)
            defer rwMu.RUnlock()
            fmt.Printf("Reader %d: %v\n", id, data)
            time.Sleep(100 * time.Millisecond)
        }(i)
    }

    // Single writer (exclusive)
    go func() {
        rwMu.Lock()  // Write lock (exclusive)
        defer rwMu.Unlock()
        data["key"] = 42
        fmt.Println("Writer: Updated data")
    }()

    time.Sleep(2 * time.Second)
}

6. Atomic Operations

Lock-free operations guaranteed to execute atomically (no interruption).

Python (limited atomic support)

# Python's GIL makes some operations accidentally atomic:
# x = 1  (atomic - single bytecode instruction)
# x += 1 (NOT atomic - multiple bytecode instructions)

# For true atomics, use multiprocessing.Value or ctypes
from multiprocessing import Value
from threading import Thread

counter = Value('i', 0)  # Shared integer with lock

def increment():
    for _ in range(100_000):
        with counter.get_lock():
            counter.value += 1

# More common: Just use threading.Lock

Go Atomics

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
)

func main() {
    var counter int64
    var wg sync.WaitGroup

    // Lock-free atomic increment
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            atomic.AddInt64(&counter, 1)  // Atomic operation
        }()
    }

    wg.Wait()
    fmt.Println(counter)  // Always: 1000

    // Other atomic operations:
    // atomic.LoadInt64(&counter)     // Atomic read
    // atomic.StoreInt64(&counter, 5) // Atomic write
    // atomic.CompareAndSwapInt64(&counter, old, new)  // CAS
}

// Atomics are faster than locks (no kernel involvement)
// But limited to simple operations (add, load, store, CAS)

7. Barriers

Synchronization point where all threads must wait until everyone arrives.

from threading import Thread, Barrier
import time
import random

# 3 threads must reach barrier before any continue
barrier = Barrier(3)

def worker(worker_id):
    print(f"Worker {worker_id}: Starting phase 1")
    time.sleep(random.uniform(1, 3))  # Simulate work
    print(f"Worker {worker_id}: Finished phase 1, waiting at barrier")

    barrier.wait()  # Block until all 3 threads reach here

    print(f"Worker {worker_id}: All workers ready, starting phase 2")
    time.sleep(random.uniform(1, 2))
    print(f"Worker {worker_id}: Finished phase 2")

threads = [Thread(target=worker, args=(i,)) for i in range(3)]
for t in threads:
    t.start()
for t in threads:
    t.join()

# Output shows all workers wait at barrier before proceeding
# Useful for phased algorithms (e.g., parallel sorting)

Common Concurrency Bugs

Deadlock

Two threads waiting for each other forever.

lock1 = threading.Lock()
lock2 = threading.Lock()

def thread1():
    with lock1:
        time.sleep(0.1)  # Give thread2 time to acquire lock2
        with lock2:  # Wait forever - thread2 holds lock2
            print("Thread 1")

def thread2():
    with lock2:
        time.sleep(0.1)
        with lock1:  # Wait forever - thread1 holds lock1
            print("Thread 2")

# DEADLOCK! Both threads wait forever

# Prevention:
# 1. Lock ordering: Always acquire locks in same order
# 2. Timeout: lock.acquire(timeout=1)
# 3. Lock hierarchy: Assign levels to locks, acquire lower→higher
# 4. Avoid nested locks when possible

Livelock

Threads keep changing state in response to each other but make no progress.

# Two people in hallway, both stepping same direction
# Both politely step aside... in same direction again!
# Not blocked, but not progressing

# Prevention: Randomized backoff, priority schemes

Starvation

A thread never gets access to a resource (unfair scheduling).

# Low-priority thread never runs because high-priority threads
# constantly acquire lock

# Prevention: Fair locks, priority inheritance

Best Practices

Principle Explanation
Minimize shared state Less shared data = fewer synchronization points. Use thread-local storage, immutability.
Coarse-grained locks Lock larger chunks of work, not every tiny operation. Reduces lock overhead.
Lock ordering Always acquire multiple locks in same order to prevent deadlock.
Short critical sections Hold locks for minimum time needed. Don't do I/O while holding lock.
Use higher-level primitives Queue, ThreadPoolExecutor (Python), channels (Go) instead of raw locks.
Prefer message passing Go channels, actor model - communicate by sharing, don't share memory.
Test with ThreadSanitizer Tools like TSan (C++/Go), pytest-xdist detect race conditions.

Other Languages & Models

Quick Comparison Across Languages

Language Model True Parallelism Notes
Java 1:1 OS threads ✅ Yes No GIL, heavy threads, good for CPU-bound
C# (.NET) 1:1 + async/await ✅ Yes No GIL, threadpool + task-based async
Rust 1:1 OS threads + async (Tokio) ✅ Yes No GIL, fearless concurrency (compiler enforces safety)
JavaScript (Node.js) Single-threaded + event loop ❌ No (but Worker threads available) async/await, non-blocking I/O, callbacks
Ruby (MRI) 1:1 OS threads with GIL ❌ No (same as Python) GIL prevents parallel CPU execution
Erlang/Elixir M:N (lightweight processes) ✅ Yes Similar to Go, millions of processes, message passing
C/C++ 1:1 OS threads (pthreads, std::thread) ✅ Yes Manual memory management, race conditions are your problem

Key Takeaways

What You Should Remember

  1. Concurrency ≠ Parallelism: Concurrency is about structure (dealing with many things), parallelism is about execution (doing many things simultaneously).
  2. Python's GIL prevents true parallelism for CPU-bound Python code
    • Threading works for I/O-bound tasks (GIL released during I/O)
    • Use multiprocessing for CPU-bound tasks
    • Use asyncio for efficient I/O concurrency
    • NumPy/C extensions can release GIL
  3. Go goroutines achieve true parallelism with no GIL
    • Extremely lightweight (2KB vs 1-2MB)
    • Can create millions without issue
    • M:N scheduling (goroutines on OS threads)
    • Channels for safe communication
  4. Choose the right tool:
    • Python: Great for I/O, scripting, ML/data science. Use multiprocessing for CPU-bound.
    • Go: Excellent for concurrent systems, microservices, network services.
  5. Interview tip: When asked about Python threading, mention:
    • GIL prevents parallel CPU execution
    • Threading still useful for I/O-bound (network, disk, DB)
    • Multiprocessing for CPU-bound parallelism
    • asyncio for efficient I/O concurrency