Distinguished Engineer Study Guide - From Bits to Files
The Core Problem: A hard drive is just a giant array of blocks (typically 4KB each). How do you store files, directories, permissions, timestamps in a way that's fast, reliable, and doesn't lose data?
The Filesystem's Job:
Everything is blocks:
Physical Disk:
┌─────┬─────┬─────┬─────┬─────┬─────┐
│ Blk │ Blk │ Blk │ Blk │ Blk │ Blk │ ... (billions of 4KB blocks)
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │
└─────┴─────┴─────┴─────┴─────┴─────┘
Block = Smallest unit filesystem can address (typically 4KB)
Sector = Smallest unit disk can address (typically 512 bytes or 4KB)
Why 4KB blocks?
Typical ext4 filesystem layout:
┌─────────────┬──────────┬─────────┬────────────┬──────────────┐
│ Superblock │ Block │ Inodes │ Inode │ Data Blocks │
│ (metadata) │ Bitmap │ Bitmap │ Table │ (files) │
└─────────────┴──────────┴─────────┴────────────┴──────────────┘
Superblock: Filesystem metadata (size, block size, mount count, etc.)
Block Bitmap: Which blocks are free/used (1 bit per block)
Inode Bitmap: Which inodes are free/used
Inode Table: File metadata entries
Data Blocks: Actual file contents
What it stores:
Superblock Contents:
- Magic number (identifies filesystem type)
- Block size (4KB, 8KB, etc.)
- Total blocks / free blocks
- Total inodes / free inodes
- Mount count / max mount count
- Last mount time / last write time
- Filesystem state (clean/dirty)
- UUID (unique identifier)
- Volume name
Example (ext4):
$ sudo dumpe2fs /dev/sda1 | head -20
Filesystem UUID: a1b2c3d4-...
Block count: 26214400
Block size: 4096
Inode count: 6553600
Free blocks: 23451234
Free inodes: 6234567
Why it matters: Corruption of the superblock = entire filesystem unreadable. Modern filesystems keep multiple superblock copies at known locations.
Definition: An inode (index node) is a data structure that stores all metadata about a file except the filename.
Each inode stores:
┌─────────────────────────────────┐
│ File Type (regular, dir, link) │
│ Permissions (rwxrwxrwx) │
│ Owner UID / Group GID │
│ File size (in bytes) │
│ Timestamps: │
│ - atime (last access) │
│ - mtime (last modification) │
│ - ctime (last status change) │
│ Link count (hard links) │
│ Block pointers (where data is) │
│ [Does NOT store filename!] │
└─────────────────────────────────┘
View inode info:
$ ls -i file.txt
12345678 file.txt (12345678 is the inode number)
$ stat file.txt
File: file.txt
Size: 1024 Blocks: 8 IO Block: 4096
Inode: 12345678 Links: 1
Access: (0644/-rw-r--r--) Uid: (1000/user) Gid: (1000/user)
Access: 2025-01-15 10:30:00.123456789 -0500
Modify: 2025-01-15 09:15:00.987654321 -0500
Change: 2025-01-15 09:15:00.987654321 -0500
Inode structure (simplified ext4):
┌────────────────┐
│ Metadata │
├────────────────┤
│ Direct[0] ───┼──> Data Block 1 (4KB)
│ Direct[1] ───┼──> Data Block 2 (4KB)
│ ... │
│ Direct[11] ───┼──> Data Block 12 (4KB)
├────────────────┤
│ Indirect ───┼──> Pointer Block ──> 1024 Data Blocks (4MB)
│ Double Ind. ───┼──> Ptr Block ──> 1024 Ptr Blocks ──> 1M Data Blocks (4GB)
│ Triple Ind. ───┼──> Ptr Block ──> ... (4TB)
└────────────────┘
Small files (< 48KB): Use direct pointers only - fast!
Large files: Use indirect pointers - extra seek to read pointer block
Why this design? Most files are small. Direct pointers provide fast access for common case. Indirect pointers allow huge files without wasting space in every inode.
Key Insight: A directory is just a file that contains a mapping of filenames to inode numbers.
Directory structure:
┌─────────────┬──────────┐
│ Filename │ Inode # │
├─────────────┼──────────┤
│ . │ 234567 │ (self)
│ .. │ 123456 │ (parent)
│ file1.txt │ 345678 │
│ file2.txt │ 456789 │
│ subdir │ 567890 │
└─────────────┴──────────┘
When you run: ls /home/user/documents/file.txt
1. Start at root inode (always #2)
2. Read "/" directory → find "home" → inode 1234
3. Read inode 1234 → get blocks for "home" directory
4. Search "home" directory → find "user" → inode 5678
5. Read inode 5678 → get blocks for "user" directory
6. Search "user" directory → find "documents" → inode 9012
7. Search "documents" directory → find "file.txt" → inode 3456
8. Read inode 3456 → get file metadata and data blocks
Why it matters: Deep directory paths = many inode lookups. This is why path traversal can be slow, and why directory caching (dentry cache in Linux) is critical.
$ ls -li
12345 -rw-r--r-- 2 user group 1024 Jan 15 10:00 original.txt
12345 -rw-r--r-- 2 user group 1024 Jan 15 10:00 hardlink.txt
Both point to same inode (12345)
Same file, two directory entries
Deleting one doesn't affect the other (link count decrements)
File only deleted when link count reaches 0
Limitations:
- Can't cross filesystem boundaries (inode numbers not unique across filesystems)
- Can't link directories (would create cycles)
$ ls -li
12345 -rw-r--r-- 1 user group 1024 Jan 15 10:00 original.txt
67890 lrwxrwxrwx 1 user group 12 Jan 15 10:05 symlink.txt -> original.txt
Different inodes
Symlink is a separate file containing path to target
Can cross filesystems
Can link directories
If target deleted, symlink becomes dangling (broken)
Scenario: Writing a file requires multiple operations:
If system crashes after step 2: Block marked free in bitmap but contains data. Block gets reused, data corruption!
If crash after step 4: Inode allocated but not linked to directory. Space leak (orphaned inode).
Journal = Special area on disk that logs intended changes
Write Operation with Journaling:
1. Write planned changes to journal (transaction)
2. Wait for journal write to complete (barrier)
3. Checkpoint: Write actual data/metadata to final locations
4. Mark transaction complete in journal
On crash/reboot:
- Read journal
- If transaction incomplete: Discard (nothing changed)
- If transaction complete but not checkpointed: Replay changes
- Filesystem always consistent!
| Mode | What's Journaled | Performance | Data Safety |
|---|---|---|---|
| writeback | Metadata only | Fastest | Filesystem consistent, but data can be garbage (recently written files) |
| ordered | Metadata only, but data written before metadata | Good | Filesystem consistent, no garbage data (default) |
| journal | Both metadata and data | Slowest (writes twice) | Full protection |
Why not always use full journaling? Write amplification. Every write happens twice (journal + final location). For large files, this is expensive.
| Filesystem | Year | Key Features | Use Case |
|---|---|---|---|
| ext2 | 1993 | No journaling, simple, fast for read-heavy | USB drives, embedded systems |
| ext3 | 1999 | Added journaling to ext2, backward compatible | Legacy Linux systems |
| ext4 | 2008 | Extents (contiguous blocks), delayed allocation, faster fsck | Default Linux filesystem |
| XFS | 1993 | Excellent large file performance, online defrag, delayed allocation | Large files, databases, RHEL default |
| Btrfs | 2009 | Copy-on-write, snapshots, checksums, compression, RAID | Modern Linux, snapshots needed |
| ZFS | 2005 | Volume manager + filesystem, snapshots, dedup, compression, checksums | NAS, data integrity critical (FreeBSD, Solaris) |
ext4:
✓ Mature, stable, well-tested
✓ Fast for general workloads
✓ Good small file performance
✗ No built-in snapshots
✗ No checksums (data corruption undetected)
XFS:
✓ Excellent for large files (streaming, video, databases)
✓ Parallel I/O performance
✓ Online defragmentation
✗ Cannot shrink filesystem (only grow)
✗ Slower for many small files
Btrfs:
✓ Snapshots (instant, space-efficient)
✓ Data checksumming (detects silent corruption)
✓ Transparent compression
✓ Self-healing with RAID
✗ Historically had stability issues (better now)
✗ RAID 5/6 still not recommended production-ready
Traditional approach (ext4): Modify data in-place
Write to block 1234:
1. Read block 1234
2. Modify in memory
3. Write back to block 1234
4. Update journal
Problem: If crash during write, block half-updated (torn write)
Copy-on-Write (Btrfs, ZFS): Never modify in-place
Write to file:
1. Allocate NEW block 5678
2. Write modified data to block 5678
3. Update pointer to point to 5678 (atomic)
4. Old block 1234 freed (eventually)
Benefits:
- No torn writes (pointer update is atomic)
- Trivial snapshots (keep old pointers around)
- Easier to implement checksums (immutable blocks)
Trade-offs:
- Write amplification (metadata updates propagate up tree)
- Fragmentation over time (need defrag or GC)
Snapshot process:
1. Take snapshot: Clone root pointer (instant, no data copied)
2. Modify file: Allocate new block, update current root, snapshot root unchanged
3. Delete snapshot: Free blocks only referenced by snapshot
Cost: Essentially free (just pointer manipulation)
Linux uses all free RAM for caching disk blocks
$ free -h
total used free shared buff/cache available
Mem: 15Gi 2.0Gi 8.0Gi 100Mi 5.0Gi 13Gi
"buff/cache" = OS cached disk blocks in memory
When you read file:
1. Check page cache - already in RAM? Return immediately
2. Not cached? Read from disk, store in cache, return to app
3. Future reads = instant (from cache)
When you write file:
1. Write to page cache (write-back cache)
2. Mark page "dirty"
3. Return to app (write appears complete)
4. Background flush daemon writes dirty pages to disk (every 30 sec default)
5. fsync() forces immediate flush
Why it matters: Workloads that fit in cache = RAM speed (~100GB/sec). Workloads that exceed cache = disk speed (~100MB/sec HDD, ~500MB/sec SSD). 1000x difference!
Problem: Path lookups expensive (/a/b/c/d/e/f/file.txt = 7 inode reads)
Solution: Cache directory entry lookups
"dentry cache" maps path → inode number
$ cat /proc/sys/fs/dentry-state
82749 72152 45 0 0 0
Shows cached dentry entries
Speeds up repeated access to same paths
Problem: Applications issue random reads/writes. Disk seeks are slow. How to optimize?
| Scheduler | Strategy | Best For |
|---|---|---|
| noop | FIFO queue, no reordering | SSDs, NVMe (no seek penalty) |
| deadline | Sort by sector, guarantee latency deadline | Databases, real-time |
| cfq | Completely Fair Queuing - per-process queues | General desktop/server (default) |
| bfq | Budget Fair Queuing - low latency, proportional I/O | Interactive workloads |
| mq-deadline | Multi-queue deadline (for NVMe) | Modern SSDs with multiple queues |
Check current scheduler:
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none
Change scheduler (temporary):
$ echo bfq > /sys/block/sda/queue/scheduler
Ideal: File stored in contiguous blocks
┌────┬────┬────┬────┬────┬────┐
│ F1 │ F1 │ F1 │ F1 │ F1 │ │ Sequential read = fast (one seek)
└────┴────┴────┴────┴────┴────┘
Fragmented: File scattered across disk
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ F1 │ F2 │ F1 │ F2 │ F1 │ │ F1 │ F2 │ Random seeks = slow
└────┴────┴────┴────┴────┴────┴────┴────┘
Causes:
- Repeated create/delete cycles
- File growth (append to file, no contiguous space available)
- Filesystem nearly full
ext4: Delayed allocation
- Don't allocate blocks immediately on write
- Buffer writes in memory, allocate contiguous blocks when flush
- Reduces fragmentation significantly
XFS: Online defragmentation
$ xfs_fsr /mount/point (reorganizes files while filesystem mounted)
Btrfs: Defragmentation with autodefrag
$ btrfs filesystem defragment -r /mount/point
ext4: Offline defragmentation
$ e4defrag /mount/point (can run online, but performance impact)
write() → Page Cache → (eventual) Disk
read() → Page Cache (if cached) or Disk → Page Cache
Benefits: Fast (RAM speed for cached data)
Problem: No control over when data hits disk
int fd = open("file", O_DIRECT);
write(fd, buffer, size); // Bypasses page cache, writes directly to disk
Use cases:
- Databases (manage own caching, need guaranteed write-through)
- Avoid double-buffering (app cache + page cache)
Requirements:
- Buffer must be aligned (typically 512-byte or 4KB aligned)
- Size must be multiple of block size
O_SYNC: Every write() blocks until data on disk
int fd = open("file", O_SYNC);
write(fd, buf, 100); // Returns only after disk write complete
fsync(): Force flush for specific file
write(fd, buf, 100); // Returns immediately (buffered)
fsync(fd); // Blocks until all data/metadata on disk
fdatasync(): Like fsync, but doesn't flush metadata (faster)
Use case: Databases, crash consistency critical
A: Process still has file open. Deleted file removed from directory, but inode/blocks not freed until all file descriptors closed.
$ rm /var/log/huge.log (file deleted from directory)
$ df -h (space not freed!)
Reason: Log process still writing to file descriptor
Solution: Restart process or: > /var/log/huge.log (truncate)
A: Would create cycles in directory tree, breaking tree invariants. Tools like find assume acyclic graph. Symlinks allowed because they can be detected/avoided.
rm -rf /path/to/dir:
1. Recursively traverse directory tree (depth-first)
2. For each file: Decrement inode link count
3. If link count → 0 and no processes have file open: Free inode + data blocks
4. Remove directory entry
5. Remove directory when empty
Atomic? No. If crash mid-delete, partially deleted state.
This is why trash/recycle bins exist (move to .trash instead of delete)
A: Rename within filesystem = update directory entry (single atomic operation). Move across filesystems = copy + delete (not atomic, can fail mid-copy).