How Filesystems Work

Block-Based Storage

Everything is blocks:

Physical Disk:
  ┌─────┬─────┬─────┬─────┬─────┬─────┐
  │ Blk │ Blk │ Blk │ Blk │ Blk │ Blk │  ... (billions of 4KB blocks)
  │  0  │  1  │  2  │  3  │  4  │  5  │
  └─────┴─────┴─────┴─────┴─────┴─────┘

Block = Smallest unit filesystem can address (typically 4KB)
Sector = Smallest unit disk can address (typically 512 bytes or 4KB)

Why 4KB blocks?

Smaller blocks: Less wasted space, but more metadata overhead
Larger blocks: Better sequential performance, but wasted space for small files
4KB: Sweet spot - matches memory page size, good for most workloads

Filesystem Layout

Typical ext4 filesystem layout:
┌─────────────┬──────────┬─────────┬────────────┬──────────────┐
│ Superblock  │  Block   │ Inodes  │   Inode    │  Data Blocks │
│ (metadata)  │  Bitmap  │  Bitmap │   Table    │   (files)    │
└─────────────┴──────────┴─────────┴────────────┴──────────────┘

Superblock: Filesystem metadata (size, block size, mount count, etc.)
Block Bitmap: Which blocks are free/used (1 bit per block)
Inode Bitmap: Which inodes are free/used
Inode Table: File metadata entries
Data Blocks: Actual file contents

The Superblock

What it stores:

Superblock Contents:
- Magic number (identifies filesystem type)
- Block size (4KB, 8KB, etc.)
- Total blocks / free blocks
- Total inodes / free inodes
- Mount count / max mount count
- Last mount time / last write time
- Filesystem state (clean/dirty)
- UUID (unique identifier)
- Volume name

Example (ext4):
$ sudo dumpe2fs /dev/sda1 | head -20
Filesystem UUID:          a1b2c3d4-...
Block count:              26214400
Block size:               4096
Inode count:              6553600
Free blocks:              23451234
Free inodes:              6234567

Why it matters: Corruption of the superblock = entire filesystem unreadable. Modern filesystems keep multiple superblock copies at known locations.

What is an Inode?

Definition: An inode (index node) is a data structure that stores all metadata about a file except the filename.

Inode Contents

Each inode stores:
┌─────────────────────────────────┐
│ File Type (regular, dir, link)  │
│ Permissions (rwxrwxrwx)         │
│ Owner UID / Group GID           │
│ File size (in bytes)            │
│ Timestamps:                     │
│   - atime (last access)         │
│   - mtime (last modification)   │
│   - ctime (last status change)  │
│ Link count (hard links)         │
│ Block pointers (where data is)  │
│ [Does NOT store filename!]      │
└─────────────────────────────────┘

View inode info:
$ ls -i file.txt
12345678 file.txt   (12345678 is the inode number)

$ stat file.txt
  File: file.txt
  Size: 1024        Blocks: 8          IO Block: 4096
Inode: 12345678    Links: 1
Access: (0644/-rw-r--r--)  Uid: (1000/user)   Gid: (1000/user)
Access: 2025-01-15 10:30:00.123456789 -0500
Modify: 2025-01-15 09:15:00.987654321 -0500
Change: 2025-01-15 09:15:00.987654321 -0500

How Files are Stored: Block Pointers

Inode structure (simplified ext4):
┌────────────────┐
│ Metadata       │
├────────────────┤
│ Direct[0]   ───┼──> Data Block 1 (4KB)
│ Direct[1]   ───┼──> Data Block 2 (4KB)
│ ...            │
│ Direct[11]  ───┼──> Data Block 12 (4KB)
├────────────────┤
│ Indirect    ───┼──> Pointer Block ──> 1024 Data Blocks (4MB)
│ Double Ind. ───┼──> Ptr Block ──> 1024 Ptr Blocks ──> 1M Data Blocks (4GB)
│ Triple Ind. ───┼──> Ptr Block ──> ... (4TB)
└────────────────┘

Small files (< 48KB): Use direct pointers only - fast!
Large files: Use indirect pointers - extra seek to read pointer block

Why this design? Most files are small. Direct pointers provide fast access for common case. Indirect pointers allow huge files without wasting space in every inode.

Directories: Just Special Files

Key Insight: A directory is just a file that contains a mapping of filenames to inode numbers.

Directory structure:
┌─────────────┬──────────┐
│ Filename    │ Inode #  │
├─────────────┼──────────┤
│ .           │ 234567   │  (self)
│ ..          │ 123456   │  (parent)
│ file1.txt   │ 345678   │
│ file2.txt   │ 456789   │
│ subdir      │ 567890   │
└─────────────┴──────────┘

When you run: ls /home/user/documents/file.txt

1. Start at root inode (always #2)
2. Read "/" directory → find "home" → inode 1234
3. Read inode 1234 → get blocks for "home" directory
4. Search "home" directory → find "user" → inode 5678
5. Read inode 5678 → get blocks for "user" directory
6. Search "user" directory → find "documents" → inode 9012
7. Search "documents" directory → find "file.txt" → inode 3456
8. Read inode 3456 → get file metadata and data blocks

Why it matters: Deep directory paths = many inode lookups. This is why path traversal can be slow, and why directory caching (dentry cache in Linux) is critical.

Hard Links vs Symbolic Links

Hard Links

$ ls -li
12345 -rw-r--r-- 2 user group 1024 Jan 15 10:00 original.txt
12345 -rw-r--r-- 2 user group 1024 Jan 15 10:00 hardlink.txt

Both point to same inode (12345)
Same file, two directory entries
Deleting one doesn't affect the other (link count decrements)
File only deleted when link count reaches 0

Limitations:
- Can't cross filesystem boundaries (inode numbers not unique across filesystems)
- Can't link directories (would create cycles)

Symbolic Links (Symlinks)

$ ls -li
12345 -rw-r--r-- 1 user group 1024 Jan 15 10:00 original.txt
67890 lrwxrwxrwx 1 user group   12 Jan 15 10:05 symlink.txt -> original.txt

Different inodes
Symlink is a separate file containing path to target
Can cross filesystems
Can link directories
If target deleted, symlink becomes dangling (broken)

The Consistency Problem

What Happens on Crash?

Scenario: Writing a file requires multiple operations:

Allocate data block (update block bitmap)
Write data to block
Allocate inode (update inode bitmap)
Write inode metadata (with pointer to data block)
Update directory entry (link filename to inode)

If system crashes after step 2: Block marked free in bitmap but contains data. Block gets reused, data corruption!

If crash after step 4: Inode allocated but not linked to directory. Space leak (orphaned inode).

Solution: Write-Ahead Logging (Journaling)

Journal = Special area on disk that logs intended changes

Write Operation with Journaling:
1. Write planned changes to journal (transaction)
2. Wait for journal write to complete (barrier)
3. Checkpoint: Write actual data/metadata to final locations
4. Mark transaction complete in journal

On crash/reboot:
- Read journal
- If transaction incomplete: Discard (nothing changed)
- If transaction complete but not checkpointed: Replay changes
- Filesystem always consistent!

Journaling Modes (ext4)

Mode	What's Journaled	Performance	Data Safety
writeback	Metadata only	Fastest	Filesystem consistent, but data can be garbage (recently written files)
ordered	Metadata only, but data written before metadata	Good	Filesystem consistent, no garbage data (default)
journal	Both metadata and data	Slowest (writes twice)	Full protection

Why not always use full journaling? Write amplification. Every write happens twice (journal + final location). For large files, this is expensive.

Unix/Linux Filesystems

Filesystem	Year	Key Features	Use Case
ext2	1993	No journaling, simple, fast for read-heavy	USB drives, embedded systems
ext3	1999	Added journaling to ext2, backward compatible	Legacy Linux systems
ext4	2008	Extents (contiguous blocks), delayed allocation, faster fsck	Default Linux filesystem
XFS	1993	Excellent large file performance, online defrag, delayed allocation	Large files, databases, RHEL default
Btrfs	2009	Copy-on-write, snapshots, checksums, compression, RAID	Modern Linux, snapshots needed
ZFS	2005	Volume manager + filesystem, snapshots, dedup, compression, checksums	NAS, data integrity critical (FreeBSD, Solaris)

ext4 vs XFS vs Btrfs

ext4:
✓ Mature, stable, well-tested
✓ Fast for general workloads
✓ Good small file performance
✗ No built-in snapshots
✗ No checksums (data corruption undetected)

XFS:
✓ Excellent for large files (streaming, video, databases)
✓ Parallel I/O performance
✓ Online defragmentation
✗ Cannot shrink filesystem (only grow)
✗ Slower for many small files

Btrfs:
✓ Snapshots (instant, space-efficient)
✓ Data checksumming (detects silent corruption)
✓ Transparent compression
✓ Self-healing with RAID
✗ Historically had stability issues (better now)
✗ RAID 5/6 still not recommended production-ready

Copy-on-Write (CoW) Filesystems

Traditional approach (ext4): Modify data in-place

Write to block 1234:
1. Read block 1234
2. Modify in memory
3. Write back to block 1234
4. Update journal

Problem: If crash during write, block half-updated (torn write)

Copy-on-Write (Btrfs, ZFS): Never modify in-place

Write to file:
1. Allocate NEW block 5678
2. Write modified data to block 5678
3. Update pointer to point to 5678 (atomic)
4. Old block 1234 freed (eventually)

Benefits:
- No torn writes (pointer update is atomic)
- Trivial snapshots (keep old pointers around)
- Easier to implement checksums (immutable blocks)

Trade-offs:
- Write amplification (metadata updates propagate up tree)
- Fragmentation over time (need defrag or GC)

How CoW Enables Snapshots

graph TD A[Root] --> B[Block 1] A --> C[Block 2] B --> D[Data A] B --> E[Data B] C --> F[Data C] A2[Snapshot Root] -.-> B A2 -.-> C A3[After Modify] --> B2[New Block 1'] A3 --> C B2 --> G[Data A'] B2 --> E

Snapshot process:
1. Take snapshot: Clone root pointer (instant, no data copied)
2. Modify file: Allocate new block, update current root, snapshot root unchanged
3. Delete snapshot: Free blocks only referenced by snapshot

Cost: Essentially free (just pointer manipulation)

Caching: The Key to Performance

Page Cache (Buffer Cache)

Linux uses all free RAM for caching disk blocks

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       2.0Gi        8.0Gi       100Mi        5.0Gi        13Gi

"buff/cache" = OS cached disk blocks in memory

When you read file:
1. Check page cache - already in RAM? Return immediately
2. Not cached? Read from disk, store in cache, return to app
3. Future reads = instant (from cache)

When you write file:
1. Write to page cache (write-back cache)
2. Mark page "dirty"
3. Return to app (write appears complete)
4. Background flush daemon writes dirty pages to disk (every 30 sec default)
5. fsync() forces immediate flush

Why it matters: Workloads that fit in cache = RAM speed (~100GB/sec). Workloads that exceed cache = disk speed (~100MB/sec HDD, ~500MB/sec SSD). 1000x difference!

Dentry Cache

Problem: Path lookups expensive (/a/b/c/d/e/f/file.txt = 7 inode reads)

Solution: Cache directory entry lookups
"dentry cache" maps path → inode number

$ cat /proc/sys/fs/dentry-state
82749 72152 45 0 0 0

Shows cached dentry entries
Speeds up repeated access to same paths

I/O Schedulers

Problem: Applications issue random reads/writes. Disk seeks are slow. How to optimize?

Linux I/O Schedulers

Scheduler	Strategy	Best For
noop	FIFO queue, no reordering	SSDs, NVMe (no seek penalty)
deadline	Sort by sector, guarantee latency deadline	Databases, real-time
cfq	Completely Fair Queuing - per-process queues	General desktop/server (default)
bfq	Budget Fair Queuing - low latency, proportional I/O	Interactive workloads
mq-deadline	Multi-queue deadline (for NVMe)	Modern SSDs with multiple queues

Check current scheduler:
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

Change scheduler (temporary):
$ echo bfq > /sys/block/sda/queue/scheduler

Fragmentation

What is Fragmentation?

Ideal: File stored in contiguous blocks
┌────┬────┬────┬────┬────┬────┐
│ F1 │ F1 │ F1 │ F1 │ F1 │    │  Sequential read = fast (one seek)
└────┴────┴────┴────┴────┴────┘

Fragmented: File scattered across disk
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ F1 │ F2 │ F1 │ F2 │ F1 │    │ F1 │ F2 │  Random seeks = slow
└────┴────┴────┴────┴────┴────┴────┴────┘

Causes:
- Repeated create/delete cycles
- File growth (append to file, no contiguous space available)
- Filesystem nearly full

Prevention & Mitigation

ext4: Delayed allocation
- Don't allocate blocks immediately on write
- Buffer writes in memory, allocate contiguous blocks when flush
- Reduces fragmentation significantly

XFS: Online defragmentation
$ xfs_fsr /mount/point  (reorganizes files while filesystem mounted)

Btrfs: Defragmentation with autodefrag
$ btrfs filesystem defragment -r /mount/point

ext4: Offline defragmentation
$ e4defrag /mount/point  (can run online, but performance impact)

Direct I/O & O_SYNC

Standard I/O (Buffered)

write() → Page Cache → (eventual) Disk
read() → Page Cache (if cached) or Disk → Page Cache

Benefits: Fast (RAM speed for cached data)
Problem: No control over when data hits disk

Direct I/O (O_DIRECT)

int fd = open("file", O_DIRECT);
write(fd, buffer, size);  // Bypasses page cache, writes directly to disk

Use cases:
- Databases (manage own caching, need guaranteed write-through)
- Avoid double-buffering (app cache + page cache)

Requirements:
- Buffer must be aligned (typically 512-byte or 4KB aligned)
- Size must be multiple of block size

Synchronous I/O (O_SYNC, fsync)

O_SYNC: Every write() blocks until data on disk
int fd = open("file", O_SYNC);
write(fd, buf, 100);  // Returns only after disk write complete

fsync(): Force flush for specific file
write(fd, buf, 100);  // Returns immediately (buffered)
fsync(fd);            // Blocks until all data/metadata on disk

fdatasync(): Like fsync, but doesn't flush metadata (faster)

Use case: Databases, crash consistency critical

Common Questions

Q: What happens when you delete a large file but disk space doesn't increase?

A: Process still has file open. Deleted file removed from directory, but inode/blocks not freed until all file descriptors closed.

$ rm /var/log/huge.log  (file deleted from directory)
$ df -h                 (space not freed!)

Reason: Log process still writing to file descriptor
Solution: Restart process or: > /var/log/huge.log (truncate)

Q: Why can't you hard link directories?

A: Would create cycles in directory tree, breaking tree invariants. Tools like find assume acyclic graph. Symlinks allowed because they can be detected/avoided.

Q: What's the difference between atime, mtime, ctime?

atime: Last access time (read). Often disabled (noatime mount option) for performance.
mtime: Last modification time (file content changed)
ctime: Last change time (inode metadata changed - permissions, ownership, etc.)

Q: How does rm -rf work at the filesystem level?

rm -rf /path/to/dir:
1. Recursively traverse directory tree (depth-first)
2. For each file: Decrement inode link count
3. If link count → 0 and no processes have file open: Free inode + data blocks
4. Remove directory entry
5. Remove directory when empty

Atomic? No. If crash mid-delete, partially deleted state.
This is why trash/recycle bins exist (move to .trash instead of delete)

Q: Why is directory rename atomic but file move across filesystems isn't?

A: Rename within filesystem = update directory entry (single atomic operation). Move across filesystems = copy + delete (not atomic, can fail mid-copy).

What is a Filesystem?

Filesystem Fundamentals