⚡ Latency Numbers Every Programmer Should Know

← Back The Numbers Visualization Analogies Design Implications Storage Units Encoding & Number Systems Regular Expressions

📊 Origin

Originally compiled by Jeff Dean (Google) around 2010, these numbers provide a mental model for understanding computer performance. While hardware evolves, the relative relationships between these operations remain remarkably stable.

Updated for 2024-2025: Numbers reflect modern hardware (DDR5 RAM, NVMe SSDs, 10+ Gbps networks).

The Numbers

Operation	Nanoseconds	Microseconds	Milliseconds	Scale
L1 cache reference	0.5 ns	—	—	⚡ Fastest
Branch mispredict	5 ns	—	—	⚡ Fastest
L2 cache reference	7 ns	—	—	⚡ Fastest
Mutex lock/unlock	25 ns	—	—	⚡ Fastest
Main memory reference	100 ns	—	—	💨 Fast
Compress 1KB with Zippy	3,000 ns	3 µs	—	💨 Fast
Send 1KB over 1 Gbps network	10,000 ns	10 µs	—	💨 Fast
Read 4KB randomly from SSD	150,000 ns	150 µs	—	⚠️ Medium
Read 1MB sequentially from memory	250,000 ns	250 µs	—	⚠️ Medium
Round trip within same datacenter	500,000 ns	500 µs	—	⚠️ Medium
Read 1MB sequentially from SSD	1,000,000 ns	1,000 µs	1 ms	⚠️ Medium
Disk seek (HDD)	10,000,000 ns	10,000 µs	10 ms	🐌 Slow
Read 1MB sequentially from disk	20,000,000 ns	20,000 µs	20 ms	🐌 Slow
Send packet CA→Netherlands→CA	150,000,000 ns	150,000 µs	150 ms	🐌 Slow

⚠️ Unit Conversions (Critical for Understanding Scale)

1 nanosecond (ns) = 10⁻⁹ seconds = 0.000000001 seconds
1 microsecond (µs) = 10⁻⁶ seconds = 1,000 ns
1 millisecond (ms) = 10⁻³ seconds = 1,000 µs = 1,000,000 ns
1 second = 1,000 ms = 1,000,000 µs = 1,000,000,000 ns

Visual Comparison

Logarithmic Scale Visualization

Because these operations span 9 orders of magnitude (0.5 ns to 150 ms = 300,000,000x difference!), we use a logarithmic scale to visualize them.

Relative Latency Bars

Visual comparison (log scale):

L1 cache reference (0.5 ns)

0.5 ns

Main memory reference (100 ns)

200x slower

SSD random read (150 µs)

1,500x slower

Disk seek (10 ms)

20,000x slower

CA→Netherlands→CA (150 ms)

300,000x slower!

Human-Scale Analogies

If 1 CPU Cycle = 1 Second

To make these numbers relatable, imagine if 1 CPU cycle = 1 second of human time. Here's how long each operation would take:

⚡ L1 Cache

1 second

Reaching into your pocket

💨 RAM Access

3 minutes

Walking to the kitchen

⚠️ SSD Read

3.5 days

A long weekend trip

🐌 Disk Seek

4 months

Pregnancy duration

🌍 Network (CA→EU)

4.75 years

College degree + grad school

Design Implications

🎯 Key Takeaways for System Design

1. Memory Hierarchy Matters

L1 cache is 200x faster than RAM → Keep hot data in cache
RAM is 1,500x faster than SSD → Cache frequently accessed data in memory
SSD is 100x faster than HDD → Never use spinning disks for latency-sensitive workloads

2. Sequential > Random

Reading 1MB sequentially from SSD: 1 ms
Reading 1MB randomly (256 × 4KB reads): 38 ms (38x slower!)
Design implication: Batch operations, use sequential I/O patterns

3. Network is SLOW

Same datacenter round trip: 500 µs
Cross-continent round trip: 150 ms (300x slower!)
Design implication: Minimize network round trips, use batching, CDNs, edge caching

4. Lock Contention

Mutex lock/unlock: 25 ns (seems fast)
But 50x slower than L1 cache access!
Design implication: Lock-free data structures, fine-grained locking, avoid hot locks

Real-World Application Examples

✅ Good: Redis Cache

Memory-based: ~100 ns reads
10,000x faster than disk database
Perfect for session storage, counters

❌ Bad: N+1 Query Problem

100 separate DB queries over network
100 × 500 µs = 50 ms just in round trips!
Solution: Batch queries, use JOINs, eager loading

✅ Good: CDN for Static Assets

Reduces cross-continent latency (150 ms → 10 ms)
15x latency improvement
Critical for user experience

💡 Optimization: Database Indexes

Table scan: Read entire disk (20+ ms per MB)
Index lookup: Few SSD reads (1-2 ms)
10-100x speedup for large tables

Performance Budget Calculator

Example: API Response Time Budget (100 ms target)

Operation	Count	Each	Total	% of Budget
Network round trip (client ↔ server)	1	30 ms	30 ms	30%
Database queries (w/ network)	3	10 ms	30 ms	30%
Application logic	—	—	20 ms	20%
Redis cache lookups	5	2 ms	10 ms	10%
Response serialization (JSON)	1	10 ms	10 ms	10%
TOTAL	—	—	100 ms	100%

Key insight: 60% of time is network! Optimize by reducing round trips, caching aggressively, and using connection pooling.

Storage Sizes & Units

📏 Understanding Digital Storage Units

Storage sizes and data transfer rates use different units that are often confused. Understanding the difference between bits vs bytes and decimal (SI) vs binary (IEC) prefixes is fundamental.

Bits vs Bytes

Unit	Symbol	Value	Typical Use Case
Bit	b	Binary digit (0 or 1)	Network speeds (Mbps, Gbps)
Byte	B	8 bits	Storage sizes (MB, GB, TB)
Nibble	—	4 bits (half byte)	Hexadecimal digit (0-F)

Key Point: 1 Byte = 8 bits

Common confusion: A "100 Mbps" connection transfers data at 100 megabits per second, which equals 12.5 MB/s (megabytes per second).

Conversion:
100 Mbps ÷ 8 = 12.5 MB/s

1 Gbps network = 125 MB/s max throughput
10 Gbps network = 1,250 MB/s = 1.25 GB/s

Decimal (SI) vs Binary (IEC) Prefixes

⚠️ The Ambiguity Problem

Historically, "kilobyte" was used for both 1000 bytes (decimal) and 1024 bytes (binary). This caused confusion and misleading storage sizes.

Solution (IEC standard): Different prefixes for decimal vs binary.

Decimal (SI) - Base 10	Value	Binary (IEC) - Base 2	Value
Kilobyte (KB)	1,000 bytes (10³)	Kibibyte (KiB)	1,024 bytes (2¹⁰)
Megabyte (MB)	1,000,000 bytes (10⁶)	Mebibyte (MiB)	1,048,576 bytes (2²⁰)
Gigabyte (GB)	1,000,000,000 bytes (10⁹)	Gibibyte (GiB)	1,073,741,824 bytes (2³⁰)
Terabyte (TB)	1,000,000,000,000 bytes (10¹²)	Tebibyte (TiB)	1,099,511,627,776 bytes (2⁴⁰)
Petabyte (PB)	10¹⁵ bytes	Pebibyte (PiB)	2⁵⁰ bytes
Exabyte (EB)	10¹⁸ bytes	Exbibyte (EiB)	2⁶⁰ bytes

Why Your 1TB Drive Shows as 931GB

Hard drive manufacturers use decimal (SI) units: 1 TB = 1,000,000,000,000 bytes

Operating systems use binary (IEC) units: 1 TiB = 1,099,511,627,776 bytes

Calculation:
1 TB = 1,000,000,000,000 bytes
1 TiB = 1,099,511,627,776 bytes

1,000,000,000,000 ÷ 1,099,511,627,776 = 0.909 TiB
0.909 TiB × 1024 GiB/TiB = 931 GiB

Result: Your "1TB" drive appears as ~931 GiB in your OS

Practical Examples by Size

Size	Examples
1 Byte	Single ASCII character ('A', '7', '$')
2 Bytes	Single Unicode character (UTF-16), 16-bit integer (-32,768 to 32,767)
4 Bytes	32-bit integer, IPv4 address, single float
8 Bytes	64-bit long, double-precision float, Unix timestamp
~1 KB	Small text file, short email (plain text)
~100 KB	Low-res photo, small favicon, email with attachments
~1 MB	High-quality photo (JPEG), 1 minute of MP3 audio, short e-book
~100 MB	Movie trailer (720p), mobile app, high-quality album
~1 GB	Standard definition movie, ~1 hour of 1080p video, large video game
~10 GB	HD movie (1080p), operating system install, AAA video game
~100 GB	4K movie, Call of Duty game install, large database backup
~1 TB	Modern laptop SSD, 200+ HD movies, massive game collection
~10 TB	Home NAS storage, professional video editing workstation
~1 PB	Small datacenter storage, major cloud service storage tier, large research dataset
~1 EB	Google/Facebook photo storage, entire internet archive snapshot, global weather data

Network Speeds: Why Bits?

Network speeds are measured in bits per second (bps), not bytes, for historical reasons:

Historical: Early telecom measured signal rate in bits
Physical layer: Networks transmit individual bits as electrical/optical signals
Encoding overhead: Not all bits carry data (framing, error correction)

Connection Type	Speed (bits)	Throughput (bytes)	Download 1GB File
Dial-up modem (ancient)	56 Kbps	7 KB/s	~4 hours
DSL / Cable (basic)	10 Mbps	1.25 MB/s	~13 minutes
Fast broadband	100 Mbps	12.5 MB/s	~80 seconds
Gigabit internet	1 Gbps	125 MB/s	~8 seconds
Datacenter link	10 Gbps	1.25 GB/s	< 1 second
High-speed datacenter	100 Gbps	12.5 GB/s	0.08 seconds

Interview gotcha: "How long to transfer 1TB over a 10 Gbps link?"

Calculation:
1 TB = 1,000 GB (decimal) = 8,000 Gb (gigabits)
10 Gbps link = 10 Gb per second

Time = 8,000 Gb ÷ 10 Gbps = 800 seconds = 13.3 minutes

Real world: Add ~10-20% overhead for TCP/IP headers, retransmissions
Actual time: ~15-16 minutes

Number Systems & Character Encoding

🔢 Why Multiple Number Systems?

Computers work in binary (base-2), but humans prefer decimal (base-10). Hexadecimal (base-16) provides a compact way to represent binary. Understanding these systems is fundamental to systems programming, debugging, and understanding how data is stored.

Binary (Base-2)

Fundamentals

Definition: Number system using only two digits: 0 and 1

Why computers use binary: Electronic circuits have two states - on (1) or off (0). Voltage high = 1, voltage low = 0.

Decimal	Binary	Calculation
0	0000	0
1	0001	1
2	0010	2
3	0011	2 + 1 = 3
4	0100	4
5	0101	4 + 1 = 5
10	1010	8 + 2 = 10
15	1111	8 + 4 + 2 + 1 = 15
255	11111111	128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255

Binary Place Values

8-bit binary number: 1 0 1 1 0 1 0 1

Place values:  128  64  32  16   8   4   2   1
Binary digit:    1   0   1   1   0   1   0   1
               ───────────────────────────────────
Calculation:   128 + 0 + 32 + 16 + 0 + 4 + 0 + 1 = 181

Formula: Each position = 2^n (where n starts at 0 from right)
Position 0 (rightmost): 2^0 = 1
Position 1: 2^1 = 2
Position 2: 2^2 = 4
Position 3: 2^3 = 8
...
Position 7: 2^7 = 128

Common Binary Patterns

Powers of 2 (important for memory sizes):
2^0  = 1
2^1  = 2
2^2  = 4
2^3  = 8
2^4  = 16
2^5  = 32
2^6  = 64
2^7  = 128
2^8  = 256
2^10 = 1,024      (1 KB)
2^16 = 65,536     (64 KB)
2^20 = 1,048,576  (1 MB)
2^30 = 1,073,741,824  (1 GB)

Maximum values for N bits:
4 bits  = 0000 to 1111 = 0 to 15
8 bits  = 00000000 to 11111111 = 0 to 255
16 bits = 0 to 65,535
32 bits = 0 to 4,294,967,295

Signed integers (two's complement):
8-bit signed:  -128 to +127
16-bit signed: -32,768 to +32,767
32-bit signed: -2,147,483,648 to +2,147,483,647

Bitwise Operations

AND (&): Both bits must be 1
  1010 (10)
& 1100 (12)
  ----
  1000 (8)

OR (|): At least one bit is 1
  1010 (10)
| 1100 (12)
  ----
  1110 (14)

XOR (^): Bits are different
  1010 (10)
^ 1100 (12)
  ----
  0110 (6)

NOT (~): Flip all bits
~ 1010 (10)
  ----
  0101 (5)  [assuming 4-bit]

Left shift (<<): Multiply by 2
5 << 1  →  0101 << 1 = 1010  (10)  [5 × 2]
5 << 2  →  0101 << 2 = 10100 (20)  [5 × 4]

Right shift (>>): Divide by 2
20 >> 1  →  10100 >> 1 = 1010 (10)  [20 ÷ 2]
20 >> 2  →  10100 >> 2 = 0101 (5)   [20 ÷ 4]

Common uses:
- Check if number is even: (n & 1) == 0
- Check if number is power of 2: (n & (n-1)) == 0
- Set bit: n |= (1 << i)
- Clear bit: n &= ~(1 << i)
- Toggle bit: n ^= (1 << i)

Hexadecimal (Base-16)

Fundamentals

Definition: Number system using 16 digits: 0-9 and A-F

Why hexadecimal? Compact representation of binary. One hex digit = exactly 4 bits (nibble).

Decimal	Binary	Hexadecimal
0	0000	0
1	0001	1
2	0010	2
3	0011	3
4	0100	4
5	0101	5
6	0110	6
7	0111	7
8	1000	8
9	1001	9
10	1010	A
11	1011	B
12	1100	C
13	1101	D
14	1110	E
15	1111	F
255	11111111	FF

Converting Between Hex and Binary

Hex to Binary (easy - just expand each digit):
0xCAFE
= C    A    F    E
= 1100 1010 1111 1110
= 1100101011111110

Binary to Hex (easy - group by 4 bits from right):
10110101
= 1011 0101  (group by 4)
= B    5
= 0xB5

Hex to Decimal:
0x2F = (2 × 16^1) + (15 × 16^0) = 32 + 15 = 47
0x1A3 = (1 × 16^2) + (10 × 16^1) + (3 × 16^0) = 256 + 160 + 3 = 419

Decimal to Hex (divide by 16, track remainders):
255 ÷ 16 = 15 remainder 15 (F)
15 ÷ 16  = 0  remainder 15 (F)
Result: 0xFF

Common Hex Patterns in Programming

Memory addresses:
0x00007fff5fbff8a0  (64-bit pointer)
0xdeadbeef          (common debug value)
0x00000000          (NULL pointer)

Colors (RGB):
#FFFFFF = white  (255, 255, 255)
#000000 = black  (0, 0, 0)
#FF0000 = red    (255, 0, 0)
#00FF00 = green  (0, 255, 0)
#0000FF = blue   (0, 0, 255)
#CAFE01 = custom (202, 254, 1)

Byte values:
0x00 = 0
0xFF = 255 (max for 1 byte)
0x7F = 127 (max positive for signed byte)
0x80 = 128 or -128 (signed)

Bitmasks:
0x0F = 00001111 (mask lower 4 bits)
0xF0 = 11110000 (mask upper 4 bits)
0xFF = 11111111 (all bits set)
0x00 = 00000000 (all bits clear)

Permissions (Unix):
0644 = rw-r--r--  (owner: rw, group: r, other: r)
0755 = rwxr-xr-x  (owner: rwx, group: rx, other: rx)
0777 = rwxrwxrwx  (all permissions)

Why Hex is Used

Compact: 2 hex digits = 1 byte (vs 8 binary digits)
Easy conversion: Direct mapping to/from binary (4 bits per hex digit)
Readability: Memory dumps, MAC addresses, colors easier to read than binary
Alignment: Byte boundaries clear (every 2 hex digits = 1 byte)

ASCII (American Standard Code for Information Interchange)

Fundamentals

Definition: 7-bit character encoding standard (0-127), representing English characters, digits, and control codes.

History: Developed in 1963 for teleprinters and early computers. Still the foundation of modern text encoding.

Range	Decimal	Hex	Description	Examples
Control characters	0-31	0x00-0x1F	Non-printable control codes	NUL(0), TAB(9), LF(10), CR(13), ESC(27)
Space & symbols	32-47	0x20-0x2F	Space, punctuation	Space(32), !(33), "(34), #(35)
Digits	48-57	0x30-0x39	0-9	'0'(48), '5'(53), '9'(57)
More symbols	58-64	0x3A-0x40	Punctuation	:(58), @(64)
Uppercase letters	65-90	0x41-0x5A	A-Z	'A'(65), 'M'(77), 'Z'(90)
More symbols	91-96	0x5B-0x60	Brackets, etc.	[(91), \(92), ](93)
Lowercase letters	97-122	0x61-0x7A	a-z	'a'(97), 'm'(109), 'z'(122)
More symbols	123-126	0x7B-0x7E	Braces, etc.	{(123), }(125), ~(126)
Delete	127	0x7F	Delete control character	DEL(127)

Important ASCII Codes to Remember

Control characters:
0x00 (0)   = NUL (null terminator in C strings)
0x09 (9)   = TAB (horizontal tab)
0x0A (10)  = LF  (line feed, '\n' on Unix)
0x0D (13)  = CR  (carriage return, '\r')
0x1B (27)  = ESC (escape, used in terminal codes)
0x20 (32)  = SPACE

Digits '0'-'9':
'0' = 48 (0x30)
'1' = 49 (0x31)
...
'9' = 57 (0x39)

Uppercase 'A'-'Z':
'A' = 65 (0x41)
'B' = 66 (0x42)
...
'Z' = 90 (0x5A)

Lowercase 'a'-'z':
'a' = 97 (0x61)
'b' = 98 (0x62)
...
'z' = 122 (0x7A)

Useful patterns:
Uppercase to lowercase: Add 32  ('A' + 32 = 'a')
Lowercase to uppercase: Subtract 32  ('a' - 32 = 'A')
Difference: 0x20 (32) exactly

Digit to integer: Subtract '0'  ('5' - '0' = 5)
Integer to digit: Add '0'  (5 + '0' = '5')

Extended ASCII (8-bit, 0-255)

Problem: Standard ASCII (7-bit) only covers English. Many 8-bit extensions created for other languages.

Latin-1 (ISO-8859-1): Western European languages (128-255: àáâãäå, èéêë, etc.)
Windows-1252: Similar to Latin-1 with some changes (common in Windows)
Problem: Different extensions incompatible → Unicode created to solve this

Line Endings (Newlines)

Different systems use different line ending conventions:

Unix/Linux/macOS:   LF   (0x0A, '\n')
Windows:            CRLF (0x0D 0x0A, '\r\n')
Old Mac (pre-OSX):  CR   (0x0D, '\r')

Why this causes problems:
- Text file created on Windows opened on Linux shows ^M characters
- Git can auto-convert (core.autocrlf setting)
- Always specify line endings in .gitattributes for consistency

Unicode: The Universal Character Set

The Problem ASCII/Extended ASCII Couldn't Solve

Issue: ASCII (128 chars) and Extended ASCII (256 chars) can't represent:

Multiple languages simultaneously (Chinese + Arabic + Cyrillic)
Emoji 😀🎉🚀
Mathematical symbols (∑∫∂√)
Thousands of CJK (Chinese, Japanese, Korean) characters

Solution: Unicode - Universal character set supporting 1,114,112 possible characters (code points U+0000 to U+10FFFF)

Unicode Concepts

Code Point: Abstract number assigned to each character
- Written as U+xxxx (hexadecimal)
- Examples:
  'A' = U+0041
  '€' = U+20AC (Euro sign)
  '中' = U+4E2D (Chinese character)
  '😀' = U+1F600 (Grinning face emoji)

Encoding: How code points are stored as bytes
Unicode defines the characters, encodings define byte representation

UTF-8: The Dominant Encoding

Why UTF-8 won:

Backward compatible with ASCII: ASCII characters (U+0000 to U+007F) = same 1-byte encoding
Variable length: 1-4 bytes per character (efficient for mostly-English text)
Self-synchronizing: Can determine character boundaries
No byte-order issues: Unlike UTF-16

UTF-8 Encoding Rules

1 byte (ASCII): U+0000 to U+007F
Pattern: 0xxxxxxx
Example: 'A' = U+0041 = 01000001 = 0x41 (1 byte)

2 bytes: U+0080 to U+07FF
Pattern: 110xxxxx 10xxxxxx
Example: '©' = U+00A9 = 11000010 10101001 = 0xC2 0xA9 (2 bytes)

3 bytes: U+0800 to U+FFFF
Pattern: 1110xxxx 10xxxxxx 10xxxxxx
Example: '€' = U+20AC = 11100010 10000010 10101100 = 0xE2 0x82 0xAC (3 bytes)

4 bytes: U+10000 to U+10FFFF
Pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Example: '😀' = U+1F600 = 11110000 10011111 10011000 10000000 = 0xF0 0x9F 0x98 0x80 (4 bytes)

Key insight: Leading byte tells you character length
0xxxxxxx   = 1 byte (ASCII)
110xxxxx   = 2 bytes
1110xxxx   = 3 bytes
11110xxx   = 4 bytes
10xxxxxx   = Continuation byte (never starts a character)

UTF-16 and UTF-32

Encoding	Bytes per Char	Pros	Cons	Used By
UTF-8	1-4 bytes (variable)	ASCII compatible, space efficient, no BOM needed	Variable length complicates indexing	Web, Linux, most programming
UTF-16	2 or 4 bytes	Fixed width for most chars (BMP)	Not ASCII compatible, byte order issues (BOM)	Windows, Java, JavaScript internally
UTF-32	4 bytes (fixed)	Fixed width, easy indexing	Space inefficient (4× ASCII size)	Rare, some internal processing

Common Unicode Pitfalls

1. String length is ambiguous:
"café" in UTF-8 = 5 bytes (c=1, a=1, f=1, é=2)
But 4 Unicode code points
But 4 "characters" to humans

JavaScript: "café".length → 4 (counts UTF-16 code units)
Python: len("café") → 4 (counts code points)
Bytes: len("café".encode('utf-8')) → 5 (bytes)

2. Emoji are complex:
"👨‍👩‍👧‍👦" (family) = 7 code points:
  👨 (man) + ZWJ + 👩 (woman) + ZWJ + 👧 (girl) + ZWJ + 👦 (boy)
  (ZWJ = Zero Width Joiner, U+200D)

"👍" (thumbs up) = 1 code point (U+1F44D)
"👍🏿" (dark skin tone) = 2 code points (base + modifier)

3. Visual vs code point length:
é can be:
- 1 code point: U+00E9 (precomposed "é")
- 2 code points: U+0065 U+0301 (e + combining acute accent)
Both look identical, different byte representations!
→ Use Unicode normalization (NFC, NFD) to standardize

4. Case folding isn't simple:
German: "ß".toUpperCase() → "SS" (1 char becomes 2!)
Turkish: "i".toUpperCase() → "İ" (dotted I)
         "I".toLowerCase() → "ı" (dotless i)
→ Use locale-aware case conversion

5. Sorting/collation is locale-dependent:
Swedish: ä comes after z
German: ä sorts like a
→ Use ICU library or locale-aware sorting

Best Practices

Always use UTF-8 for text storage and transmission (unless you have a specific reason not to)
Specify encoding explicitly: HTTP headers, file metadata, database columns
Don't assume: 1 character = 1 byte = 1 code point = 1 visual glyph
Normalize: Use NFC normalization for comparison/storage
Test with non-ASCII: "Iñtërnâtiônàlizætiøn" and emoji in your test data
Libraries: Use ICU for proper Unicode handling (collation, normalization, case folding)

Regular Expressions (Regex)

🔍 What Are Regular Expressions?

Definition: Regular expressions are patterns used to match character combinations in strings. They're a powerful tool for text searching, validation, parsing, and manipulation.

Why they matter: Used everywhere - text editors, log parsing, input validation, data extraction, search & replace, URL routing, lexical analysis, and more.

Trade-off: Powerful but can be cryptic. The joke: "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Basic Syntax

Literal Characters

Most characters match themselves:
Pattern: cat
Matches: "cat" in "The cat sat"
Doesn't match: "Cat" (case-sensitive by default)

Pattern: hello world
Matches: "hello world" exactly

Metacharacters (Special Characters)

Character	Meaning	Example
`.`	Any single character (except newline)	`c.t` matches "cat", "cot", "c9t"
`^`	Start of string/line	`^cat` matches "cat" only at start
`$`	End of string/line	`cat$` matches "cat" only at end
`*`	0 or more times	`ca*t` matches "ct", "cat", "caaat"
`+`	1 or more times	`ca+t` matches "cat", "caat" (not "ct")
`?`	0 or 1 time (optional)	`colou?r` matches "color" or "colour"
`\|`	OR (alternation)	`cat\|dog` matches "cat" or "dog"
`( )`	Grouping and capturing	`(ca)+t` matches "cat", "cacat"
`[ ]`	Character class	`[aeiou]` matches any vowel
`\`	Escape special character	`\.` matches literal "."

To match metacharacters literally, escape them:
Pattern: \$\d+\.\d{2}
Matches: "$19.99" (dollar amount)
         $ escaped with \$
         . escaped with \.

Character Classes

Predefined Character Classes

Class	Equivalent	Matches
`\d`	`[0-9]`	Any digit
`\D`	`[^0-9]`	Any non-digit
`\w`	`[a-zA-Z0-9_]`	Word character (letter, digit, underscore)
`\W`	`[^a-zA-Z0-9_]`	Non-word character
`\s`	`[ \t\n\r\f\v]`	Whitespace (space, tab, newline, etc.)
`\S`	`[^ \t\n\r\f\v]`	Non-whitespace

Custom Character Classes

Square brackets [ ] define character sets:

[aeiou]        - Matches any single vowel
[0-9]          - Matches any digit (same as \d)
[a-z]          - Matches lowercase letter
[A-Z]          - Matches uppercase letter
[a-zA-Z]       - Matches any letter
[a-z0-9]       - Matches letter or digit
[^aeiou]       - Matches any character EXCEPT vowels (^ negates)
[0-9a-fA-F]    - Matches hexadecimal digit

Special characters lose meaning inside [ ]:
[.]            - Matches literal dot (no need to escape)
[*+?]          - Matches literal *, +, or ?
[a-z-]         - Matches a-z or hyphen (hyphen at end)
[-a-z]         - Matches hyphen or a-z (hyphen at start)
[a\-z]         - Matches a, hyphen, or z (hyphen escaped)

Quantifiers

Quantifier	Meaning	Example	Matches
`*`	0 or more	`ab*c`	"ac", "abc", "abbc", "abbbc"
`+`	1 or more	`ab+c`	"abc", "abbc" (not "ac")
`?`	0 or 1	`ab?c`	"ac" or "abc"
`{n}`	Exactly n times	`\d{3}`	"123", "999" (exactly 3 digits)
`{n,}`	n or more times	`\d{3,}`	"123", "1234", "12345"
`{n,m}`	Between n and m times	`\d{3,5}`	"123", "1234", "12345" (not "12" or "123456")

Greedy vs Lazy (Non-Greedy) Matching

Greedy (default): Match as much as possible
Pattern: <.*>
String: bold and italic
Matches: "bold and italic"  (entire string!)

Lazy (non-greedy): Match as little as possible (add ?)
Pattern: <.*?>
String: bold and italic
Matches: "", "", "", ""  (each tag separately)

Lazy quantifiers:
*?    - 0 or more (lazy)
+?    - 1 or more (lazy)
??    - 0 or 1 (lazy)
{n,}? - n or more (lazy)
{n,m}? - between n and m (lazy)

Example: Extract quoted strings
Greedy:  ".*"   on 'He said "hello" and "goodbye"' → '"hello" and "goodbye"'
Lazy:    ".*?"  on 'He said "hello" and "goodbye"' → '"hello"' and '"goodbye"'

Anchors & Boundaries

Anchor	Meaning	Example
`^`	Start of string/line	`^Hello` matches "Hello world" (not "Say Hello")
`$`	End of string/line	`world$` matches "Hello world" (not "world peace")
`\b`	Word boundary	`\bcat\b` matches "cat" (not "category" or "scat")
`\B`	Non-word boundary	`\Bcat\B` matches "cat" in "scattered"
`\A`	Start of string only	Like ^ but never matches after newline (multiline mode)
`\Z`	End of string only	Like $ but never matches before newline

Word boundaries (\b) are crucial for exact matches:

Pattern: cat
String: "cat category scat"
Matches: ALL occurrences (cat in all three words)

Pattern: \bcat\b
String: "cat category scat"
Matches: Only "cat" (standalone word)

Validate entire string (common for input validation):
Pattern: ^\d{3}-\d{2}-\d{4}$
Matches: "123-45-6789" (entire string is SSN format)
Rejects: "My SSN is 123-45-6789" (extra text)

Groups & Capturing

Capturing Groups ( )

Parentheses create capturing groups:

Pattern: (\d{3})-(\d{2})-(\d{4})
String: "123-45-6789"
Capture groups:
  Group 0 (entire match): "123-45-6789"
  Group 1: "123"
  Group 2: "45"
  Group 3: "6789"

Use in replacement:
Pattern: (\w+)\s+(\w+)
String: "John Doe"
Replace with: $2, $1  (or \2, \1 in some flavors)
Result: "Doe, John"

Backreferences (refer to captured groups):
Pattern: \b(\w+)\s+\1\b
Matches: Repeated words like "the the" or "is is"
         \1 refers back to whatever Group 1 matched

Pattern: <(\w+)>.*?
Matches: text or text
         Ensures closing tag matches opening tag

Non-Capturing Groups (?:...)

Use (?:...) when you need grouping but not capturing:

Pattern: (?:https?|ftp)://(\S+)
Matches: URLs with http, https, or ftp
Only captures the domain/path part (not the protocol)

Why use non-capturing groups?
- Performance: Capturing has overhead
- Clarity: Numbered groups ($1, $2) stay simple
- Necessity: Some regex engines limit number of capture groups

Named Capturing Groups

Python/PCRE syntax: (?P...)
JavaScript/Java/.NET syntax: (?...)

Pattern: (?P\d{4})-(?P\d{2})-(?P\d{2})
String: "2025-01-15"
Captures:
  year: "2025"
  month: "01"
  day: "15"

# Python example:
import re
match = re.search(r'(?P\d{4})-(?P\d{2})-(?P\d{2})', '2025-01-15')
print(match.group('year'))  # "2025"

Lookahead & Lookbehind

Type	Syntax	Meaning	Example
Positive Lookahead	`(?=...)`	Followed by pattern	`\d+(?= dollars)` matches number before " dollars"
Negative Lookahead	`(?!...)`	NOT followed by pattern	`\d+(?! dollars)` matches numbers not before " dollars"
Positive Lookbehind	`(?<=...)`	Preceded by pattern	`(?<=\$)\d+` matches number after "$"
Negative Lookbehind	`(?`	NOT preceded by pattern	`(? matches numbers not after "$"`

Lookaheads/lookbehinds are zero-width (don't consume characters):

Pattern: \w+(?=\.)
String: "Hello. World. Test."
Matches: "Hello", "World", "Test" (dots not included in match)

Complex example: Password validation
Requirement: 8+ chars, at least 1 uppercase, 1 lowercase, 1 digit

Pattern: ^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$
         (?=.*[A-Z])  - Must contain uppercase
         (?=.*[a-z])  - Must contain lowercase
         (?=.*\d)     - Must contain digit
         .{8,}        - At least 8 characters

Matches: "Password1" ✓
Rejects: "password" (no uppercase or digit) ✗

Common Patterns

Email (basic):
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Note: Email validation is complex; this catches most common formats

URL/URI:
https?://[^\s]+
Or more strict:
^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b

Phone (US):
^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$
Matches: (123) 456-7890, 123-456-7890, 123.456.7890, 1234567890

IP Address (IPv4):
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Validates: 0.0.0.0 to 255.255.255.255

Date (YYYY-MM-DD):
^\d{4}-\d{2}-\d{2}$
Basic: Matches format, not validity (allows 2025-99-99)
Strict: ^(19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$

Credit Card (format only, no Luhn check):
^\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}$

Hex Color:
^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$
Matches: #FFFFFF, #FFF, FFFFFF, FFF

Username (alphanumeric + underscore, 3-16 chars):
^[a-zA-Z0-9_]{3,16}$

Strong Password (8+ chars, upper, lower, digit, special):
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$

Extract all words:
\b\w+\b

Extract numbers (including decimals):
\d+\.?\d*
Or: -?\d+(?:\.\d+)?  (includes negative numbers)

HTML tags:
<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)
Note: HTML is not regular; use proper HTML parser for production

Remove extra whitespace:
\s+
Replace with single space

Trim whitespace:
^\s+|\s+$
Replace with empty string

Regex Flavors & Differences

Feature	PCRE (Perl)	JavaScript	Python	Java
Named groups	`(?)`	`(?)` (ES2018+)	`(?P)`	`(?)`
Lookbehind	✓ Variable length	✓ Fixed length only	✓ Variable length	✓ Fixed length only
Backreferences	`\1`	`\1` or `$1`	`\1`	`\1` or `$1`
Flags	`/pattern/i`	`/pattern/gi`	`re.I, re.M`	`Pattern.CASE_INSENSITIVE`
Unicode support	✓ Full	✓ Limited (better in ES2015+)	✓ Full	✓ Full

Common Flags/Modifiers

Flags modify how regex behaves:

i - Case insensitive
    /cat/i matches "cat", "Cat", "CAT"

g - Global (find all matches, not just first)
    /cat/g finds all "cat" in string

m - Multiline (^ and $ match line breaks)
    Without: ^cat$ matches entire string
    With:    ^cat$ matches lines starting and ending with "cat"

s - Dotall (. matches newline)
    Without: . matches everything except \n
    With:    . matches everything including \n

x - Extended/verbose (ignore whitespace, allow comments)
    Useful for complex patterns with documentation

u - Unicode (JavaScript, treat pattern as Unicode)
    /\u{1F600}/u matches 😀

JavaScript example:
const regex = /cat/gi;  // Case insensitive + global
"Cat dog cat CAT".match(regex);  // ["Cat", "cat", "CAT"]

Python example:
import re
re.findall(r'cat', 'Cat dog cat CAT', re.IGNORECASE)  # ['Cat', 'cat', 'CAT']

Best Practices & Performance

Performance Pitfalls

1. Catastrophic Backtracking

Bad: (a+)+b
String: "aaaaaaaaaaaaaaaaaac" (no 'b' at end)
Result: Exponential backtracking, can hang for seconds!

Why: Regex engine tries every combination of how to split 'a's between inner and outer +

Solution: Use possessive/atomic groups or avoid nested quantifiers
Better: a+b (no nesting)

2. Greedy Quantifiers on Large Text

Bad: .*keyword
On large text, .* matches everything, then backtracks slowly

Better: .*?keyword (lazy)
Or: [^k]*keyword (match anything except start of keyword)

3. Unnecessary Captures

Bad: (https?)://(www\.)?([a-z]+)\.com
Creates 3 capture groups you might not need

Better: (?:https?)://(?:www\.)?([a-z]+)\.com
Use (?:...) for non-capturing groups

Best Practices

Be specific: Use [0-9] instead of . when you know it's a digit
Anchor when possible: ^ and $ prevent unnecessary scanning
Use non-capturing groups: (?:...) when you don't need the capture
Avoid nested quantifiers: (a+)+ can cause exponential backtracking
Test edge cases: Empty strings, very long strings, special characters
Comment complex regex: Use verbose mode or code comments
Don't parse HTML/XML with regex: Use proper parsers (regex can't handle nested structures)
Validate, don't rely solely on regex: Regex can validate format, but not business logic

Testing & Debugging

Online tools:
- regex101.com (debugger with explanation, best for learning)
- regexr.com (visual, interactive)
- regexpal.com (simple tester)

In code:
# Python
import re
pattern = r'\d{3}-\d{2}-\d{4}'
test_cases = ['123-45-6789', 'invalid', '123456789']
for test in test_cases:
    match = re.match(pattern, test)
    print(f"{test}: {'✓' if match else '✗'}")

// JavaScript
const pattern = /\d{3}-\d{2}-\d{4}/;
['123-45-6789', 'invalid'].forEach(test => {
    console.log(`${test}: ${pattern.test(test) ? '✓' : '✗'}`);
});

When NOT to Use Regex

Parsing nested structures: HTML, XML, JSON (use proper parsers)
Complex email validation: RFC 5322 is too complex for regex
Validating credit cards: Format check OK, but use Luhn algorithm for validity
Simple string operations: str.contains() or str.startsWith() are clearer and faster
Performance-critical code: String methods are usually faster for simple tasks

Quick Reference Card

Character Classes:
.        Any character (except newline)
\d \D    Digit / non-digit
\w \W    Word char / non-word
\s \S    Whitespace / non-whitespace
[abc]    a, b, or c
[^abc]   Not a, b, or c
[a-z]    Range a to z

Anchors:
^        Start of string/line
$        End of string/line
\b       Word boundary
\B       Non-word boundary

Quantifiers:
*        0 or more (greedy)
+        1 or more (greedy)
?        0 or 1 (greedy)
{n}      Exactly n times
{n,}     n or more times
{n,m}    n to m times
*? +? ?? Lazy versions

Groups:
(...)        Capturing group
(?:...)      Non-capturing group
(?...) Named group
\1 \2        Backreference
(?=...)      Positive lookahead
(?!...)      Negative lookahead
(?<=...)     Positive lookbehind
(?Special:
|        OR (alternation)
\        Escape special char

References & Further Reading

Original source: Jeff Dean, Google (circa 2010)
GitHub Gist by jboner - Updated numbers (5,000+ stars)
Interactive Latency Visualization - See how numbers changed over time
samwho.dev/numbers - Beautiful interactive exploration
Jeff Dean's original talk notes

← Back to Home