⚡ Latency Numbers Every Programmer Should Know

📊 Origin

Originally compiled by Jeff Dean (Google) around 2010, these numbers provide a mental model for understanding computer performance. While hardware evolves, the relative relationships between these operations remain remarkably stable.

Updated for 2024-2025: Numbers reflect modern hardware (DDR5 RAM, NVMe SSDs, 10+ Gbps networks).

The Numbers

Operation Nanoseconds Microseconds Milliseconds Scale
L1 cache reference 0.5 ns ⚡ Fastest
Branch mispredict 5 ns ⚡ Fastest
L2 cache reference 7 ns ⚡ Fastest
Mutex lock/unlock 25 ns ⚡ Fastest
Main memory reference 100 ns 💨 Fast
Compress 1KB with Zippy 3,000 ns 3 µs 💨 Fast
Send 1KB over 1 Gbps network 10,000 ns 10 µs 💨 Fast
Read 4KB randomly from SSD 150,000 ns 150 µs ⚠️ Medium
Read 1MB sequentially from memory 250,000 ns 250 µs ⚠️ Medium
Round trip within same datacenter 500,000 ns 500 µs ⚠️ Medium
Read 1MB sequentially from SSD 1,000,000 ns 1,000 µs 1 ms ⚠️ Medium
Disk seek (HDD) 10,000,000 ns 10,000 µs 10 ms 🐌 Slow
Read 1MB sequentially from disk 20,000,000 ns 20,000 µs 20 ms 🐌 Slow
Send packet CA→Netherlands→CA 150,000,000 ns 150,000 µs 150 ms 🐌 Slow

⚠️ Unit Conversions (Critical for Understanding Scale)

Visual Comparison

Logarithmic Scale Visualization

Because these operations span 9 orders of magnitude (0.5 ns to 150 ms = 300,000,000x difference!), we use a logarithmic scale to visualize them.

Relative Latency Bars

Visual comparison (log scale):

L1 cache reference (0.5 ns)
0.5 ns
Main memory reference (100 ns)
200x slower
SSD random read (150 µs)
1,500x slower
Disk seek (10 ms)
20,000x slower
CA→Netherlands→CA (150 ms)
300,000x slower!

Human-Scale Analogies

If 1 CPU Cycle = 1 Second

To make these numbers relatable, imagine if 1 CPU cycle = 1 second of human time. Here's how long each operation would take:

⚡ L1 Cache

1 second

Reaching into your pocket

💨 RAM Access

3 minutes

Walking to the kitchen

⚠️ SSD Read

3.5 days

A long weekend trip

🐌 Disk Seek

4 months

Pregnancy duration

🌍 Network (CA→EU)

4.75 years

College degree + grad school

Design Implications

🎯 Key Takeaways for System Design

1. Memory Hierarchy Matters

2. Sequential > Random

3. Network is SLOW

4. Lock Contention

Real-World Application Examples

✅ Good: Redis Cache

  • Memory-based: ~100 ns reads
  • 10,000x faster than disk database
  • Perfect for session storage, counters

❌ Bad: N+1 Query Problem

  • 100 separate DB queries over network
  • 100 × 500 µs = 50 ms just in round trips!
  • Solution: Batch queries, use JOINs, eager loading

✅ Good: CDN for Static Assets

  • Reduces cross-continent latency (150 ms → 10 ms)
  • 15x latency improvement
  • Critical for user experience

💡 Optimization: Database Indexes

  • Table scan: Read entire disk (20+ ms per MB)
  • Index lookup: Few SSD reads (1-2 ms)
  • 10-100x speedup for large tables

Performance Budget Calculator

Example: API Response Time Budget (100 ms target)

Operation Count Each Total % of Budget
Network round trip (client ↔ server) 1 30 ms 30 ms 30%
Database queries (w/ network) 3 10 ms 30 ms 30%
Application logic 20 ms 20%
Redis cache lookups 5 2 ms 10 ms 10%
Response serialization (JSON) 1 10 ms 10 ms 10%
TOTAL 100 ms 100%

Key insight: 60% of time is network! Optimize by reducing round trips, caching aggressively, and using connection pooling.

Storage Sizes & Units

📏 Understanding Digital Storage Units

Storage sizes and data transfer rates use different units that are often confused. Understanding the difference between bits vs bytes and decimal (SI) vs binary (IEC) prefixes is fundamental.

Bits vs Bytes

Unit Symbol Value Typical Use Case
Bit b Binary digit (0 or 1) Network speeds (Mbps, Gbps)
Byte B 8 bits Storage sizes (MB, GB, TB)
Nibble 4 bits (half byte) Hexadecimal digit (0-F)

Key Point: 1 Byte = 8 bits

Common confusion: A "100 Mbps" connection transfers data at 100 megabits per second, which equals 12.5 MB/s (megabytes per second).

Conversion:
100 Mbps ÷ 8 = 12.5 MB/s

1 Gbps network = 125 MB/s max throughput
10 Gbps network = 1,250 MB/s = 1.25 GB/s

Decimal (SI) vs Binary (IEC) Prefixes

⚠️ The Ambiguity Problem

Historically, "kilobyte" was used for both 1000 bytes (decimal) and 1024 bytes (binary). This caused confusion and misleading storage sizes.

Solution (IEC standard): Different prefixes for decimal vs binary.

Decimal (SI) - Base 10 Value Binary (IEC) - Base 2 Value
Kilobyte (KB) 1,000 bytes (10³) Kibibyte (KiB) 1,024 bytes (2¹⁰)
Megabyte (MB) 1,000,000 bytes (10⁶) Mebibyte (MiB) 1,048,576 bytes (2²⁰)
Gigabyte (GB) 1,000,000,000 bytes (10⁹) Gibibyte (GiB) 1,073,741,824 bytes (2³⁰)
Terabyte (TB) 1,000,000,000,000 bytes (10¹²) Tebibyte (TiB) 1,099,511,627,776 bytes (2⁴⁰)
Petabyte (PB) 10¹⁵ bytes Pebibyte (PiB) 2⁵⁰ bytes
Exabyte (EB) 10¹⁸ bytes Exbibyte (EiB) 2⁶⁰ bytes

Why Your 1TB Drive Shows as 931GB

Hard drive manufacturers use decimal (SI) units: 1 TB = 1,000,000,000,000 bytes

Operating systems use binary (IEC) units: 1 TiB = 1,099,511,627,776 bytes

Calculation:
1 TB = 1,000,000,000,000 bytes
1 TiB = 1,099,511,627,776 bytes

1,000,000,000,000 ÷ 1,099,511,627,776 = 0.909 TiB
0.909 TiB × 1024 GiB/TiB = 931 GiB

Result: Your "1TB" drive appears as ~931 GiB in your OS

Practical Examples by Size

Size Examples
1 Byte Single ASCII character ('A', '7', '$')
2 Bytes Single Unicode character (UTF-16), 16-bit integer (-32,768 to 32,767)
4 Bytes 32-bit integer, IPv4 address, single float
8 Bytes 64-bit long, double-precision float, Unix timestamp
~1 KB Small text file, short email (plain text)
~100 KB Low-res photo, small favicon, email with attachments
~1 MB High-quality photo (JPEG), 1 minute of MP3 audio, short e-book
~100 MB Movie trailer (720p), mobile app, high-quality album
~1 GB Standard definition movie, ~1 hour of 1080p video, large video game
~10 GB HD movie (1080p), operating system install, AAA video game
~100 GB 4K movie, Call of Duty game install, large database backup
~1 TB Modern laptop SSD, 200+ HD movies, massive game collection
~10 TB Home NAS storage, professional video editing workstation
~1 PB Small datacenter storage, major cloud service storage tier, large research dataset
~1 EB Google/Facebook photo storage, entire internet archive snapshot, global weather data

Network Speeds: Why Bits?

Network speeds are measured in bits per second (bps), not bytes, for historical reasons:

Connection Type Speed (bits) Throughput (bytes) Download 1GB File
Dial-up modem (ancient) 56 Kbps 7 KB/s ~4 hours
DSL / Cable (basic) 10 Mbps 1.25 MB/s ~13 minutes
Fast broadband 100 Mbps 12.5 MB/s ~80 seconds
Gigabit internet 1 Gbps 125 MB/s ~8 seconds
Datacenter link 10 Gbps 1.25 GB/s < 1 second
High-speed datacenter 100 Gbps 12.5 GB/s 0.08 seconds

Interview gotcha: "How long to transfer 1TB over a 10 Gbps link?"

Calculation:
1 TB = 1,000 GB (decimal) = 8,000 Gb (gigabits)
10 Gbps link = 10 Gb per second

Time = 8,000 Gb ÷ 10 Gbps = 800 seconds = 13.3 minutes

Real world: Add ~10-20% overhead for TCP/IP headers, retransmissions
Actual time: ~15-16 minutes

Number Systems & Character Encoding

🔢 Why Multiple Number Systems?

Computers work in binary (base-2), but humans prefer decimal (base-10). Hexadecimal (base-16) provides a compact way to represent binary. Understanding these systems is fundamental to systems programming, debugging, and understanding how data is stored.

Binary (Base-2)

Fundamentals

Definition: Number system using only two digits: 0 and 1

Why computers use binary: Electronic circuits have two states - on (1) or off (0). Voltage high = 1, voltage low = 0.

Decimal Binary Calculation
0 0000 0
1 0001 1
2 0010 2
3 0011 2 + 1 = 3
4 0100 4
5 0101 4 + 1 = 5
10 1010 8 + 2 = 10
15 1111 8 + 4 + 2 + 1 = 15
255 11111111 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255

Binary Place Values

8-bit binary number: 1 0 1 1 0 1 0 1

Place values:  128  64  32  16   8   4   2   1
Binary digit:    1   0   1   1   0   1   0   1
               ───────────────────────────────────
Calculation:   128 + 0 + 32 + 16 + 0 + 4 + 0 + 1 = 181

Formula: Each position = 2^n (where n starts at 0 from right)
Position 0 (rightmost): 2^0 = 1
Position 1: 2^1 = 2
Position 2: 2^2 = 4
Position 3: 2^3 = 8
...
Position 7: 2^7 = 128

Common Binary Patterns

Powers of 2 (important for memory sizes):
2^0  = 1
2^1  = 2
2^2  = 4
2^3  = 8
2^4  = 16
2^5  = 32
2^6  = 64
2^7  = 128
2^8  = 256
2^10 = 1,024      (1 KB)
2^16 = 65,536     (64 KB)
2^20 = 1,048,576  (1 MB)
2^30 = 1,073,741,824  (1 GB)

Maximum values for N bits:
4 bits  = 0000 to 1111 = 0 to 15
8 bits  = 00000000 to 11111111 = 0 to 255
16 bits = 0 to 65,535
32 bits = 0 to 4,294,967,295

Signed integers (two's complement):
8-bit signed:  -128 to +127
16-bit signed: -32,768 to +32,767
32-bit signed: -2,147,483,648 to +2,147,483,647

Bitwise Operations

AND (&): Both bits must be 1
  1010 (10)
& 1100 (12)
  ----
  1000 (8)

OR (|): At least one bit is 1
  1010 (10)
| 1100 (12)
  ----
  1110 (14)

XOR (^): Bits are different
  1010 (10)
^ 1100 (12)
  ----
  0110 (6)

NOT (~): Flip all bits
~ 1010 (10)
  ----
  0101 (5)  [assuming 4-bit]

Left shift (<<): Multiply by 2
5 << 1  →  0101 << 1 = 1010  (10)  [5 × 2]
5 << 2  →  0101 << 2 = 10100 (20)  [5 × 4]

Right shift (>>): Divide by 2
20 >> 1  →  10100 >> 1 = 1010 (10)  [20 ÷ 2]
20 >> 2  →  10100 >> 2 = 0101 (5)   [20 ÷ 4]

Common uses:
- Check if number is even: (n & 1) == 0
- Check if number is power of 2: (n & (n-1)) == 0
- Set bit: n |= (1 << i)
- Clear bit: n &= ~(1 << i)
- Toggle bit: n ^= (1 << i)

Hexadecimal (Base-16)

Fundamentals

Definition: Number system using 16 digits: 0-9 and A-F

Why hexadecimal? Compact representation of binary. One hex digit = exactly 4 bits (nibble).

Decimal Binary Hexadecimal
000000
100011
200102
300113
401004
501015
601106
701117
810008
910019
101010A
111011B
121100C
131101D
141110E
151111F
25511111111FF

Converting Between Hex and Binary

Hex to Binary (easy - just expand each digit):
0xCAFE
= C    A    F    E
= 1100 1010 1111 1110
= 1100101011111110

Binary to Hex (easy - group by 4 bits from right):
10110101
= 1011 0101  (group by 4)
= B    5
= 0xB5

Hex to Decimal:
0x2F = (2 × 16^1) + (15 × 16^0) = 32 + 15 = 47
0x1A3 = (1 × 16^2) + (10 × 16^1) + (3 × 16^0) = 256 + 160 + 3 = 419

Decimal to Hex (divide by 16, track remainders):
255 ÷ 16 = 15 remainder 15 (F)
15 ÷ 16  = 0  remainder 15 (F)
Result: 0xFF

Common Hex Patterns in Programming

Memory addresses:
0x00007fff5fbff8a0  (64-bit pointer)
0xdeadbeef          (common debug value)
0x00000000          (NULL pointer)

Colors (RGB):
#FFFFFF = white  (255, 255, 255)
#000000 = black  (0, 0, 0)
#FF0000 = red    (255, 0, 0)
#00FF00 = green  (0, 255, 0)
#0000FF = blue   (0, 0, 255)
#CAFE01 = custom (202, 254, 1)

Byte values:
0x00 = 0
0xFF = 255 (max for 1 byte)
0x7F = 127 (max positive for signed byte)
0x80 = 128 or -128 (signed)

Bitmasks:
0x0F = 00001111 (mask lower 4 bits)
0xF0 = 11110000 (mask upper 4 bits)
0xFF = 11111111 (all bits set)
0x00 = 00000000 (all bits clear)

Permissions (Unix):
0644 = rw-r--r--  (owner: rw, group: r, other: r)
0755 = rwxr-xr-x  (owner: rwx, group: rx, other: rx)
0777 = rwxrwxrwx  (all permissions)

Why Hex is Used

ASCII (American Standard Code for Information Interchange)

Fundamentals

Definition: 7-bit character encoding standard (0-127), representing English characters, digits, and control codes.

History: Developed in 1963 for teleprinters and early computers. Still the foundation of modern text encoding.

Range Decimal Hex Description Examples
Control characters 0-31 0x00-0x1F Non-printable control codes NUL(0), TAB(9), LF(10), CR(13), ESC(27)
Space & symbols 32-47 0x20-0x2F Space, punctuation Space(32), !(33), "(34), #(35)
Digits 48-57 0x30-0x39 0-9 '0'(48), '5'(53), '9'(57)
More symbols 58-64 0x3A-0x40 Punctuation :(58), @(64)
Uppercase letters 65-90 0x41-0x5A A-Z 'A'(65), 'M'(77), 'Z'(90)
More symbols 91-96 0x5B-0x60 Brackets, etc. [(91), \(92), ](93)
Lowercase letters 97-122 0x61-0x7A a-z 'a'(97), 'm'(109), 'z'(122)
More symbols 123-126 0x7B-0x7E Braces, etc. {(123), }(125), ~(126)
Delete 127 0x7F Delete control character DEL(127)

Important ASCII Codes to Remember

Control characters:
0x00 (0)   = NUL (null terminator in C strings)
0x09 (9)   = TAB (horizontal tab)
0x0A (10)  = LF  (line feed, '\n' on Unix)
0x0D (13)  = CR  (carriage return, '\r')
0x1B (27)  = ESC (escape, used in terminal codes)
0x20 (32)  = SPACE

Digits '0'-'9':
'0' = 48 (0x30)
'1' = 49 (0x31)
...
'9' = 57 (0x39)

Uppercase 'A'-'Z':
'A' = 65 (0x41)
'B' = 66 (0x42)
...
'Z' = 90 (0x5A)

Lowercase 'a'-'z':
'a' = 97 (0x61)
'b' = 98 (0x62)
...
'z' = 122 (0x7A)

Useful patterns:
Uppercase to lowercase: Add 32  ('A' + 32 = 'a')
Lowercase to uppercase: Subtract 32  ('a' - 32 = 'A')
Difference: 0x20 (32) exactly

Digit to integer: Subtract '0'  ('5' - '0' = 5)
Integer to digit: Add '0'  (5 + '0' = '5')

Extended ASCII (8-bit, 0-255)

Problem: Standard ASCII (7-bit) only covers English. Many 8-bit extensions created for other languages.

Line Endings (Newlines)

Different systems use different line ending conventions:

Unix/Linux/macOS:   LF   (0x0A, '\n')
Windows:            CRLF (0x0D 0x0A, '\r\n')
Old Mac (pre-OSX):  CR   (0x0D, '\r')

Why this causes problems:
- Text file created on Windows opened on Linux shows ^M characters
- Git can auto-convert (core.autocrlf setting)
- Always specify line endings in .gitattributes for consistency

Unicode: The Universal Character Set

The Problem ASCII/Extended ASCII Couldn't Solve

Issue: ASCII (128 chars) and Extended ASCII (256 chars) can't represent:

Solution: Unicode - Universal character set supporting 1,114,112 possible characters (code points U+0000 to U+10FFFF)

Unicode Concepts

Code Point: Abstract number assigned to each character
- Written as U+xxxx (hexadecimal)
- Examples:
  'A' = U+0041
  '€' = U+20AC (Euro sign)
  '中' = U+4E2D (Chinese character)
  '😀' = U+1F600 (Grinning face emoji)

Encoding: How code points are stored as bytes
Unicode defines the characters, encodings define byte representation

UTF-8: The Dominant Encoding

Why UTF-8 won:

UTF-8 Encoding Rules

1 byte (ASCII): U+0000 to U+007F
Pattern: 0xxxxxxx
Example: 'A' = U+0041 = 01000001 = 0x41 (1 byte)

2 bytes: U+0080 to U+07FF
Pattern: 110xxxxx 10xxxxxx
Example: '©' = U+00A9 = 11000010 10101001 = 0xC2 0xA9 (2 bytes)

3 bytes: U+0800 to U+FFFF
Pattern: 1110xxxx 10xxxxxx 10xxxxxx
Example: '€' = U+20AC = 11100010 10000010 10101100 = 0xE2 0x82 0xAC (3 bytes)

4 bytes: U+10000 to U+10FFFF
Pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Example: '😀' = U+1F600 = 11110000 10011111 10011000 10000000 = 0xF0 0x9F 0x98 0x80 (4 bytes)

Key insight: Leading byte tells you character length
0xxxxxxx   = 1 byte (ASCII)
110xxxxx   = 2 bytes
1110xxxx   = 3 bytes
11110xxx   = 4 bytes
10xxxxxx   = Continuation byte (never starts a character)

UTF-16 and UTF-32

Encoding Bytes per Char Pros Cons Used By
UTF-8 1-4 bytes (variable) ASCII compatible, space efficient, no BOM needed Variable length complicates indexing Web, Linux, most programming
UTF-16 2 or 4 bytes Fixed width for most chars (BMP) Not ASCII compatible, byte order issues (BOM) Windows, Java, JavaScript internally
UTF-32 4 bytes (fixed) Fixed width, easy indexing Space inefficient (4× ASCII size) Rare, some internal processing

Common Unicode Pitfalls

1. String length is ambiguous:
"café" in UTF-8 = 5 bytes (c=1, a=1, f=1, é=2)
But 4 Unicode code points
But 4 "characters" to humans

JavaScript: "café".length → 4 (counts UTF-16 code units)
Python: len("café") → 4 (counts code points)
Bytes: len("café".encode('utf-8')) → 5 (bytes)

2. Emoji are complex:
"👨‍👩‍👧‍👦" (family) = 7 code points:
  👨 (man) + ZWJ + 👩 (woman) + ZWJ + 👧 (girl) + ZWJ + 👦 (boy)
  (ZWJ = Zero Width Joiner, U+200D)

"👍" (thumbs up) = 1 code point (U+1F44D)
"👍🏿" (dark skin tone) = 2 code points (base + modifier)

3. Visual vs code point length:
é can be:
- 1 code point: U+00E9 (precomposed "é")
- 2 code points: U+0065 U+0301 (e + combining acute accent)
Both look identical, different byte representations!
→ Use Unicode normalization (NFC, NFD) to standardize

4. Case folding isn't simple:
German: "ß".toUpperCase() → "SS" (1 char becomes 2!)
Turkish: "i".toUpperCase() → "İ" (dotted I)
         "I".toLowerCase() → "ı" (dotless i)
→ Use locale-aware case conversion

5. Sorting/collation is locale-dependent:
Swedish: ä comes after z
German: ä sorts like a
→ Use ICU library or locale-aware sorting

Best Practices

Regular Expressions (Regex)

🔍 What Are Regular Expressions?

Definition: Regular expressions are patterns used to match character combinations in strings. They're a powerful tool for text searching, validation, parsing, and manipulation.

Why they matter: Used everywhere - text editors, log parsing, input validation, data extraction, search & replace, URL routing, lexical analysis, and more.

Trade-off: Powerful but can be cryptic. The joke: "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Basic Syntax

Literal Characters

Most characters match themselves:
Pattern: cat
Matches: "cat" in "The cat sat"
Doesn't match: "Cat" (case-sensitive by default)

Pattern: hello world
Matches: "hello world" exactly

Metacharacters (Special Characters)

Character Meaning Example
. Any single character (except newline) c.t matches "cat", "cot", "c9t"
^ Start of string/line ^cat matches "cat" only at start
$ End of string/line cat$ matches "cat" only at end
* 0 or more times ca*t matches "ct", "cat", "caaat"
+ 1 or more times ca+t matches "cat", "caat" (not "ct")
? 0 or 1 time (optional) colou?r matches "color" or "colour"
| OR (alternation) cat|dog matches "cat" or "dog"
( ) Grouping and capturing (ca)+t matches "cat", "cacat"
[ ] Character class [aeiou] matches any vowel
\ Escape special character \. matches literal "."
To match metacharacters literally, escape them:
Pattern: \$\d+\.\d{2}
Matches: "$19.99" (dollar amount)
         $ escaped with \$
         . escaped with \.

Character Classes

Predefined Character Classes

Class Equivalent Matches
\d [0-9] Any digit
\D [^0-9] Any non-digit
\w [a-zA-Z0-9_] Word character (letter, digit, underscore)
\W [^a-zA-Z0-9_] Non-word character
\s [ \t\n\r\f\v] Whitespace (space, tab, newline, etc.)
\S [^ \t\n\r\f\v] Non-whitespace

Custom Character Classes

Square brackets [ ] define character sets:

[aeiou]        - Matches any single vowel
[0-9]          - Matches any digit (same as \d)
[a-z]          - Matches lowercase letter
[A-Z]          - Matches uppercase letter
[a-zA-Z]       - Matches any letter
[a-z0-9]       - Matches letter or digit
[^aeiou]       - Matches any character EXCEPT vowels (^ negates)
[0-9a-fA-F]    - Matches hexadecimal digit

Special characters lose meaning inside [ ]:
[.]            - Matches literal dot (no need to escape)
[*+?]          - Matches literal *, +, or ?
[a-z-]         - Matches a-z or hyphen (hyphen at end)
[-a-z]         - Matches hyphen or a-z (hyphen at start)
[a\-z]         - Matches a, hyphen, or z (hyphen escaped)

Quantifiers

Quantifier Meaning Example Matches
* 0 or more ab*c "ac", "abc", "abbc", "abbbc"
+ 1 or more ab+c "abc", "abbc" (not "ac")
? 0 or 1 ab?c "ac" or "abc"
{n} Exactly n times \d{3} "123", "999" (exactly 3 digits)
{n,} n or more times \d{3,} "123", "1234", "12345"
{n,m} Between n and m times \d{3,5} "123", "1234", "12345" (not "12" or "123456")

Greedy vs Lazy (Non-Greedy) Matching

Greedy (default): Match as much as possible
Pattern: <.*>
String: bold and italic
Matches: "bold and italic"  (entire string!)

Lazy (non-greedy): Match as little as possible (add ?)
Pattern: <.*?>
String: bold and italic
Matches: "", "", "", ""  (each tag separately)

Lazy quantifiers:
*?    - 0 or more (lazy)
+?    - 1 or more (lazy)
??    - 0 or 1 (lazy)
{n,}? - n or more (lazy)
{n,m}? - between n and m (lazy)

Example: Extract quoted strings
Greedy:  ".*"   on 'He said "hello" and "goodbye"' → '"hello" and "goodbye"'
Lazy:    ".*?"  on 'He said "hello" and "goodbye"' → '"hello"' and '"goodbye"'

Anchors & Boundaries

Anchor Meaning Example
^ Start of string/line ^Hello matches "Hello world" (not "Say Hello")
$ End of string/line world$ matches "Hello world" (not "world peace")
\b Word boundary \bcat\b matches "cat" (not "category" or "scat")
\B Non-word boundary \Bcat\B matches "cat" in "scattered"
\A Start of string only Like ^ but never matches after newline (multiline mode)
\Z End of string only Like $ but never matches before newline
Word boundaries (\b) are crucial for exact matches:

Pattern: cat
String: "cat category scat"
Matches: ALL occurrences (cat in all three words)

Pattern: \bcat\b
String: "cat category scat"
Matches: Only "cat" (standalone word)

Validate entire string (common for input validation):
Pattern: ^\d{3}-\d{2}-\d{4}$
Matches: "123-45-6789" (entire string is SSN format)
Rejects: "My SSN is 123-45-6789" (extra text)

Groups & Capturing

Capturing Groups ( )

Parentheses create capturing groups:

Pattern: (\d{3})-(\d{2})-(\d{4})
String: "123-45-6789"
Capture groups:
  Group 0 (entire match): "123-45-6789"
  Group 1: "123"
  Group 2: "45"
  Group 3: "6789"

Use in replacement:
Pattern: (\w+)\s+(\w+)
String: "John Doe"
Replace with: $2, $1  (or \2, \1 in some flavors)
Result: "Doe, John"

Backreferences (refer to captured groups):
Pattern: \b(\w+)\s+\1\b
Matches: Repeated words like "the the" or "is is"
         \1 refers back to whatever Group 1 matched

Pattern: <(\w+)>.*?
Matches: text or text
         Ensures closing tag matches opening tag

Non-Capturing Groups (?:...)

Use (?:...) when you need grouping but not capturing:

Pattern: (?:https?|ftp)://(\S+)
Matches: URLs with http, https, or ftp
Only captures the domain/path part (not the protocol)

Why use non-capturing groups?
- Performance: Capturing has overhead
- Clarity: Numbered groups ($1, $2) stay simple
- Necessity: Some regex engines limit number of capture groups

Named Capturing Groups

Python/PCRE syntax: (?P...)
JavaScript/Java/.NET syntax: (?...)

Pattern: (?P\d{4})-(?P\d{2})-(?P\d{2})
String: "2025-01-15"
Captures:
  year: "2025"
  month: "01"
  day: "15"

# Python example:
import re
match = re.search(r'(?P\d{4})-(?P\d{2})-(?P\d{2})', '2025-01-15')
print(match.group('year'))  # "2025"

Lookahead & Lookbehind

Type Syntax Meaning Example
Positive Lookahead (?=...) Followed by pattern \d+(?= dollars) matches number before " dollars"
Negative Lookahead (?!...) NOT followed by pattern \d+(?! dollars) matches numbers not before " dollars"
Positive Lookbehind (?<=...) Preceded by pattern (?<=\$)\d+ matches number after "$"
Negative Lookbehind (? NOT preceded by pattern (? matches numbers not after "$"
Lookaheads/lookbehinds are zero-width (don't consume characters):

Pattern: \w+(?=\.)
String: "Hello. World. Test."
Matches: "Hello", "World", "Test" (dots not included in match)

Complex example: Password validation
Requirement: 8+ chars, at least 1 uppercase, 1 lowercase, 1 digit

Pattern: ^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$
         (?=.*[A-Z])  - Must contain uppercase
         (?=.*[a-z])  - Must contain lowercase
         (?=.*\d)     - Must contain digit
         .{8,}        - At least 8 characters

Matches: "Password1" ✓
Rejects: "password" (no uppercase or digit) ✗

Common Patterns

Email (basic):
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Note: Email validation is complex; this catches most common formats

URL/URI:
https?://[^\s]+
Or more strict:
^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b

Phone (US):
^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$
Matches: (123) 456-7890, 123-456-7890, 123.456.7890, 1234567890

IP Address (IPv4):
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Validates: 0.0.0.0 to 255.255.255.255

Date (YYYY-MM-DD):
^\d{4}-\d{2}-\d{2}$
Basic: Matches format, not validity (allows 2025-99-99)
Strict: ^(19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$

Credit Card (format only, no Luhn check):
^\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}$

Hex Color:
^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$
Matches: #FFFFFF, #FFF, FFFFFF, FFF

Username (alphanumeric + underscore, 3-16 chars):
^[a-zA-Z0-9_]{3,16}$

Strong Password (8+ chars, upper, lower, digit, special):
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$

Extract all words:
\b\w+\b

Extract numbers (including decimals):
\d+\.?\d*
Or: -?\d+(?:\.\d+)?  (includes negative numbers)

HTML tags:
<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)
Note: HTML is not regular; use proper HTML parser for production

Remove extra whitespace:
\s+
Replace with single space

Trim whitespace:
^\s+|\s+$
Replace with empty string

Regex Flavors & Differences

Feature PCRE (Perl) JavaScript Python Java
Named groups (?) (?) (ES2018+) (?P) (?)
Lookbehind ✓ Variable length ✓ Fixed length only ✓ Variable length ✓ Fixed length only
Backreferences \1 \1 or $1 \1 \1 or $1
Flags /pattern/i /pattern/gi re.I, re.M Pattern.CASE_INSENSITIVE
Unicode support ✓ Full ✓ Limited (better in ES2015+) ✓ Full ✓ Full

Common Flags/Modifiers

Flags modify how regex behaves:

i - Case insensitive
    /cat/i matches "cat", "Cat", "CAT"

g - Global (find all matches, not just first)
    /cat/g finds all "cat" in string

m - Multiline (^ and $ match line breaks)
    Without: ^cat$ matches entire string
    With:    ^cat$ matches lines starting and ending with "cat"

s - Dotall (. matches newline)
    Without: . matches everything except \n
    With:    . matches everything including \n

x - Extended/verbose (ignore whitespace, allow comments)
    Useful for complex patterns with documentation

u - Unicode (JavaScript, treat pattern as Unicode)
    /\u{1F600}/u matches 😀

JavaScript example:
const regex = /cat/gi;  // Case insensitive + global
"Cat dog cat CAT".match(regex);  // ["Cat", "cat", "CAT"]

Python example:
import re
re.findall(r'cat', 'Cat dog cat CAT', re.IGNORECASE)  # ['Cat', 'cat', 'CAT']

Best Practices & Performance

Performance Pitfalls

1. Catastrophic Backtracking

Bad: (a+)+b
String: "aaaaaaaaaaaaaaaaaac" (no 'b' at end)
Result: Exponential backtracking, can hang for seconds!

Why: Regex engine tries every combination of how to split 'a's between inner and outer +

Solution: Use possessive/atomic groups or avoid nested quantifiers
Better: a+b (no nesting)

2. Greedy Quantifiers on Large Text

Bad: .*keyword
On large text, .* matches everything, then backtracks slowly

Better: .*?keyword (lazy)
Or: [^k]*keyword (match anything except start of keyword)

3. Unnecessary Captures

Bad: (https?)://(www\.)?([a-z]+)\.com
Creates 3 capture groups you might not need

Better: (?:https?)://(?:www\.)?([a-z]+)\.com
Use (?:...) for non-capturing groups

Best Practices

Testing & Debugging

Online tools:
- regex101.com (debugger with explanation, best for learning)
- regexr.com (visual, interactive)
- regexpal.com (simple tester)

In code:
# Python
import re
pattern = r'\d{3}-\d{2}-\d{4}'
test_cases = ['123-45-6789', 'invalid', '123456789']
for test in test_cases:
    match = re.match(pattern, test)
    print(f"{test}: {'✓' if match else '✗'}")

// JavaScript
const pattern = /\d{3}-\d{2}-\d{4}/;
['123-45-6789', 'invalid'].forEach(test => {
    console.log(`${test}: ${pattern.test(test) ? '✓' : '✗'}`);
});

When NOT to Use Regex

Quick Reference Card

Character Classes:
.        Any character (except newline)
\d \D    Digit / non-digit
\w \W    Word char / non-word
\s \S    Whitespace / non-whitespace
[abc]    a, b, or c
[^abc]   Not a, b, or c
[a-z]    Range a to z

Anchors:
^        Start of string/line
$        End of string/line
\b       Word boundary
\B       Non-word boundary

Quantifiers:
*        0 or more (greedy)
+        1 or more (greedy)
?        0 or 1 (greedy)
{n}      Exactly n times
{n,}     n or more times
{n,m}    n to m times
*? +? ?? Lazy versions

Groups:
(...)        Capturing group
(?:...)      Non-capturing group
(?...) Named group
\1 \2        Backreference
(?=...)      Positive lookahead
(?!...)      Negative lookahead
(?<=...)     Positive lookbehind
(?Special:
|        OR (alternation)
\        Escape special char

References & Further Reading