Originally compiled by Jeff Dean (Google) around 2010, these numbers provide a mental model for understanding computer performance. While hardware evolves, the relative relationships between these operations remain remarkably stable.
Updated for 2024-2025: Numbers reflect modern hardware (DDR5 RAM, NVMe SSDs, 10+ Gbps networks).
| Operation | Nanoseconds | Microseconds | Milliseconds | Scale |
|---|---|---|---|---|
| L1 cache reference | 0.5 ns | — | — | ⚡ Fastest |
| Branch mispredict | 5 ns | — | — | ⚡ Fastest |
| L2 cache reference | 7 ns | — | — | ⚡ Fastest |
| Mutex lock/unlock | 25 ns | — | — | ⚡ Fastest |
| Main memory reference | 100 ns | — | — | 💨 Fast |
| Compress 1KB with Zippy | 3,000 ns | 3 µs | — | 💨 Fast |
| Send 1KB over 1 Gbps network | 10,000 ns | 10 µs | — | 💨 Fast |
| Read 4KB randomly from SSD | 150,000 ns | 150 µs | — | ⚠️ Medium |
| Read 1MB sequentially from memory | 250,000 ns | 250 µs | — | ⚠️ Medium |
| Round trip within same datacenter | 500,000 ns | 500 µs | — | ⚠️ Medium |
| Read 1MB sequentially from SSD | 1,000,000 ns | 1,000 µs | 1 ms | ⚠️ Medium |
| Disk seek (HDD) | 10,000,000 ns | 10,000 µs | 10 ms | 🐌 Slow |
| Read 1MB sequentially from disk | 20,000,000 ns | 20,000 µs | 20 ms | 🐌 Slow |
| Send packet CA→Netherlands→CA | 150,000,000 ns | 150,000 µs | 150 ms | 🐌 Slow |
Because these operations span 9 orders of magnitude (0.5 ns to 150 ms = 300,000,000x difference!), we use a logarithmic scale to visualize them.
Visual comparison (log scale):
To make these numbers relatable, imagine if 1 CPU cycle = 1 second of human time. Here's how long each operation would take:
Reaching into your pocket
Walking to the kitchen
A long weekend trip
Pregnancy duration
College degree + grad school
1 ms38 ms (38x slower!)500 µs150 ms (300x slower!)25 ns (seems fast)| Operation | Count | Each | Total | % of Budget |
|---|---|---|---|---|
| Network round trip (client ↔ server) | 1 | 30 ms | 30 ms | 30% |
| Database queries (w/ network) | 3 | 10 ms | 30 ms | 30% |
| Application logic | — | — | 20 ms | 20% |
| Redis cache lookups | 5 | 2 ms | 10 ms | 10% |
| Response serialization (JSON) | 1 | 10 ms | 10 ms | 10% |
| TOTAL | — | — | 100 ms | 100% |
Key insight: 60% of time is network! Optimize by reducing round trips, caching aggressively, and using connection pooling.
Storage sizes and data transfer rates use different units that are often confused. Understanding the difference between bits vs bytes and decimal (SI) vs binary (IEC) prefixes is fundamental.
| Unit | Symbol | Value | Typical Use Case |
|---|---|---|---|
| Bit | b | Binary digit (0 or 1) | Network speeds (Mbps, Gbps) |
| Byte | B | 8 bits | Storage sizes (MB, GB, TB) |
| Nibble | — | 4 bits (half byte) | Hexadecimal digit (0-F) |
Key Point: 1 Byte = 8 bits
Common confusion: A "100 Mbps" connection transfers data at 100 megabits per second,
which equals 12.5 MB/s (megabytes per second).
Conversion: 100 Mbps ÷ 8 = 12.5 MB/s 1 Gbps network = 125 MB/s max throughput 10 Gbps network = 1,250 MB/s = 1.25 GB/s
Historically, "kilobyte" was used for both 1000 bytes (decimal) and
1024 bytes (binary). This caused confusion and misleading storage sizes.
Solution (IEC standard): Different prefixes for decimal vs binary.
| Decimal (SI) - Base 10 | Value | Binary (IEC) - Base 2 | Value |
|---|---|---|---|
| Kilobyte (KB) | 1,000 bytes (10³) | Kibibyte (KiB) | 1,024 bytes (2¹⁰) |
| Megabyte (MB) | 1,000,000 bytes (10⁶) | Mebibyte (MiB) | 1,048,576 bytes (2²⁰) |
| Gigabyte (GB) | 1,000,000,000 bytes (10⁹) | Gibibyte (GiB) | 1,073,741,824 bytes (2³⁰) |
| Terabyte (TB) | 1,000,000,000,000 bytes (10¹²) | Tebibyte (TiB) | 1,099,511,627,776 bytes (2⁴⁰) |
| Petabyte (PB) | 10¹⁵ bytes | Pebibyte (PiB) | 2⁵⁰ bytes |
| Exabyte (EB) | 10¹⁸ bytes | Exbibyte (EiB) | 2⁶⁰ bytes |
Hard drive manufacturers use decimal (SI) units:
1 TB = 1,000,000,000,000 bytes
Operating systems use binary (IEC) units:
1 TiB = 1,099,511,627,776 bytes
Calculation: 1 TB = 1,000,000,000,000 bytes 1 TiB = 1,099,511,627,776 bytes 1,000,000,000,000 ÷ 1,099,511,627,776 = 0.909 TiB 0.909 TiB × 1024 GiB/TiB = 931 GiB Result: Your "1TB" drive appears as ~931 GiB in your OS
| Size | Examples |
|---|---|
| 1 Byte | Single ASCII character ('A', '7', '$') |
| 2 Bytes | Single Unicode character (UTF-16), 16-bit integer (-32,768 to 32,767) |
| 4 Bytes | 32-bit integer, IPv4 address, single float |
| 8 Bytes | 64-bit long, double-precision float, Unix timestamp |
| ~1 KB | Small text file, short email (plain text) |
| ~100 KB | Low-res photo, small favicon, email with attachments |
| ~1 MB | High-quality photo (JPEG), 1 minute of MP3 audio, short e-book |
| ~100 MB | Movie trailer (720p), mobile app, high-quality album |
| ~1 GB | Standard definition movie, ~1 hour of 1080p video, large video game |
| ~10 GB | HD movie (1080p), operating system install, AAA video game |
| ~100 GB | 4K movie, Call of Duty game install, large database backup |
| ~1 TB | Modern laptop SSD, 200+ HD movies, massive game collection |
| ~10 TB | Home NAS storage, professional video editing workstation |
| ~1 PB | Small datacenter storage, major cloud service storage tier, large research dataset |
| ~1 EB | Google/Facebook photo storage, entire internet archive snapshot, global weather data |
Network speeds are measured in bits per second (bps), not bytes, for historical reasons:
| Connection Type | Speed (bits) | Throughput (bytes) | Download 1GB File |
|---|---|---|---|
| Dial-up modem (ancient) | 56 Kbps | 7 KB/s | ~4 hours |
| DSL / Cable (basic) | 10 Mbps | 1.25 MB/s | ~13 minutes |
| Fast broadband | 100 Mbps | 12.5 MB/s | ~80 seconds |
| Gigabit internet | 1 Gbps | 125 MB/s | ~8 seconds |
| Datacenter link | 10 Gbps | 1.25 GB/s | < 1 second |
| High-speed datacenter | 100 Gbps | 12.5 GB/s | 0.08 seconds |
Interview gotcha: "How long to transfer 1TB over a 10 Gbps link?"
Calculation: 1 TB = 1,000 GB (decimal) = 8,000 Gb (gigabits) 10 Gbps link = 10 Gb per second Time = 8,000 Gb ÷ 10 Gbps = 800 seconds = 13.3 minutes Real world: Add ~10-20% overhead for TCP/IP headers, retransmissions Actual time: ~15-16 minutes
Computers work in binary (base-2), but humans prefer decimal (base-10). Hexadecimal (base-16) provides a compact way to represent binary. Understanding these systems is fundamental to systems programming, debugging, and understanding how data is stored.
Definition: Number system using only two digits: 0 and 1
Why computers use binary: Electronic circuits have two states - on (1) or off (0). Voltage high = 1, voltage low = 0.
| Decimal | Binary | Calculation |
|---|---|---|
| 0 | 0000 | 0 |
| 1 | 0001 | 1 |
| 2 | 0010 | 2 |
| 3 | 0011 | 2 + 1 = 3 |
| 4 | 0100 | 4 |
| 5 | 0101 | 4 + 1 = 5 |
| 10 | 1010 | 8 + 2 = 10 |
| 15 | 1111 | 8 + 4 + 2 + 1 = 15 |
| 255 | 11111111 | 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255 |
8-bit binary number: 1 0 1 1 0 1 0 1
Place values: 128 64 32 16 8 4 2 1
Binary digit: 1 0 1 1 0 1 0 1
───────────────────────────────────
Calculation: 128 + 0 + 32 + 16 + 0 + 4 + 0 + 1 = 181
Formula: Each position = 2^n (where n starts at 0 from right)
Position 0 (rightmost): 2^0 = 1
Position 1: 2^1 = 2
Position 2: 2^2 = 4
Position 3: 2^3 = 8
...
Position 7: 2^7 = 128
Powers of 2 (important for memory sizes): 2^0 = 1 2^1 = 2 2^2 = 4 2^3 = 8 2^4 = 16 2^5 = 32 2^6 = 64 2^7 = 128 2^8 = 256 2^10 = 1,024 (1 KB) 2^16 = 65,536 (64 KB) 2^20 = 1,048,576 (1 MB) 2^30 = 1,073,741,824 (1 GB) Maximum values for N bits: 4 bits = 0000 to 1111 = 0 to 15 8 bits = 00000000 to 11111111 = 0 to 255 16 bits = 0 to 65,535 32 bits = 0 to 4,294,967,295 Signed integers (two's complement): 8-bit signed: -128 to +127 16-bit signed: -32,768 to +32,767 32-bit signed: -2,147,483,648 to +2,147,483,647
AND (&): Both bits must be 1 1010 (10) & 1100 (12) ---- 1000 (8) OR (|): At least one bit is 1 1010 (10) | 1100 (12) ---- 1110 (14) XOR (^): Bits are different 1010 (10) ^ 1100 (12) ---- 0110 (6) NOT (~): Flip all bits ~ 1010 (10) ---- 0101 (5) [assuming 4-bit] Left shift (<<): Multiply by 2 5 << 1 → 0101 << 1 = 1010 (10) [5 × 2] 5 << 2 → 0101 << 2 = 10100 (20) [5 × 4] Right shift (>>): Divide by 2 20 >> 1 → 10100 >> 1 = 1010 (10) [20 ÷ 2] 20 >> 2 → 10100 >> 2 = 0101 (5) [20 ÷ 4] Common uses: - Check if number is even: (n & 1) == 0 - Check if number is power of 2: (n & (n-1)) == 0 - Set bit: n |= (1 << i) - Clear bit: n &= ~(1 << i) - Toggle bit: n ^= (1 << i)
Definition: Number system using 16 digits: 0-9 and A-F
Why hexadecimal? Compact representation of binary. One hex digit = exactly 4 bits (nibble).
| Decimal | Binary | Hexadecimal |
|---|---|---|
| 0 | 0000 | 0 |
| 1 | 0001 | 1 |
| 2 | 0010 | 2 |
| 3 | 0011 | 3 |
| 4 | 0100 | 4 |
| 5 | 0101 | 5 |
| 6 | 0110 | 6 |
| 7 | 0111 | 7 |
| 8 | 1000 | 8 |
| 9 | 1001 | 9 |
| 10 | 1010 | A |
| 11 | 1011 | B |
| 12 | 1100 | C |
| 13 | 1101 | D |
| 14 | 1110 | E |
| 15 | 1111 | F |
| 255 | 11111111 | FF |
Hex to Binary (easy - just expand each digit): 0xCAFE = C A F E = 1100 1010 1111 1110 = 1100101011111110 Binary to Hex (easy - group by 4 bits from right): 10110101 = 1011 0101 (group by 4) = B 5 = 0xB5 Hex to Decimal: 0x2F = (2 × 16^1) + (15 × 16^0) = 32 + 15 = 47 0x1A3 = (1 × 16^2) + (10 × 16^1) + (3 × 16^0) = 256 + 160 + 3 = 419 Decimal to Hex (divide by 16, track remainders): 255 ÷ 16 = 15 remainder 15 (F) 15 ÷ 16 = 0 remainder 15 (F) Result: 0xFF
Memory addresses: 0x00007fff5fbff8a0 (64-bit pointer) 0xdeadbeef (common debug value) 0x00000000 (NULL pointer) Colors (RGB): #FFFFFF = white (255, 255, 255) #000000 = black (0, 0, 0) #FF0000 = red (255, 0, 0) #00FF00 = green (0, 255, 0) #0000FF = blue (0, 0, 255) #CAFE01 = custom (202, 254, 1) Byte values: 0x00 = 0 0xFF = 255 (max for 1 byte) 0x7F = 127 (max positive for signed byte) 0x80 = 128 or -128 (signed) Bitmasks: 0x0F = 00001111 (mask lower 4 bits) 0xF0 = 11110000 (mask upper 4 bits) 0xFF = 11111111 (all bits set) 0x00 = 00000000 (all bits clear) Permissions (Unix): 0644 = rw-r--r-- (owner: rw, group: r, other: r) 0755 = rwxr-xr-x (owner: rwx, group: rx, other: rx) 0777 = rwxrwxrwx (all permissions)
Definition: 7-bit character encoding standard (0-127), representing English characters, digits, and control codes.
History: Developed in 1963 for teleprinters and early computers. Still the foundation of modern text encoding.
| Range | Decimal | Hex | Description | Examples |
|---|---|---|---|---|
| Control characters | 0-31 | 0x00-0x1F | Non-printable control codes | NUL(0), TAB(9), LF(10), CR(13), ESC(27) |
| Space & symbols | 32-47 | 0x20-0x2F | Space, punctuation | Space(32), !(33), "(34), #(35) |
| Digits | 48-57 | 0x30-0x39 | 0-9 | '0'(48), '5'(53), '9'(57) |
| More symbols | 58-64 | 0x3A-0x40 | Punctuation | :(58), @(64) |
| Uppercase letters | 65-90 | 0x41-0x5A | A-Z | 'A'(65), 'M'(77), 'Z'(90) |
| More symbols | 91-96 | 0x5B-0x60 | Brackets, etc. | [(91), \(92), ](93) |
| Lowercase letters | 97-122 | 0x61-0x7A | a-z | 'a'(97), 'm'(109), 'z'(122) |
| More symbols | 123-126 | 0x7B-0x7E | Braces, etc. | {(123), }(125), ~(126) |
| Delete | 127 | 0x7F | Delete control character | DEL(127) |
Control characters:
0x00 (0) = NUL (null terminator in C strings)
0x09 (9) = TAB (horizontal tab)
0x0A (10) = LF (line feed, '\n' on Unix)
0x0D (13) = CR (carriage return, '\r')
0x1B (27) = ESC (escape, used in terminal codes)
0x20 (32) = SPACE
Digits '0'-'9':
'0' = 48 (0x30)
'1' = 49 (0x31)
...
'9' = 57 (0x39)
Uppercase 'A'-'Z':
'A' = 65 (0x41)
'B' = 66 (0x42)
...
'Z' = 90 (0x5A)
Lowercase 'a'-'z':
'a' = 97 (0x61)
'b' = 98 (0x62)
...
'z' = 122 (0x7A)
Useful patterns:
Uppercase to lowercase: Add 32 ('A' + 32 = 'a')
Lowercase to uppercase: Subtract 32 ('a' - 32 = 'A')
Difference: 0x20 (32) exactly
Digit to integer: Subtract '0' ('5' - '0' = 5)
Integer to digit: Add '0' (5 + '0' = '5')
Problem: Standard ASCII (7-bit) only covers English. Many 8-bit extensions created for other languages.
Different systems use different line ending conventions: Unix/Linux/macOS: LF (0x0A, '\n') Windows: CRLF (0x0D 0x0A, '\r\n') Old Mac (pre-OSX): CR (0x0D, '\r') Why this causes problems: - Text file created on Windows opened on Linux shows ^M characters - Git can auto-convert (core.autocrlf setting) - Always specify line endings in .gitattributes for consistency
Issue: ASCII (128 chars) and Extended ASCII (256 chars) can't represent:
Solution: Unicode - Universal character set supporting 1,114,112 possible characters (code points U+0000 to U+10FFFF)
Code Point: Abstract number assigned to each character - Written as U+xxxx (hexadecimal) - Examples: 'A' = U+0041 '€' = U+20AC (Euro sign) '中' = U+4E2D (Chinese character) '😀' = U+1F600 (Grinning face emoji) Encoding: How code points are stored as bytes Unicode defines the characters, encodings define byte representation
Why UTF-8 won:
1 byte (ASCII): U+0000 to U+007F Pattern: 0xxxxxxx Example: 'A' = U+0041 = 01000001 = 0x41 (1 byte) 2 bytes: U+0080 to U+07FF Pattern: 110xxxxx 10xxxxxx Example: '©' = U+00A9 = 11000010 10101001 = 0xC2 0xA9 (2 bytes) 3 bytes: U+0800 to U+FFFF Pattern: 1110xxxx 10xxxxxx 10xxxxxx Example: '€' = U+20AC = 11100010 10000010 10101100 = 0xE2 0x82 0xAC (3 bytes) 4 bytes: U+10000 to U+10FFFF Pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Example: '😀' = U+1F600 = 11110000 10011111 10011000 10000000 = 0xF0 0x9F 0x98 0x80 (4 bytes) Key insight: Leading byte tells you character length 0xxxxxxx = 1 byte (ASCII) 110xxxxx = 2 bytes 1110xxxx = 3 bytes 11110xxx = 4 bytes 10xxxxxx = Continuation byte (never starts a character)
| Encoding | Bytes per Char | Pros | Cons | Used By |
|---|---|---|---|---|
| UTF-8 | 1-4 bytes (variable) | ASCII compatible, space efficient, no BOM needed | Variable length complicates indexing | Web, Linux, most programming |
| UTF-16 | 2 or 4 bytes | Fixed width for most chars (BMP) | Not ASCII compatible, byte order issues (BOM) | Windows, Java, JavaScript internally |
| UTF-32 | 4 bytes (fixed) | Fixed width, easy indexing | Space inefficient (4× ASCII size) | Rare, some internal processing |
1. String length is ambiguous:
"café" in UTF-8 = 5 bytes (c=1, a=1, f=1, é=2)
But 4 Unicode code points
But 4 "characters" to humans
JavaScript: "café".length → 4 (counts UTF-16 code units)
Python: len("café") → 4 (counts code points)
Bytes: len("café".encode('utf-8')) → 5 (bytes)
2. Emoji are complex:
"👨👩👧👦" (family) = 7 code points:
👨 (man) + ZWJ + 👩 (woman) + ZWJ + 👧 (girl) + ZWJ + 👦 (boy)
(ZWJ = Zero Width Joiner, U+200D)
"👍" (thumbs up) = 1 code point (U+1F44D)
"👍🏿" (dark skin tone) = 2 code points (base + modifier)
3. Visual vs code point length:
é can be:
- 1 code point: U+00E9 (precomposed "é")
- 2 code points: U+0065 U+0301 (e + combining acute accent)
Both look identical, different byte representations!
→ Use Unicode normalization (NFC, NFD) to standardize
4. Case folding isn't simple:
German: "ß".toUpperCase() → "SS" (1 char becomes 2!)
Turkish: "i".toUpperCase() → "İ" (dotted I)
"I".toLowerCase() → "ı" (dotless i)
→ Use locale-aware case conversion
5. Sorting/collation is locale-dependent:
Swedish: ä comes after z
German: ä sorts like a
→ Use ICU library or locale-aware sorting
Definition: Regular expressions are patterns used to match character combinations in strings. They're a powerful tool for text searching, validation, parsing, and manipulation.
Why they matter: Used everywhere - text editors, log parsing, input validation, data extraction, search & replace, URL routing, lexical analysis, and more.
Trade-off: Powerful but can be cryptic. The joke: "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
Most characters match themselves: Pattern: cat Matches: "cat" in "The cat sat" Doesn't match: "Cat" (case-sensitive by default) Pattern: hello world Matches: "hello world" exactly
| Character | Meaning | Example |
|---|---|---|
. |
Any single character (except newline) | c.t matches "cat", "cot", "c9t" |
^ |
Start of string/line | ^cat matches "cat" only at start |
$ |
End of string/line | cat$ matches "cat" only at end |
* |
0 or more times | ca*t matches "ct", "cat", "caaat" |
+ |
1 or more times | ca+t matches "cat", "caat" (not "ct") |
? |
0 or 1 time (optional) | colou?r matches "color" or "colour" |
| |
OR (alternation) | cat|dog matches "cat" or "dog" |
( ) |
Grouping and capturing | (ca)+t matches "cat", "cacat" |
[ ] |
Character class | [aeiou] matches any vowel |
\ |
Escape special character | \. matches literal "." |
To match metacharacters literally, escape them:
Pattern: \$\d+\.\d{2}
Matches: "$19.99" (dollar amount)
$ escaped with \$
. escaped with \.
| Class | Equivalent | Matches |
|---|---|---|
\d |
[0-9] |
Any digit |
\D |
[^0-9] |
Any non-digit |
\w |
[a-zA-Z0-9_] |
Word character (letter, digit, underscore) |
\W |
[^a-zA-Z0-9_] |
Non-word character |
\s |
[ \t\n\r\f\v] |
Whitespace (space, tab, newline, etc.) |
\S |
[^ \t\n\r\f\v] |
Non-whitespace |
Square brackets [ ] define character sets: [aeiou] - Matches any single vowel [0-9] - Matches any digit (same as \d) [a-z] - Matches lowercase letter [A-Z] - Matches uppercase letter [a-zA-Z] - Matches any letter [a-z0-9] - Matches letter or digit [^aeiou] - Matches any character EXCEPT vowels (^ negates) [0-9a-fA-F] - Matches hexadecimal digit Special characters lose meaning inside [ ]: [.] - Matches literal dot (no need to escape) [*+?] - Matches literal *, +, or ? [a-z-] - Matches a-z or hyphen (hyphen at end) [-a-z] - Matches hyphen or a-z (hyphen at start) [a\-z] - Matches a, hyphen, or z (hyphen escaped)
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
* |
0 or more | ab*c |
"ac", "abc", "abbc", "abbbc" |
+ |
1 or more | ab+c |
"abc", "abbc" (not "ac") |
? |
0 or 1 | ab?c |
"ac" or "abc" |
{n} |
Exactly n times | \d{3} |
"123", "999" (exactly 3 digits) |
{n,} |
n or more times | \d{3,} |
"123", "1234", "12345" |
{n,m} |
Between n and m times | \d{3,5} |
"123", "1234", "12345" (not "12" or "123456") |
Greedy (default): Match as much as possible
Pattern: <.*>
String: bold and italic
Matches: "bold and italic" (entire string!)
Lazy (non-greedy): Match as little as possible (add ?)
Pattern: <.*?>
String: bold and italic
Matches: "", "", "", "" (each tag separately)
Lazy quantifiers:
*? - 0 or more (lazy)
+? - 1 or more (lazy)
?? - 0 or 1 (lazy)
{n,}? - n or more (lazy)
{n,m}? - between n and m (lazy)
Example: Extract quoted strings
Greedy: ".*" on 'He said "hello" and "goodbye"' → '"hello" and "goodbye"'
Lazy: ".*?" on 'He said "hello" and "goodbye"' → '"hello"' and '"goodbye"'
| Anchor | Meaning | Example |
|---|---|---|
^ |
Start of string/line | ^Hello matches "Hello world" (not "Say Hello") |
$ |
End of string/line | world$ matches "Hello world" (not "world peace") |
\b |
Word boundary | \bcat\b matches "cat" (not "category" or "scat") |
\B |
Non-word boundary | \Bcat\B matches "cat" in "scattered" |
\A |
Start of string only | Like ^ but never matches after newline (multiline mode) |
\Z |
End of string only | Like $ but never matches before newline |
Word boundaries (\b) are crucial for exact matches:
Pattern: cat
String: "cat category scat"
Matches: ALL occurrences (cat in all three words)
Pattern: \bcat\b
String: "cat category scat"
Matches: Only "cat" (standalone word)
Validate entire string (common for input validation):
Pattern: ^\d{3}-\d{2}-\d{4}$
Matches: "123-45-6789" (entire string is SSN format)
Rejects: "My SSN is 123-45-6789" (extra text)
Parentheses create capturing groups:
Pattern: (\d{3})-(\d{2})-(\d{4})
String: "123-45-6789"
Capture groups:
Group 0 (entire match): "123-45-6789"
Group 1: "123"
Group 2: "45"
Group 3: "6789"
Use in replacement:
Pattern: (\w+)\s+(\w+)
String: "John Doe"
Replace with: $2, $1 (or \2, \1 in some flavors)
Result: "Doe, John"
Backreferences (refer to captured groups):
Pattern: \b(\w+)\s+\1\b
Matches: Repeated words like "the the" or "is is"
\1 refers back to whatever Group 1 matched
Pattern: <(\w+)>.*?\1>
Matches: text or text
Ensures closing tag matches opening tag
Use (?:...) when you need grouping but not capturing: Pattern: (?:https?|ftp)://(\S+) Matches: URLs with http, https, or ftp Only captures the domain/path part (not the protocol) Why use non-capturing groups? - Performance: Capturing has overhead - Clarity: Numbered groups ($1, $2) stay simple - Necessity: Some regex engines limit number of capture groups
Python/PCRE syntax: (?P...) JavaScript/Java/.NET syntax: (?...) Pattern: (?P\d{4})-(?P \d{2})-(?P \d{2}) String: "2025-01-15" Captures: year: "2025" month: "01" day: "15" # Python example: import re match = re.search(r'(?P \d{4})-(?P \d{2})-(?P \d{2})', '2025-01-15') print(match.group('year')) # "2025"
| Type | Syntax | Meaning | Example |
|---|---|---|---|
| Positive Lookahead | (?=...) |
Followed by pattern | \d+(?= dollars) matches number before " dollars" |
| Negative Lookahead | (?!...) |
NOT followed by pattern | \d+(?! dollars) matches numbers not before " dollars" |
| Positive Lookbehind | (?<=...) |
Preceded by pattern | (?<=\$)\d+ matches number after "$" |
| Negative Lookbehind | (? |
NOT preceded by pattern | (? matches numbers not after "$" |
Lookaheads/lookbehinds are zero-width (don't consume characters):
Pattern: \w+(?=\.)
String: "Hello. World. Test."
Matches: "Hello", "World", "Test" (dots not included in match)
Complex example: Password validation
Requirement: 8+ chars, at least 1 uppercase, 1 lowercase, 1 digit
Pattern: ^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$
(?=.*[A-Z]) - Must contain uppercase
(?=.*[a-z]) - Must contain lowercase
(?=.*\d) - Must contain digit
.{8,} - At least 8 characters
Matches: "Password1" ✓
Rejects: "password" (no uppercase or digit) ✗
Email (basic):
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Note: Email validation is complex; this catches most common formats
URL/URI:
https?://[^\s]+
Or more strict:
^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b
Phone (US):
^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$
Matches: (123) 456-7890, 123-456-7890, 123.456.7890, 1234567890
IP Address (IPv4):
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Validates: 0.0.0.0 to 255.255.255.255
Date (YYYY-MM-DD):
^\d{4}-\d{2}-\d{2}$
Basic: Matches format, not validity (allows 2025-99-99)
Strict: ^(19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$
Credit Card (format only, no Luhn check):
^\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}$
Hex Color:
^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$
Matches: #FFFFFF, #FFF, FFFFFF, FFF
Username (alphanumeric + underscore, 3-16 chars):
^[a-zA-Z0-9_]{3,16}$
Strong Password (8+ chars, upper, lower, digit, special):
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
Extract all words:
\b\w+\b
Extract numbers (including decimals):
\d+\.?\d*
Or: -?\d+(?:\.\d+)? (includes negative numbers)
HTML tags:
<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)
Note: HTML is not regular; use proper HTML parser for production
Remove extra whitespace:
\s+
Replace with single space
Trim whitespace:
^\s+|\s+$
Replace with empty string
| Feature | PCRE (Perl) | JavaScript | Python | Java |
|---|---|---|---|---|
| Named groups | (? |
(? (ES2018+) |
(?P |
(? |
| Lookbehind | ✓ Variable length | ✓ Fixed length only | ✓ Variable length | ✓ Fixed length only |
| Backreferences | \1 |
\1 or $1 |
\1 |
\1 or $1 |
| Flags | /pattern/i |
/pattern/gi |
re.I, re.M |
Pattern.CASE_INSENSITIVE |
| Unicode support | ✓ Full | ✓ Limited (better in ES2015+) | ✓ Full | ✓ Full |
Flags modify how regex behaves:
i - Case insensitive
/cat/i matches "cat", "Cat", "CAT"
g - Global (find all matches, not just first)
/cat/g finds all "cat" in string
m - Multiline (^ and $ match line breaks)
Without: ^cat$ matches entire string
With: ^cat$ matches lines starting and ending with "cat"
s - Dotall (. matches newline)
Without: . matches everything except \n
With: . matches everything including \n
x - Extended/verbose (ignore whitespace, allow comments)
Useful for complex patterns with documentation
u - Unicode (JavaScript, treat pattern as Unicode)
/\u{1F600}/u matches 😀
JavaScript example:
const regex = /cat/gi; // Case insensitive + global
"Cat dog cat CAT".match(regex); // ["Cat", "cat", "CAT"]
Python example:
import re
re.findall(r'cat', 'Cat dog cat CAT', re.IGNORECASE) # ['Cat', 'cat', 'CAT']
1. Catastrophic Backtracking
Bad: (a+)+b String: "aaaaaaaaaaaaaaaaaac" (no 'b' at end) Result: Exponential backtracking, can hang for seconds! Why: Regex engine tries every combination of how to split 'a's between inner and outer + Solution: Use possessive/atomic groups or avoid nested quantifiers Better: a+b (no nesting)
2. Greedy Quantifiers on Large Text
Bad: .*keyword On large text, .* matches everything, then backtracks slowly Better: .*?keyword (lazy) Or: [^k]*keyword (match anything except start of keyword)
3. Unnecessary Captures
Bad: (https?)://(www\.)?([a-z]+)\.com Creates 3 capture groups you might not need Better: (?:https?)://(?:www\.)?([a-z]+)\.com Use (?:...) for non-capturing groups
[0-9] instead of . when you know it's a digit^ and $ prevent unnecessary scanning(?:...) when you don't need the capture(a+)+ can cause exponential backtracking
Online tools:
- regex101.com (debugger with explanation, best for learning)
- regexr.com (visual, interactive)
- regexpal.com (simple tester)
In code:
# Python
import re
pattern = r'\d{3}-\d{2}-\d{4}'
test_cases = ['123-45-6789', 'invalid', '123456789']
for test in test_cases:
match = re.match(pattern, test)
print(f"{test}: {'✓' if match else '✗'}")
// JavaScript
const pattern = /\d{3}-\d{2}-\d{4}/;
['123-45-6789', 'invalid'].forEach(test => {
console.log(`${test}: ${pattern.test(test) ? '✓' : '✗'}`);
});
str.contains() or str.startsWith() are clearer and faster
Character Classes:
. Any character (except newline)
\d \D Digit / non-digit
\w \W Word char / non-word
\s \S Whitespace / non-whitespace
[abc] a, b, or c
[^abc] Not a, b, or c
[a-z] Range a to z
Anchors:
^ Start of string/line
$ End of string/line
\b Word boundary
\B Non-word boundary
Quantifiers:
* 0 or more (greedy)
+ 1 or more (greedy)
? 0 or 1 (greedy)
{n} Exactly n times
{n,} n or more times
{n,m} n to m times
*? +? ?? Lazy versions
Groups:
(...) Capturing group
(?:...) Non-capturing group
(?...) Named group
\1 \2 Backreference
(?=...) Positive lookahead
(?!...) Negative lookahead
(?<=...) Positive lookbehind
(?Special:
| OR (alternation)
\ Escape special char