Unicode microscope

Why string.length lies: graphemes vs code points vs UTF-8 bytes.

stack

Intl.Segmenter · TextEncoder · NFC/NFD

kind

decoder

status

live

Type a string. See JS .length, code points, grapheme clusters, and UTF-8 bytes side-by-side. The four numbers rarely agree. Emoji ZWJ sequences, regional-indicator flags, and combining marks each break a different abstraction; the breakdown shows where.

input

graphemes

what the user sees as one character

code points

Unicode scalar values

UTF-16 units

JS .length — surrogate pairs split

UTF-8 bytes

wire / disk / database length

breakdown · one row per grapheme cluster

#	cluster	codepoints	utf-8	utf-16
00	c	U+0063	63	0063
01	a	U+0061	61	0061
02	f	U+0066	66	0066
03	é	U+00E9	C3 A9	00E9
04		U+0020	20	0020
05	🇨🇦	U+1F1E8 U+1F1E6	F0 9F 87 A8 F0 9F 87 A6	D83C DDE8 D83C DDE6
06		U+0020	20	0020
07	👨‍👩‍👧	U+1F468 U+200D U+1F469 U+200D U+1F467	F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D F0 9F 91 A7	D83D DC68 200D D83D DC69 200D D83D DC67

normalization · NFC vs NFD

The same visual character can be encoded as one precomposed code point (NFC) or as a base + combining mark (NFD). They render identically; their byte length differs. Most filesystems normalize; most APIs do not.

NFC33b

café 🇨🇦 👨‍👩‍👧

precomposed — fewer code points, fewer bytes

NFD34b

café 🇨🇦 👨‍👩‍👧

decomposed — base + combining marks

// forms differ · NFD adds 1 byte of combining marks

Graphemes counted via Intl.Segmenter (granularity: grapheme). UTF-8 via TextEncoder; UTF-16 read directly from the JS string. NFC normalizes precomposed characters; NFD decomposes them. Most UI work wants graphemes; most network/storage work wants UTF-8 bytes; almost nothing wants .length.

thesis

Type any string and see four lengths side-by-side: JS .length, code points, grapheme clusters, UTF-8 bytes. Per-cluster breakdown of code points and byte sequences. NFC vs NFD diff for combining characters. The reason emoji and accented Latin both look like one character but count as four.

← back to lab