TCthomas.caron
~ / lab / unicode-microscope
04·live·2026.05

Unicode microscope

Why string.length lies: graphemes vs code points vs UTF-8 bytes.
stack
Intl.Segmenter · TextEncoder · NFC/NFD
kind
decoder
status
live

Type a string. See JS .length, code points, grapheme clusters, and UTF-8 bytes side-by-side. The four numbers rarely agree. Emoji ZWJ sequences, regional-indicator flags, and combining marks each break a different abstraction; the breakdown shows where.

graphemes
8
what the user sees as one character
code points
13
Unicode scalar values
UTF-16 units
18
JS .length — surrogate pairs split
UTF-8 bytes
33
wire / disk / database length
breakdown · one row per grapheme cluster
#clustercodepointsutf-8utf-16
00cU+0063630063
01aU+0061610061
02fU+0066660066
03éU+00E9C3 A900E9
04 U+0020200020
05🇨🇦U+1F1E8 U+1F1E6F0 9F 87 A8 F0 9F 87 A6D83C DDE8 D83C DDE6
06 U+0020200020
07👨‍👩‍👧U+1F468 U+200D U+1F469 U+200D U+1F467F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D F0 9F 91 A7D83D DC68 200D D83D DC69 200D D83D DC67
normalization · NFC vs NFD

The same visual character can be encoded as one precomposed code point (NFC) or as a base + combining mark (NFD). They render identically; their byte length differs. Most filesystems normalize; most APIs do not.

NFC33b
café 🇨🇦 👨‍👩‍👧
precomposed — fewer code points, fewer bytes
NFD34b
café 🇨🇦 👨‍👩‍👧
decomposed — base + combining marks
// forms differ · NFD adds 1 byte of combining marks

Graphemes counted via Intl.Segmenter (granularity: grapheme). UTF-8 via TextEncoder; UTF-16 read directly from the JS string. NFC normalizes precomposed characters; NFD decomposes them. Most UI work wants graphemes; most network/storage work wants UTF-8 bytes; almost nothing wants .length.

thesis

Type any string and see four lengths side-by-side: JS .length, code points, grapheme clusters, UTF-8 bytes. Per-cluster breakdown of code points and byte sequences. NFC vs NFD diff for combining characters. The reason emoji and accented Latin both look like one character but count as four.

← back to lab
thomas.caron — software developer
montréal, qc · UTC−04:00
built with vite + cloudflare workers
© 2026 — last deploy: 2026.05.07 ·privacy