Unicode microscope
Type a string. See JS .length, code points, grapheme clusters, and UTF-8 bytes side-by-side. The four numbers rarely agree. Emoji ZWJ sequences, regional-indicator flags, and combining marks each break a different abstraction; the breakdown shows where.
| # | cluster | codepoints | utf-8 | utf-16 |
|---|---|---|---|---|
| 00 | c | U+0063 | 63 | 0063 |
| 01 | a | U+0061 | 61 | 0061 |
| 02 | f | U+0066 | 66 | 0066 |
| 03 | é | U+00E9 | C3 A9 | 00E9 |
| 04 | U+0020 | 20 | 0020 | |
| 05 | 🇨🇦 | U+1F1E8 U+1F1E6 | F0 9F 87 A8 F0 9F 87 A6 | D83C DDE8 D83C DDE6 |
| 06 | U+0020 | 20 | 0020 | |
| 07 | 👨👩👧 | U+1F468 U+200D U+1F469 U+200D U+1F467 | F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D F0 9F 91 A7 | D83D DC68 200D D83D DC69 200D D83D DC67 |
The same visual character can be encoded as one precomposed code point (NFC) or as a base + combining mark (NFD). They render identically; their byte length differs. Most filesystems normalize; most APIs do not.
Graphemes counted via Intl.Segmenter (granularity: grapheme). UTF-8 via TextEncoder; UTF-16 read directly from the JS string. NFC normalizes precomposed characters; NFD decomposes them. Most UI work wants graphemes; most network/storage work wants UTF-8 bytes; almost nothing wants .length.
Type any string and see four lengths side-by-side: JS .length, code points, grapheme clusters, UTF-8 bytes. Per-cluster breakdown of code points and byte sequences. NFC vs NFD diff for combining characters. The reason emoji and accented Latin both look like one character but count as four.
← back to lab