JavaScript : understanding string normalize

Recently, When I was working on enhancing the search of Bender, I wanted to search words with characters like "ö" or "é" using normal characters like o and e. While looking into handling this case, I came across String.prototype.normalize

String.prototype.normalize supports 4 types of normalization

  • NFC : canonical composed
  • NFD : canonical decomposed
  • NFKC : compatible composed
  • NFKD : compatible decomposed

Based on our use case, we should be able to choose which normalization we should use. NFC is the default normalization form, if we omit the argument.

Before getting into each normalization form, let us see what is unicode composed and decomposed form.

Composed & decomposed forms

Some unicode characters can be represented in both composed and decomposed forms.
For example, "ö" can be represented in composed form by \u00F6 and in the decomposed form it can be represented by \u006F (o) and \u0308 (Combining Diaeresis).

const composed = "\u00F6";
const decomposed = "\u006F\u0308"

console.log(composed); // ö
console.log(composed.length); // 1

console.log(decomposed); // ö 
console.log(decomposed.length); // 2

In order to show the difference I will use "Köln ff" (ff = U+FB00) as sample string and perform all 4 normalization form.

NFC

  • default normalization form
  • use canonical equivalent character
  • uses composed form (single codepoint)
  • won’t change the visual appearance
  • won’t be able to produce or find the string based on compatible chars
const str = 'Köln ff';
const normalizedStr = str.normalize('NFC');

console.log(str); // Köln ff
console.log(str.length); // 6

console.log(normalizedStr); // Köln ff
console.log(normalizedStr.length); // 6

console.log(normalizedStr.includes('o')) // false
console.log(normalizedStr.indexOf('o')) // -1

console.log(normalizedStr.includes('f')) // false
console.log(normalizedStr.indexOf('f')) // -1

NFD

  • use canonical equivalent character
  • uses decomposed form (multiple codepoints)
  • won’t change the visual appearance
  • won’t be able to produce or find the string based on compatible chars
const str = 'Köln ff';
const normalizedStr = str.normalize('NFD');

console.log(str); // Köln ff
console.log(str.length); // 6

console.log(normalizedStr); // Köln ff
console.log(normalizedStr.length); // 7 (only ö get decomposed into \u006F and \u0308

console.log(normalizedStr.includes('o')) // true
console.log(normalizedStr.indexOf('o')) // 1

console.log(normalizedStr.includes('f')) // false
console.log(normalizedStr.indexOf('f')) // -1

NFKC

  • use compatible equivalent character
  • uses composed form (single codepoint)
  • changes the visual appearance
  • able to produce or find the string based on compatible chars
const str = 'Köln ff';
const normalizedStr = str.normalize('NFKC');

console.log(str); // Köln ff
console.log(str.length); // 6

console.log(normalizedStr); // Köln ff (ff got converted into ff)
console.log(normalizedStr.length); // 7 (ö in the composed form, but ff got converted into ff)

console.log(normalizedStr.includes('o')) // false
console.log(normalizedStr.indexOf('o')) // -1

console.log(normalizedStr.includes('f')) // true 
console.log(normalizedStr.indexOf('f')) // 5

NFKD

  • use compatible equivalent character
  • uses decomposed form (multiple codepoints)
  • changes the visual appearance
  • able to produce or find the string based on compatible characters
const str = 'Köln ff';
const normalizedStr = str.normalize('NFKD');

console.log(str); // Köln ff
console.log(str.length); // 6

console.log(normalizedStr); // Köln ff (ff got converted into ff)
console.log(normalizedStr.length); // 8 (ö is decomposed and ff got converted into ff)

console.log(normalizedStr.includes('o')) // true
console.log(normalizedStr.indexOf('o')) // 1

console.log(normalizedStr.includes('f')) // true 
console.log(normalizedStr.indexOf('f')) // 5

Conclusion

Since we now understood all the normalization forms, we can make an informed decision on which normalization form to use for our usecase.

If we want to compare string or compare length of strings, we can normalize both strings using NFC or NFD before comparing.

If we want to search a string using compatible characters and canonical equivalent characters, the best option is to use NFKD normalization. NFKD normalized string may not be suitable for display, as it will visually change the character.

Hope this is Helpful.
If I have made a mistake or missed something, please feel free to email me.

References:

If you find my work helpful, You can buy me a coffee.