Proper String Normalization for Comparison Purposes

TL;DR

In Java, do:

Java
 




xxxxxxxxxx
1


 
1
String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
2
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();


Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them.