Swedish to not-so-Swedish through diacritics removal

Posted on Fri 06 April 2012 in Coding

According to Merriam-Webster, a diacritic is:

a mark near or through an orthographic or phonetic character or combination of characters indicating a phonetic value different from that given the unmarked or otherwise marked element

Right! :-) In the Swedish language, there are three common letters with diacritics, namely å, ä and ö. The diacritics here are ˚ (ring above) and ¨ (diaeresis). In a particular situation, I wanted to do a one-way conversion of a name containing Swedish letters with diacritics into a name with only non-diacritical letters. Thus, I wanted to convert å and ä into a and ö into o.

A naive solution to the problem is obviously to create a small mapping table with entries for the relevant lowercase and uppercase letters. But how fun is that? Wouldn’t it be better to have a generic solution that doesn’t require us to maintain a mapping table? Well, of course it would! Let me show you how!

The key to solving the problem lies in Unicode equivalence, which in short means that two Unicode strings that contain different code points can be equivalent as long as the code point sequences have the same meaning across the strings. Single-code point characters can be decomposed into multi-code point characters and vice versa - the latter can be composed into the former. For example, å can be decomposed into a and ˚, as we saw before. This is achieved by normalizing a string according to Normalization Form D (NFD), which decomposes combined code points into separate code points. Let’s see how that is done in a few different languages.

In the examples below, I use the string “fåll fälla föll” (the words mean hem as in skirt hem, fell as in fell a tree and fell as in I fell down), which I expect to be converted to “fall falla foll”.

First up, Java:

public static void main(String[] args) {
    String str = "fåll fälla föll";

    // import java.text.Normalizer and java.text.Normalizer.Form
    String nstr = Normalizer.normalize(str, Form.NFD);
    char[] chars = nstr.toCharArray();
    int j = 0;
    for (char ch : chars) {
        chars[j] = ch;
        j += IsMark(ch) ? 0 : 1;
    }
    nstr = new String(chars, 0, j);

    System.out.println(str);
    System.out.println(nstr);
}

private static boolean IsMark(char ch) {
    int gc = Character.getType(ch);

    return gc == Character.NON_SPACING_MARK
        || gc == Character.ENCLOSING_MARK
        || gc == Character.COMBINING_SPACING_MARK;
}

Pretty verbose, unfortunately, but much of the code involves filtering out diacritics. The marked lines show the code point decomposition and diacritic identification. Note that the Swedish diacritics fall into the NON_SPACING_MARK/Mn category, but I included all the Mark categories for good measure!

Alright, how about C#?

static void Main(string[] args) {
    var str = "fåll fälla föll";

    var nstr = str.Normalize(NormalizationForm.FormD);
    nstr = new string(nstr.Where(ch => !IsMark(ch)).ToArray());

    Console.WriteLine(str);
    Console.WriteLine(nstr);
}

private static bool IsMark(char ch) {
    var gc = Char.GetUnicodeCategory(ch);

    // using System.Globalization
    return gc == UnicodeCategory.NonSpacingMark 
        || gc == UnicodeCategory.EnclosingMark
        || gc == UnicodeCategory.SpacingCombiningMark;
}

Diacritics filtering is much more concise thanks to LINQ, but decomposition and diacritic identification are virtually identical to how they’re done in Java.

Finally, let’s look at Python:

import unicodedata

str = u'fåll fälla föll'
nstr = unicodedata.normalize('NFD', str)

nstr = ''.join([ch for ch in nstr if not unicodedata.category(ch).startswith('M')])

print str
print nstr

Python wins the conciseness competition! Decomposition is similar to how it’s done in Java and C#, but since we get the category as a string rather than a numeric constant, we can filter out all Mark categories in one go. And with list comprehension on top of that, diacritics filtering becomes very succinct!

Not too bad, regardless of language! For each of the languages, the corresponding framework contains what we need, and it’s only a matter of putting together the pieces.

I’ll finish with a small rant; in 2012, you’d expect most web services to handle words that include diacritics without a problem, right? Well, Amazon thinks that my credit card reads ROVEGÃRD instead of ROVEGÅRD, and it insists that my street address begins with ÃÂstra rather than Östra. Yeah, I realize this is an encoding problem rather than a problem with diacritics per se, but still…