Optimizing Unicode's grapheme cluster break algorithm

With grapheme cluster break detection algorithm in Unicode you can detect whether a caret can be placed at given offset so that it won't break text structure. For example in a text editor you don't want caret to be placed between CR LF as an eventual edit there might break line end detection. Similar situation happens when diacritics is stored in decomposed form (letter followed by one or more accents).

To do that you need to grab grapheme break property for the two codepoints left and right from position in question and apply a bunch of rules to tell whether the position is a break or not.

I wanted to dig in and take a look if I make the API a bit better as well as maybe give it some performance boost, because this function is used a lot in text processing.

I measured all functions with two data sets:

Complex - Set of codepoints available in official GraphemeBreakTest.txt.
Simple - A C source file that only uses ASCII characters. Which is a typical use case for a code editor.

I used utf8proc as a reference for the initial implementation of the detection. It has a function utf8proc_grapheme_break that takes two codepoints, does the property lookup for you and with a bunch of ternary operators tells you the result:

    /* return whether there is a grapheme break between boundclasses lbc and tbc */
    static utf8proc_bool grapheme_break(int lbc, int tbc) {
      return
        (lbc == UTF8PROC_BOUNDCLASS_START) ? true :
        (lbc == UTF8PROC_BOUNDCLASS_CR &&
         tbc == UTF8PROC_BOUNDCLASS_LF) ? false :
        (lbc >= UTF8PROC_BOUNDCLASS_CR && lbc <= UTF8PROC_BOUNDCLASS_CONTROL) ? true :
        (tbc >= UTF8PROC_BOUNDCLASS_CR && tbc <= UTF8PROC_BOUNDCLASS_CONTROL) ? true :
        (tbc == UTF8PROC_BOUNDCLASS_EXTEND) ? false :
        (lbc == UTF8PROC_BOUNDCLASS_L &&
         (tbc == UTF8PROC_BOUNDCLASS_L ||
          tbc == UTF8PROC_BOUNDCLASS_V ||
          tbc == UTF8PROC_BOUNDCLASS_LV ||
          tbc == UTF8PROC_BOUNDCLASS_LVT)) ? false :
        ((lbc == UTF8PROC_BOUNDCLASS_LV ||
          lbc == UTF8PROC_BOUNDCLASS_V) &&
         (tbc == UTF8PROC_BOUNDCLASS_V ||
          tbc == UTF8PROC_BOUNDCLASS_T)) ? false :
        ((lbc == UTF8PROC_BOUNDCLASS_LVT ||
          lbc == UTF8PROC_BOUNDCLASS_T) &&
         tbc == UTF8PROC_BOUNDCLASS_T) ? false :
        (lbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR &&
         tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR) ? false :
        (tbc != UTF8PROC_BOUNDCLASS_SPACINGMARK);
    }

However, rather than a function that takes two codepoints (or properties), it would be more useful to have a function that takes one property and keeps state. I ended up having two very rough functions with switch/case. One returns 1 if there’s a grapheme break before the codepoint being processed, and the other one advances the state.

    static inline int
    switch_state_for(uint32_t boundary_class)
    {
        switch (boundary_class)
        {
        case UnicodeBoundaryClass_LF:
        case UnicodeBoundaryClass_CONTROL:
        case UnicodeBoundaryClass_START:
            return State_Out;

        // case UnicodeBoundaryClass_EXTEND:
        // case UnicodeBoundaryClass_OTHER:
        //     return State_Other;

        // CR+LF
        case UnicodeBoundaryClass_CR:
            return State_CR;

        // Hangul

        case UnicodeBoundaryClass_L:
            return State_Hangul_GB6;

        case UnicodeBoundaryClass_V:
            return State_Hangul_GB7;
        case UnicodeBoundaryClass_LV:
            return State_Hangul_GB7;

        case UnicodeBoundaryClass_T:
            return State_Hangul_GB8;
        case UnicodeBoundaryClass_LVT:
            return State_Hangul_GB8;

        case UnicodeBoundaryClass_REGIONAL_INDICATOR:
            return State_RegionalIndicator_GB8a;

        // GB9b - No characters in Prepend category in Unicode 8.0.
        }
        return State_Other;
    }

    inline int
    switch_break_for(uint32_t boundary_class, int state)
    {
        switch (state)
        {
        case State_Out:
            return 1;

        case State_CR:
            switch (boundary_class)
            {
            case UnicodeBoundaryClass_LF:
                return 0;

            default:
                return 1;
            }
            break;

        // GB6  L   ×   ( L | V | LV | LVT )
        case State_Hangul_GB6:
            switch (boundary_class)
            {
            case UnicodeBoundaryClass_L:
            case UnicodeBoundaryClass_V:
            case UnicodeBoundaryClass_LV:
            case UnicodeBoundaryClass_LVT:
                return 0;
            }
            break;

        // GB7  ( LV | V )  ×   ( V | T )
        case State_Hangul_GB7:
            switch (boundary_class)
            {
            case UnicodeBoundaryClass_V:
            case UnicodeBoundaryClass_T:
                return 0;
            }
            break;

        // GB8  ( LVT | T)  ×   T
        case State_Hangul_GB8:
            switch (boundary_class)
            {
            case UnicodeBoundaryClass_T:
                return 0;
            }
            break;

        case State_RegionalIndicator_GB8a:
            switch (boundary_class)
            {
            case UnicodeBoundaryClass_REGIONAL_INDICATOR:
                return 0;
            }
            break;
        }

        switch (boundary_class)
        {
        // GB9 - Don't break before EXTEND
        case UnicodeBoundaryClass_EXTEND:
            return 0;
        // GB9a - Don't break before SPACING_MARK
        case UnicodeBoundaryClass_SPACING_MARK:
            return 0;
        }

        return 1;
    }

For the simple data set, this was 0.86 of utf8proc time.
For the complex data set, this was 1.70 of utf8proc time.

Next I sampled the switch/case in these functions to a table, and added two very simple branchless lookup functions. This helped the performance quite a lot.

    const uint8_t T[] = {
    //  state_for()
        0,   1,   2,   0,   0,   1,   3,   4,   5,   4,   5,   6,   1,
    //  break_for()
        1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
        1,   1,   1,   1,   1,   0,   1,   1,   1,   1,   1,   1,   0,
        1,   1,   1,   0,   1,   1,   1,   1,   1,   1,   1,   1,   1,
        1,   1,   1,   1,   1,   0,   0,   0,   1,   0,   0,   1,   0,
        1,   1,   1,   1,   1,   0,   1,   0,   0,   1,   1,   1,   0,
        1,   1,   1,   1,   1,   0,   1,   1,   0,   1,   1,   1,   0,
        1,   1,   1,   1,   1,   0,   1,   1,   1,   1,   1,   0,   0,
        };

    bool
    table_break_for(int b, int s) {
        return T[((s + 1) * 13) + b];
    }

    int
    table_state_for(int b) {
        return T[b];
    }

For the simple data set, this was 0.34 of utf8proc time.
For the complex data set, this was 0.46 of utf8proc time.

There were some further optimizations thanks to talks with Martin Wardener:

There are two identical states that were merged (already reflected in the codes above), that brought the number of states from 8 down to 7 (plus first row is used for state lookup).
Align to cache lines is not only a good step, but it would allow us to replace 13 multiplication with << 4. The final table still fitts in two cache lines.

So the final code looks like this: