With grapheme cluster break detection algorithm in Unicode you can detect whether a caret can be placed at given offset so that it won't break text structure. For example in a text editor you don't want caret to be placed between CR LF
as an eventual edit there might break line end detection. Similar situation happens when diacritics is stored in decomposed form (letter followed by one or more accents).
To do that you need to grab grapheme break property for the two codepoints left and right from position in question and apply a bunch of rules to tell whether the position is a break or not.
I wanted to dig in and take a look if I make the API a bit better as well as maybe give it some performance boost, because this function is used a lot in text processing.
I measured all functions with two data sets:
I used utf8proc as a reference for the initial implementation of the detection. It has a function utf8proc_grapheme_break that takes two codepoints, does the property lookup for you and with a bunch of ternary operators tells you the result:
/* return whether there is a grapheme break between boundclasses lbc and tbc */
static utf8proc_bool grapheme_break(int lbc, int tbc) {
return
(lbc == UTF8PROC_BOUNDCLASS_START) ? true :
(lbc == UTF8PROC_BOUNDCLASS_CR &&
tbc == UTF8PROC_BOUNDCLASS_LF) ? false :
(lbc >= UTF8PROC_BOUNDCLASS_CR && lbc <= UTF8PROC_BOUNDCLASS_CONTROL) ? true :
(tbc >= UTF8PROC_BOUNDCLASS_CR && tbc <= UTF8PROC_BOUNDCLASS_CONTROL) ? true :
(tbc == UTF8PROC_BOUNDCLASS_EXTEND) ? false :
(lbc == UTF8PROC_BOUNDCLASS_L &&
(tbc == UTF8PROC_BOUNDCLASS_L ||
tbc == UTF8PROC_BOUNDCLASS_V ||
tbc == UTF8PROC_BOUNDCLASS_LV ||
tbc == UTF8PROC_BOUNDCLASS_LVT)) ? false :
((lbc == UTF8PROC_BOUNDCLASS_LV ||
lbc == UTF8PROC_BOUNDCLASS_V) &&
(tbc == UTF8PROC_BOUNDCLASS_V ||
tbc == UTF8PROC_BOUNDCLASS_T)) ? false :
((lbc == UTF8PROC_BOUNDCLASS_LVT ||
lbc == UTF8PROC_BOUNDCLASS_T) &&
tbc == UTF8PROC_BOUNDCLASS_T) ? false :
(lbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR &&
tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR) ? false :
(tbc != UTF8PROC_BOUNDCLASS_SPACINGMARK);
}
However, rather than a function that takes two codepoints (or properties), it would be more useful to have a function that takes one property and keeps state. I ended up having two very rough functions with switch/case. One returns 1
if there’s a grapheme break before the codepoint being processed, and the other one advances the state.
static inline int
switch_state_for(uint32_t boundary_class)
{
switch (boundary_class)
{
case UnicodeBoundaryClass_LF:
case UnicodeBoundaryClass_CONTROL:
case UnicodeBoundaryClass_START:
return State_Out;
// case UnicodeBoundaryClass_EXTEND:
// case UnicodeBoundaryClass_OTHER:
// return State_Other;
// CR+LF
case UnicodeBoundaryClass_CR:
return State_CR;
// Hangul
case UnicodeBoundaryClass_L:
return State_Hangul_GB6;
case UnicodeBoundaryClass_V:
return State_Hangul_GB7;
case UnicodeBoundaryClass_LV:
return State_Hangul_GB7;
case UnicodeBoundaryClass_T:
return State_Hangul_GB8;
case UnicodeBoundaryClass_LVT:
return State_Hangul_GB8;
case UnicodeBoundaryClass_REGIONAL_INDICATOR:
return State_RegionalIndicator_GB8a;
// GB9b - No characters in Prepend category in Unicode 8.0.
}
return State_Other;
}
inline int
switch_break_for(uint32_t boundary_class, int state)
{
switch (state)
{
case State_Out:
return 1;
case State_CR:
switch (boundary_class)
{
case UnicodeBoundaryClass_LF:
return 0;
default:
return 1;
}
break;
// GB6 L × ( L | V | LV | LVT )
case State_Hangul_GB6:
switch (boundary_class)
{
case UnicodeBoundaryClass_L:
case UnicodeBoundaryClass_V:
case UnicodeBoundaryClass_LV:
case UnicodeBoundaryClass_LVT:
return 0;
}
break;
// GB7 ( LV | V ) × ( V | T )
case State_Hangul_GB7:
switch (boundary_class)
{
case UnicodeBoundaryClass_V:
case UnicodeBoundaryClass_T:
return 0;
}
break;
// GB8 ( LVT | T) × T
case State_Hangul_GB8:
switch (boundary_class)
{
case UnicodeBoundaryClass_T:
return 0;
}
break;
case State_RegionalIndicator_GB8a:
switch (boundary_class)
{
case UnicodeBoundaryClass_REGIONAL_INDICATOR:
return 0;
}
break;
}
switch (boundary_class)
{
// GB9 - Don't break before EXTEND
case UnicodeBoundaryClass_EXTEND:
return 0;
// GB9a - Don't break before SPACING_MARK
case UnicodeBoundaryClass_SPACING_MARK:
return 0;
}
return 1;
}
Next I sampled the switch/case in these functions to a table, and added two very simple branchless lookup functions. This helped the performance quite a lot.
const uint8_t T[] = {
// state_for()
0, 1, 2, 0, 0, 1, 3, 4, 5, 4, 5, 6, 1,
// break_for()
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
};
bool
table_break_for(int b, int s) {
return T[((s + 1) * 13) + b];
}
int
table_state_for(int b) {
return T[b];
}
There were some further optimizations thanks to talks with Martin Wardener:
13
multiplication with << 4
. The final table still fitts in two cache lines.So the final code looks like this: