Optimizing Unicode data for fun and profit

Just the default ICU Unicode data library is some 20MB. We don't need most of those data and since we are anal about executable size and also about dependencies, I implemented all of this shit by hand.

I could have packed the data with some state-of-art compressor, but that is a dependency, and it blows up executable size. Instead I went for some fun and tried to bring the data sizes down with very simple algorithms.

Grapheme clusters boundary data

These data allow Rift to place it's caret correctly and therefore not break Unicode text even if we cannot render it.

Algorithm requires 17811 codepoints in 1,345 ranges (a space of 921,600 codepoints) to be assigned one of 14 categories. You then use these 14 categories to detect whether there is a valid caret offset between a pair of codepoints using algorithm defined in Unicode.

Naive storage with two 32-bit codepoints and an 8-bit category would be some 12105 bytes.

The optimization I made is this:

Expanded the ranges to an array of 921,600 categories, each sitting at index of corresponding codepoint.
RLE encoded them down.

Result is 3,674 bytes in total.

Detection algorithm itself requires a state machine which is encoded into a table of 98 bytes. So it totals at 3,772 bytes. That allows the actual algorithm code to be this tiny:

static inline int
corune_get_cluster_break_property(CoRune rune) {
    return rune < CORUNE_CLUSTER_BREAK_MAX ? CORUNE_CLUSTER_BREAK[rune] : 0;
}

static inline int
corune_is_cluster_break(int state, int category) {
    return CORUNE_CLUSTER_BREAK_STATES[((state + 1) << 4) + category];
}

static inline int
corune_get_cluster_break_state(int category) {
    return CORUNE_CLUSTER_BREAK_STATES[category];
}

I've documented how we went about optimizing the algorithm in a separate article written in 2016.

General category data

These data allow Rift to tell whether a codepoint is a lowercase, uppercase, number or any of the 30 categories defined by Unicode. It is a replacement for <ctype.h>'s isalpha, isdigit etc.

Data is one of 30 categories assigned to one of 3,823 ranges. Again with simple storage of two 32-bit codepoints and 8-bit category we'd end up with 34,407 bytes.

To optimize, similarly to previous data set:

I've expanded the ranges into a flat array of categories each at index of corresponding codepoint.
RLE encoding with pair {category, count}
Encoded both pair values with UTF-8.