-
Notifications
You must be signed in to change notification settings - Fork 164
Description
Based on the actual utf8proc_NFKC implementation, I tried with success to write a NFKC normalization C++ function that operates directly on UTF32 code points:
bool tryNormalizeNFKC(const vector<char32_t>& codePoints, vector<char32_t>& normalized)
{
normalized.clear();
normalized.reserve(codePoints.size());
char32_t buff[8];
utf8proc_ssize_t rc;
int lastBoundClass;
for (size_t i = 0; i < codePoints.size(); i++)
{
// NOTE: UTF8PROC_DECOMPOSE is undocumented for utf8proc_decompose_char but it's necessary
rc = utf8proc_decompose_char(codePoints[i], (utf8proc_int32_t*)buff, std::size(buff),
(utf8proc_option_t)(UTF8PROC_DECOMPOSE | UTF8PROC_COMPAT), &lastBoundClass);
if (rc < 0 || rc > std::size(buff))
goto Fail;
normalized.insert(normalized.end(), buff, buff + rc);
}
rc = utf8proc_normalize_utf32((utf8proc_int32_t*)normalized.data(),
(utf8proc_ssize_t)normalized.size(), (utf8proc_option_t)(UTF8PROC_COMPOSE | UTF8PROC_STABLE));
if (rc < 0)
goto Fail;
normalized.resize((size_t)rc);
return true;
Fail:
normalized.clear();
return false;
}This is more convenient for me to use instead of utf8proc_NFKC, since I already have the vector of char32_t codepoints, which I also need to further postprocess after the normalization. The only problem I found is that UTF8PROC_DECOMPOSE or UTF8PROC_COMPOSE are not documented as accepted flags in utf8proc_decompose_char, but either one of two is necessary to perform the desired transformation. Considering that the function has 'decompose' in the name that is even more confusing (I got it working just with try and guess and a bit of luck).
If you bother also clarifying a couple of other things:
- What's the maximum size I need
utf8proc_decompose_charfor thedstbuffer (I guess that there exists a static max value)? - I noticed
UTF8PROC_STABLEmay currently be unused in the code utf8proc code, correct?