These will have the high bit set which will make them appear to be non-printable resulting in the entire string being discarded. The style information will have the necessary font information so it's sufficient to just strip the bit.
Is there something in the spec encoding the microsoft fonts or just some kind of practice? Seems a little scary just stripping bits.
If the main issue is the whole string disappearing, those text.clear() calls below looks suspicious. Unicode string could contain all kinds of non-printing characters like LTR/RTL controls which would remain broken? Seems to be introduced by commit 4847181d7d5f for some kind of workaround. No idea if still needed or not.
There's nothing in the spec that I've found, there's a more detailed explanation in the word parser https://cgit.kde.org/calligra.git/tree/filters/words/msword-odf/wv2/src/parser9x.cpp#n513 but it still doesn't cite any sources. Removing the entire string is excessive and may be a problem with some documents, but removing that without addressing the decoding issue gives you a string with junk or missing characters whereas addressing the decoding gives the full correct string.
But would those characters work without encoding adjustments if the used MS font was present? Not sure how it gets rendered now without the font, but if it's anything like the "J" exchange email smiley I'm not sure which is worse.
Adding Marijn. Ancient changes, but could those text.clear() parts be removed by now?