-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multibyte characters #10
base: main
Are you sure you want to change the base?
Conversation
β¦ based on character arrays
@talltyler thanks for picking up my issue! I wish I had the knowledge to fix it myself... Also, the original unicode bidi tests lack tests with multibyte characters which is why this implementation didn't pick it up, we need to add some here if we want @lojjic to approve this quicker, I'll try to add a PR with some cases based on |
I'll spend some time today seeing if I can fix the tests. I don't know much about the details of these cases but hopefully I can find a pattern. |
05D0 0028 0332 05D1 0029 0333;0;0;1 1 1 1 1 1;5 4 3 2 1 0 | ||
0661 0028 0662 0029 0331;0;0;2 1 2 1 1;4 3 2 1 0 | ||
0661 0028 0332 0662 0029 0333;0;0;2 1 1 2 1 1;5 4 3 2 1 0 | ||
0061 0028 0062 0029 0331;1;1;2 2 2 1;3 0 1 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you are not supposed to touch this file it is imported directly from the Unicode bidi spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think BidiCharacterTest.js should also use string.split('')
I've made a few changes and the tests are passing but I think this could still be better. There are multibyte characters in the tests already, the problem is my changes combine things like \u036F with the characters that are before them into a single item in the character array being passed around. The test cases assume these multibyte characters are broken out into their parts so the tests and the values they are testing are off. I've fixed the cases in BidiCharacterTest.txt but their are 49038 cases in BidiTest.txt that are off. To get the tests in BidiCharacterTest.js to pass I'm not using the stringToArray function, it is just |
* @return {string[]} - the string broken down into an array of characters. | ||
*/ | ||
export function stringToArray (string) { | ||
return [...new Intl.Segmenter().segment(string)].map(x => x.segment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intl.Segmenter
segments by grapheme clusters by default. This is possibly overkill for fixing #9. In some scripts, it will group whole syllables of multiple characters. If the issue is just that surrogate pairs are getting split up, then use the spread operator ([...string]
) or the string[Symbol.iterator]()
iterator directly on the string you want to split, as described in this documentation:
return [...string];
The standard iterator will also get you whole Unicode codepoints:
for (const char of string) {
const codePoint = char.codePointAt(0); // never a lone surrogate
}
The string iterator is much more widely supported than Intl.Segmenter
, which only landed in Firefox a few months ago.
All functions now operate based on character arrays rather than indexing characters from within the string.
The reason why this is important is
'π±π½ββοΈ'.length
is 7 not 1 so accessing string[0] through string[6] would be modifying the bytes in the middle of this single character which also messed with your logic related to how reordering should work.This seems to fix #9
but the problem can be seen more clearly in this modified version of their example where I changed their emoji to the one above. https://codepen.io/talltyler/pen/jOomxwz
To get this working I'm using Intl.Segmenter which is newish but has pretty good support across browsers along with Node 16 and Deno.
https://caniuse.com/mdn-javascript_builtins_intl_segmenter
If you would rather use something that is more compatible with older environments I can make a change but the only other way I know how to do it is with a huge regex that covers all of the ranges that can connect.
I tried to modify the comments and keep the current api working but if you are ok changing this a few of these functions can be removed.