-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multibyte characters #10
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
export { getEmbeddingLevels } from './embeddingLevels.js' | ||
export { getReorderSegments, getReorderedIndices, getReorderedString } from './reordering.js' | ||
export { getEmbeddingLevels, getEmbeddingLevelsForCharacters } from './embeddingLevels.js' | ||
export { getReorderSegments, getReorderedCharacters, getReorderedIndices, getReorderedString } from './reordering.js' | ||
export { getBidiCharType, getBidiCharTypeName } from './charTypes.js' | ||
export { getMirroredCharacter, getMirroredCharactersMap } from './mirroring.js' | ||
export { closingToOpeningBracket, openingToClosingBracket, getCanonicalBracket } from './brackets.js' | ||
export { stringToArray } from './util/stringToArray.js' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
/** | ||
* Break a string into rendered characters (graphemes), | ||
* using simpler methods of breaking strings apart doesn't take into account characters with multiple bytes. | ||
* For instance `'π±π½ββοΈ'.length === 7` | ||
* @param {string} string - input string | ||
* @return {string[]} - the string broken down into an array of characters. | ||
*/ | ||
export function stringToArray (string) { | ||
return [...new Intl.Segmenter().segment(string)].map(x => x.segment); | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,12 +80,12 @@ | |
202E 202D 05D0 202C 0028 005B 202C 202B 0061 005D 0029 0062;2;1;x x 4 x 3 3 x x 4 4 4 4;8 9 10 11 5 4 2 | ||
|
||
# Nonspacing marks applied to paired brackets | ||
0061 0028 0062 0029 0331;1;1;2 2 2 2 2;0 1 2 3 4 | ||
0061 0028 0332 0062 0029 0333;1;1;2 2 2 2 2 2;0 1 2 3 4 5 | ||
05D0 0028 05D1 0029 0331;0;0;1 1 1 1 1;4 3 2 1 0 | ||
05D0 0028 0332 05D1 0029 0333;0;0;1 1 1 1 1 1;5 4 3 2 1 0 | ||
0661 0028 0662 0029 0331;0;0;2 1 2 1 1;4 3 2 1 0 | ||
0661 0028 0332 0662 0029 0333;0;0;2 1 1 2 1 1;5 4 3 2 1 0 | ||
0061 0028 0062 0029 0331;1;1;2 2 2 1;3 0 1 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe you are not supposed to touch this file it is imported directly from the Unicode bidi spec There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think BidiCharacterTest.js should also use |
||
0061 0028 0332 0062 0029 0333;1;1;2 2 2 1;3 0 1 2 | ||
05D0 0028 05D1 0029 0331;0;0;1 1 1 0;2 1 0 3 | ||
05D0 0028 0332 05D1 0029 0333;0;0;1 1 1 0;2 1 0 3 | ||
0661 0028 0662 0029 0331;0;0;2 1 2 0;2 1 0 3 | ||
0661 0028 0332 0662 0029 0333;0;0;2 1 2 0;2 1 0 3 | ||
|
||
# Nested bracket pairs that reach and exceed the fixed capacity of the bracket stack | ||
# a ( ( ... ( b ) ) ... ) with 62, 63, and 64 nested bracket pairs | ||
|
@@ -228,8 +228,8 @@ | |
0661 0009 0028 0662 0029;2;0;2 0 1 2 1;0 1 4 3 2 | ||
0661 0020 0028 0662 0029;2;0;2 1 1 2 1;4 3 2 1 0 | ||
05D0 0029 0020 0028 0661 0029;0;0;1 1 1 1 2 1;5 4 3 2 1 0 | ||
05D0 0029 0028 0301 0031 0029;0;0;1 1 1 1 2 1;5 4 3 2 1 0 | ||
05D0 0029 0028 0301 0661 0029;0;0;1 1 1 1 2 1;5 4 3 2 1 0 | ||
05D0 0029 0028 0301 0031 0029;0;0;1 1 1 2 0;3 2 1 0 4 | ||
05D0 0029 0028 0301 0661 0029;0;0;1 1 1 2 0;3 2 1 0 4 | ||
0627 0028 0661 003F 0020 0029 005D;0;0;1 1 2 1 1 1 0;5 4 3 2 1 0 6 | ||
|
||
# Combinations of paired brackets, numbers, and directional formatting characters | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intl.Segmenter
segments by grapheme clusters by default. This is possibly overkill for fixing #9. In some scripts, it will group whole syllables of multiple characters. If the issue is just that surrogate pairs are getting split up, then use the spread operator ([...string]
) or thestring[Symbol.iterator]()
iterator directly on the string you want to split, as described in this documentation:The standard iterator will also get you whole Unicode codepoints:
The string iterator is much more widely supported than
Intl.Segmenter
, which only landed in Firefox a few months ago.