Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Normalization: attempt 2 #394

Merged
merged 62 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
1dd6b66
Add unicode normalization layer tests
apaniukov Nov 26, 2024
afd4a60
WiP
apaniukov Nov 29, 2024
2f24fec
WiP
apaniukov Dec 20, 2024
08052c2
Switch Casefold and UnicodeNormalization to CharsMap
apaniukov Jan 8, 2025
f6c001b
Add unicode normalization layer tests
apaniukov Nov 26, 2024
472b163
WiP
apaniukov Nov 29, 2024
04fb20c
WiP
apaniukov Dec 20, 2024
ed1203f
Switch Casefold and UnicodeNormalization to CharsMap
apaniukov Jan 8, 2025
012fb8e
Update tests and fix custom charsmap support
apaniukov Jan 9, 2025
e3831ec
Merge remote-tracking branch 'origin/update-normalization' into updat…
apaniukov Jan 9, 2025
8092720
Ruff checks
apaniukov Jan 9, 2025
df34dee
Merge branch 'master' into update-normalization
apaniukov Jan 9, 2025
6a611f3
wip
apaniukov Jan 9, 2025
258f0f4
wip
apaniukov Jan 9, 2025
baf0e70
wip
apaniukov Jan 9, 2025
6177b81
Switch Off FastTokenizer
apaniukov Jan 10, 2025
68b7e4e
Delete torch from dependencies
apaniukov Jan 10, 2025
7244191
Delete FastTokenizer from cmake and readme
apaniukov Jan 10, 2025
082064c
Delete FastTokenizer related patches
apaniukov Jan 10, 2025
7380898
Delete FastTokenizer build form CI
apaniukov Jan 10, 2025
68d0300
Delete FastTokenizer build form CI
apaniukov Jan 10, 2025
fc094a0
Delete FastTokenizer from Cmake
apaniukov Jan 10, 2025
72b0646
Delete FastTokenizer from Cmake
apaniukov Jan 10, 2025
e53ded8
Merge CharsMaps
apaniukov Jan 15, 2025
3e57200
Update CaseFold and UnicodeNorm to call_once
apaniukov Jan 16, 2025
369f2de
Merge branch 'master' into update-normalization
mryzhov Jan 28, 2025
1f17935
ICU build
mryzhov Jan 28, 2025
c3a88aa
Fixed release/debug builds
mryzhov Jan 29, 2025
9d0b134
Merge branch 'master' into update-normalization
mryzhov Jan 30, 2025
9d57a48
Add unicode normalization layer tests
apaniukov Nov 26, 2024
a30f012
WiP
apaniukov Nov 29, 2024
b5a10cd
WiP
apaniukov Dec 20, 2024
ad81006
Switch Casefold and UnicodeNormalization to CharsMap
apaniukov Jan 8, 2025
cde08d3
Update tests and fix custom charsmap support
apaniukov Jan 9, 2025
5186a16
Add unicode normalization layer tests
apaniukov Nov 26, 2024
f70269d
WiP
apaniukov Nov 29, 2024
4004293
Switch Casefold and UnicodeNormalization to CharsMap
apaniukov Jan 8, 2025
dfec349
Ruff checks
apaniukov Jan 9, 2025
1d60d43
wip
apaniukov Jan 9, 2025
4d698d5
wip
apaniukov Jan 9, 2025
0881130
wip
apaniukov Jan 9, 2025
1de0a52
Switch Off FastTokenizer
apaniukov Jan 10, 2025
9ab8f71
Delete torch from dependencies
apaniukov Jan 10, 2025
c323c2f
Delete FastTokenizer from cmake and readme
apaniukov Jan 10, 2025
cee621f
Delete FastTokenizer related patches
apaniukov Jan 10, 2025
faa04d2
Delete FastTokenizer build form CI
apaniukov Jan 10, 2025
b9943b1
Delete FastTokenizer build form CI
apaniukov Jan 10, 2025
5ea1a23
Delete FastTokenizer from Cmake
apaniukov Jan 10, 2025
21e3655
Delete FastTokenizer from Cmake
apaniukov Jan 10, 2025
3d62779
Merge CharsMaps
apaniukov Jan 15, 2025
066ddab
Update CaseFold and UnicodeNorm to call_once
apaniukov Jan 16, 2025
30dfdc5
ICU build
mryzhov Jan 28, 2025
c2d821b
Fixed release/debug builds
mryzhov Jan 29, 2025
b19c193
multi configuration support
mryzhov Jan 30, 2025
368b9dd
Merge remote-tracking branch 'origin/update-normalization' into updat…
apaniukov Jan 30, 2025
0a3a6cc
Merge remote-tracking branch 'origin/update-normalization' into updat…
apaniukov Jan 30, 2025
dc87325
Fix CharsMap For Several Input Combinations
apaniukov Jan 31, 2025
13f3bf9
Disable Sentencepiece Builder Info Logging
apaniukov Jan 31, 2025
e15ff72
Merge branch 'master' into update-normalization
apaniukov Jan 31, 2025
7952af9
Fixed cross-compilation
ilya-lavrenov Feb 3, 2025
3438111
Fixed build with Ninja
ilya-lavrenov Feb 3, 2025
de7c080
WA for Apple
ilya-lavrenov Feb 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 8 additions & 16 deletions .github/workflows/linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,9 @@ jobs:


openvino_tokenizers_cpack:
name: OpenVINO tokenizers cpack (BUILD_FAST_TOKENIZERS=${{ matrix.build_fast_tokenizers }}, BUILD_TYPE=${{ matrix.build_type }})
name: OpenVINO tokenizers cpack, BUILD_TYPE=${{ matrix.build_type }})
strategy:
matrix:
build_fast_tokenizers: [ON]
build_type: [Release] # TODO: Add Debug build when OV provider is ready or use OV package
needs: [ openvino_download ]
if: |
Expand Down Expand Up @@ -110,8 +109,7 @@ jobs:
- name: CMake configure - tokenizers
run: |
source ${INSTALL_DIR}/setupvars.sh
cmake -DBUILD_FAST_TOKENIZERS="${{ matrix.build_fast_tokenizers }}" \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
cmake -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
-S ${{ env.OPENVINO_TOKENIZERS_REPO }} \
-B ${{ env.BUILD_DIR }}

Expand All @@ -138,15 +136,13 @@ jobs:
if: ${{ always() }}
uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
name: openvino_tokenizers_cpack_${{ matrix.build_fast_tokenizers }}_${{ matrix.build_type }}
name: openvino_tokenizers_cpack_${{ matrix.build_type }}
path: ${{ env.BUILD_DIR }}/*.tar.gz
if-no-files-found: 'error'

openvino_tokenizers_wheel:
name: OpenVINO tokenizers extension (BUILD_FAST_TOKENIZERS=${{ matrix.build_fast_tokenizers }})
strategy:
matrix:
build_fast_tokenizers: [ON, OFF]
name: OpenVINO tokenizers extension wheel

needs: [ openvino_download ]
if: |
always() &&
Expand Down Expand Up @@ -188,7 +184,6 @@ jobs:
run: |
python -m pip wheel -v --no-deps --wheel-dir ${BUILD_DIR} \
--config-settings=override=cross.arch="manylinux_2_31_x86_64" \
--config-settings=override=cmake.options.BUILD_FAST_TOKENIZERS="${{ matrix.build_fast_tokenizers }}" \
${{ needs.openvino_download.outputs.ov_wheel_source }} \
${OPENVINO_TOKENIZERS_REPO}
env:
Expand All @@ -204,15 +199,12 @@ jobs:
if: ${{ always() }}
uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
name: openvino_tokenizers_wheel_${{ matrix.build_fast_tokenizers }}
name: openvino_tokenizers_wheel
path: ${{ env.BUILD_DIR }}/*.whl
if-no-files-found: 'error'

openvino_tokenizers_tests:
name: OpenVINO tokenizers tests (BUILD_FAST_TOKENIZERS=${{ matrix.build_fast_tokenizers }})
strategy:
matrix:
build_fast_tokenizers: [ON, OFF]
name: OpenVINO tokenizers tests
needs: [ openvino_download, openvino_tokenizers_wheel]
if: always() && needs.openvino_tokenizers_wheel.result == 'success'
timeout-minutes: 45
Expand Down Expand Up @@ -242,7 +234,7 @@ jobs:
- name: Download tokenizers package
uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
with:
name: openvino_tokenizers_wheel_${{ matrix.build_fast_tokenizers }}
name: openvino_tokenizers_wheel
path: ${{ env.INSTALL_DIR }}/ov_tokenizers

- name: Download OpenVINO package
Expand Down
8 changes: 3 additions & 5 deletions .github/workflows/mac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,9 @@ jobs:
revision: latest_nightly

openvino_tokenizers_cpack:
name: OpenVINO tokenizers cpack (BUILD_FAST_TOKENIZERS=${{ matrix.build_fast_tokenizers }}, BUILD_TYPE=${{ matrix.build_type }})
name: OpenVINO tokenizers cpack (BUILD_TYPE=${{ matrix.build_type }})
strategy:
matrix:
build_fast_tokenizers: [ON]
build_type: [Release] # TODO: Add Debug build when OV provider is ready or use OV package
needs: [ openvino_download ]
timeout-minutes: 45
Expand Down Expand Up @@ -89,8 +88,7 @@ jobs:
- name: CMake configure - tokenizers
run: |
source ${INSTALL_DIR}/setupvars.sh
cmake -DBUILD_FAST_TOKENIZERS="${{ matrix.build_fast_tokenizers }}" \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
cmake -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
-S ${{ env.OPENVINO_TOKENIZERS_REPO }} \
-B ${{ env.BUILD_DIR }}

Expand All @@ -115,7 +113,7 @@ jobs:
if: ${{ always() }}
uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
name: openvino_tokenizers_cpack_${{ matrix.build_fast_tokenizers }}_${{ matrix.build_type }}
name: openvino_tokenizers_cpack_${{ matrix.build_type }}
path: ${{ env.BUILD_DIR }}/*.tar.gz
if-no-files-found: 'error'

Expand Down
8 changes: 3 additions & 5 deletions .github/workflows/windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,9 @@ jobs:
revision: 'latest_available_commit'

openvino_tokenizers_cpack:
name: OpenVINO tokenizers cpack (BUILD_FAST_TOKENIZERS=${{ matrix.build_fast_tokenizers }}, BUILD_TYPE=${{ matrix.build_type }})
name: OpenVINO tokenizers cpack (BUILD_TYPE=${{ matrix.build_type }})
strategy:
matrix:
build_fast_tokenizers: [ON]
build_type: [Release] # TODO: Add Debug build when OV provider is ready or use OV package
needs: [ openvino_download ]
if: |
Expand Down Expand Up @@ -115,8 +114,7 @@ jobs:
shell: pwsh
run: |
${{ env.OV_INSTALL_DIR }}/setupvars.ps1
cmake -DBUILD_FAST_TOKENIZERS="${{ matrix.build_fast_tokenizers }}" `
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} `
cmake -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} `
-S ${{ env.OPENVINO_TOKENIZERS_REPO }} `
-B ${{ env.BUILD_DIR }}
env:
Expand Down Expand Up @@ -149,7 +147,7 @@ jobs:
if: ${{ always() }}
uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
name: openvino_tokenizers_cpack_${{ matrix.build_fast_tokenizers }}_${{ matrix.build_type }}
name: openvino_tokenizers_cpack_${{ matrix.build_type }}
path: ${{ env.BUILD_DIR }}/*.zip
if-no-files-found: 'error'

Expand Down
71 changes: 0 additions & 71 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,77 +150,6 @@ make

After that, you can transfer all binaries from `build/src` to `<openvino_dir>` as described in the C++ installation instruction above.

### Reducing the ICU Data Size

By default, all available ICU locales are supported, which significantly increases the package size. To reduce the size of the ICU libraries included in your final package, follow these steps:

1. **Use the ICU Data Configuration File**:
- This file specifies which features and locales to include in a custom data bundle. You can find more information [here](https://unicode-org.github.io/icu/userguide/icu_data/buildtool.html#icu-data-configuration-file).

2. **Set the ICU Data Filter File as an Environment Variable**:
- **On Unix-like systems (Linux, macOS)**:
Set the `ICU_DATA_FILTER_FILE` environment variable to the path of your configuration file (`filters.json`):

```bash
export ICU_DATA_FILTER_FILE="filters.json"
```

- **On Windows**:
Set the `ICU_DATA_FILTER_FILE` environment variable using the Command Prompt or PowerShell:

**Command Prompt:**
```cmd
set ICU_DATA_FILTER_FILE=filters.json
```

**PowerShell:**
```powershell
$env:ICU_DATA_FILTER_FILE="filters.json"
```

3. **Create a Configuration File**:
- An example configuration file (`filters.json`) might look like this:

```json
{
"localeFilter": {
"filterType": "language",
"includelist": [
"en"
]
}
}
```

4. **Configure OpenVINO Tokenizers**:
- When building OpenVINO tokenizers, set the following CMake option during the project configuration:

```bash
-DBUILD_FAST_TOKENIZERS=ON
```
- Example for a pip installation path:
```bash
ICU_DATA_FILTER_FILE=</path/to/filters.json> pip install git+https://github.com/openvinotoolkit/openvino_tokenizers.git --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly --config-settings=override=cmake.options.BUILD_FAST_TOKENIZERS=ON
```

By following these instructions, you can effectively reduce the size of the ICU libraries in your final package.

### Build OpenVINO Tokenizers without FastTokenizer Library

If a tokenizer doesn't use `CaseFold`, `UnicodeNormalization` or `Wordpiece` operations, you can drastically reduce package binary size by building OpenVINO Tokenizers without FastTokenizer dependency with this flag:

```bash
-DENABLE_FAST_TOKENIZERS=OFF
```

This option can also help with building for platform that is supported by FastTokenizer, for example `Android x86_64`.

Example for a pip installation path:
```bash

pip install git+https://github.com/openvinotoolkit/openvino_tokenizers.git --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly --config-settings=override=cmake.options.ENABLE_FAST_TOKENIZERS=OFF
```

## Usage

:warning: OpenVINO Tokenizers can be inferred on a `CPU` device only.
Expand Down
Loading
Loading