[feature](analysis) add new chinese tokenizer IK #269

Ryan19929 · 2025-01-02T07:29:41Z

Support IK tokenizer for inverted index:
Migrate analysis-ik from Java to C++, Implement basic tokenization functionality.
The major differences from the original Java code are as follows:

Encoding Format Difference: Use /jieba/Unicode.hpp to process characters in IK-C++.
Memory Management Optimization: Add a custom allocator to avoid performance overhead caused by frequent memory allocation in STL containers.
Remote Dictionary Support: IK-C++ does not currently support remote dictionaries.

Major changes to the original code:

testChinese.cpp: Add test for testing Chinese tokenization speed. Use the dataset located at /src/test/data/contribs-lib/analysis/chinese/speed-test-text.txt (红楼梦) for testing.
LanguageBasedAnalyzer.h/cpp:
Add IK tokenizer configuration, initialization entry, and dictionary loading logic.
Add the IK tokenization mode entry (temporary mode entry) in AnalyzerMode.

zzzxl1993 · 2025-01-06T03:36:28Z

src/contribs-lib/CLucene/analysis/LanguageBasedAnalyzer.h


 CL_NS_DEF(analysis)

 enum class AnalyzerMode {
    Default,
    All,
-    Search
+    Search,
+    IK_Smart,


Suggest separating the IK and Jieba enums for better clarity.

zzzxl1993 · 2025-01-06T03:39:15Z

src/contribs-lib/CLucene/analysis/ik/dic/DictSegment.cpp

+                if (store_size_ >= children_array_.size()) {
+                    children_array_.resize(store_size_ + 1);
+                }
+                // 插入并保持有序


Suggest using English comments.

zzzxl1993

LGTM

Migrate analysis-ik from Java to C++, implement basic tokenization functionality, and integrate it into CLucene.

zzzxl1993 reviewed Jan 17, 2025

View reviewed changes

Ryan19929 force-pushed the clucene-ik-20250102 branch 3 times, most recently from f538355 to e45e3f3 Compare January 20, 2025 05:46

zzzxl1993 approved these changes Jan 20, 2025

View reviewed changes

Ryan19929 force-pushed the clucene-ik-20250102 branch from e45e3f3 to 35881b0 Compare January 21, 2025 03:14

airborne12 force-pushed the clucene branch from eebc55f to 4932d23 Compare January 21, 2025 06:38

[feature](analysis) add new chinese tokenizer IK

eecf99b

Migrate analysis-ik from Java to C++, implement basic tokenization functionality, and integrate it into CLucene.

Ryan19929 force-pushed the clucene-ik-20250102 branch from 35881b0 to eecf99b Compare January 21, 2025 06:46

zzzxl1993 requested a review from airborne12 January 22, 2025 07:18

[improvement](analysis) IKMemoryPool supports dynamic expansion

5bac34b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature](analysis) add new chinese tokenizer IK #269

[feature](analysis) add new chinese tokenizer IK #269

Ryan19929 commented Jan 2, 2025

zzzxl1993 Jan 6, 2025

zzzxl1993 Jan 6, 2025

zzzxl1993 left a comment

[feature](analysis) add new chinese tokenizer IK #269

Are you sure you want to change the base?

[feature](analysis) add new chinese tokenizer IK #269

Conversation

Ryan19929 commented Jan 2, 2025

zzzxl1993 Jan 6, 2025

Choose a reason for hiding this comment

zzzxl1993 Jan 6, 2025

Choose a reason for hiding this comment

zzzxl1993 left a comment

Choose a reason for hiding this comment