-
-
Notifications
You must be signed in to change notification settings - Fork 715
Introduce cppjieba as a submodule for Chinese word segmentation #18548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents
I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work |
Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional. |
Apologies for the insufficient explanation. This PR is intended as an initial step of the overall work. It does not implement any segmentation functionality yet, but merely introduces the ‘cppjieba’ module. Therefore, it has no impact on end users. |
As many parts of the code need changes, splitting the work into smaller tasks might be more efficient, as the community can review them at the same time. |
@CrazySteve0605 - can you please provide more information on why this library was selected over others? what were alternative options? |
nvdaHelper/local/sconscript
Outdated
@@ -113,8 +113,18 @@ localLib = env.SharedLibrary( | |||
"Gdiplus", | |||
"Iphlpapi", | |||
"Ws2_32", | |||
"runtimeobject", | |||
"runtimeobject", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"runtimeobject", | |
"runtimeobject", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seanbudd Now I'm going to migrate these building scripts and directly remove changes in the whole file. Thanks for reviews.
Co-authored-by: Sean Budd <[email protected]>
Co-authored-by: Sean Budd <[email protected]>
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Introduce cppjieba, an NLP-based Chinese tokenizer, for implementing Chinese word navigation and braille output.
Link to issue number:
Related to #4075 and a part of OSPP 2025 of NVDA.
Summary of the issue:
NVDA’s current word navigation mechanism relies on Unicode boundary rules through the Uniscribe API, which do not work well for languages such as Chinese due to the absence of explicit word delimiters.
Description of user facing changes:
None
Description of developer facing changes:
A tool to implement word navigation and braille output within Chinese content.
Description of development approach:
Testing strategy:
Confirm weather it can be successfully compiled and its segmentation function can be called by ctypes.
Known issues with pull request:
Code Review Checklist:
@coderabbitai summary