Skip to content

Introduce cppjieba as a submodule for Chinese word segmentation #18548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

CrazySteve0605
Copy link
Contributor

@CrazySteve0605 CrazySteve0605 commented Jul 24, 2025

Introduce cppjieba, an NLP-based Chinese tokenizer, for implementing Chinese word navigation and braille output.

Link to issue number:

Related to #4075 and a part of OSPP 2025 of NVDA.

Summary of the issue:

NVDA’s current word navigation mechanism relies on Unicode boundary rules through the Uniscribe API, which do not work well for languages such as Chinese due to the absence of explicit word delimiters.

Description of user facing changes:

None

Description of developer facing changes:

A tool to implement word navigation and braille output within Chinese content.

Description of development approach:

  • Added cppjieba as a submodule.
  • Added its wrapper and building script.

Testing strategy:

Confirm weather it can be successfully compiled and its segmentation function can be called by ctypes.

Known issues with pull request:

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@coderabbitai summary

- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide
  robust Chinese word segmentation capabilities.
- Update `.gitmodules` to point to the official `cppjieba` repository and
  configure it to track the `master` branch.
- Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp'
- Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the
  project’s existing license, ensuring proper attribution and compliance.
- Update documents
@seanbudd seanbudd requested a review from michaelDCurran July 24, 2025 02:29
@CrazySteve0605 CrazySteve0605 marked this pull request as ready for review July 24, 2025 09:05
@CrazySteve0605 CrazySteve0605 requested a review from a team as a code owner July 24, 2025 09:05
@wmhn1872265132
Copy link
Contributor

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

@cary-rowen
Copy link
Contributor

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

@CrazySteve0605
Copy link
Contributor Author

CrazySteve0605 commented Jul 24, 2025

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

Apologies for the insufficient explanation. This PR is intended as an initial step of the overall work. It does not implement any segmentation functionality yet, but merely introduces the ‘cppjieba’ module. Therefore, it has no impact on end users.

@CrazySteve0605
Copy link
Contributor Author

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

As many parts of the code need changes, splitting the work into smaller tasks might be more efficient, as the community can review them at the same time.

@seanbudd
Copy link
Member

@CrazySteve0605 - can you please provide more information on why this library was selected over others? what were alternative options?

@@ -113,8 +113,18 @@ localLib = env.SharedLibrary(
"Gdiplus",
"Iphlpapi",
"Ws2_32",
"runtimeobject",
"runtimeobject",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"runtimeobject",
"runtimeobject",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanbudd Now I'm going to migrate these building scripts and directly remove changes in the whole file. Thanks for reviews.

@CrazySteve0605 CrazySteve0605 marked this pull request as draft August 4, 2025 00:32
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/'
  - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method
  - Implements Meyers’ singleton via `getInstance()` with a private constructor
  - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance
- Add C-style API in the same module:
  - `int initJieba()` to force singleton initialization
  - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets
  - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched.
- This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants