Introduce cppjieba as a submodule for Chinese word segmentation #18548

CrazySteve0605 · 2025-07-24T02:27:21Z

Introduce cppjieba, an NLP-based Chinese tokenizer, for implementing Chinese word navigation and braille output.

Link to issue number:

Related to #4075 and a part of OSPP 2025 of NVDA.

Summary of the issue:

NVDA’s current word navigation mechanism relies on Unicode boundary rules through the Uniscribe API, which do not work well for languages such as Chinese due to the absence of explicit word delimiters.

Description of user facing changes:

None

Description of developer facing changes:

A tool to implement word navigation and braille output within Chinese content.

Description of development approach:

Added cppjieba as a submodule.
Added its wrapper and building script.

Testing strategy:

Confirm weather it can be successfully compiled and its segmentation function can be called by ctypes.

Known issues with pull request:

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

@coderabbitai summary

- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents

wmhn1872265132 · 2025-07-24T09:16:44Z

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

cary-rowen · 2025-07-24T09:23:24Z

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

CrazySteve0605 · 2025-07-24T09:28:34Z

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

Apologies for the insufficient explanation. This PR is intended as an initial step of the overall work. It does not implement any segmentation functionality yet, but merely introduces the ‘cppjieba’ module. Therefore, it has no impact on end users.

CrazySteve0605 · 2025-07-24T09:39:42Z

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

As many parts of the code need changes, splitting the work into smaller tasks might be more efficient, as the community can review them at the same time.

seanbudd · 2025-07-31T05:27:42Z

@CrazySteve0605 - can you please provide more information on why this library was selected over others? what were alternative options?

seanbudd · 2025-07-31T05:26:22Z

nvdaHelper/local/sconscript

@@ -113,8 +113,18 @@ localLib = env.SharedLibrary(
 		"Gdiplus",
 		"Iphlpapi",
 		"Ws2_32",
-		"runtimeobject",
+		"runtimeobject",  


Suggested change

"runtimeobject",

"runtimeobject",

@seanbudd Now I'm going to migrate these building scripts and directly remove changes in the whole file. Thanks for reviews.

include/readme.md

projectDocs/dev/createDevEnvironment.md

Co-authored-by: Sean Budd <[email protected]>

- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer

- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.

nvdaHelper/cppjieba/cppjieba.cpp

nvdaHelper/cppjieba/cppjieba.hpp

include/readme.md

nvdaHelper/cppjieba/cppjieba.cpp

nvdaHelper/cppjieba/cppjieba.hpp

nvdaHelper/cppjieba/sconscript

user_docs/en/changes.md

Co-authored-by: Sean Budd <[email protected]>

seanbudd requested a review from michaelDCurran July 24, 2025 02:29

CrazySteve0605 added 2 commits July 24, 2025 16:51

Update what's new

fb4efef

Add comments for building script of cppjieba and its dependency

ae58e9b

CrazySteve0605 marked this pull request as ready for review July 24, 2025 09:05

CrazySteve0605 requested a review from a team as a code owner July 24, 2025 09:05

seanbudd reviewed Jul 31, 2025

View reviewed changes

CrazySteve0605 and others added 3 commits August 1, 2025 13:04

Update projectDocs/dev/createDevEnvironment.md

06070c1

Co-authored-by: Sean Budd <[email protected]>

Update include/readme.md

2273a60

Co-authored-by: Sean Budd <[email protected]>

Remove changes in sconscript for localLIb

3d4d9f1

CrazySteve0605 marked this pull request as draft August 4, 2025 00:32

CrazySteve0605 added 7 commits August 4, 2025 09:59

add building script for cppjieba

1fbf05f

Merge branch 'master' into integrateCPPJieba

f4cab8a

Merge branch 'master' into integrateCPPJieba

d4c3a92

Update .gitignore for cppjieba

da662be

Update building and setup script for cppjieba's dicts installation

38a12dc

gerald-hartig reviewed Aug 8, 2025

View reviewed changes

nvdaHelper/cppjieba/cppjieba.cpp Outdated Show resolved Hide resolved

gerald-hartig reviewed Aug 8, 2025

View reviewed changes

nvdaHelper/cppjieba/cppjieba.hpp Outdated Show resolved Hide resolved

seanbudd reviewed Aug 8, 2025

View reviewed changes

include/readme.md Outdated Show resolved Hide resolved

nvdaHelper/cppjieba/cppjieba.cpp Outdated Show resolved Hide resolved

nvdaHelper/cppjieba/cppjieba.hpp Show resolved Hide resolved

nvdaHelper/cppjieba/sconscript Show resolved Hide resolved

user_docs/en/changes.md Show resolved Hide resolved

CrazySteve0605 and others added 3 commits August 9, 2025 20:30

update copyright headers based on @seanbudd's suggestions

c60c2da

Update include/readme.md

c853b64

Co-authored-by: Sean Budd <[email protected]>

Merge branch 'master' into integrateCPPJieba

53dd3bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Introduce cppjieba as a submodule for Chinese word segmentation #18548

Introduce cppjieba as a submodule for Chinese word segmentation #18548

CrazySteve0605 commented Jul 24, 2025 •

edited

Loading

Uh oh!

wmhn1872265132 commented Jul 24, 2025

Uh oh!

cary-rowen commented Jul 24, 2025

Uh oh!

CrazySteve0605 commented Jul 24, 2025 •

edited

Loading

Uh oh!

CrazySteve0605 commented Jul 24, 2025

Uh oh!

seanbudd commented Jul 31, 2025

Uh oh!

seanbudd Jul 31, 2025

Uh oh!

CrazySteve0605 Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Introduce cppjieba as a submodule for Chinese word segmentation #18548

Are you sure you want to change the base?

Introduce cppjieba as a submodule for Chinese word segmentation #18548

Conversation

CrazySteve0605 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

wmhn1872265132 commented Jul 24, 2025

Uh oh!

cary-rowen commented Jul 24, 2025

Uh oh!

CrazySteve0605 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CrazySteve0605 commented Jul 24, 2025

Uh oh!

seanbudd commented Jul 31, 2025

Uh oh!

seanbudd Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

CrazySteve0605 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CrazySteve0605 commented Jul 24, 2025 •

edited

Loading

CrazySteve0605 commented Jul 24, 2025 •

edited

Loading