Skip to content

Introduce cppjieba as a submodule for Chinese word segmentation #18548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/testAndPublish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
- name: Checkout NVDA
uses: actions/checkout@v4
with:
submodules: true
submodules: recursive
- name: Install Python
uses: actions/setup-python@v5
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ source/lib
source/lib64
source/typelibs
source/louis
source/cppjieba
*.obj
*.exp
*.lib
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,6 @@
[submodule "include/nvda-mathcat"]
path = include/nvda-mathcat
url = https://github.com/nvaccess/nvda-mathcat.git
[submodule "include/cppjieba"]
path = include/cppjieba
url = https://github.com/yanyiwu/cppjieba
1 change: 1 addition & 0 deletions copying.txt
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,7 @@ In addition to these dependencies, the following are also included in NVDA:
- Microsoft Detours: MIT
- Python: PSF
- NSIS: zlib/libpng
- cppjieba: MIT

Furthermore, NVDA also utilises some static/binary dependencies, details of which can be found at the following URL:

Expand Down
1 change: 1 addition & 0 deletions include/cppjieba
Submodule cppjieba added at 9b4090
7 changes: 7 additions & 0 deletions include/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,10 @@ Used in chrome system tests.
https://github.com/microsoft/wil/

Fetch latest from master.

### cppjieba

[cppjieba](https://github.com/yanyiwu/cppjieba)

Fetch latest from master.
Used for Chinese text segmentation.
6 changes: 5 additions & 1 deletion nvdaHelper/archBuild_sconscript
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# A part of NonVisual Desktop Access (NVDA)
# Copyright (C) 2006-2023 NV Access Limited
# Copyright (C) 2006-2025 NV Access Limited
# This file may be used under the terms of the GNU General Public License, version 2 or later.
# For more details see: https://www.gnu.org/licenses/gpl-2.0.html

Expand Down Expand Up @@ -209,6 +209,10 @@ Export("detoursLib")
apiHookObj = env.Object("apiHook", "common/apiHook.cpp")
Export("apiHookObj")

cppjiebaLib = env.SConscript("cppjieba/sconscript")
Export("cppjiebaLib")
env.Install(libInstallDir, cppjiebaLib)

if TARGET_ARCH == "x86":
localLib = env.SConscript("local/sconscript")
Export("localLib")
Expand Down
85 changes: 85 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
/*
This file is a part of the NVDA project.
URL: http://www.nvda-project.org/
Copyright 2025 NV Access Limited, Wang Chong.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2.0, as published by
the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This license can be found at:
http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
*/

#include "cppjieba.hpp"

JiebaSingleton& JiebaSingleton::getInstance() {
// C++11 guarantees thread-safe init of this local static
static JiebaSingleton instance;
return instance;
}

JiebaSingleton::JiebaSingleton(): cppjieba::Jieba() { } // call base ctor to load dictionaries, models, etc.

void JiebaSingleton::getOffsets(const std::string& text, std::vector<int>& charOffsets) {
std::lock_guard<std::mutex> lock(segMutex);
std::vector<std::string> words;
this->Cut(text, words, true);

int cumulative = 0;
for (auto const& w : words) {
int wc = 0;
auto ptr = reinterpret_cast<const unsigned char*>(w.c_str());
size_t i = 0, len = w.size();
while (i < len) {
unsigned char c = ptr[i];
if ((c & 0x80) == 0) i += 1;
else if ((c & 0xE0) == 0xC0) i += 2;
else if ((c & 0xF0) == 0xE0) i += 3;
else if ((c & 0xF8) == 0xF0) i += 4;
else i += 1;
++wc;
}
cumulative += wc;
charOffsets.push_back(cumulative);
}
}

extern "C" {

int initJieba() {
try {
// simply force the singleton into existence
(void)JiebaSingleton::getInstance();
return 0;
} catch (...) {
return -1;
}
}

int segmentOffsets(const char* text, int** charOffsets, int* outLen) {
if (!text || !charOffsets || !outLen) return -1;
// we assume initJieba() has already been called successfully

std::string input(text);
std::vector<int> offs;
JiebaSingleton::getInstance().getOffsets(input, offs);

int n = static_cast<int>(offs.size());
int* buf = static_cast<int*>(std::malloc(sizeof(int) * n));
if (!buf) {
*outLen = 0;
return -1;
}
for (int i = 0; i < n; ++i) buf[i] = offs[i];
*charOffsets = buf;
*outLen = n;
return 0;
}

void freeOffsets(int* ptr) {
if (ptr) free(ptr);
}

} // extern "C"
5 changes: 5 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
LIBRARY cppjieba
EXPORTS
initJieba
segmentOffsets
freeOffsets
71 changes: 71 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
/*
This file is a part of the NVDA project.
URL: http://www.nvda-project.org/
Copyright 2025 NV Access Limited, Wang Chong.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2.0, as published by
the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This license can be found at:
http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
*/

#ifndef CPPJIEBA_DLL_H
#define CPPJIEBA_DLL_H
#pragma once

#include <vector>
#include <string>
#include <mutex>
#include <cstdlib>
#include "Jieba.hpp"

#ifdef _WIN32
# define JIEBA_API __declspec(dllexport)
#else
# define JIEBA_API
#endif

using namespace std;

/// @brief Singleton wrapper around cppjieba::Jieba.
class JiebaSingleton : public cppjieba::Jieba {
public:
/// @brief Returns the single instance, constructing on first call.
static JiebaSingleton& getInstance();

/// @brief Do thread-safe segmentation and compute character end offsets.
/// @param text The input text in UTF-8 encoding.
/// @param charOffsets Output vector to hold character offsets.
void getOffsets(const string& text, vector<int>& charOffsets);

private:
JiebaSingleton(); ///< private ctor initializes base Jieba

/// Disable copy and move
JiebaSingleton(const JiebaSingleton&) = delete;
JiebaSingleton& operator = (const JiebaSingleton&) = delete;
JiebaSingleton(JiebaSingleton&&) = delete;
JiebaSingleton& operator = (JiebaSingleton&&) = delete;

std::mutex segMutex; ///< guards concurrent Cut() calls
};

extern "C" {

/// @brief Force singleton construction (load dicts, etc.) before any segmentation.
/// @return 0 on success, -1 on failure.
JIEBA_API int initJieba();

/// @brief Segment UTF-8 text into character offsets.
/// @return 0 on success, -1 on failure.
JIEBA_API int segmentOffsets(const char* text, int** charOffsets, int* outLen);

/// @brief Free memory allocated by segmentOffsets.
JIEBA_API void freeOffsets(int* ptr);

} // extern "C"

#endif // CPPJIEBA_DLL_H
55 changes: 55 additions & 0 deletions nvdaHelper/cppjieba/sconscript
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# A part of NonVisual Desktop Access (NVDA)
# Copyright (C) 2025 NV Access Limited, Wang Chong.
# This file may be used under the terms of the GNU General Public License, version 2 or later.
# For more details see: https://www.gnu.org/licenses/gpl-2.0.html

import typing # noqa: E402
import os

Import(
[
"thirdPartyEnv",
"sourceDir",
]
)
thirdPartyEnv: Environment = thirdPartyEnv
env: Environment = typing.cast(Environment, thirdPartyEnv.Clone())

cppjiebaPath = Dir("#include/cppjieba")
cppjiebaSrcPath = cppjiebaPath.Dir("include/cppjieba")
cppjiebaDictPath = cppjiebaPath.Dir("dict")
outDir = sourceDir.Dir("cppjieba")
unitTestDictsDir = env.Dir("#tests/unit/cppjiebaDicts")
LimonpPath = cppjiebaPath.Dir("deps/limonp") # cppjieba's dependency
LimonpSrcPath = LimonpPath.Dir("include/limonp")

env.Prepend(
CPPPATH=[
cppjiebaSrcPath,
LimonpSrcPath.Dir(".."),
]
)

sourceFiles = [
"cppjieba.cpp",
"cppjieba.def",
]

cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles)

if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning
env.Install(
outDir.Dir("dicts"),
[
f
for f in env.Glob(f"{cppjiebaDictPath}/*")
if f.name
not in (
"README.md",
"pos_dict",
)
and not f.name.endswith(".in")
],
)

Return("cppjiebaLib")
2 changes: 1 addition & 1 deletion projectDocs/dev/createDevEnvironment.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ If you aren't sure, run `git submodule update` after every git pull, merge or ch
* [Nullsoft Install System](https://nsis.sourceforge.io), version 3.11
* [Java Access Bridge 32 bit, from Zulu Community OpenJDK build 17.0.9+8Zulu (17.46.19)](https://github.com/nvaccess/javaAccessBridge32-bin)
* [Windows Implementation Libraries (WIL)](https://github.com/microsoft/wil/)
* [NVDA DiffMatchPatch](https://github.com/codeofdusk/nvda_dmp)
* [cppjieba - Chinese word segmentation](https://github.com/yanyiwu/cppjieba), commit `9b40903ed6cbd795367ea64f9a7d3f3bc4aa4714`

#### Build time dependencies

Expand Down
1 change: 1 addition & 0 deletions source/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ def _genManifestTemplate(shouldHaveUIAccess: bool) -> tuple[int, int, bytes]:
("images", glob("images/*.ico")),
("fonts", glob("fonts/*.ttf")),
("louis/tables", glob("louis/tables/*")),
("cppjieba/dicts", glob("cppjieba/dicts/*")),
("COMRegistrationFixes", glob("COMRegistrationFixes/*.reg")),
("miscDeps/tools", ["../miscDeps/tools/msgfmt.exe"]),
(".", glob("../miscDeps/python/*.dll")),
Expand Down
2 changes: 1 addition & 1 deletion user_docs/en/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Please refer to [the developer guide](https://download.nvaccess.org/documentatio
* Updated `include` dependencies:
* detours to `9764cebcb1a75940e68fa83d6730ffaf0f669401`. (#18447, @LeonarddeR)
* The `nvda_dmp` utility has been removed. (#18480, @codeofdusk)
* `comInterfaces_sconscript` has been updated to make the generated files in `comInterfaces` work better with IDEs. (#17608, @gexgd0419)
* Added [cppjieba](https://github.com/yanyiwu/cppjieba) as a git submodule for word segmentation. (#18548, @CrazySteve0605)

#### Deprecations

Expand Down
Loading