-
-
Notifications
You must be signed in to change notification settings - Fork 713
Introduce cppjieba as a submodule for Chinese word segmentation #18548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
CrazySteve0605
wants to merge
16
commits into
nvaccess:master
Choose a base branch
from
CrazySteve0605:integrateCPPJieba
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+224
−3
Draft
Changes from 13 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
5cb5189
Introduce cppjieba as a submodule for Chinese word segmentation
CrazySteve0605 fb4efef
Update what's new
CrazySteve0605 ae58e9b
Add comments for building script of cppjieba and its dependency
CrazySteve0605 06070c1
Update projectDocs/dev/createDevEnvironment.md
CrazySteve0605 2273a60
Update include/readme.md
CrazySteve0605 3d4d9f1
Remove changes in sconscript for localLIb
CrazySteve0605 1fbf05f
add building script for cppjieba
CrazySteve0605 7de7464
add JiebaSingleton wrapper and C API for NVDA segmentation
CrazySteve0605 f4cab8a
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 d4c3a92
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 0d92c08
Update GitHub action workflow to fetch cppjieba's submodule
CrazySteve0605 da662be
Update .gitignore for cppjieba
CrazySteve0605 38a12dc
Update building and setup script for cppjieba's dicts installation
CrazySteve0605 c60c2da
update copyright headers based on @seanbudd's suggestions
CrazySteve0605 c853b64
Update include/readme.md
CrazySteve0605 53dd3bb
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ source/lib | |
source/lib64 | ||
source/typelibs | ||
source/louis | ||
source/cppjieba | ||
*.obj | ||
*.exp | ||
*.lib | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
/* | ||
This file is a part of the NVDA project. | ||
CrazySteve0605 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
URL: http://www.nvda-project.org/ | ||
Copyright 2025 NV Access Limited, Wang Chong. | ||
This program is free software: you can redistribute it and/or modify | ||
it under the terms of the GNU General Public License version 2.0, as published by | ||
the Free Software Foundation. | ||
CrazySteve0605 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This program is distributed in the hope that it will be useful, | ||
but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
This license can be found at: | ||
http://www.gnu.org/licenses/old-licenses/gpl-2.0.html | ||
*/ | ||
|
||
#include "cppjieba.hpp" | ||
|
||
JiebaSingleton& JiebaSingleton::getInstance() { | ||
// C++11 guarantees thread-safe init of this local static | ||
static JiebaSingleton instance; | ||
return instance; | ||
} | ||
|
||
JiebaSingleton::JiebaSingleton(): cppjieba::Jieba() { } // call base ctor to load dictionaries, models, etc. | ||
|
||
void JiebaSingleton::getOffsets(const std::string& text, std::vector<int>& charOffsets) { | ||
std::lock_guard<std::mutex> lock(segMutex); | ||
std::vector<std::string> words; | ||
this->Cut(text, words, true); | ||
|
||
int cumulative = 0; | ||
for (auto const& w : words) { | ||
int wc = 0; | ||
auto ptr = reinterpret_cast<const unsigned char*>(w.c_str()); | ||
size_t i = 0, len = w.size(); | ||
while (i < len) { | ||
unsigned char c = ptr[i]; | ||
if ((c & 0x80) == 0) i += 1; | ||
else if ((c & 0xE0) == 0xC0) i += 2; | ||
else if ((c & 0xF0) == 0xE0) i += 3; | ||
else if ((c & 0xF8) == 0xF0) i += 4; | ||
else i += 1; | ||
++wc; | ||
} | ||
cumulative += wc; | ||
charOffsets.push_back(cumulative); | ||
} | ||
} | ||
|
||
extern "C" { | ||
|
||
int initJieba() { | ||
try { | ||
// simply force the singleton into existence | ||
(void)JiebaSingleton::getInstance(); | ||
return 0; | ||
} catch (...) { | ||
return -1; | ||
} | ||
} | ||
|
||
int segmentOffsets(const char* text, int** charOffsets, int* outLen) { | ||
if (!text || !charOffsets || !outLen) return -1; | ||
// we assume initJieba() has already been called successfully | ||
|
||
std::string input(text); | ||
std::vector<int> offs; | ||
JiebaSingleton::getInstance().getOffsets(input, offs); | ||
|
||
int n = static_cast<int>(offs.size()); | ||
int* buf = static_cast<int*>(std::malloc(sizeof(int) * n)); | ||
if (!buf) { | ||
*outLen = 0; | ||
return -1; | ||
} | ||
for (int i = 0; i < n; ++i) buf[i] = offs[i]; | ||
*charOffsets = buf; | ||
*outLen = n; | ||
return 0; | ||
} | ||
|
||
void freeOffsets(int* ptr) { | ||
if (ptr) free(ptr); | ||
} | ||
|
||
} // extern "C" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
LIBRARY cppjieba | ||
EXPORTS | ||
initJieba | ||
segmentOffsets | ||
freeOffsets |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
/* | ||
CrazySteve0605 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file is a part of the NVDA project. | ||
URL: http://www.nvda-project.org/ | ||
Copyright 2025 NV Access Limited, Wang Chong. | ||
This program is free software: you can redistribute it and/or modify | ||
it under the terms of the GNU General Public License version 2.0, as published by | ||
the Free Software Foundation. | ||
CrazySteve0605 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This program is distributed in the hope that it will be useful, | ||
but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
This license can be found at: | ||
http://www.gnu.org/licenses/old-licenses/gpl-2.0.html | ||
*/ | ||
|
||
#ifndef CPPJIEBA_DLL_H | ||
#define CPPJIEBA_DLL_H | ||
#pragma once | ||
|
||
#include <vector> | ||
#include <string> | ||
#include <mutex> | ||
#include <cstdlib> | ||
#include "Jieba.hpp" | ||
|
||
#ifdef _WIN32 | ||
# define JIEBA_API __declspec(dllexport) | ||
#else | ||
# define JIEBA_API | ||
#endif | ||
|
||
using namespace std; | ||
|
||
/// @brief Singleton wrapper around cppjieba::Jieba. | ||
class JiebaSingleton : public cppjieba::Jieba { | ||
public: | ||
/// @brief Returns the single instance, constructing on first call. | ||
static JiebaSingleton& getInstance(); | ||
|
||
/// @brief Do thread-safe segmentation and compute character end offsets. | ||
/// @param text The input text in UTF-8 encoding. | ||
/// @param charOffsets Output vector to hold character offsets. | ||
void getOffsets(const string& text, vector<int>& charOffsets); | ||
|
||
private: | ||
JiebaSingleton(); ///< private ctor initializes base Jieba | ||
|
||
/// Disable copy and move | ||
JiebaSingleton(const JiebaSingleton&) = delete; | ||
JiebaSingleton& operator = (const JiebaSingleton&) = delete; | ||
JiebaSingleton(JiebaSingleton&&) = delete; | ||
JiebaSingleton& operator = (JiebaSingleton&&) = delete; | ||
|
||
std::mutex segMutex; ///< guards concurrent Cut() calls | ||
}; | ||
|
||
extern "C" { | ||
|
||
/// @brief Force singleton construction (load dicts, etc.) before any segmentation. | ||
/// @return 0 on success, -1 on failure. | ||
JIEBA_API int initJieba(); | ||
|
||
/// @brief Segment UTF-8 text into character offsets. | ||
/// @return 0 on success, -1 on failure. | ||
JIEBA_API int segmentOffsets(const char* text, int** charOffsets, int* outLen); | ||
|
||
/// @brief Free memory allocated by segmentOffsets. | ||
JIEBA_API void freeOffsets(int* ptr); | ||
|
||
} // extern "C" | ||
|
||
#endif // CPPJIEBA_DLL_H |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# A part of NonVisual Desktop Access (NVDA) | ||
CrazySteve0605 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Copyright (C) 2025 NV Access Limited, Wang Chong. | ||
# This file may be used under the terms of the GNU General Public License, version 2 or later. | ||
# For more details see: https://www.gnu.org/licenses/gpl-2.0.html | ||
|
||
import typing # noqa: E402 | ||
import os | ||
|
||
Import( | ||
[ | ||
"thirdPartyEnv", | ||
"sourceDir", | ||
] | ||
) | ||
thirdPartyEnv: Environment = thirdPartyEnv | ||
env: Environment = typing.cast(Environment, thirdPartyEnv.Clone()) | ||
|
||
cppjiebaPath = Dir("#include/cppjieba") | ||
cppjiebaSrcPath = cppjiebaPath.Dir("include/cppjieba") | ||
cppjiebaDictPath = cppjiebaPath.Dir("dict") | ||
outDir = sourceDir.Dir("cppjieba") | ||
unitTestDictsDir = env.Dir("#tests/unit/cppjiebaDicts") | ||
LimonpPath = cppjiebaPath.Dir("deps/limonp") # cppjieba's dependency | ||
LimonpSrcPath = LimonpPath.Dir("include/limonp") | ||
|
||
env.Prepend( | ||
CPPPATH=[ | ||
cppjiebaSrcPath, | ||
LimonpSrcPath.Dir(".."), | ||
] | ||
) | ||
|
||
sourceFiles = [ | ||
"cppjieba.cpp", | ||
"cppjieba.def", | ||
] | ||
|
||
cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles) | ||
|
||
if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning | ||
env.Install( | ||
outDir.Dir("dicts"), | ||
[ | ||
f | ||
for f in env.Glob(f"{cppjiebaDictPath}/*") | ||
if f.name | ||
not in ( | ||
"README.md", | ||
"pos_dict", | ||
) | ||
and not f.name.endswith(".in") | ||
], | ||
) | ||
|
||
Return("cppjiebaLib") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.