-
-
Notifications
You must be signed in to change notification settings - Fork 837
ICU-22080 Add plain ASCII as an explicitly detected type #2127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
- Introduce class CharsetRecog_ASCII and add it to the default detectors - Refine TestCharsetDetector.java - needed to modify ISO-2022-JP test, because of "conflict" with ASCII: shifts added - Fix casing of method "mungeInput()" - Add comment to "& 0x00ff" - Typo fix Co-authored-by: Christoph <[email protected]> Co-authored-by: Carl Christian Snethlage <[email protected]>
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
|
I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++. |
Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR... |
icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java
Outdated
Show resolved
Hide resolved
icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java
Outdated
Show resolved
Hide resolved
icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java
Outdated
Show resolved
Hide resolved
| return null; | ||
| } else { | ||
| // ASCII, because ALL bytes in the stream are <= 127. | ||
| // However, there could be some unicode (such as Hebrew) which also has this property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? It could be some unusual charset like UTF-7 or HZ, but those are "prohibited character encodings" in modern HTML and have generally fallen out of favor.
I also don't think that Hebrew is relevant here.
Seems like the confidence for ASCII should be high if all bytes are 00..7F.
icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java
Outdated
Show resolved
Hide resolved
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java
Outdated
Show resolved
Hide resolved
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java
Outdated
Show resolved
Hide resolved
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java
Outdated
Show resolved
Hide resolved
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java
Outdated
Show resolved
Hide resolved
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java
Outdated
Show resolved
Hide resolved
|
FYI @srl295 @aheninger Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic? |
|
I think ASCII is sufficiently unambiguous.
…On Fri, Sep 9, 2022 at 5:23 PM Markus Scherer ***@***.***> wrote:
FYI @srl295 <https://github.com/srl295> @aheninger
<https://github.com/aheninger>
Also, should we return "US-ASCII" which is more specific than "ASCII"? Or
is that too pedantic?
—
Reply to this email directly, view it on GitHub
<#2127 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMER6NH6YHFWFK6WXWTV5PIHFANCNFSM52UC2RXQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something. So.. I think ASCII is fine probably more recognizable these days. |
Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this. |
…I.java Co-authored-by: Markus Scherer <[email protected]>
Co-authored-by: Markus Scherer <[email protected]>
…I.java Co-authored-by: Markus Scherer <[email protected]>
…I.java Co-authored-by: Markus Scherer <[email protected]>
…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>
…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>
…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>
…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>
|
I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI. After "we" sorted everything out, I'll squash into one commit and add a |
- Fix NPE - Add space for a more readable exception output
- 7-bytes-characters only: ASCII instead of ISO-8859-1 - Exclude ASCII for "exotic" charset tests
|
Note that I needed to replace |
|
Java code finished. Now, I would try to port it to the C implementation. |
Checklist