ICU-22080 Add plain ASCII as an explicitly detected type #2127

koppor · 2022-07-04T20:02:27Z

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22080
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

CLAassistant · 2022-07-04T20:02:33Z

All committers have signed the CLA.

jira-pull-request-webhook · 2022-07-04T21:55:07Z

Notice: the branch changed across the force-push!

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
icu4j/main/tests/charset/src/com/ibm/icu/dev/test/charset/TestCharSetRecognition.java is no longer changed in the branch
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2022-07-04T23:22:00Z

Notice: the branch changed across the force-push!

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

- Introduce class CharsetRecog_ASCII and add it to the default detectors - Refine TestCharsetDetector.java - needed to modify ISO-2022-JP test, because of "conflict" with ASCII: shifts added - Fix casing of method "mungeInput()" - Add comment to "& 0x00ff" - Typo fix Co-authored-by: Christoph <[email protected]> Co-authored-by: Carl Christian Snethlage <[email protected]>

jira-pull-request-webhook · 2022-07-06T20:56:43Z

Notice: the branch changed across the force-push!

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java is different
icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is different
icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

richgillam · 2022-07-14T16:03:07Z

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

markusicu · 2022-09-09T23:44:51Z

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java

markusicu · 2022-09-10T00:02:10Z

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java

+            return null;
+        } else {
+            // ASCII, because ALL bytes in the stream are <= 127.
+            // However, there could be some unicode (such as Hebrew) which also has this property.


Hm? It could be some unusual charset like UTF-7 or HZ, but those are "prohibited character encodings" in modern HTML and have generally fallen out of favor.

I also don't think that Hebrew is relevant here.

Seems like the confidence for ASCII should be high if all bytes are 00..7F.

icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java

icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java

markusicu · 2022-09-10T00:23:01Z

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

macchiati · 2022-09-10T01:27:54Z

I think ASCII is sufficiently unambiguous.

…

On Fri, Sep 9, 2022 at 5:23 PM Markus Scherer ***@***.***> wrote: FYI @srl295 <https://github.com/srl295> @aheninger <https://github.com/aheninger> Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic? — Reply to this email directly, view it on GitHub <#2127 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMER6NH6YHFWFK6WXWTV5PIHFANCNFSM52UC2RXQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

srl295 · 2022-09-10T02:28:42Z

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something.

So.. I think ASCII is fine probably more recognizable these days.

koppor · 2022-09-24T07:15:50Z

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.
Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this.

…I.java Co-authored-by: Markus Scherer <[email protected]>

Co-authored-by: Markus Scherer <[email protected]>

…I.java Co-authored-by: Markus Scherer <[email protected]>

…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>

koppor · 2022-09-24T07:41:36Z

I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI.

After "we" sorted everything out, I'll squash into one commit and add a Co-authored-by for @markusicu

- Fix NPE - Add space for a more readable exception output

…e ASCII text

- 7-bytes-characters only: ASCII instead of ISO-8859-1 - Exclude ASCII for "exotic" charset tests

koppor · 2022-11-27T21:02:50Z

Note that I needed to replace ISO-2022-JP in the context of com.ibm.icu.dev.test.charsetdet.TestCharsetDetector#TestBufferOverflow by ASCII as the byte sequence (line 283) does not contain any non-7-bit characters. Think, it is no harm, because the shift state "at the start" is "a bad one".

koppor · 2022-11-27T21:20:00Z

Java code finished. Now, I would try to port it to the C implementation.

koppor mentioned this pull request Jul 4, 2022

Fix charset detection with utf16 and others JabRef/jabref#8947

Merged

6 tasks

koppor force-pushed the addASciiTest branch from 8470c4b to d8159ef Compare July 4, 2022 21:55

koppor force-pushed the addASciiTest branch from d8159ef to b1d0fa4 Compare July 4, 2022 23:21

koppor marked this pull request as draft July 5, 2022 06:55

koppor force-pushed the addASciiTest branch from b1d0fa4 to c661193 Compare July 6, 2022 20:56

koppor marked this pull request as ready for review July 6, 2022 22:02

markusicu self-assigned this Jul 14, 2022

markusicu requested a review from richgillam July 14, 2022 16:02

markusicu requested a review from FrankYFTang July 14, 2022 16:03

markusicu reviewed Sep 10, 2022

View reviewed changes

markusicu added the waiting-on-author label Sep 14, 2022

koppor and others added 8 commits September 24, 2022 09:16

Update icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCI…

9901afd

…I.java Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java

661e4ac

Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCI…

7762195

…I.java Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCI…

9754e59

…I.java Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/Test…

2f1ae4d

…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/Test…

5c47cf2

…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/Test…

54d7440

…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>

Update icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/Test…

5408a22

…CharsetDetector.java Co-authored-by: Markus Scherer <[email protected]>

koppor added 2 commits September 24, 2022 09:33

Adress comments

c6ecf20

Change confidence of ASCII from 35 to 80

79e7eaa

koppor added 6 commits October 27, 2022 22:56

80 -> 81 to trigger build

8da7490

Set confidence to 95 and fix typo

47b870c

Merge remote-tracking branch 'upstream/main' into addASciiTest

af0ffcc

Refine TestCharsetDetector

0b19cd4

- Fix NPE - Add space for a more readable exception output

Use ISO-8859-1 pattern matching magic for detecing the language of th…

07472c4

…e ASCII text

Adapt tests to prefer ASCII

8bdce8c

- 7-bytes-characters only: ASCII instead of ISO-8859-1 - Exclude ASCII for "exotic" charset tests

Siedlerchr mentioned this pull request Dec 25, 2022

BOM now missing at beginning of bibliography file -- causes JabRef to not recognize existing library JabRef/jabref#9496

Open

2 tasks

koppor mentioned this pull request Jun 15, 2025

Gradle build updates JabRef/jabref#13319

Merged

2 tasks

Uh oh!

ICU-22080 Add plain ASCII as an explicitly detected type #2127

Are you sure you want to change the base?

ICU-22080 Add plain ASCII as an explicitly detected type #2127

Uh oh!

Conversation

koppor commented Jul 4, 2022

Checklist

Uh oh!

CLAassistant commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jira-pull-request-webhook bot commented Jul 4, 2022

Uh oh!

jira-pull-request-webhook bot commented Jul 4, 2022

Uh oh!

jira-pull-request-webhook bot commented Jul 6, 2022

Uh oh!

richgillam commented Jul 14, 2022

Uh oh!

markusicu commented Sep 9, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markusicu Sep 10, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markusicu commented Sep 10, 2022

Uh oh!

macchiati commented Sep 10, 2022 via email

Uh oh!

srl295 commented Sep 10, 2022

Uh oh!

koppor commented Sep 24, 2022

Uh oh!

koppor commented Sep 24, 2022

Uh oh!

koppor commented Nov 27, 2022

Uh oh!

koppor commented Nov 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented Jul 4, 2022 •

edited

Loading