Skip to content

Conversation

@koppor
Copy link

@koppor koppor commented Jul 4, 2022

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22080
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@CLAassistant
Copy link

CLAassistant commented Jul 4, 2022

CLA assistant check
All committers have signed the CLA.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/charset/src/com/ibm/icu/dev/test/charset/TestCharSetRecognition.java is no longer changed in the branch
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@koppor koppor marked this pull request as draft July 5, 2022 06:55
- Introduce class CharsetRecog_ASCII and add it to the default detectors
- Refine TestCharsetDetector.java
  - needed to modify ISO-2022-JP test, because of "conflict" with ASCII: shifts added
- Fix casing of method "mungeInput()"
- Add comment to "& 0x00ff"
- Typo fix

Co-authored-by: Christoph <[email protected]>
Co-authored-by: Carl Christian Snethlage <[email protected]>
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java is different
  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@koppor koppor marked this pull request as ready for review July 6, 2022 22:02
@markusicu markusicu self-assigned this Jul 14, 2022
@markusicu markusicu requested a review from richgillam July 14, 2022 16:02
@richgillam
Copy link
Contributor

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

@markusicu markusicu requested a review from FrankYFTang July 14, 2022 16:03
@markusicu
Copy link
Member

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

return null;
} else {
// ASCII, because ALL bytes in the stream are <= 127.
// However, there could be some unicode (such as Hebrew) which also has this property.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? It could be some unusual charset like UTF-7 or HZ, but those are "prohibited character encodings" in modern HTML and have generally fallen out of favor.

I also don't think that Hebrew is relevant here.

Seems like the confidence for ASCII should be high if all bytes are 00..7F.

@markusicu
Copy link
Member

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

@macchiati
Copy link
Member

macchiati commented Sep 10, 2022 via email

@srl295
Copy link
Member

srl295 commented Sep 10, 2022

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something.

So.. I think ASCII is fine probably more recognizable these days.

@koppor
Copy link
Author

koppor commented Sep 24, 2022

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.
Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this.

@koppor
Copy link
Author

koppor commented Sep 24, 2022

I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI.

After "we" sorted everything out, I'll squash into one commit and add a Co-authored-by for @markusicu

@koppor
Copy link
Author

koppor commented Nov 27, 2022

Note that I needed to replace ISO-2022-JP in the context of com.ibm.icu.dev.test.charsetdet.TestCharsetDetector#TestBufferOverflow by ASCII as the byte sequence (line 283) does not contain any non-7-bit characters. Think, it is no harm, because the shift state "at the start" is "a bad one".

@koppor
Copy link
Author

koppor commented Nov 27, 2022

Java code finished. Now, I would try to port it to the C implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants