-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add support for automatic ICORE conference ranking lookup [#13476] #13699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jablib/src/main/java/org/jabref/logic/util/strings/StringSimilarity.java
Show resolved
Hide resolved
| double exactMatch = 1.0; | ||
| double similarity = similarityChecker.similarity(a, b); | ||
|
|
||
| assertTrue(similarity >= EPSILON_SIMILARITY && similarity < exactMatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assertTrue with a boolean condition instead of asserting the actual contents. Should compare the actual similarity value with expected bounds using assertEquals.
|
For the last few days, I've just been browsing the code, reading the docs, and interacting with the application on my local machine. Since this is the first time I'm interacting with the JabRef ecosystem, and ICORE by extension, I have some questions regarding the app itself and the feature's use-case. I'll post each one as a separate comment. Apologies if some of these are too obvious. |
|
The issue post mentions: "When a BibTeX entry includes a conference title". What does "conference title" here refer to? A. Following the Getting Started guide on the app, when you add a new entry via I'm guessing the answer is any entry regardless of type, but I still have to ask to be sure. B. If I'm querying all entries regardless of type, do I have to search only the Title field? There are entries where the conference name isn't in the title field. Case in point: I used the Web Search feature in the app to lookup the conference mentioned in the issue post: "ACIS Conference on Software Engineering Research, Management and Applications". I selected and imported the following entry: https://ieeexplore.ieee.org/document/9509045. It gets imported as an I'm assuming that I should be looking for the conference title and its acronym in all of the fields. Is there some sort of a standard way here regarding how entries are imported into JabRef so that I only have to look for the title inside a subset of fields rather than all of them? |
|
The ICORE ranking data and its presentation. A. Since each BibTeX entry is annotated with a year, I'm assuming that the user wants to see the ICORE ranking for that particular year. Again, very obvious, but I want to confirm. This would also mean that if a conference was added to ICORE later on, any entries from previous years should have a "Not Available" or "N/A" in the ICORE rank field. Or do we use the mismatched ranking as a fallback? B. Since ICORE rankings are released roughly every 2-3 years, what about some weird edge cases within that time period? Say a conference was there for 2014, then it wasn't in the 2017 list, but then it was added again in 2018. Do we worry about entries for 2017 and give them an "N/A" rank? What if a conference's rank changed during that time period? Of course, if the answer to the previous question was that we do not use a fallback and stick to the exact date, then none of these questions matter. C. The exported ICORE ranking data provided from the website (https://portal.core.edu.au/conf-ranks/) does not contain a header row in the CSV file. This isn't a big deal as the important bits can be made out quite clearly (all except one, that is). Each line contains 9 columns: ID, Title, Acronym, Source, Rank, UNKNOWN, FoR-1, FoR-2, FoR-3. The UNKNOWN field in column 6 is a "Yes" or "No" for every line. It is always present, but I can't seem to figure what it corresponds to (it isn't the Note or DBLP column from the website). Consequently, I cannot determine whether it is important. If you know what it is or if it is relevant to the feature, please let me know. |
|
@koppor can you please help answer the questions I've posted above? |
Oh, did you ever read about bibtex? - it is |
|
Always use the latest ranking. We are not interested in historic data. - Only one CSV should be used. |
Web site has:
Which is
Example CSV line (NOTE: It would be good if this was included in your question to make it self-contained) I cannot quickly see it, but we need "Title", "Acronym" and "Rank" only. The other columns can be ommitted, can't they? |
|
Please make your question numbers unique. "A" is used double, isn't it?
No, always the latest year. |
Always use the latest CSV - there is one export. This CSV should be used. |
For |
I hope, I got all questions, I am a bit confused since the questions are all labeled with "A" and I could have missed something. |
|
@TheYorouzoya I wonder if you have seen the "Helpful resources" section at the issue description (#13476)
It links to #13512 Did you know that one can click on "Files changed"?
You are routed to https://github.com/JabRef/jabref/pull/13512/files You then might have seen
I know that code reading is not easy; but it is an essential skill to produce maintainable code. |
Starting from the issue post
I wanted to see for myself what the feature might look like inside of JabRef to the user. So I downloaded the current version and searched for the conference mentioned in the issue post. I imported one of the entries and saw that the conference name showed up inside the "Journal" field
Even though the manual says that the optional field for a venue exists
So I, then, booted up the build on my local repo, i.e., the
even though there is a clearly indicated "Venue" field which is empty
Do you see why I would ask such an obvious question after this? |
|
@TheYorouzoya Thank you for your patience. It's all voluntary work here. It needs time to explain the domain of scientific references. Maybe you can be a guest a little longer here and improve our documentation at https://docs.jabref.org. Currently we see guests being here just a short time, doing a task, and then leave. I always hope that a guest will make the place better as a whole; especially because all guests seem to be learning software engineering and not just programming. |
Data sourced from ICORE website here: https://portal.core.edu.au/conf-ranks/ to enable ICORE rank lookups. As discussed here: JabRef#13699 (comment), only the latest data from ICORE is to be used. At this time, it is the ICORE2023 ranking data. Part of JabRef#13476
Hey @TheYorouzoya - That is not needed. You can just use the labels to bundle questions under their respective contexts like you do, just add numbering to them (like A1, A2, etc.) so that they can be specifically and easily referred to when answering. |
|
I am more used to Gitter (Matrix) chat for a bulk of questions 😅. Sorry for that! I have to confess that I did not check the terms properly while writing. I used "venue" as a scientist indicating a conference. And I did not check whether BibLaTeX has some "definition" of venue. A "venue" meant in the issue is some I also meant journal articles, which are defined by I hope, I could answer your question now and you are unblocked to move forward. |
- Append a header row to resources/icore/ICORE2023.csv - Add ConferenceEntry record to represent ICORE conference data - Add ConferenceRepository to load conference data and allow conference lookups using an acronym or a bookTitle with fuzzy match as a fallback - Add utility class to extract an acronym from a bookTitle - Add tests Part of JabRef#13476
| // A slight modification of: https://stackoverflow.com/a/17759264 | ||
| private static final Pattern PATTERN = Pattern.compile("\\(([^()]*)\\)"); | ||
|
|
||
| public static Optional<String> extract(String input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method lacks input validation for null parameter which could lead to NullPointerException. While Optional return is good, the input should be validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NonNull jspecify annotation is OK
Please add JavaDoc.
| String acronym, | ||
| String rank | ||
| ) { | ||
| private final static String URL_PREFIX = "https://portal.core.edu.au/conf-ranks/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect order of modifiers. According to Java conventions and effective Java principles, it should be 'private static final' instead of 'private final static'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix that in the next commit.
| } | ||
|
|
||
| @Test | ||
| void extractReturnsEmptyforEmptyParentheses() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method name contains a typo: 'forEmptyParentheses' should be 'ForEmptyParentheses' to maintain consistent camelCase naming convention in test methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix that in the next commit.
|
|
||
| @Test | ||
| void getConferenceFromBookTitleReturnsConferenceForFuzzyMatchAboveThreshold() { | ||
| // String similarity > 0.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment merely states what can be derived from the code and test name, not providing additional information about reasoning or implementation details.
Thank you! I'll work on the GUI side next. I do have some questions there, but I'll post those once I'm done looking around the code a bit more. |
Also, here is a link to our gitter chat. |
- Add ICORERankingEditor and ICORERankingEditorViewModel classes (inspired by other editors and their view models) to the GUI package. - Add ICORERankingEditor.fxml for the editor's layout. - Add field creation logic to FieldEditors#getForField. - Update jablib/module-info to export the icore logic and model packages for use on the GUI side. - Add ICORE rank field to FieldFactory#getDefaultGeneralFields and update preferences migrations as per JabRef#13512 (comment). - Fix typos in ConferenceAcronymExtractorTest. - Fix order of modifiers in ConferenceEntry. - Update CHANGELOG. Part of JabRef#13476
|
@koppor I was just about to push an updated matching algorithm before heading to sleep. I was doing some finishing touches here. Should I push it now? Or do I do it on another issue/PR later on? |
No rush - then we wait to merge the PR. (The alterantive would be to merge this as is and that you base your new commits on latest "main" - or create a [magic merge commit])(https://github.com/koppor/magic-merge-commit/)) |
…ouzoya/jabref into add-ICORE-ranking-support
|
I've improved the search algorithm to include some of the stuff I mentioned in my last comment. This new version finds 23/31 matches from this test. The 8 failing tests all fail for only two reasons: either the data has too much noise with jumbled up conference titles or the conference title/acronym has been changed in the latest data (there is no matching entry for it). The New Algorithm1. Generate acronym candidatesFirstly, to deal with the Group A offenders, i.e., acronyms inside parentheses with other text like The idea is to define a set of delimiters, mark all the positions of the delimiters in the string, and generate all substrings between the computed bounds. The above example of Note that simply splitting on delimiters would not work in our favor since the acronyms themselves can contain those delimiters within them. Also, It is important here that a length-based ordering is maintained in the final result so as to avoid looking up composite acronyms like To further trim down the number of generated candidates, we also pass in a Substrings are also delimiter-trimmed since no acronym starts with a delimiter like Since the number of substrings can still blow up pretty quickly, we have a hard cap of 50 candidates. As soon as we hit that number, we stop the generation and return the candidates collected so far. Overall, the probability of a false positive is quite low since we only split on a set of delimiters rather than in the middle of strings. 2. Normalize inputThe main obstacle hindering Group B matches was the abundance of noise in the input. So we need a way to trim as much out as we can without hurting our odds for a good match. Take this for example,
I've classified noise into following categories:
Note that none of the above can contain things which are found in the ICORE conference title data. I've implemented a normalizer that strips away all the noise defined above and smashes the input into one long string composed only of letters and digits. So our example of While loading the conference data, we also fed all the conference titles through the same normalizer to get a We also do a acronym lookup after normalizing which can also catch cases like 3. Introduce another metric for matchingNormalization is just a preliminary step to improving the odds of matching titles. But even with much of the noise removed, there is still enough clutter left in the resulting string which can throw off our Levenshtein matching. Since it is a reasonable assumption to make that we will often find conference titles as a substring inside the query, matching based on substrings should improve our odds quite a bit. Following this, I've introduced a Longest Common Substring Similarity rating similar to the Levenshtein Similarity. We compute the length of the longest common substring between the query and a conference title and divide the result with the length of the shorter string. This gives us a value between 0 and 1 which tells us how much of the shorter string exists as-is in the longer one. If a conference title exists as a substring inside the query, we'll get back a value of 1 for an exact match. To avoid overfitting while matching the query against conference titles, we only compute the similarity values when the normalized query string is either equal to or longer than the conference title in length. This prevents incomplete queries like We also use the LCS similarity to compute a combined metric along with Levenshtein Similarity as follows:
These ratios and the threshold aren't exactly "grounded" in hard data, but they're more of an educated guess based on certain assumptions and fiddling with the various inputs to find a pattern. The core assumption here is that if a user is intending to lookup a conference's ICORE rank, the probability that the correct conference title is embedded somewhere in the
We prioritize Levenshtein so that edit distances do matter more than substrings, but this combination should give us a reasonable metric for maching. For example, for our noisy input with a misspelling in the conference title:
normalized down to: Matches with the correct conference title of: which gets normalized to: with Levenshtein similarity of There is still a possibility of false positives here, but I believe the tradeoff is worth it. |
|
It seems I've bungled up a test when I changed the expected value to be |
|
Also seems like my JavaDoc is malformed for the |
| static Stream<Arguments> generateAcronymCandidateTestData() { | ||
| return Stream.of( | ||
| // Edge cases | ||
| Arguments.of("", 2, Collections.emptySet()), // Empty string returns empty set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses Collections.emptySet() instead of the modern Java Set.of() for creating empty sets. Modern Java practices prefer Set.of() for better readability and consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheYorouzoya please push your changes if you've taken this suggestion ^, as I think your PR is near completion, so two of us can approve and concretely plan follow-ups if any.
P.S. thank you so much for the detailed write-ups!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was waiting for any feedback regarding the updated algorithm before pushing in case I needed to make some changes there as well. I'll update the test with Set.of() and push.
| import static org.junit.jupiter.api.Assertions.assertEquals; | ||
|
|
||
| public class ConferenceUtilsTest { | ||
| @ParameterizedTest(name = "Extract from \"{0}\" should return \"{1}\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method uses @ParameterizedTest with a display name pattern. According to JabRef practices, method names should be comprehensive enough without @DisplayName or name patterns.
I think, its more than "just" a decision - its a documentation how things work. Maybe, |
subhramit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor cosmetic comments
| public static Set<String> generateAcronymCandidates(@NonNull String input, int CUTOFF) { | ||
| if (input.isEmpty() || CUTOFF <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capital letters are used for constants, please use lowercase here (and adjust the javadoc as well)
| */ | ||
| public static Set<String> generateAcronymCandidates(@NonNull String input, int CUTOFF) { | ||
| if (input.isEmpty() || CUTOFF <= 0) { | ||
| return Collections.emptySet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set.of and remove the collections import
| return Collections.emptySet(); | ||
| } | ||
|
|
||
| final int MAX_CANDIDATES_THRESHOLD = 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to class level declarations
| // Collect delimiter boundaries: -1 (start), every delimiter index, and input length (end). | ||
| bounds.add(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract as class-level constant DELIMITER_START
| import org.jspecify.annotations.NonNull; | ||
|
|
||
| public class ConferenceUtils { | ||
| // Regex that'll extract the string within the first deepest set of parentheses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is stating the obvious which can be derived from the code and regex pattern itself. It doesn't provide additional value or reasoning behind the implementation.
subhramit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was some fine work, lgtm :)
Documentation can be added as a follow-up.
|
@TheYorouzoya You have nice screenshots on the Icore ranking field - and a step-by-step guide. Can you add this to https://github.com/JabRef/user-documentation/tree/main/en/advanced/entryeditor? (It is OK if the screenshots do not match the color theme of the current ones - this is future work to fix it) |
* upstream/main: (54 commits) Split relativizeSymlinks parameterized tests in separate tests (#13782) Update the search syntax highlight for web search (#13801) Chore(deps): Bump ai.djl:bom from 0.33.0 to 0.34.0 in /versions (#13833) Fix typos in CHANGELOG.md (#13826) Chore(deps): Bump com.konghq:unirest-modules-gson in /versions (#13831) Chore(deps): Bump org.gradlex:extra-java-module-info in /build-logic (#13830) Chore(deps): Bump org.apache.logging.log4j:log4j-to-slf4j in /versions (#13832) Chore(deps): Bump io.zonky.test.postgres:embedded-postgres-binaries-bom (#13834) Chore(deps): Bump jablib/src/main/resources/csl-locales (#13829) Chore(deps): Bump jablib/src/main/resources/csl-styles (#13827) Chore(deps): Bump jablib/src/main/abbrv.jabref.org (#13828) add: CAYW endpoint formats (#13785) New Crowdin updates (#13823) chore(deps): update dependency org.kohsuke:github-api to v2.0-rc.5 (#13822) Add support for automatic ICORE conference ranking lookup [#13476] (#13699) New Crowdin updates (#13820) Initialize search bar auto-completion with real database context (no tab switch needed) (#13816) Fixes #13274: Allow cygwin-paths on Windows (#13297) Refine "REDACTED" replacement of API key value in web fetcher search URL (#13814) changed ISSNCleanup into NormalizeIssn, refactored respective tests #13748 (#13767) ...













Closes #13476: Add support for automatic ICORE conference ranking lookup
This PR adds the required feature to enable ICORE conference ranking lookups whenever a BibTeX entry includes a conference title.
Task list mentioned in the original issue:
Steps to test
Icorerankingfield shows up in the General Tab under theDOIfield.InProceedingsand enter a conference acronym (in parentheses) in theBooktitlefield. Then, navigate to the General Tab again and click the lookup rank button to see the ICORE rank for the conference.Clicking the Open Conference Page button will open your default browser and take you to the ICORE conference page for the conference (for SIGCOMM in the screenshot, it would be here.
In case an acronym isn't present in the title, the tool will then try to lookup the entire
Booktitlein the ranking data, with a fuzzy match fallback of 90% similarity.InProceedings,InCollection, andArticleentry types and looks for conference titles inBooktitle,Journaltitle, orTitlefields.Some caveats:
ConferenceAcronymExtractorworks (see related tests for details). Some examples to illustrate this:(This doesn't get pulled (this does))->this does(First) acronym is pulled, not the (second) one. ->First(This doesn't (I DO)) and (this won't (either)). ->I DOMandatory checks
CHANGELOG.mddescribed in a way that is understandable for the average user (if change is visible to the user)It seems that new rules expect only 6 points here.