-
Notifications
You must be signed in to change notification settings - Fork 92
Ngrams improvements #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Added token frequency to the ngrams array.
Added ngram frequency and placed the ngram string as the key of the array.
Class to calculate statistics over the ngrams.
Created setStatVariables method.
Inserted Log-Likelihood stat and refactored calculate method.
Added comments.
Added Mutual Information coefficient measure.
Added Dice measure.
Added X squared coefficient.
Added T-score measure.
Added Phi Coefficient measure.
Added Odds Ratio measure.
Adjusted the method of inserting frequencies to set the frequency of the token at the referred position.
Added Fisher's exact test (left-sided).
Fisher's exact test (right-sided) measure and combined with the left-sided.
Implemented calculation of statistic measures for bi and trigrams. The implementation was based on The Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee. The façade pattern was used.
Implemented tests to the statistic measures of the ngrams
Owner
|
Thank you for the contribution. Before I merge, please make sure to camelCase your variables. You have several variables using underscore. |
The variable names were modified to the camelCase convention.
The variable names were modified to the camelCase convention, tests included.
Contributor
Author
|
Hi, @yooper. Sorry for that mistake. I fixed it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello, @yooper
Based on the Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee I implemented some features for the Ngram functionalities of this library.
For bigrams, it calculates the frequency of the bigram as a whole and the frequencies of the right and left token in its found positions.
For trigrams, it calculates the frequency of the trigram as a whole, the frequencies of each token in its found positions, the frequency of the first token with the second token, the frequency of the first token with the third token and the frequency of the second token with the third token, all in its found positions.
There is a much more detailed description of Pedersen and Banerjee's package at their paper, available at: http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf
Feel free to contact me in case of questions.