Skip to content

Conversation

@Euak
Copy link
Contributor

@Euak Euak commented Feb 19, 2019

Hello, @yooper
Based on the Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee I implemented some features for the Ngram functionalities of this library.

  • I fixed the separator insertion when the ngram is created with a separator with length bigger than one;
  • I implemented a function to calculate the frequency of each ngram inside of a ngram array and its tokens. The frequency is based on Pedersen and Banerjee's package as follows:
    For bigrams, it calculates the frequency of the bigram as a whole and the frequencies of the right and left token in its found positions.
    For trigrams, it calculates the frequency of the trigram as a whole, the frequencies of each token in its found positions, the frequency of the first token with the second token, the frequency of the first token with the third token and the frequency of the second token with the third token, all in its found positions.
  • Finally, I implemented calculations for statistic measures that determine the degree of association. Also, based on Pedersen and Banerjee's package.
  • Tests were also implemented.

There is a much more detailed description of Pedersen and Banerjee's package at their paper, available at: http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

Feel free to contact me in case of questions.

Euak added 18 commits February 5, 2019 16:27
Added token frequency to the ngrams array.
Added ngram frequency and placed the ngram string as the key of the array.
Class to calculate statistics over the ngrams.
Created setStatVariables method.
Inserted Log-Likelihood stat and refactored calculate method.
Added comments.
Added Mutual Information coefficient measure.
Added Dice measure.
Added X squared coefficient.
Added T-score measure.
Added Phi Coefficient measure.
Added Odds Ratio measure.
Adjusted the method of inserting frequencies to set the frequency of the
token at the referred position.
Added Fisher's exact test (left-sided).
Fisher's exact test (right-sided) measure and combined with the
left-sided.
Implemented calculation of statistic measures for bi and trigrams. The
implementation was based on The Ngram Statistics Package by Ted Pedersen
and Satanjeev Banerjee. The façade pattern was used.
Implemented tests to the statistic measures of the ngrams
@yooper
Copy link
Owner

yooper commented Feb 21, 2019

Thank you for the contribution. Before I merge, please make sure to camelCase your variables. You have several variables using underscore.

Euak added 2 commits February 21, 2019 13:33
The variable names were modified to the camelCase convention.
The variable names were modified to the camelCase convention, tests
included.
@Euak
Copy link
Contributor Author

Euak commented Feb 21, 2019

Hi, @yooper. Sorry for that mistake. I fixed it.
Thanks.

@yooper yooper merged commit 1bc8514 into yooper:master Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants