Ngrams improvements #47

Euak · 2019-02-19T19:52:20Z

Hello, @yooper
Based on the Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee I implemented some features for the Ngram functionalities of this library.

I fixed the separator insertion when the ngram is created with a separator with length bigger than one;
I implemented a function to calculate the frequency of each ngram inside of a ngram array and its tokens. The frequency is based on Pedersen and Banerjee's package as follows:
For bigrams, it calculates the frequency of the bigram as a whole and the frequencies of the right and left token in its found positions.
For trigrams, it calculates the frequency of the trigram as a whole, the frequencies of each token in its found positions, the frequency of the first token with the second token, the frequency of the first token with the third token and the frequency of the second token with the third token, all in its found positions.
Finally, I implemented calculations for statistic measures that determine the degree of association. Also, based on Pedersen and Banerjee's package.
Tests were also implemented.

There is a much more detailed description of Pedersen and Banerjee's package at their paper, available at: http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

Feel free to contact me in case of questions.

Added token frequency to the ngrams array.

Added ngram frequency and placed the ngram string as the key of the array.

Class to calculate statistics over the ngrams.

Created setStatVariables method.

Inserted Log-Likelihood stat and refactored calculate method.

Added comments.

Added Mutual Information coefficient measure.

Added Dice measure.

Added X squared coefficient.

Added T-score measure.

Added Phi Coefficient measure.

Added Odds Ratio measure.

Adjusted the method of inserting frequencies to set the frequency of the token at the referred position.

Added Fisher's exact test (left-sided).

Fisher's exact test (right-sided) measure and combined with the left-sided.

Implemented calculation of statistic measures for bi and trigrams. The implementation was based on The Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee. The façade pattern was used.

Implemented tests to the statistic measures of the ngrams

yooper · 2019-02-21T15:30:31Z

Thank you for the contribution. Before I merge, please make sure to camelCase your variables. You have several variables using underscore.

The variable names were modified to the camelCase convention.

The variable names were modified to the camelCase convention, tests included.

Euak · 2019-02-21T16:48:45Z

Hi, @yooper. Sorry for that mistake. I fixed it.
Thanks.

Euak added 18 commits February 5, 2019 16:27

Added token frequency

c8a8623

Added token frequency to the ngrams array.

Added ngram frequency

fbb7ec5

Added ngram frequency and placed the ngram string as the key of the array.

Statistic class

ec3c4f8

Class to calculate statistics over the ngrams.

Method setStatVariables decoupled

39d203a

Created setStatVariables method.

Inserted Log-Likelihood stat

089c6f2

Inserted Log-Likelihood stat and refactored calculate method.

Added comments

7f48c88

Added comments.

Added Mutual Information coefficient measure

5fcedc5

Added Mutual Information coefficient measure.

Added Dice measure

1a31497

Added Dice measure.

Added X squared coefficient measure

3a062e0

Added X squared coefficient.

Added T-score measure

39f77c3

Added T-score measure.

Added Phi Coefficient measure

47cdf70

Added Phi Coefficient measure.

Added Odds Ratio measure

11b75e5

Added Odds Ratio measure.

Adjusted frequencies

e9bf62e

Adjusted the method of inserting frequencies to set the frequency of the token at the referred position.

Added Fisher's exact test (left-sided) measure

bb9195f

Added Fisher's exact test (left-sided).

Added Fisher's exact test (right-sided)

b313e08

Fisher's exact test (right-sided) measure and combined with the left-sided.

Implemented calculation of statistic measures

756f31b

Implemented calculation of statistic measures for bi and trigrams. The implementation was based on The Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee. The façade pattern was used.

added comment about the requirements

33b86c9

Implemented tests to Statistic measures ngrams

252f6c8

Implemented tests to the statistic measures of the ngrams

Euak added 2 commits February 21, 2019 13:33

Fixed variable names

d48e31d

The variable names were modified to the camelCase convention.

Fixed variable names

453f3c9

The variable names were modified to the camelCase convention, tests included.

yooper merged commit 1bc8514 into yooper:master Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ngrams improvements #47

Ngrams improvements #47

Uh oh!

Euak commented Feb 19, 2019

Uh oh!

yooper commented Feb 21, 2019

Uh oh!

Euak commented Feb 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ngrams improvements #47

Ngrams improvements #47

Uh oh!

Conversation

Euak commented Feb 19, 2019

Uh oh!

yooper commented Feb 21, 2019

Uh oh!

Euak commented Feb 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants