Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,24 @@ Google normalized distance as described in "The Google Similarity Distance", Ci

`gnd` also accepts the `background_is_superset` parameter.


===== Percentage
A simple calculation of the number of documents in the foreground sample with a term divided by the number of documents in the background with the term.
By default this produces a score greater than zero and less than one.

The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar with a "per capita" statistic. However, for fields with high cardinality there is a tendency for this heuristic to select the rarest terms such as typos that occur only once because they score 1/1 = 100%.

It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under his belt would be impossible to beat.
Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both `min_doc_count` and `shard_min_doc_count` to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence.

[source,js]
--------------------------------------------------

"percentage": {
}
--------------------------------------------------


===== Which one is best?


Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/


package org.elasticsearch.search.aggregations.bucket.significant.heuristics;


import org.elasticsearch.ElasticsearchParseException;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.query.QueryParsingException;

import java.io.IOException;

public class PercentageScore extends SignificanceHeuristic {

public static final PercentageScore INSTANCE = new PercentageScore();

protected static final String[] NAMES = {"percentage"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a String[]? We seem to only use it in one place where we just get the first element anyway so would it not be better as a plain String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SignificanceHeuristicParser base class has a getNames() method that requires an array of names (presumably to allow for alternatives) so this is returned there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I missing that method below, sorry.


private PercentageScore() {};

public static final SignificanceHeuristicStreams.Stream STREAM = new SignificanceHeuristicStreams.Stream() {
@Override
public SignificanceHeuristic readResult(StreamInput in) throws IOException {
return readFrom(in);
}

@Override
public String getName() {
return NAMES[0];
}
};

public static SignificanceHeuristic readFrom(StreamInput in) throws IOException {
return INSTANCE;
}

/**
* Indicates the significance of a term in a sample by determining what percentage
* of all occurrences of a term are found in the sample.
*/
@Override
public double getScore(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
checkFrequencyValidity(subsetFreq, subsetSize, supersetFreq, supersetSize, "PercentageScore");
if (supersetFreq == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subsetFreq is > 0 and supersetFreq = 0, should the user not be warned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed the logic we've used elsewhere. I think this allows for a case where a background_filter is not a strict superset e.g. the foreground is you search for all tweets with "JeSuisCharlie" and set a background of "language:fr" to tune out everyday French like "vous" and hopefully focus in on the Charlie Hebdo related keywords. The background stats are used in that example as a rough backdrop and we choose not to error if the foreground has a term not in the background.

// avoid a divide by zero issue
return 0;
}
return (double) subsetFreq / (double) supersetFreq;
}

@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(STREAM.getName());
}

public static class PercentageScoreParser implements SignificanceHeuristicParser {

@Override
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException {
// move to the closing bracket
if (!parser.nextToken().equals(XContentParser.Token.END_OBJECT)) {
throw new ElasticsearchParseException("expected }, got " + parser.currentName() + " instead in percentage score");
}
return new PercentageScore();
}

@Override
public String[] getNames() {
return NAMES;
}
}

public static class PercentageScoreBuilder implements SignificanceHeuristicBuilder {

@Override
public void toXContent(XContentBuilder builder) throws IOException {
builder.startObject(STREAM.getName()).endObject();
}
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ protected void checkFrequencyValidity(long subsetFreq, long subsetSize, long sup
throw new ElasticsearchIllegalArgumentException("Frequencies of subset and superset must be positive in " + scoreFunctionName + ".getScore()");
}
if (subsetFreq > subsetSize) {
throw new ElasticsearchIllegalArgumentException("subsetFreq > subsetSize, in JLHScore.score(..)");
throw new ElasticsearchIllegalArgumentException("subsetFreq > subsetSize, in " + scoreFunctionName);
}
if (supersetFreq > supersetSize) {
throw new ElasticsearchIllegalArgumentException("supersetFreq > supersetSize, in JLHScore.score(..)");
throw new ElasticsearchIllegalArgumentException("supersetFreq > supersetSize, in " + scoreFunctionName);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;

import com.google.common.collect.Lists;

import org.elasticsearch.common.inject.AbstractModule;
import org.elasticsearch.common.inject.multibindings.Multibinder;

Expand All @@ -33,6 +34,7 @@ public class SignificantTermsHeuristicModule extends AbstractModule {

public SignificantTermsHeuristicModule() {
registerParser(JLHScore.JLHScoreParser.class);
registerParser(PercentageScore.PercentageScoreParser.class);
registerParser(MutualInformation.MutualInformationParser.class);
registerParser(GND.GNDParser.class);
registerParser(ChiSquare.ChiSquareParser.class);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;

import com.google.common.collect.Lists;

import org.elasticsearch.common.inject.AbstractModule;

import java.util.List;
Expand All @@ -32,6 +33,7 @@ public class TransportSignificantTermsHeuristicModule extends AbstractModule {

public TransportSignificantTermsHeuristicModule() {
registerStream(JLHScore.STREAM);
registerStream(PercentageScore.STREAM);
registerStream(MutualInformation.STREAM);
registerStream(GND.STREAM);
registerStream(ChiSquare.STREAM);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.GND;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.JLHScore;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.MutualInformation;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.PercentageScore;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.elasticsearch.search.aggregations.bucket.terms.TermsBuilder;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
Expand Down Expand Up @@ -272,6 +273,23 @@ public void textAnalysisChiSquare() throws Exception {
checkExpectedStringTermsFound(topTerms);
}

@Test
public void textAnalysisPercentageScore() throws Exception {
SearchResponse response = client()
.prepareSearch("test")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(new TermQueryBuilder("_all", "terje"))
.setFrom(0)
.setSize(60)
.setExplain(true)
.addAggregation(
new SignificantTermsBuilder("mySignificantTerms").field("description").executionHint(randomExecutionHint())
.significanceHeuristic(new PercentageScore.PercentageScoreBuilder()).minDocCount(2)).execute().actionGet();
assertSearchResponse(response);
SignificantTerms topTerms = response.getAggregations().get("mySignificantTerms");
checkExpectedStringTermsFound(topTerms);
}

@Test
public void badFilteredAnalysis() throws Exception {
// Deliberately using a bad choice of filter here for the background context in order
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,16 @@
import org.elasticsearch.common.xcontent.json.JsonXContent;
import org.elasticsearch.search.SearchShardTarget;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.ChiSquare;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.GND;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.JLHScore;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.MutualInformation;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.PercentageScore;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicBuilder;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicParser;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicParserMapper;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicStreams;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
import org.elasticsearch.test.ElasticsearchTestCase;
Expand All @@ -45,7 +54,11 @@
import java.util.List;
import java.util.Set;

import static org.hamcrest.Matchers.*;
import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.greaterThan;
import static org.hamcrest.Matchers.greaterThanOrEqualTo;
import static org.hamcrest.Matchers.lessThan;
import static org.hamcrest.Matchers.lessThanOrEqualTo;

/**
*
Expand All @@ -68,6 +81,7 @@ public SearchShardTarget shardTarget() {
public void streamResponse() throws Exception {
SignificanceHeuristicStreams.registerStream(MutualInformation.STREAM, MutualInformation.STREAM.getName());
SignificanceHeuristicStreams.registerStream(JLHScore.STREAM, JLHScore.STREAM.getName());
SignificanceHeuristicStreams.registerStream(PercentageScore.STREAM, PercentageScore.STREAM.getName());
SignificanceHeuristicStreams.registerStream(GND.STREAM, GND.STREAM.getName());
SignificanceHeuristicStreams.registerStream(ChiSquare.STREAM, ChiSquare.STREAM.getName());
Version version = ElasticsearchIntegrationTest.randomVersion();
Expand Down Expand Up @@ -304,13 +318,15 @@ public void testAssertions() throws Exception {
testBackgroundAssertions(new MutualInformation(true, true), new MutualInformation(true, false));
testBackgroundAssertions(new ChiSquare(true, true), new ChiSquare(true, false));
testBackgroundAssertions(new GND(true), new GND(false));
testAssertions(PercentageScore.INSTANCE);
testAssertions(JLHScore.INSTANCE);
}

@Test
public void basicScoreProperties() {
basicScoreProperties(JLHScore.INSTANCE, true);
basicScoreProperties(new GND(true), true);
basicScoreProperties(PercentageScore.INSTANCE, true);
basicScoreProperties(new MutualInformation(true, true), false);
basicScoreProperties(new ChiSquare(true, true), false);
}
Expand Down