Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c4bfdb7
Add HistogramField.
iverase Oct 28, 2019
550394c
checkStyle
iverase Oct 28, 2019
9d4f9c4
more checkStyle
iverase Oct 28, 2019
4e3eed7
Addressed part of the review
iverase Oct 29, 2019
a168d32
Extract the logic of creating a new histogram to a separate method
iverase Oct 29, 2019
038d429
Addressed more comments.
iverase Oct 29, 2019
edc2faf
formatting
iverase Oct 29, 2019
c527aec
extract logic for getting histogram in TDigest
iverase Oct 29, 2019
bd59238
remove unused imports
iverase Oct 29, 2019
71886a8
rename test class
iverase Oct 29, 2019
793a257
Detect in the constructor if we expect histogram value source
iverase Oct 29, 2019
579c05c
revert last change
iverase Oct 29, 2019
af1249f
Values must be provided in increasing order
iverase Oct 31, 2019
1cb8f53
Handling null value and do not fail if arrays are empty, trate it as a
iverase Oct 31, 2019
93229e5
Handle ignore malformed properly
iverase Oct 31, 2019
996f8fc
Merge branch 'master' into histogramField
iverase Oct 31, 2019
edec448
initial documentation for the new field
iverase Oct 31, 2019
adf12a4
initial documentation for the new field
iverase Oct 31, 2019
3c5892e
Addressed docs review
iverase Nov 1, 2019
19f15a2
Add HistogramFieldTypeTests
iverase Nov 1, 2019
1f6383d
address last review comments
iverase Nov 3, 2019
fe039ee
Merge branch 'master' into histogramField
iverase Nov 3, 2019
40f679d
Merge branch 'master' into histogramField
iverase Nov 15, 2019
79f7fd9
Merge branch 'master' into histogramField
iverase Nov 27, 2019
fbabf1c
Make sure that in ignore malformed we move to the end of the
iverase Nov 27, 2019
f1a1ead
address review comments
iverase Nov 28, 2019
c8a1f12
remove support for parsed fields
iverase Nov 28, 2019
0045a8b
Merge branch 'master' into histogramField
elasticmachine Nov 28, 2019
f8cf1a7
addressed last comments
iverase Nov 28, 2019
2e8649a
Merge branch 'histogramField' of github.com:iverase/elasticsearch int…
iverase Nov 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@

A `multi-value` metrics aggregation that calculates one or more percentiles
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or histogram
fields in the documents.
generated by a provided script or extracted from specific numeric or
<<histogram,histogram fields>> in the documents.

Percentiles show the point at which a certain percentage of observed values
occur. For example, the 95th percentile is the value which is greater than 95%
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@

A `multi-value` metrics aggregation that calculates one or more percentile ranks
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or histogram
fields in the documents.
generated by a provided script or extracted from specific numeric or
<<histogram,histogram fields>> in the documents.

[NOTE]
==================================================
Expand Down
21 changes: 12 additions & 9 deletions docs/reference/mapping/types/histogram.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -39,23 +39,26 @@ following aggregations and queries:
* <<search-aggregations-metrics-percentile-rank-aggregation,percentile ranks>> aggregation
* <<query-dsl-exists-query,exists>> query

We recommend you define the buckets in the `values` array based on the type of aggregation you intended to use.

[[mapping-types-histogram-building-histogram]]
==== Building a histogram

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the
histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
to build it. Some possibilities include:

- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, histograms
can be built by using the mean value of the centroids and the centroid's count. If the algorithm has already
started to approximate the percentiles, this inaccuracy is carried over in the histogram.
- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the `values` array represents
the mean centroid positions and the `counts` array represents the number of values that are attributed to each
centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is
carried over in the histogram.

- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the `values` array represents fixed upper
limits of each bucket interval, and the `counts` array represents the number of values that are attributed to each
interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits),
therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, histograms
can be created by using the recorded values and the count at that value. This implementation maintains a fixed worse-case
percentage error (specified as a number of significant digits), therefore the value used when generating the histogram
would be the maximum accuracy you can achieve at aggregation time.
The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this
means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and
index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another sentence/paragraph at the end?

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

Or something similar... trying to convey to the user that how they index the data is important and they should chose upfront.

[[histogram-ex]]
==== Examples
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ public void collect(int doc, long bucket) throws IOException {
DoubleHistogram state = getExistingOrNewHistogram(bigArrays, bucket);
if (values.advanceExact(doc)) {
final HistogramValue sketch = values.histogram();
while(sketch.next()) {
while (sketch.next()) {
state.recordValueWithCount(sketch.value(), sketch.count());
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.common.xcontent.XContentSubParser;
import org.elasticsearch.common.xcontent.support.XContentMapValues;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.fielddata.AtomicHistogramFieldData;
Expand Down Expand Up @@ -298,8 +299,8 @@ private HistogramValue getHistogramValue(final BytesRef bytesRef) throws IOExcep
@Override
public boolean next() throws IOException {
if (streamInput.available() > 0) {
value = streamInput.readDouble();
count = streamInput.readVInt();
value = streamInput.readDouble();
return true;
}
isExhausted = true;
Expand Down Expand Up @@ -352,7 +353,7 @@ public void parse(ParseContext context) throws IOException {
}
context.path().add(simpleName());
XContentParser.Token token = null;
int level = 0;
XContentSubParser subParser = null;
try {
token = context.parser().currentToken();
if (token == XContentParser.Token.VALUE_NULL) {
Expand All @@ -363,22 +364,23 @@ public void parse(ParseContext context) throws IOException {
IntArrayList counts = null;
// should be an object
ensureExpectedToken(XContentParser.Token.START_OBJECT, token, context.parser()::getTokenLocation);
token = context.parser().nextToken();
subParser = new XContentSubParser(context.parser());
token = subParser.nextToken();
while (token != XContentParser.Token.END_OBJECT) {
// should be an field
ensureExpectedToken(XContentParser.Token.FIELD_NAME, token, context.parser()::getTokenLocation);
String fieldName = context.parser().currentName();
ensureExpectedToken(XContentParser.Token.FIELD_NAME, token, subParser::getTokenLocation);
String fieldName = subParser.currentName();
if (fieldName.equals(VALUES_FIELD.getPreferredName())) {
token = context.parser().nextToken();
token = subParser.nextToken();
// should be an array
ensureExpectedToken(XContentParser.Token.START_ARRAY, token, context.parser()::getTokenLocation);
ensureExpectedToken(XContentParser.Token.START_ARRAY, token, subParser::getTokenLocation);
values = new DoubleArrayList();
token = context.parser().nextToken();
token = subParser.nextToken();
double previousVal = -Double.MAX_VALUE;
while (token != XContentParser.Token.END_ARRAY) {
// should be a number
ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, token, context.parser()::getTokenLocation);
double val = context.parser().doubleValue();
ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, token, subParser::getTokenLocation);
double val = subParser.doubleValue();
if (val < previousVal) {
// values must be in increasing order
throw new MapperParsingException("error parsing field ["
Expand All @@ -387,28 +389,26 @@ public void parse(ParseContext context) throws IOException {
}
values.add(val);
previousVal = val;
token = context.parser().nextToken();
token = subParser.nextToken();
}
} else if (fieldName.equals(COUNTS_FIELD.getPreferredName())) {
token = context.parser().nextToken();
token = subParser.nextToken();
// should be an array
ensureExpectedToken(XContentParser.Token.START_ARRAY, token, context.parser()::getTokenLocation);
ensureExpectedToken(XContentParser.Token.START_ARRAY, token, subParser::getTokenLocation);
counts = new IntArrayList();
token = context.parser().nextToken();
token = subParser.nextToken();
while (token != XContentParser.Token.END_ARRAY) {
// should be a number
ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, token, context.parser()::getTokenLocation);
counts.add(context.parser().intValue());
token = context.parser().nextToken();
ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, token, subParser::getTokenLocation);
counts.add(subParser.intValue());
token = subParser.nextToken();
}
} else {
throw new MapperParsingException("error parsing field [" +
name() + "], with unknown parameter [" + fieldName + "]");
}
token = context.parser().nextToken();
level = maybeAddOrRemoveLevel(token, level);
token = subParser.nextToken();
}
level = 0;
if (values == null) {
throw new MapperParsingException("error parsing field ["
+ name() + "], expected field called [" + VALUES_FIELD.getPreferredName() + "]");
Expand All @@ -431,8 +431,8 @@ public void parse(ParseContext context) throws IOException {
+ name() + "], ["+ COUNTS_FIELD + "] elements must be >= 0 but got " + counts.get(i));
} else if (count > 0) {
// we do not add elements with count == 0
streamOutput.writeDouble(values.get(i));
streamOutput.writeVInt(count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting the count before the values, it might make it easier to better compress in the future by stealing bits of the count.

streamOutput.writeDouble(values.get(i));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we skip values that have a count of 0 from doc values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, added code to skip zero counts

}

Expand All @@ -451,27 +451,16 @@ public void parse(ParseContext context) throws IOException {
ex, fieldType().name(), fieldType().typeName());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ignoreMalformed is true, I think we should also try to move to the end of the histogram object so that other fields can be parsed successfully. See how the geo_shape field does it for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added my own logic to skip the field to the end. Not sure if there is some other utility to what I am doing here by hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what XContentSubParser has been designed for. See #35603. Maybe it would be more robust? By the way looking at the latest version of GeoShapeFieldMapper, it looks like it no longer handles ignoreMalformed correctly, or am I misreading it cc @imotov ?

// we need to advance until the end of the field
if (token != null) {
while (level > 0 || token != XContentParser.Token.END_OBJECT) {
level = maybeAddOrRemoveLevel(token, level);
token = context.parser().nextToken();
if (subParser != null) {
while (token != null) {
token = subParser.nextToken();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do subParser.close() instead?

}
context.addIgnoredField(fieldType().name());
}
context.path().remove();
}

private int maybeAddOrRemoveLevel(XContentParser.Token token, int level) {
if (token == XContentParser.Token.START_OBJECT) {
return ++level;
}
if (token == XContentParser.Token.END_OBJECT) {
return --level;
}
return level;
}

@Override
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
super.doXContentBody(builder, includeDefaults, params);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,10 @@ private BinaryDocValuesField getDocValue(String fieldName, double[] values) thro
Iterator<DoubleHistogramIterationValue> iterator = recordedValues.iterator();
while (iterator.hasNext()) {
DoubleHistogramIterationValue value = iterator.next();
double d = value.getValueIteratedTo();
streamOutput.writeDouble(d);
long count = value.getCountAtValueIteratedTo();
streamOutput.writeVInt(Math.toIntExact(count));

double d = value.getValueIteratedTo();
streamOutput.writeDouble(d);
}
return new BinaryDocValuesField(fieldName, streamOutput.bytes().toBytesRef());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,9 @@ private BinaryDocValuesField getDocValue(String fieldName, double[] values) thro
DoubleHistogramIterationValue value = iterator.next();
long count = value.getCountAtValueIteratedTo();
if (count != 0) {
streamOutput.writeVInt(Math.toIntExact(count));
double d = value.getValueIteratedTo();
streamOutput.writeDouble(d);
streamOutput.writeVInt(Math.toIntExact(count));
}

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,48 @@ public void testIgnoreMalformed() throws Exception {
assertThat(doc.rootDoc().getField("pre_aggregated"), nullValue());
}

public void testIgnoreMalformedSkipsKeyword() throws Exception {
ensureGreen();
XContentBuilder xContentBuilder = XContentFactory.jsonBuilder().startObject().startObject("_doc")
.startObject("properties").startObject("pre_aggregated").field("type", "histogram")
.field("ignore_malformed", true)
.endObject().startObject("otherField").field("type", "keyword");
String mapping = Strings.toString(xContentBuilder.endObject().endObject().endObject().endObject());
DocumentMapper defaultMapper = createIndex("test").mapperService().documentMapperParser()
.parse("_doc", new CompressedXContent(mapping));

ParsedDocument doc = defaultMapper.parse(new SourceToParse("test", "1",
BytesReference.bytes(XContentFactory.jsonBuilder()
.startObject().field("pre_aggregated", "value")
.field("otherField","value")
.endObject()),
XContentType.JSON));

assertThat(doc.rootDoc().getField("pre_aggregated"), nullValue());
assertThat(doc.rootDoc().getField("otherField"), notNullValue());
}

public void testIgnoreMalformedSkipsArray() throws Exception {
ensureGreen();
XContentBuilder xContentBuilder = XContentFactory.jsonBuilder().startObject().startObject("_doc")
.startObject("properties").startObject("pre_aggregated").field("type", "histogram")
.field("ignore_malformed", true)
.endObject().startObject("otherField").field("type", "keyword");
String mapping = Strings.toString(xContentBuilder.endObject().endObject().endObject().endObject());
DocumentMapper defaultMapper = createIndex("test").mapperService().documentMapperParser()
.parse("_doc", new CompressedXContent(mapping));

ParsedDocument doc = defaultMapper.parse(new SourceToParse("test", "1",
BytesReference.bytes(XContentFactory.jsonBuilder()
.startObject().field("pre_aggregated", new int[] {2, 2, 2})
.field("otherField","value")
.endObject()),
XContentType.JSON));

assertThat(doc.rootDoc().getField("pre_aggregated"), nullValue());
assertThat(doc.rootDoc().getField("otherField"), notNullValue());
}

public void testIgnoreMalformedSkipsField() throws Exception {
ensureGreen();
XContentBuilder xContentBuilder = XContentFactory.jsonBuilder().startObject().startObject("_doc")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ private BinaryDocValuesField getDocValue(String fieldName, double[] values) thro
Iterator<Centroid> iterator = centroids.iterator();
while ( iterator.hasNext()) {
Centroid centroid = iterator.next();
streamOutput.writeDouble(centroid.mean());
streamOutput.writeVInt(centroid.count());
streamOutput.writeDouble(centroid.mean());
}
return new BinaryDocValuesField(fieldName, streamOutput.bytes().toBytesRef());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ private BinaryDocValuesField getDocValue(String fieldName, double[] values) thro
Iterator<Centroid> iterator = centroids.iterator();
while ( iterator.hasNext()) {
Centroid centroid = iterator.next();
streamOutput.writeDouble(centroid.mean());
streamOutput.writeVInt(centroid.count());
streamOutput.writeDouble(centroid.mean());
}
return new BinaryDocValuesField(fieldName, streamOutput.bytes().toBytesRef());
}
Expand Down