Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 117 additions & 64 deletions docs/reference/ingest/processors/script.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,99 +4,152 @@
<titleabbrev>Script</titleabbrev>
++++

Allows inline and stored scripts to be executed within ingest pipelines.
Runs an inline or stored <<modules-scripting,script>> on incoming documents. The
script runs in the {painless}/painless-ingest-processor-context.html[`ingest`]
context.

See <<modules-scripting-using, How to use scripts>> to learn more about writing scripts. The Script Processor
leverages caching of compiled scripts for improved performance. Since the
script specified within the processor is potentially re-compiled per document, it is important
to understand how script caching works. To learn more about
caching see <<scripts-and-search-speed, Script Caching>>.
The script processor uses the <<scripts-and-search-speed,script cache>> to avoid
recompiling the script for each incoming document. To improve performance,
ensure the script cache is properly sized before using a script processor in
production.

[[script-options]]
.Script Options
.Script options
[options="header"]
|======
| Name | Required | Default | Description
| `lang` | no | "painless" | The scripting language
| `id` | no | - | The stored script id to refer to
| `source` | no | - | An inline script to be executed
| `params` | no | - | Script Parameters
| Name | Required | Default | Description
| `lang` | no | "painless" | <<scripting-available-languages,Script language>>.
| `id` | no | - | ID of a <<create-stored-script-api,stored script>>.
If no `source` is specified, this parameter is required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing to have this table say 'no' to all parameters in an either/or situation. One of id/source is required, and while this is specified as part of the description, is there a way we could change the no to a '*' or some other indicator that at least one of these is required because it's easy to miss on quick glance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this isn't ideal, but it's consistent with the other ingest processors. Changing this would break the later include::common-options.asciidoc[] statement, which is used in all our processor reference docs.

I've opened #72717 to address your concerns as a separate effort.

| `source` | no | - | Inline script.
If no `id` is specified, this parameter is required.
| `params` | no | - | Object containing parameters for the script.
include::common-options.asciidoc[]
|======

One of `id` or `source` options must be provided in order to properly reference a script to execute.
[discrete]
[[script-processor-access-source-fields]]
==== Access source fields

You can access the current ingest document from within the script context by using the `ctx` variable.
To access an incoming document's source fields with a Painless script, use the
`ctx.<field>` syntax. The `ctx._source.<field>` and `ctx['_source.<field>']`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes here:

  1. I would include a bit more detail about how the documents contents are a set of maps, lists, and primitives based on the parsed JSON from the original source document.
  2. I wonder if the not included should be a note section to isolate that. I initially read that as I can use ctx._source because the not supported is at the end of the sentence. ctx['_source.<field>'] should be ctx['_source']['my_field'].
  3. I would prefer to present the appropriate way to access fields as ctx['my_field'] instead of ctx.my_field since the first way can allow any field with special characters, but the second way is just a limited shortcut.
  4. It may be worth having a link to the access operator here. (https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-operators-reference.html#map-access-operator)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this feedback. I pushed 9a1f779 to update this section and use the preferred syntax.

syntaxes are not supported.

The following example sets a new field called `field_a_plus_b_times_c` to be the sum of two existing
numeric fields `field_a` and `field_b` multiplied by the parameter param_c:
The following processor uses a Painless script to extract the `tags` field from
the `env` source field.

[source,js]
--------------------------------------------------
[source,console]
----
POST _ingest/pipeline/_simulate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we explain what the simulate query is anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's covered in the ingest pipeline docs. I don't feel we need to cover that here.

{
"script": {
"lang": "painless",
"source": "ctx.field_a_plus_b_times_c = (ctx.field_a + ctx.field_b) * params.param_c",
"params": {
"param_c": 10
"pipeline": {
"processors": [
{
"script": {
"description": "Extract 'tags' from 'env' field",
"lang": "painless",
"source": """
String[] envSplit = ctx?.env.splitOnToken(params.delimiter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where ctx can actually be null? Also we still access envSplit anyway so if ctx were null this is still going to throw an NPE. I would not include the elvis operator here.
ctx?.env.splitOnToken(params.delimiter); --> ctx.env.splitOnToken(params.delimiter);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've changed this to remove the shorthand syntax anyway.

ArrayList tags = new ArrayList();
tags.add(envSplit[params.position].trim());
ctx.tags = tags;
""",
"params": {
"delimiter": "-",
"position": 1
}
}
}
]
},
"docs": [
{
"_source": {
"env": "es01-prod"
}
}
}
]
}
--------------------------------------------------
// NOTCONSOLE
----

It is possible to use the Script Processor to manipulate document metadata like `_index` during
ingestion. Here is an example of an Ingest Pipeline that renames the index to `my-index` no matter what
was provided in the original index request:
The processor produces:

[source,console]
--------------------------------------------------
PUT _ingest/pipeline/my-index
[source,console-result]
----
{
"description": "use index:my-index",
"processors": [
"docs": [
{
"script": {
"source": """
ctx._index = 'my-index';
"""
"doc": {
...
"_source": {
"env": "es01-prod",
"tags": [
"prod"
]
}
}
}
]
}
--------------------------------------------------
----
// TESTRESPONSE[s/\.\.\./"_index":"_index","_id":"_id","_ingest":{"timestamp":$body.docs.0.doc._ingest.timestamp},/]


Using the above pipeline, we can attempt to index a document into the `any-index` index.
[discrete]
[[script-processor-access-metadata-fields]]
==== Access metadata fields

You can also use a script processor to access metadata fields. The following
processor uses a Painless script to set an incoming document's `_index`.

[source,console]
--------------------------------------------------
PUT any-index/_doc/1?pipeline=my-index
----
POST _ingest/pipeline/_simulate
{
"message": "text"
"pipeline": {
"processors": [
{
"script": {
"description": "Set index based on `lang` field and `dataset` param",
"lang": "painless",
"source": """
ctx._index = ctx.lang + '-' + params.dataset;
""",
"params": {
"dataset": "catalog"
}
}
}
]
},
"docs": [
{
"_index": "generic-index",
"_source": {
"lang": "fr"
}
}
]
}
--------------------------------------------------
// TEST[continued]
----

The response from the above index request:
The processor changes the document's `_index` to `fr-catalog` from
`generic-index`.

[source,console-result]
--------------------------------------------------
----
{
"_index": "my-index",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 89,
"_primary_term": 1,
"docs": [
{
"doc": {
...
"_index": "fr-catalog",
"_source": {
"lang": "fr"
}
}
}
]
}
--------------------------------------------------
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]

In the above response, you can see that our document was actually indexed into `my-index` instead of
`any-index`. This type of manipulation is often convenient in pipelines that have various branches of transformation,
and depending on the progress made, indexed into different indices.
----
// TESTRESPONSE[s/\.\.\./"_id":"_id","_ingest":{"timestamp":$body.docs.0.doc._ingest.timestamp},/]