-
Notifications
You must be signed in to change notification settings - Fork 598
feat: ESQL query validation against Elastic cluster #4955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -1432,15 +1432,14 @@ def get_packaged_integrations( | |||
# if both exist, rule tags are only used if defined in definitions for non-dataset packages | |||
# of machine learning analytic packages | |||
|
|||
rule_integrations = meta.get("integration", []) | |||
if rule_integrations: | |||
for integration in rule_integrations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple style fix, replacing if
condition with a more robust default value condition via
rule_integrations = meta.get("integration") or []
@@ -1754,7 +1753,7 @@ def parse_datasets(datasets: list[str], package_manifest: dict[str, Any]) -> lis | |||
else: | |||
package = value | |||
|
|||
if package in list(package_manifest): | |||
if package in package_manifest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small style fix
detection_rules/rule_validators.py
Outdated
|
||
log(f"Got query columns: {', '.join(query_column_names)}") | ||
|
||
# FIXME: validate the dynamic columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The columns returned from the cluster must be validated against the input mapping, and the dynamic fields checked for validity.
at the moment (before any field validation) the test marks 33 rules out of 75 as invalid. The tests were executed against a vanilla local The many errors are most probably because of the bugs in the code, so I expect the number of invalid rules to go down after those are fixed. full log
|
Updated to include initial dynamic field validation. This will parse the schema(s) for dynamic fields and perform some initial formatting check. It checks if the field has a proper prefix as described in #4909, and if the field is based on a field that is present in the schema. However, additional validation will be needed if we want to validate the proper types for ES|QL function and operator return values. https://www.elastic.co/docs/reference/query-languages/esql/esql-functions-operators Additionally, a number of the errors seen in the above testing are due to schema updates that do not have the required fields. For instance. Next steps are:
Note after discussion with @Mikaayenson we determined that the sub-field of the dynamic query does not need to have ecs enforcement here. E.g. For |
While the PR is ready for review from a logic perspective, we also still need to validate that the 48 rules that are currently in error are correctly in error. For instance, we are aware of the use of
I would expect that these errors are ones we want to ignore. With this assumption the current rule stats are:
|
Enhancement - GuidelinesThese guidelines serve as a reminder set of considerations when addressing adding a feature to the code. Documentation and Context
Code Standards and Practices
Testing
Additional Checks
|
|
||
return True | ||
|
||
def create_remote_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this go in remote_validation.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason it should be moved out of the ESQLValidator?
In the current code, remote_validation for ESQL simply calls the ESQLValidator's remote functions. The existing purpose of remote_validation is to provide a lightweight means of wrapping the Stack API interfaces to validate a query.
To me it would seem moving it to remote_validation would break form from the other query languages as handling the lack of indexes. etc. is a hard failure if not done elsewhere.
E.g. for EQL
def validate_eql(self, contents: TOMLRuleContents) -> dict[str, Any]:
"""Validate query for "eql" rule types."""
query = contents.data.query # type: ignore[reportAttributeAccessIssue]
rule_id = contents.data.rule_id
index = contents.data.index # type: ignore[reportAttributeAccessIssue]
time_range = {"range": {"@timestamp": {"gt": "now-1h/h", "lte": "now", "format": "strict_date_optional_time"}}}
body: dict[str, Any] = {"query": query}
if not self.es_client:
raise ValueError("No ES client found")
if not index:
raise ValueError("Indices not found")
ESQL is a unique case because we cannot parse the indexes from the query, so we in effect have to fall back on using the ESQLValidator class. In the future, the remote_validation should not be using this to validate the rule, and should directly send the rule to the stack separate from any setup. However, since we cannot currently separate query syntax validation, index parsing, etc. from remote query validation (as for EQL and KQL these solve different problems), we are dependent on the query validation happening first which then requires all of the index setup, etc. which happens in ESQLValidator as does the query syntax validation for the other respective validator classes.
|
||
return full_index_str | ||
|
||
def execute_query_against_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this go in remote_validation.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this logical path is highly dependent on the index modifications which are required as part of the query syntax validation. Remote validation should support running one's query against supplied indexes, but since we cannot parse them, we cannot support this as directly.
Remote validation's purpose is to validate the rule against a stack setup, assuming syntax validation is already done. Given that we cannot separate the two in our cases we are depend on stack validation for syntax validation.
|
||
return nested_multifields # type: ignore[reportUnknownVariableType] | ||
|
||
def get_ecs_schema_mappings(self, current_version: Version) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this go in ecs.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but it would be specific for loading an index mapping for ESQL. This function is primarily taking the ecs mapping and editing it to an index mapping format, with specific handling of scaled floats. I think this boils down to preference, fine with me either way.
def remote_validate_rule_contents( | ||
self, kibana_client: Kibana, elastic_client: Elasticsearch, contents: TOMLRuleContents, verbosity: int = 0 | ||
) -> ObjectApiResponse[Any]: | ||
"""Remote validate a rule's ES|QL query using an Elastic Stack.""" | ||
return self.remote_validate_rule( | ||
kibana_client=kibana_client, | ||
elastic_client=elastic_client, | ||
query=contents.data.query, # type: ignore[reportUnknownVariableType] | ||
metadata=contents.metadata, | ||
rule_id=contents.data.rule_id, | ||
verbosity=verbosity, | ||
) | ||
|
||
def remote_validate_rule( # noqa: PLR0913 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these go in remote_validation.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see https://github.com/elastic/detection-rules/pull/4955/files/9e1150cbd9962a13e9ce11fd564eaea3855030bf#r2334535612 and https://github.com/elastic/detection-rules/pull/4955/files/9e1150cbd9962a13e9ce11fd564eaea3855030bf#r2334527576 for more detail, but in short, this is remote syntax validation. The fact of it being remote will go away upon the presence of local ESQL syntax validation. The remote_validation.py worksflows are not for query language syntax validation, but to run the query against the stack and provide the response (not specifically implying valid or invalid, that is left to the calling function).
return nested_schema # type: ignore[reportUnknownVariableType] | ||
|
||
|
||
def combine_dicts(dest: dict[Any, Any], src: dict[Any, Any]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be dict.update(dict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, dict.update is a replace/update function.
This is a recursive merge (in effect combining the dictionaries vs overwriting the keys).
Example:
>>> from typing import Any
>>> def combine_dicts(dest: dict[Any, Any], src: dict[Any, Any]) -> None:
... """Combine two dictionaries recursively."""
... for k, v in src.items():
... if k in dest and isinstance(dest[k], dict) and isinstance(v, dict):
... combine_dicts(dest[k], v) # type: ignore[reportUnknownVariableType]
... else:
... dest[k] = v
...
>>> dest = {'a': 1, 'b': {'x': 10, 'y': 20}}
>>> src = {'b': {'y': 30, 'z': 40}, 'c': 3}
>>> combine_dicts(dest, src)
>>> dest
{'a': 1, 'b': {'x': 10, 'y': 30, 'z': 40}, 'c': 3}
>>> dest = {'a': 1, 'b': {'x': 10, 'y': 20}}
>>> dest.update(src)
>>> dest
{'a': 1, 'b': {'y': 30, 'z': 40}, 'c': 3}
…tection-rules into esql-field-validation
Considerations from discussion with @Mikaayenson :
|
Pull Request
Issue link(s):
Summary - What I changed
As a note to reviewers, the entry point when validating a given rule is through
remote_validate_rule
.Another note, in some integrations (specifically Okta) there are fields defined in the integration where the mapping is not directly supported in the stack. See details below for an example. Fleet handles these cases by removing the offending fields. As such, this PR proposes a similar process. See
find_nested_multifields
for the core logic for identifying these offending fields.Details
When using the Okta mapping as-is, one would receive the following error:
We can see in the integration YAML
(Relevant Snippet)
logOnlySecurityData is a keyword but has fields, behaviors is a field of logOnlySecurityData and is also a keyword, but is also has fields like New_City which is not allowed according to the error message.
When installing the integration through fleet, one can see that it strips the sub-fields under behaviors.
How To Test
.detection-rules-cfg.yml
) or from the environment variablesOnce you have the environment variables setup and stack ready, you can test the remote validation with the following command:
python -m pytest tests/test_rules_remote.py::TestRemoteRules::test_esql_rules -s -v
Note,
-v
is optional but provides useful debugging information.Also, test remote validation with the rule loader through view-rule via the following:
export DR_REMOTE_ESQL_VALIDATION=True python -m detection_rules view-rule rules/linux/discovery_port_scanning_activity_from_compromised_host.toml
Checklist
bug
,enhancement
,schema
,maintenance
,Rule: New
,Rule: Deprecation
,Rule: Tuning
,Hunt: New
, orHunt: Tuning
so guidelines can be generatedmeta:rapid-merge
label if planning to merge within 24 hoursContributor checklist