-
Notifications
You must be signed in to change notification settings - Fork 67
Add update method to handle consecutive updates safely #2364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added an update method in NodeNetworkConfigurationPolicy to wait for NNCP status updates before applying changes. This prevents API failures caused by the admission webhook rejecting updates while NNCP is still in progress. Signed-off-by: Shahaf Bahar <[email protected]>
WalkthroughThe pull request introduces a new method Changes
Possibly related PRs
Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (3)
🔇 Additional comments (2)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
Report bugs in Issues The following are automatically added:
Available user actions:
Supported /retest check runs
Supported labels
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
ocp_resources/node_network_configuration_policy.py (2)
364-369: Check for KeyError exceptions.The current implementation may raise a KeyError if
resource_dictdoesn't contain a "spec" key. While this might be an expected contract from callers, it's safer to handle this case explicitly.- if resource_dict["spec"] and initial_success_status_time: + if "spec" in resource_dict and resource_dict["spec"] and initial_success_status_time:
364-369: Consider adding documentation.This method implements an important safeguard for consecutive updates. Adding a docstring would help developers understand its purpose and behavior.
def update(self, resource_dict: dict[str, Any]) -> None: + """ + Updates the resource with the provided resource dictionary and waits for status updates. + + This method waits for the status of the NodeNetworkConfigurationPolicy to update before + applying changes to prevent API failures when the admission webhook rejects updates + while the NNCP is still processing. + + Args: + resource_dict: Dictionary containing the resource configuration to update + """ initial_success_status_time = self._get_last_successful_transition_time() super().update(resource_dict=resource_dict) if resource_dict["spec"] and initial_success_status_time: self._wait_for_nncp_status_update(initial_transition_time=initial_success_status_time)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ocp_resources/node_network_configuration_policy.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: tox
- GitHub Check: python-module-install
- GitHub Check: conventional-title
🔇 Additional comments (3)
ocp_resources/node_network_configuration_policy.py (3)
15-15: Good addition of type hints.Adding the
Anytype import will help with type annotations in the new method.
364-369: Align with the existing pattern in_absent_interface.The method introduces a pattern similar to what already exists in the
_absent_interfacemethod (lines 329-336). This confirms that the approach is consistent with the codebase's existing practices, which is good.
364-369:❓ Verification inconclusive
Consider potential timeouts.
The
_wait_for_nncp_status_updatemethod has a timeout of 1 minute (TIMEOUT_1MINUTE). Consider whether this is sufficient for all environments and use cases. In certain scenarios with high system load, this timeout might be too short.Let's check if there are any contextual clues about appropriate timeout values in the codebase:
🏁 Script executed:
#!/bin/bash # Check for TIMEOUT constants used with NNCP operations rg "TIMEOUT_" --type py "ocp_resources/node_network_configuration"Length of output: 145
Timeout Value Consideration – Please Verify
The current implementation uses a 1-minute timeout in the
_wait_for_nncp_status_updatemethod. When searching for TIMEOUT-related constants, the initial script didn’t return results—likely because it targeted a non-existent directory rather than the actual file. Please manually verify that the 1-minute duration (TIMEOUT_1MINUTE) is defined and appropriate. In environments with high load or complex processing, this value might need to be adjusted.
- Location:
ocp_resources/node_network_configuration_policy.py(lines 364–369)- Next Steps:
- Confirm the definition and usage of TIMEOUT_1MINUTE in the file.
- Assess whether a longer timeout might be more robust across diverse environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ocp_resources/node_network_configuration_policy.py (1)
364-369: Great implementation of safe consecutive updates.This new
updatemethod addresses the issue mentioned in the PR objectives where the admission webhook might reject updates while the NNCP is still processing. By waiting for status updates before applying changes, it prevents API failures during consecutive updates.I appreciate the robust check using
resource_dict.get("spec")which handles cases where the key might not exist.Consider adding logging statements for improved visibility:
def update(self, resource_dict: dict[str, Any]) -> None: initial_success_status_time = self._get_last_successful_transition_time() + self.logger.debug(f"Updating {self.name} with resource_dict: {resource_dict}") super().update(resource_dict=resource_dict) if resource_dict.get("spec") and initial_success_status_time: + self.logger.debug(f"Waiting for status update after initial time: {initial_success_status_time}") self._wait_for_nncp_status_update(initial_transition_time=initial_success_status_time)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ocp_resources/node_network_configuration_policy.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: tox
- GitHub Check: python-module-install
- GitHub Check: conventional-title
🔇 Additional comments (1)
ocp_resources/node_network_configuration_policy.py (1)
15-15: Appropriate type import for the new method.Good addition of the
Anytype from typing module to properly type-hint the dictionary parameter in the new update method.
| return True | ||
| return False | ||
|
|
||
| def update(self, resource_dict: dict[str, Any]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the call to update() from _absent_interface() goes through here.
If I am indeed correct, then I think that your PR, which presents a more robust solution because it covers all the cases of calling nncp.update() (and not only the specific case of removing an interface on teardown) should include removal of the time-stamp verification in _absent_interface.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update call in the _absent_interfaces method currently uses the update method of the ResourceEditor object instead of the NNCP object. I can replace it with the new update method of NNCP and remove the related logic accordingly.
Maybe in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I initially thought as well, but then I followed the flow and found that although ResourceEditor does not inherit from Resource, the update() flow does go through update() of Resource class.
IIRC, if you visually trace the code, eventually you can trace the code that leads to that (it's quite cumbersome).
Anyway, to make sure I ran a little check now:
I added this print to Resource.update():
self.logger.info("\n\n******************* going via Resource.update() *******************\n\n")
Then I ran a test that involves NNCP creation and teardown, and indeed found this print in the teardown flow:
-------------------------------------------------------------------------------------------------------------------------- TEARDOWN --------------------------------------------------------------------------------------------------------------------------
2025-03-27T15:49:55.436437 timeout_sampler INFO Elapsed time: 0.00023365020751953125 [0:00:00.000234]
2025-03-27T15:49:55.436839 timeout_sampler INFO Waiting for 60 seconds [0:01:00], retry every 1 seconds. (Function: tests.network.utils.<locals>.lambda: nncp.status)
2025-03-27T15:49:55.646993 timeout_sampler INFO Elapsed time: 0.0002613067626953125 [0:00:00.000261]
2025-03-27T15:49:55.647681 ocp_resources.resource INFO Trying to get client via new_client_from_config
2025-03-27T15:49:55.671476 ocp_resources.resource INFO kind: NodeNetworkState api version: nmstate.io/v1beta1
2025-03-27T15:49:56.875371 ocp_resources.resource INFO ResourceEdits: Updating data for resource NodeNetworkConfigurationPolicy restart-nmstate-net-ys-419-lrlqb-worker-0-qlz29
2025-03-27T15:49:56.875511 ocp_resources NodeNetworkConfigurationPolicy INFO
******************* going via Resource.update() *******************
2025-03-27T15:49:56.875573 ocp_resources NodeNetworkConfigurationPolicy INFO Update NodeNetworkConfigurationPolicy restart-nmstate-net-ys-419-lrlqb-worker-0-qlz29:
{'spec': {'desiredState': {'interfaces': [{'name': 'br-nmstate', 'type': 'linux-bridge', 'state': 'absent', 'bridge': {'options': {'stp': {'enabled': False}}, 'port': [{'name': 'enp5s0'}]}, 'ipv4': {'enabled': False, 'dhcp': False, 'auto-dns': True}, 'ipv6': {'enabled': False, 'dhcp': False, 'auto-dns': True, 'autoconf': False}}, {'name': 'enp5s0', 'type': 'ethernet', 'state': 'up', 'ipv4': {'dhcp': False, 'enabled': False}, 'ipv6': {'autoconf': False, 'dhcp': False, 'enabled': False}}]}}, 'metadata': {'name': 'restart-nmstate-net-ys-419-lrlqb-worker-0-qlz29'}}
2025-03-27T15:49:57.080747 timeout_sampler INFO Waiting for 60 seconds [0:01:00], retry every 5 seconds. (Function: ocp_resources.node_network_configuration_policy._wait_for_nncp_status_update Args: (<utilities.network.LinuxBridgeNodeNetworkConfigurationPolicy object at 0x7f1ce7233fb0>,) Kwargs: {'initial_transition_time': '2025-03-27T13:49:52Z'})
2025-03-27T15:50:02.507453 timeout_sampler INFO Elapsed time: 5.198873996734619 [0:00:05.198874]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done @yossisegev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sbahar619 .
Please resolve this comment (the author can do that, the reviewer cannot).
|
This solution was not accepted by the maintainers. |
Short description:
Add update method to handle consecutive updates safely
More details:
Add update method to handle consecutive updates safely
What this PR does / why we need it:
Example of failure:
Which issue(s) this PR fixes:
Special notes for reviewer:
Bug:
Summary by CodeRabbit