Skip to content

Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 14, 2022
Merged

Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

merged 2 commits into from
Apr 14, 2022

Conversation

uditgaurav
Copy link
Member

Signed-off-by: uditgaurav [email protected]

What this PR does / why we need it:

  • Adds statusCheckTimeout to container kill recovery

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Checklist:

  • Fixes #
  • PR messages has document related information
  • Labelled this PR & related issue with breaking-changes tag
  • PR messages has breaking changes related information
  • Labelled this PR & related issue with requires-upgrade tag
  • PR messages has upgrade related information
  • Commit has unit tests
  • Commit has integration tests
  • E2E run Required for the changes

@uditgaurav uditgaurav merged commit 7d7adcb into litmuschaos:master Apr 14, 2022
uditgaurav added a commit that referenced this pull request Jun 13, 2022
* Chore(stress-chaos): Run CPU chaos with percentage of cpu cores (#482)

* Chore(stress-chaos): Run CPU chaos with percentage of cores

Signed-off-by: uditgaurav <[email protected]>

* Fixeing alpine CVEs by upgrading the version (#486)

* Chore(vulnerability): Remove openebs retry module and update pkgs (#488)

* Chore(vulnerability): Fix some vulnerability by updaing the pkgs

Signed-off-by: uditgaurav <[email protected]>

* Chore(vulnerability): Remove openebs retry module and update pkgs

Signed-off-by: udit <[email protected]>

* Chore(cgroup): Add support for cgroup version2 in stress-chaos experiment (#490)

Signed-off-by: uditgaurav <[email protected]>

* Chore(snyk): Fix snyk security scan on litmus-go (#492)

Signed-off-by: uditgaurav <[email protected]>

* Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment (#491)

* Chore(network-chaos):

Signed-off-by: uditgaurav <[email protected]>

* Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment

Signed-off-by: uditgaurav <[email protected]>

Co-authored-by: Karthik Satchitanand <[email protected]>

* Chore(randomize): Randomize stress-chaos tunables (#487)

* Chore(randomize): Randomize stress-chaos tunables

Signed-off-by: uditgaurav <[email protected]>

* Update stress-chaos.go

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill (#493)

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill

Signed-off-by: uditgaurav <[email protected]>

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill

Signed-off-by: uditgaurav <[email protected]>

* (enahncement)experiment: add node label filter for pod network and stress chaos (#494)

Signed-off-by: uditgaurav <[email protected]>

* Fix(targetContainer): Incorrect target container passed in the helper pod for pod level experiments (#496)

* Fix target container issue

Signed-off-by: uditgaurav <[email protected]>

* Fix target container issue

Signed-off-by: uditgaurav <[email protected]>

* Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#498)

Signed-off-by: uditgaurav <[email protected]>

* Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#499)

Signed-off-by: uditgaurav <[email protected]>

* Chore(warn): Remove warning Neither --kubeconfig nor --master was specified for InClusterConfig (#507)

Signed-off-by: uditgaurav <[email protected]>

* Chore(ssm): Update the ssm file path in the Dockerfile (#508)

Signed-off-by: uditgaurav <[email protected]>

* GCP Experiments Refactor, New Label Selector Experiments and IAM Integration (#495)

* experiment init

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment file

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment lib

Signed-off-by: neelanjan00 <[email protected]>

* updated post chaos validation

Signed-off-by: neelanjan00 <[email protected]>

* updated empty slices to nil, updated experiment name in environment.go

Signed-off-by: neelanjan00 <[email protected]>

* removed experiment charts

Signed-off-by: neelanjan00 <[email protected]>

* bootstrapped gcp-vm-disk-loss-by-label artiacts

Signed-off-by: neelanjan00 <[email protected]>

* removed device-names input for gcp-vm-disk-loss experiment, added API calls to derive device name internally

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant condition check in gcp-vm-disk-loss experiment pre-requisite checks

Signed-off-by: neelanjan00 <[email protected]>

* reformatted error messages

Signed-off-by: neelanjan00 <[email protected]>

* replaced the SetTargetInstances function

Signed-off-by: neelanjan00 <[email protected]>

* added settargetdisk function for getting target disk names using label

Signed-off-by: neelanjan00 <[email protected]>

* refactored Target Disk Attached VM Instance memorisation, updated vm-disk-loss and added lib logic for vm-disk-loss-by-label experiment

Signed-off-by: neelanjan00 <[email protected]>

* added experiment to bin and cleared default experiment name in environment.go

Signed-off-by: neelanjan00 <[email protected]>

* removed charts

Signed-off-by: neelanjan00 <[email protected]>

* updated test.yml

Signed-off-by: neelanjan00 <[email protected]>

* updated AutoScalingGroup to ManagedInstanceGroup; updated logic for checking InstanceStop recovery for ManagedInstanceGroup VMs; Updated log and error messages with VM names

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant computeService code snippets

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant computeService code snippets in gcp-disk-loss experiments

Signed-off-by: neelanjan00 <[email protected]>

* updated logic for deriving default gcp sa credentials for computeService

Signed-off-by: neelanjan00 <[email protected]>

* updated logging for IAM integration

Signed-off-by: neelanjan00 <[email protected]>

* refactored log and error messages and wait for start/stop instances logic

Signed-off-by: neelanjan00 <[email protected]>

* fixed logs, optimised control statements, added comments, corrected experiment names

Signed-off-by: neelanjan00 <[email protected]>

* fixed file exists check logic

Signed-off-by: Neelanjan Manna <[email protected]>

* updated instance and device name fetch logic for disk loss

Signed-off-by: Neelanjan Manna <[email protected]>

* updated logs

Signed-off-by: Neelanjan Manna <[email protected]>

* update(sdk): updating litmus sdk for the defaultAppHealthCheck (#513)

Signed-off-by: shubhamc <[email protected]>

Co-authored-by: shubhamc <[email protected]>

* fix: updated release workflow (#512)

Signed-off-by: Soumya Ghosh Dastidar <[email protected]>

* Added Active Node Count Check using AWS APIs (#500)

* Added node count check using aws apis

Signed-off-by: Akash Shrivastava <[email protected]>

* Added node count check using aws apis to instance terminate by tag experiment

Signed-off-by: Akash Shrivastava <[email protected]>

* Log improvements; Code improvement in findActiveNodeCount function;

Signed-off-by: Akash Shrivastava <[email protected]>

* Added log for instance status check failed in find active node count

Signed-off-by: Akash Shrivastava <[email protected]>

* Added check if active node count is less than provided instance ids

Signed-off-by: Akash Shrivastava <[email protected]>

* updated appns podlist filtering error handling (#515)

Signed-off-by: Neelanjan Manna <[email protected]>

Co-authored-by: Udit Gaurav <[email protected]>
Co-authored-by: Vedant Shrotria <[email protected]>

* return error if node not present (#516)

Signed-off-by: Akash Shrivastava <[email protected]>

* Chore(helper pod): Make setHelper data as tunable (#519)

Signed-off-by: uditgaurav <[email protected]>

Co-authored-by: Udit Gaurav <[email protected]>
Co-authored-by: Raj Babu Das <[email protected]>
Co-authored-by: Karthik Satchitanand <[email protected]>
Co-authored-by: Shubham Chaudhary <[email protected]>
Co-authored-by: shubhamc <[email protected]>
Co-authored-by: Soumya Ghosh Dastidar <[email protected]>
Co-authored-by: Akash Shrivastava <[email protected]>
Co-authored-by: Vedant Shrotria <[email protected]>
uditgaurav added a commit that referenced this pull request Jun 14, 2022
* modified the cmdProbe for inline mode of execution to accomodate litmusd

Signed-off-by: neelanjan00 <[email protected]>

* go mod tidy

Signed-off-by: neelanjan00 <[email protected]>

* bootstrapped process-kill experiment files

Signed-off-by: neelanjan00 <[email protected]>

* updated types.go and environment.go

Signed-off-by: neelanjan00 <[email protected]>

* updated secret envs

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment logic and added steady state validation steps

Signed-off-by: neelanjan00 <[email protected]>

* removed action from probe refactor function parameters

Signed-off-by: neelanjan00 <[email protected]>

* added serial and parallel chaos execution steps

Signed-off-by: neelanjan00 <[email protected]>

* added conn parameter to probe

Signed-off-by: neelanjan00 <[email protected]>

* added logic for closing websocket in the end of the experiment

Signed-off-by: neelanjan00 <[email protected]>

* added experiment to bin

Signed-off-by: neelanjan00 <[email protected]>

* corrected the agent endpoint

Signed-off-by: neelanjan00 <[email protected]>

* corrected environement.go

Signed-off-by: neelanjan00 <[email protected]>

* updated logs, removed close message and added parallel sequence as default

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment charts

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment charts

Signed-off-by: neelanjan00 <[email protected]>

* updated authorization header, replaced Processes struct with int slice of pids

Signed-off-by: neelanjan00 <[email protected]>

* restored experiment image

Signed-off-by: neelanjan00 <[email protected]>

* updated test.yml

Signed-off-by: neelanjan00 <[email protected]>

* added rbac, README, exported charts

Signed-off-by: neelanjan00 <[email protected]>

* added websocket connection to chaos details struct, restored probe functions params

Signed-off-by: neelanjan00 <[email protected]>

* removed websocket connection in chaoslib params

Signed-off-by: neelanjan00 <[email protected]>

* updated code function

Signed-off-by: neelanjan00 <[email protected]>

* updated readme

Signed-off-by: neelanjan00 <[email protected]>

* restructured directories, added m-agent tag

Signed-off-by: neelanjan00 <[email protected]>

* updated workflow branch

Signed-off-by: neelanjan00 <[email protected]>

* removed guest-os pkg

Signed-off-by: neelanjan00 <[email protected]>

* Chore(stress-chaos): Run CPU chaos with percentage of cpu cores (#482)

* Chore(stress-chaos): Run CPU chaos with percentage of cores

Signed-off-by: uditgaurav <[email protected]>

* updated client side m-agent design; added channelised message sending

Signed-off-by: neelanjan00 <[email protected]>

* added liveness check for process kill

Signed-off-by: neelanjan00 <[email protected]>

* updated mutex lock to an RWMutex lock, locked read operations on the map

Signed-off-by: neelanjan00 <[email protected]>

* Fixeing alpine CVEs by upgrading the version (#486)

* updated WaitForDurationAndCheckLiveness function

Signed-off-by: neelanjan00 <[email protected]>

* updated cpu-stress experiment and steady-state condition

Signed-off-by: neelanjan00 <[email protected]>

* corrected probe format

Signed-off-by: neelanjan00 <[email protected]>

* added functionality for multiple websocket connections

Signed-off-by: neelanjan00 <[email protected]>

* updated liveness check to test for all the connections and added parallel chaos injection

Signed-off-by: neelanjan00 <[email protected]>

* updated m-agent cmd probe for only one agent endpoint

Signed-off-by: neelanjan00 <[email protected]>

* updated underChaosEndpoints for abort

Signed-off-by: neelanjan00 <[email protected]>

* optimised make connections logic

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant check and comments

Signed-off-by: neelanjan00 <[email protected]>

* updated comments for function

Signed-off-by: neelanjan00 <[email protected]>

* updated chaosInterval timer for fixing infinitely running chaosInterval

Signed-off-by: neelanjan00 <[email protected]>

* added CLOSE_CONNECTION action for closure of websocket connections

Signed-off-by: neelanjan00 <[email protected]>

* Chore(vulnerability): Remove openebs retry module and update pkgs (#488)

* Chore(vulnerability): Fix some vulnerability by updaing the pkgs

Signed-off-by: uditgaurav <[email protected]>

* Chore(vulnerability): Remove openebs retry module and update pkgs

Signed-off-by: udit <[email protected]>

* added chaos revert logic

Signed-off-by: neelanjan00 <[email protected]>

* updated connection close on ERROR functionalty and return on Read error

Signed-off-by: neelanjan00 <[email protected]>

* added log for chaos revert

Signed-off-by: neelanjan00 <[email protected]>

* reverted env params

Signed-off-by: neelanjan00 <[email protected]>

* added abort log info, added defer close statement to message listener, added load percentage validation

Signed-off-by: neelanjan00 <[email protected]>

* updated probe error feedback, removed charts

Signed-off-by: neelanjan00 <[email protected]>

* updated mutex locks for RLock and RUnlock, updated connect agent function parameters

Signed-off-by: neelanjan00 <[email protected]>

* Chore(cgroup): Add support for cgroup version2 in stress-chaos experiment (#490)

Signed-off-by: uditgaurav <[email protected]>

* updated mutex locks

Signed-off-by: neelanjan00 <[email protected]>

* Chore(snyk): Fix snyk security scan on litmus-go (#492)

Signed-off-by: uditgaurav <[email protected]>

* Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment (#491)

* Chore(network-chaos):

Signed-off-by: uditgaurav <[email protected]>

* Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment

Signed-off-by: uditgaurav <[email protected]>

Co-authored-by: Karthik Satchitanand <[email protected]>

* Chore(randomize): Randomize stress-chaos tunables (#487)

* Chore(randomize): Randomize stress-chaos tunables

Signed-off-by: uditgaurav <[email protected]>

* Update stress-chaos.go

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill (#493)

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill

Signed-off-by: uditgaurav <[email protected]>

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill

Signed-off-by: uditgaurav <[email protected]>

* (enahncement)experiment: add node label filter for pod network and stress chaos (#494)

Signed-off-by: uditgaurav <[email protected]>

* Fix(targetContainer): Incorrect target container passed in the helper pod for pod level experiments (#496)

* Fix target container issue

Signed-off-by: uditgaurav <[email protected]>

* Fix target container issue

Signed-off-by: uditgaurav <[email protected]>

* Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#498)

Signed-off-by: uditgaurav <[email protected]>

* Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#499)

Signed-off-by: uditgaurav <[email protected]>

* Chore(warn): Remove warning Neither --kubeconfig nor --master was specified for InClusterConfig (#507)

Signed-off-by: uditgaurav <[email protected]>

* Chore(ssm): Update the ssm file path in the Dockerfile (#508)

Signed-off-by: uditgaurav <[email protected]>

* GCP Experiments Refactor, New Label Selector Experiments and IAM Integration (#495)

* experiment init

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment file

Signed-off-by: neelanjan00 <[email protected]>

* updated experiment lib

Signed-off-by: neelanjan00 <[email protected]>

* updated post chaos validation

Signed-off-by: neelanjan00 <[email protected]>

* updated empty slices to nil, updated experiment name in environment.go

Signed-off-by: neelanjan00 <[email protected]>

* removed experiment charts

Signed-off-by: neelanjan00 <[email protected]>

* bootstrapped gcp-vm-disk-loss-by-label artiacts

Signed-off-by: neelanjan00 <[email protected]>

* removed device-names input for gcp-vm-disk-loss experiment, added API calls to derive device name internally

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant condition check in gcp-vm-disk-loss experiment pre-requisite checks

Signed-off-by: neelanjan00 <[email protected]>

* reformatted error messages

Signed-off-by: neelanjan00 <[email protected]>

* replaced the SetTargetInstances function

Signed-off-by: neelanjan00 <[email protected]>

* added settargetdisk function for getting target disk names using label

Signed-off-by: neelanjan00 <[email protected]>

* refactored Target Disk Attached VM Instance memorisation, updated vm-disk-loss and added lib logic for vm-disk-loss-by-label experiment

Signed-off-by: neelanjan00 <[email protected]>

* added experiment to bin and cleared default experiment name in environment.go

Signed-off-by: neelanjan00 <[email protected]>

* removed charts

Signed-off-by: neelanjan00 <[email protected]>

* updated test.yml

Signed-off-by: neelanjan00 <[email protected]>

* updated AutoScalingGroup to ManagedInstanceGroup; updated logic for checking InstanceStop recovery for ManagedInstanceGroup VMs; Updated log and error messages with VM names

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant computeService code snippets

Signed-off-by: neelanjan00 <[email protected]>

* removed redundant computeService code snippets in gcp-disk-loss experiments

Signed-off-by: neelanjan00 <[email protected]>

* updated logic for deriving default gcp sa credentials for computeService

Signed-off-by: neelanjan00 <[email protected]>

* updated logging for IAM integration

Signed-off-by: neelanjan00 <[email protected]>

* refactored log and error messages and wait for start/stop instances logic

Signed-off-by: neelanjan00 <[email protected]>

* fixed logs, optimised control statements, added comments, corrected experiment names

Signed-off-by: neelanjan00 <[email protected]>

* fixed file exists check logic

Signed-off-by: Neelanjan Manna <[email protected]>

* updated instance and device name fetch logic for disk loss

Signed-off-by: Neelanjan Manna <[email protected]>

* updated logs

Signed-off-by: Neelanjan Manna <[email protected]>

* update(sdk): updating litmus sdk for the defaultAppHealthCheck (#513)

Signed-off-by: shubhamc <[email protected]>

Co-authored-by: shubhamc <[email protected]>

* fix: updated release workflow (#512)

Signed-off-by: Soumya Ghosh Dastidar <[email protected]>

* Added Active Node Count Check using AWS APIs (#500)

* Added node count check using aws apis

Signed-off-by: Akash Shrivastava <[email protected]>

* Added node count check using aws apis to instance terminate by tag experiment

Signed-off-by: Akash Shrivastava <[email protected]>

* Log improvements; Code improvement in findActiveNodeCount function;

Signed-off-by: Akash Shrivastava <[email protected]>

* Added log for instance status check failed in find active node count

Signed-off-by: Akash Shrivastava <[email protected]>

* Added check if active node count is less than provided instance ids

Signed-off-by: Akash Shrivastava <[email protected]>

* updated appns podlist filtering error handling (#515)

Signed-off-by: Neelanjan Manna <[email protected]>

Co-authored-by: Udit Gaurav <[email protected]>
Co-authored-by: Vedant Shrotria <[email protected]>

* go mod tidy

Signed-off-by: neelanjan00 <[email protected]>

* return error if node not present (#516)

Signed-off-by: Akash Shrivastava <[email protected]>

* Chore(helper pod): Make setHelper data as tunable (#519)

Signed-off-by: uditgaurav <[email protected]>

* added CPUs check in prerequisites check

Signed-off-by: Neelanjan Manna <[email protected]>

* removed .DS_Store

Signed-off-by: Neelanjan Manna <[email protected]>

* removed .DS_Store

Signed-off-by: Neelanjan Manna <[email protected]>

* updated rbac and readme

Signed-off-by: Neelanjan Manna <[email protected]>

* removed .DS_Store

Signed-off-by: Neelanjan Manna <[email protected]>

* updated qemu github action

Signed-off-by: Neelanjan Manna <[email protected]>

* updated qemu action version

Signed-off-by: Neelanjan Manna <[email protected]>

* updated m-agent go-runner tag to 2.10.0-Beta1

Signed-off-by: Neelanjan Manna <[email protected]>

* updated target names

Signed-off-by: Neelanjan Manna <[email protected]>

* updated machine=>Machine targets, removed .DS_Store

Signed-off-by: Neelanjan Manna <[email protected]>

Co-authored-by: Udit Gaurav <[email protected]>
Co-authored-by: Raj Babu Das <[email protected]>
Co-authored-by: Karthik Satchitanand <[email protected]>
Co-authored-by: Shubham Chaudhary <[email protected]>
Co-authored-by: shubhamc <[email protected]>
Co-authored-by: Soumya Ghosh Dastidar <[email protected]>
Co-authored-by: Akash Shrivastava <[email protected]>
Co-authored-by: Vedant Shrotria <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants