Skip to content

Releases: pytorch/test-infra

v20250917-004931

17 Sep 00:51
03bf20c

Choose a tag to compare

[autorevert] implement actions layer and logging (#7169)

This pull request:
* introduces a final "Signal Actions" layer (responsible for executing
side effects of processed Signals, like restarts and reverts)
* changes the main entry point for the PyTorch auto-revert Lambda to use
the new signals-based autorevert flow by default.
* for observability, two CH tables are added: 
  * `autorevert_events_v2`
  * `autorevert_state`


See [the
spec](https://github.com/pytorch/test-infra/blob/ff2645443aafb0209d7f546302a5c09d8243cb31/aws/lambda/pytorch-auto-revert/SIGNAL_ACTIONS.md)
for more details.



### Testing

Tested locally (only restart & state logging):


```
HOURS=18 WORKFLOWS=Lint,trunk,pull,inductor python -m pytorch_auto_revert
INFO:root:[v2] Start: workflows=Lint,trunk,pull,inductor hours=18 repo=pytorch/pytorch dry_run=False
INFO:root:[v2] Run timestamp (CH log ts) = 2025-09-16T15:51:18.656175
INFO:pytorch_auto_revert.signal_extraction_datasource:[extract] Fetching jobs: repo=pytorch/pytorch workflows=Lint,trunk,pull,inductor lookback=18h
INFO:pytorch_auto_revert.signal_extraction_datasource:[extract] Jobs fetched: 6738 rows in 45.70s
INFO:pytorch_auto_revert.signal_extraction_datasource:[extract] Fetching tests for 414 job_ids (20 failed jobs) in batches
INFO:pytorch_auto_revert.signal_extraction_datasource:[extract] Test batch 1/1 (size=414)
INFO:pytorch_auto_revert.signal_extraction_datasource:[extract] Tests fetched: 231 rows for 414 job_ids in 1.95s
INFO:root:[v2] Extracted 19 signals
INFO:root:[v2][signal] wf=trunk key=inductor/test_cudagraph_trees.py::test_graph_partition outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=trunk key=test_transformers.py::test_fused_sdp_priority_order_use_compile_False_cuda outcome=Ineligible(reason=<IneligibleReason.NO_SUCCESSES: 'no_successes'>, message='no successful commits present in window')
INFO:root:[v2][signal] wf=trunk key=export/test_hop.py::test_retrace_export_local_map_hop_simple_cuda_float32 outcome=Ineligible(reason=<IneligibleReason.NO_SUCCESSES: 'no_successes'>, message='no successful commits present in window')
INFO:root:[v2][signal] wf=trunk key=inductor/test_cudagraph_trees_expandable_segments.py::test_forward_backward_not_called_backend_inductor outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=trunk key=export/test_hop.py::test_pre_dispatch_export_local_map_hop_simple_cuda_float32 outcome=Ineligible(reason=<IneligibleReason.NO_SUCCESSES: 'no_successes'>, message='no successful commits present in window')
INFO:root:[v2][signal] wf=trunk key=export/test_hop.py::test_serialize_export_local_map_hop_simple_cuda_float32 outcome=Ineligible(reason=<IneligibleReason.NO_SUCCESSES: 'no_successes'>, message='no successful commits present in window')
INFO:root:[v2][signal] wf=trunk key=export/test_hop.py::test_aot_export_local_map_hop_simple_cuda_float32 outcome=Ineligible(reason=<IneligibleReason.NO_SUCCESSES: 'no_successes'>, message='no successful commits present in window')
INFO:root:[v2][signal] wf=trunk key=inductor/test_cudagraph_trees_expandable_segments.py::test_graph_partition outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=trunk key=inductor/test_cudagraph_trees.py::test_forward_backward_not_called_backend_inductor outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=trunk key=distributed/tensor/debug/test_debug_mode.py::test_debug_mode_backward outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=Lint key=lintrunner-noclang / linux-job outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=pull key=linux-jammy-py3.10-clang12 / test outcome=Ineligible(reason=<IneligibleReason.FLAKY: 'flaky'>, message='signal is flaky (mixed outcomes on same commit)')
INFO:root:[v2][signal] wf=trunk key=win-vs2022-cpu-py3 / test outcome=Ineligible(reason=<IneligibleReason.FLAKY: 'flaky'>, message='signal is flaky (mixed outcomes on same commit)')
INFO:root:[v2][signal] wf=inductor key=unit-test / inductor-test / test outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=trunk key=win-vs2022-cpu-py3 / build outcome=RestartCommits(commit_shas={'814338826e0b5cd065f8278c4b9487f13e16a5c7'})
INFO:root:[v2][signal] wf=inductor key=inductor-cpu-test / test outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=trunk key=win-vs2022-cuda12.6-py3 / build outcome=RestartCommits(commit_shas={'814338826e0b5cd065f8278c4b9487f13e16a5c7'})
INFO:root:[v2][signal] wf=inductor key=unit-test / inductor-cpu-build / build outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2][signal] wf=pull key=linux-jammy-py3.13-clang12 / test outcome=Ineligible(reason=<IneligibleReason.FIXED: 'fixed'>, message='signal appears recovered at head')
INFO:root:[v2] Candidate action groups: 1
INFO:root:[v2][action] preparing to execute ActionGroup(type='restart', commit_sha='814338826e0b5cd065f8278c4b9487f13e16a5c7', workflow_target='trunk', sources=[SignalMetadata(workflow_name='trunk', key='win-vs2022-cpu-py3 / build'), SignalMetadata(workflow_name='trunk', key='win-vs2022-cuda12.6-py3 / build')])
INFO:root:[v2][action] restart: skipping pacing (delta_sec=-24852)
INFO:root:[v2] Executed action groups: 0
INFO:root:[v2] State logged
```

v20250916-131543

16 Sep 13:17
1ab651d

Choose a tag to compare

[autorevert] filter out 'mem_leak_check' and 'rerun_disabled_tests' w…

v20250916-131426

16 Sep 13:16
85948fd

Choose a tag to compare

[autorevert] non nullable dates & dedup (#7167)

**Signal event deduplication and timestamp handling:**

* Added a deduplication step in `SignalExtractor` to remove duplicate
signal events within commits, based on identical `(started_at,
wf_run_id)` pairs. This addresses issues with "rerun failed" jobs in
GitHub workflows that reuse the same underlying job (but reports them
with different job ids)

* For test-track signals, extract start_date from the specific job that
hosted the test (when available)

* Changed all job and signal timestamp fields (`started_at`,
`created_at`) to be non-optional and default

v20250916-092913

16 Sep 09:30
9ae4838

Choose a tag to compare

Improve the time series api + add policy for regression (#7156)

For API:
- Add model filters during query
- Add format options table, and raw
- add API hook method for frontend

add listCommits api

For regression lambda
- Add regression policy for compilation latency (if new value> 1.05 x
baseline, consider as regression)
- Change the data format to match with the api

v20250915-175459

15 Sep 17:56
3836ad9

Choose a tag to compare

fix makefile lint & typo (#7166)

v20250915-170446

15 Sep 17:06
1f17a03

Choose a tag to compare

Update NVIDIA driver to 580.82.07 (#7159)

This updates the nvidia driver to `580.82.07` to add support for CUDA
13.0 runtime.

This is similar to https://github.com/pytorch/pytorch/pull/162531 but
for our entire fleet

v20250915-165633

15 Sep 16:58
41043ed

Choose a tag to compare

[autorevert] add signal extraction layer for transforming CI data int…

v20250913-193142

13 Sep 19:33
efacd29

Choose a tag to compare

[Benchmark Regression] Add Report Level  (#7138)

add a configuration to decide what level of data should we store in db
for regregssion benchmark

lowest from highest

v20250910-211252

10 Sep 21:14
8cbb5f0

Choose a tag to compare

[Compiler time series data] Add fetch raw data logics (#7137)

support fetch raw data as time series for compiler data

support arch mapping between the  db and api

make the query logics as commits driven
1. if commits is provided in api, then fetch data using the list of
commits
2. if not provided, use startTime stopTime to fetch list of unique
commits, then fetch the time series data

v20250910-032549

10 Sep 03:27
73a32b8

Choose a tag to compare

add github notification (#7096)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #7112
* __->__ #7096
* #7095


# Add github notification settings
currently we only create github comment if and only a regression status
is detected.
later we can add if we want to add suspicious too,

## Prerequest
currently user must:
1. create a github issue first, and put it in the Policy section to make
this work
2. Each issue should be associated with a butterfly rule to link to
internal workplace/ oncall emails


## later improvement
as you see, currently if a regression does not resolved, it will send
notification to github everyday.
since we have those report in db, we can later do exponential
notification based on previous report status