Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20250930-222836
[autorever] exclude unstable jobs (#7260)
v20250930-134331
[AUTOREVERT] [BUGFIX] fixing typo in variable name preventing revert …
v20250930-125800
[autorevert] correctly fetch and build the gaps in the signal (#7248) 1. Fixed commits-without-jobs issue - Problem: Commits with no workflow jobs (e.g., periodic workflow) were excluded from signal extraction - Solution: - Added fetch_commits_in_time_range() to query push table directly - Modified job query to filter by explicit list of head_shas instead of JOIN - Changed ORDER BY to use sha dimension first (preserves grouping, actual order doesn't matter as internally extractors now iterate over the list of commits passed explicitly) 2. Added mandatory timestamp field to SignalCommit - Changes: - SignalCommit.__init__(head_sha, timestamp, events) - timestamp is now mandatory - Signal extraction populates timestamps from push table - HUD state logger uses commit timestamp instead of computing from event times - Updated 36 test constructor calls ### Testing Before: [2025-09-29T19-29-47.670686-00-00.html](https://github.com/user-attachments/files/22606856/2025-09-29T19-29-47.670686-00-00.html) After: [2025-09-29T21-38-10.190584-00-00.html](https://github.com/user-attachments/files/22606859/2025-09-29T21-38-10.190584-00-00.html)
v20250929-230908
[autorever] fix indentation in `fetch_tests_for_job_ids` (#7250) Accidentally noticed another bug introduced by https://github.com/pytorch/test-infra/pull/7241 when testing locally on the large lookback windows: ``` python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256 --bisection-limit 2 --hud-html 2025-09-29 15:56:16,356 INFO [root] [v2] Start: workflows=periodic hours=256 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650 2025-09-29 15:56:16,356 INFO [root] [v2] Run timestamp (CH log ts) = 2025-09-29T22:56:16.356213+00:00 2025-09-29 15:56:16,356 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching commits in time range: repo=pytorch/pytorch lookback=256h 2025-09-29 15:56:16,909 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Commits fetched: 419 commits in 0.55s 2025-09-29 15:56:16,909 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=periodic commits=419 lookback=256h 2025-09-29 15:56:56,850 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 2848 rows in 39.94s 2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching tests for 1077 job_ids (453 failed jobs) in batches 2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 1/2 (size=1024) 2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] existing rows: 0 2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 2/2 (size=53) 2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] existing rows: 0 2025-09-29 15:56:57,718 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Tests fetched: 265 rows for 1077 job_ids in 0.86s ``` notice, that no tests are read in the first batch! after this fix: ``` python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256 --hud-html 2025-09-29 16:03:06,896 INFO [root] [v2] Start: workflows=periodic hours=256 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650 2025-09-29 16:03:06,896 INFO [root] [v2] Run timestamp (CH log ts) = 2025-09-29T23:03:06.896595+00:00 2025-09-29 16:03:06,897 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=periodic lookback=256h 2025-09-29 16:03:49,456 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 2887 rows in 42.56s 2025-09-29 16:03:49,466 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching tests for 1113 job_ids (454 failed jobs) in batches 2025-09-29 16:03:49,466 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 1/2 (size=1024) 2025-09-29 16:03:51,753 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 2/2 (size=89) 2025-09-29 16:03:53,056 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Tests fetched: 5002 rows for 1113 job_ids in 3.59s 2025-09-29 16:03:53,122 INFO [root] [v2] Extracted 144 signals ```
v20250929-192231
[autorevert] fix local cli (#7244) Before: ``` (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % python -m pytorch_auto_revert hud 2025-09-29 12:12:37,159 WARNING [pytorch_auto_revert.clickhouse_client_helper] Connection test failed: HTTPDriver for https://hyt81izu0c.us-east-1.aws.clickhouse.cloud:8443 received ClickHouse error code 516 Code: 516. DB::Exception: revert_lambda: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6151 (official build)) Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/Users/ivanzaitsev/test-infra/aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py", line 336, in <module> main() File "/Users/ivanzaitsev/test-infra/aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py", line 283, in main raise RuntimeError( RuntimeError: ClickHouse connection test failed. Please check your configuration. ``` After: ``` (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % python -m pytorch_auto_revert hud 2025-09-29 12:18:23,118 INFO [root] [hud] Fetching run state ts=2025-09-29 19:13:18 repo=<any> 2025-09-29 12:18:23,521 INFO [root] [hud] Loaded state for repo=pytorch/pytorch workflows=Lint,trunk,pull,inductor 2025-09-29 12:18:23,521 INFO [root] [hud] Rendering HTML for repo=pytorch/pytorch workflows=Lint,trunk,pull,inductor lookback=16 → 2025-09-29_19-13-18.html 2025-09-29 12:18:23,523 INFO [root] HUD written to 2025-09-29_19-13-18.html (venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % ```
v20250929-184550
[PYTORCHBOT] adds 'autorevert' classification for reverts (#7242) Autorevert should issue revert commands in the format `&pytorchbot revert -m "message" -c autorevert` this change enables pytorchbot to accept this classification
v20250929-182904
[autorevert] fix RetryWithBackoff, add tests (#7243) a followup to https://github.com/pytorch/test-infra/pull/7241 fixes the logic and adds unit tests
v20250929-161641
[AUTOREVERT] use secret store over environment variables for password…
v20250929-155929
[AUTOREVERT] Add retry with back-off for GH API and CH (#7241) Just going on the code, finding where we call external API, and adding a retry with exponential back-off. Defaults to 5 retries, 0.5s base and with 10% jitter There are NO CODE CHANGES, all parts of the code that are relevant are being guardrailed with: ``` for attempt in RetryWithBackoff(): with attempt: # the code ``` Changes appear to be big due: * Extra tabs and the consequent linter changes * Lazy nature of the gh and ch libraries, that resolve pagination as the code consume information
v20250929-124114
[autorevert] fix handling for insufficient successes (#7235) Previously the code was trying to group branches for restarts resulting from "infra check" and from "insufficient events", and this was a mistake, resulting in delayed restarts. Specifically, in this situation: <img width="999" height="747" alt="image" src="https://github.com/user-attachments/assets/9cd0051e-8d87-4fe2-af90-88a776847c4d" /> a restart on the success side is expected, but the system waits for pending job on the failure side. This PR decouples and simplifies the logic. Now, all restarts are scheduled independently (relying on set deduplication) and all final checks are performed afterwards. Added a unit test to specifically verify the case above.