Backfill missing Pypi dependencies #3045
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rewrote the rake task for the backfill based on learnings from the past couple days.
This query doesn't find broken projects. It finds projects which are potentially broken and effectively short-lists them to be resynced (by
PackageManagerDownloadWorker
).It will find affected projects in batches of 120, group those into 2, and then run 2 every second for 1 minute, and then repeat for the next minute, and so on. It will fix 2 projects per second until all projects are fixed.
There are currently
103999
affected projects, so it is expected to take ~14.5 hours.The reason we are not running a script which calls
Pypi#save_dependencies
(like in a previous commit is because it doesn't set an indicator which allows it to be filtered out in subsequent queries in case the backfill unexpectedly stops or needs to be stopped and restarted for any reason.The reason we are not querying for versions instead of projects and running
PackageManagerDownloadWorker
with theversion
arg is to avoid a situation where a project has multiple affected versions and one gets completed by the worker and the other fails or the task needs to be stopped and restarted. In that case the query will not pick the project back up since itslast_resync_at
timestamp got updated and it wouldn't be possible to query for such an anomaly.7/6/2022 is the date the Pypi api changed
2/1/2022 is the day after the fix to Libraries (#3040) went live