|
| 1 | +# Premerge Advisor |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +While the premerge infrastructure is generally reliable, tests that are broken |
| 6 | +at HEAD, or tests that are flaky can significantly degrade the user experience. |
| 7 | +Failures unrelated to the PR being worked on can lead to warning fatigue which |
| 8 | +can compound this problem when people land failing tests into main that they |
| 9 | +believe are unrelated. The premerge advisor is designed to run after premerge |
| 10 | +testing and signal to the user whether the failures can be safely ignored because |
| 11 | +they are flaky or broken at head, or whether they should be investigated further. |
| 12 | + |
| 13 | +## Background/Motivation |
| 14 | + |
| 15 | +The usefulness of the premerge system is significantly impacted by how much |
| 16 | +people trust it. People trust the system less when the issues reported to them |
| 17 | +are unrelated to the changes that they are currently working on. False positives |
| 18 | +occur regularly whether due to flaky tests, or due to main being broken. Efforts |
| 19 | +to fix these issues at the source and keep main always green and deflake tests |
| 20 | +are laudable, but not a scalable solution to the problem of false positive |
| 21 | +failures. It is also not reasonable to expect PR authors to spend time |
| 22 | +familiarizing themselves with all known flaky tests and dig through postcommit |
| 23 | +testing logs to see if the failures in their premerge run also occur in main. |
| 24 | +These alternatives are further explored in the section on |
| 25 | +[alternatives considered](#alternatives-considered). Having tooling to |
| 26 | +automatically run through the steps that one would otherwise need to perform |
| 27 | +manually would ensure the analysis on every failed premerge run is thorough and |
| 28 | +likely to be correct. |
| 29 | + |
| 30 | +## Design |
| 31 | + |
| 32 | +The premerge advisor will consist of three main parts: jobs uploading failure |
| 33 | +information, a web server and database to store and query failure information, |
| 34 | +and tooling to write out a verdict about the failures to comments on Github. |
| 35 | +When a job runs premerge or postcommit and there are build/test failures, it |
| 36 | +will upload information to the web server containing information on the failure |
| 37 | +like the test/build action that failed and the exact log. The web server will |
| 38 | +then store this information in a format that makes it easy to query for later. |
| 39 | +Every premerge run that encounters build/test failures will then query the web |
| 40 | +server to see if there are any matching build/test failures for the version of |
| 41 | +`main` that the PR has been merged into. If there are, the web server can |
| 42 | +respond with the failure information and signal that the failures can be safely |
| 43 | +ignored for that premerge run. If the failures are novel, then the web server |
| 44 | +can signal that the failures should be investigated more thoroughly. If there |
| 45 | +are failures, the premerge advisor can then write out a comment on the PR |
| 46 | +explaining its findings. |
| 47 | + |
| 48 | +### Processing and Storing Failures |
| 49 | + |
| 50 | +A key part of the premerge advisor is infrastructure to store and process build |
| 51 | +failure information so that it can be queried later. We plan on having jobs |
| 52 | +extract failure logs and upload them to a web server. This web server in |
| 53 | +addition to having an endpoint to accept uploads will have an endpoint that will |
| 54 | +accept test failure information (logs and filenames) and return whether or not |
| 55 | +they are broken at `main`, flaky, or novel test failures due to the PR. |
| 56 | + |
| 57 | +For the premerge jobs running through Github Actions, we plan on using the |
| 58 | +existing `generate_test_report` scripts that are currently used for generating |
| 59 | +summaries on job failures. When the job ends and there is a failure, there would |
| 60 | +be a script that runs, utilizing the `generate_test_report` library to extract |
| 61 | +failure information, and then uploading the information to the web server. |
| 62 | +Information on how the premerge jobs will query the server and display results |
| 63 | +about flaky/already broken tests is in |
| 64 | +[the section below](#informing-the-user). We plan on having both the premerge |
| 65 | +and postcommit jobs upload failure information to the web server. This enables |
| 66 | +the web server to know about failures that are not the result of mid-air |
| 67 | +collisions before postcommit testing has been completed. Postcommit testing can |
| 68 | +take upwards of 30 minutes, during which the premerge advisor would not be able |
| 69 | +to advise that these failures are due to a broken `main`. |
| 70 | + |
| 71 | +We plan on implementing the web server in python with the `flask` library. All |
| 72 | +contributors to the premerge infrastructure are already familiar with Python, |
| 73 | +building web servers in Python with `flask` is relatively easy, and we do not |
| 74 | +need high performance or advanced features. For storage, we plan on using SQLite |
| 75 | +as it has support built in to Python, does not require any additional complexity |
| 76 | +in terms of infrastructure setup, and is reliable. |
| 77 | + |
| 78 | +Given we have two identical clusters that need to be able to function |
| 79 | +independently of each other, we also need to duplicate the web server for the |
| 80 | +premerge advisor. Queries and failure information uploads will go to the |
| 81 | +cluster-local premerge advisor web server. Periodically (eg once every thirty |
| 82 | +seconds), the web server will query the web server on the other cluster to see |
| 83 | +if there is any new data that has not been propgated back to the other side yet. |
| 84 | +It is easy to figure out what one side is missing as Github workflow runs are |
| 85 | +numbered sequentially and git commits are also explicitly ordered. One side just |
| 86 | +needs to send the latest commit SHA and workflow run it has all previous data |
| 87 | +for, and the other side can reply with the data that it has past that point. |
| 88 | +Explicitly synchronizing everytime without assumptions about the state of the |
| 89 | +other side has benefits over just writing through, like ensuring that a cluster |
| 90 | +that has been down for a significant amount of time is seamlessly able to |
| 91 | +recover. Synchronization does not need to be perfect as test failures that are |
| 92 | +flaky or broken at head will almost certainly show up in both clusters |
| 93 | +relatively quickly, and minor discrepancies for queries between the clusters are |
| 94 | +not a big deal. |
| 95 | + |
| 96 | +### Informing the User |
| 97 | + |
| 98 | +Once a workflow has completed, no actions related to the premerge advisor will |
| 99 | +be performed if there are no failures. If there are failures, they will be |
| 100 | +uploaded to the web server. Afterwards, the premerge workflow then makes a |
| 101 | +request to the server asking if it can explain the failures as either existing |
| 102 | +in `main` or as flaky. |
| 103 | + |
| 104 | +After the response from the server has been recieved, the workflow will then |
| 105 | +construct a comment. It will contain the failure information, and if relevant, |
| 106 | +information/links showing the tests are flaky (and the flakiness rate) or are |
| 107 | +broken in `main`. If all of the test failures are due to flakes or failures in |
| 108 | +`main`, the comment will indicate to the user that they can safely merge their |
| 109 | +PR despite the test failures. We plan to construct the comment in a manner |
| 110 | +similar to the code formatting action. We will create one comment on the first |
| 111 | +workflow failure and then update that comment everytime we get new data. This |
| 112 | +prevents creating much noise. This does mean that the comment might get buried |
| 113 | +in long running reviews, but those are not the common case and people should |
| 114 | +learn to expect to look for the comment earlier in the thread in such cases. |
| 115 | + |
| 116 | +## Alternatives Considered |
| 117 | + |
| 118 | +There are two main alternatives to this work. One would be to do nothing and let |
| 119 | +users figure out these failures on their own, potentially with documentation to |
| 120 | +better inform them of the process. The other alternative would be to work on |
| 121 | +keeping `main` green all the time and deflake tests rather than work on a |
| 122 | +premerge advisor. Both of these alternatives are considered below. |
| 123 | + |
| 124 | +### Deflaking Tests and Keeping Main Green |
| 125 | + |
| 126 | +Instead of putting effort into building the premerge advisor, we could also be |
| 127 | +putting effort into deflaking tests and making process changes to ensure `main` |
| 128 | +is not broken. These fixes have the bonus of being useful for more than just |
| 129 | +premerge, also improving reliability for buildbots and any downstream testing. |
| 130 | +While we probably could achieve this at least temporarily with process changes |
| 131 | +and a concentrated deflaking effort, we do not believe this is feasible or |
| 132 | +scalable. |
| 133 | + |
| 134 | +In order to ensure that main is not broken by new patches, we need to ensure |
| 135 | +that every commit is tested directly on top of `main` before landing. This is |
| 136 | +not feasible given LLVM's current processes where pushing directly to main is a |
| 137 | +critical component of several developer's workflows. We would also need to |
| 138 | +reduce the risk of "mid-air collisions", patches that pass premerge testing, but |
| 139 | +fail on `main` when landed due to the patch in its original state not being |
| 140 | +compatible with the new state of main. This would most likely involve merge |
| 141 | +queues which would introduce new CI load and are also not compatible with LLVM's |
| 142 | +existing practices for the same reason requiring premerge checks to be run |
| 143 | +before landing are not. |
| 144 | + |
| 145 | +Doing an initial effort for deflaking tests is also not scalable from an |
| 146 | +engineering effort perspective. While we might be able to deflake existing |
| 147 | +tests, additional flaky tests will get added in the future, and it is likely not |
| 148 | +feasible to dedicate enough resources to deflake them. Policy improvements |
| 149 | +around reverting patches that introduce flaky tests might make this more |
| 150 | +scalable, but relies on quick detection of flaky tests, which might be difficult |
| 151 | +for tests that experience flaky failures very rarely. |
| 152 | + |
| 153 | +### Not Doing Anything |
| 154 | + |
| 155 | +Alternatively, we could not implement this at all. This system adds quite a bit |
| 156 | +of complexity and adds new failure modes. False positive failures also are not |
| 157 | +that frequent. However, even a relatively small percentage of failures like we |
| 158 | +are observing significantly impacts trust in the premerge system, which |
| 159 | +compounds the problem. People learn not to trust the results, ignore true |
| 160 | +failures caused by their patch, and then land it, causing many downstream |
| 161 | +failures. The frequency of these incidents (typically multiple times per week) |
| 162 | +means that it is pretty likely most LLVM developers will run into this class of |
| 163 | +issue sooner or later. |
| 164 | + |
| 165 | +The complexity is also well confined to the components specific to this new |
| 166 | +infrastructure, and the new failure modes can be mitigated through proper error |
| 167 | +handling at the interface between the existing premerge system and the new |
| 168 | +premerge advisor infrastructure. |
0 commit comments