Skip to content

Commit 3195b70

Browse files
[CI] Add Design Doc for Premerge Advisor
This patch adds a design document for the premerge advisor that we plan on implementing.
1 parent d434832 commit 3195b70

File tree

1 file changed

+168
-0
lines changed

1 file changed

+168
-0
lines changed

premerge/premerge-advisor.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Premerge Advisor
2+
3+
## Introduction
4+
5+
While the premerge infrastructure is generally reliable, tests that are broken
6+
at HEAD, or tests that are flaky can significantly degrade the user experience.
7+
Failures unrelated to the PR being worked on can lead to warning fatigue which
8+
can compound this problem when people land failing tests into main that they
9+
believe are unrelated. The premerge advisor is designed to run after premerge
10+
testing and signal to the user whether the failures can be safely ignored because
11+
they are flaky or broken at head, or whether they should be investigated further.
12+
13+
## Background/Motivation
14+
15+
The usefulness of the premerge system is significantly impacted by how much
16+
people trust it. People trust the system less when the issues reported to them
17+
are unrelated to the changes that they are currently working on. False positives
18+
occur regularly whether due to flaky tests, or due to main being broken. Efforts
19+
to fix these issues at the source and keep main always green and deflake tests
20+
are laudable, but not a scalable solution to the problem of false positive
21+
failures. It is also not reasonable to expect PR authors to spend time
22+
familiarizing themselves with all known flaky tests and dig through postcommit
23+
testing logs to see if the failures in their premerge run also occur in main.
24+
These alternatives are further explored in the section on
25+
[alternatives considered](#alternatives-considered). Having tooling to
26+
automatically run through the steps that one would otherwise need to perform
27+
manually would ensure the analysis on every failed premerge run is thorough and
28+
likely to be correct.
29+
30+
## Design
31+
32+
The premerge advisor will consist of three main parts: jobs uploading failure
33+
information, a web server and database to store and query failure information,
34+
and tooling to write out a verdict about the failures to comments on Github.
35+
When a job runs premerge or postcommit and there are build/test failures, it
36+
will upload information to the web server containing information on the failure
37+
like the test/build action that failed and the exact log. The web server will
38+
then store this information in a format that makes it easy to query for later.
39+
Every premerge run that encounters build/test failures will then query the web
40+
server to see if there are any matching build/test failures for the version of
41+
`main` that the PR has been merged into. If there are, the web server can
42+
respond with the failure information and signal that the failures can be safely
43+
ignored for that premerge run. If the failures are novel, then the web server
44+
can signal that the failures should be investigated more thoroughly. If there
45+
are failures, the premerge advisor can then write out a comment on the PR
46+
explaining its findings.
47+
48+
### Processing and Storing Failures
49+
50+
A key part of the premerge advisor is infrastructure to store and process build
51+
failure information so that it can be queried later. We plan on having jobs
52+
extract failure logs and upload them to a web server. This web server in
53+
addition to having an endpoint to accept uploads will have an endpoint that will
54+
accept test failure information (logs and filenames) and return whether or not
55+
they are broken at `main`, flaky, or novel test failures due to the PR.
56+
57+
For the premerge jobs running through Github Actions, we plan on using the
58+
existing `generate_test_report` scripts that are currently used for generating
59+
summaries on job failures. When the job ends and there is a failure, there would
60+
be a script that runs, utilizing the `generate_test_report` library to extract
61+
failure information, and then uploading the information to the web server.
62+
Information on how the premerge jobs will query the server and display results
63+
about flaky/already broken tests is in
64+
[the section below](#informing-the-user). We plan on having both the premerge
65+
and postcommit jobs upload failure information to the web server. This enables
66+
the web server to know about failures that are not the result of mid-air
67+
collisions before postcommit testing has been completed. Postcommit testing can
68+
take upwards of 30 minutes, during which the premerge advisor would not be able
69+
to advise that these failures are due to a broken `main`.
70+
71+
We plan on implementing the web server in python with the `flask` library. All
72+
contributors to the premerge infrastructure are already familiar with Python,
73+
building web servers in Python with `flask` is relatively easy, and we do not
74+
need high performance or advanced features. For storage, we plan on using SQLite
75+
as it has support built in to Python, does not require any additional complexity
76+
in terms of infrastructure setup, and is reliable.
77+
78+
Given we have two identical clusters that need to be able to function
79+
independently of each other, we also need to duplicate the web server for the
80+
premerge advisor. Queries and failure information uploads will go to the
81+
cluster-local premerge advisor web server. Periodically (eg once every thirty
82+
seconds), the web server will query the web server on the other cluster to see
83+
if there is any new data that has not been propgated back to the other side yet.
84+
It is easy to figure out what one side is missing as Github workflow runs are
85+
numbered sequentially and git commits are also explicitly ordered. One side just
86+
needs to send the latest commit SHA and workflow run it has all previous data
87+
for, and the other side can reply with the data that it has past that point.
88+
Explicitly synchronizing everytime without assumptions about the state of the
89+
other side has benefits over just writing through, like ensuring that a cluster
90+
that has been down for a significant amount of time is seamlessly able to
91+
recover. Synchronization does not need to be perfect as test failures that are
92+
flaky or broken at head will almost certainly show up in both clusters
93+
relatively quickly, and minor discrepancies for queries between the clusters are
94+
not a big deal.
95+
96+
### Informing the User
97+
98+
Once a workflow has completed, no actions related to the premerge advisor will
99+
be performed if there are no failures. If there are failures, they will be
100+
uploaded to the web server. Afterwards, the premerge workflow then makes a
101+
request to the server asking if it can explain the failures as either existing
102+
in `main` or as flaky.
103+
104+
After the response from the server has been recieved, the workflow will then
105+
construct a comment. It will contain the failure information, and if relevant,
106+
information/links showing the tests are flaky (and the flakiness rate) or are
107+
broken in `main`. If all of the test failures are due to flakes or failures in
108+
`main`, the comment will indicate to the user that they can safely merge their
109+
PR despite the test failures. We plan to construct the comment in a manner
110+
similar to the code formatting action. We will create one comment on the first
111+
workflow failure and then update that comment everytime we get new data. This
112+
prevents creating much noise. This does mean that the comment might get buried
113+
in long running reviews, but those are not the common case and people should
114+
learn to expect to look for the comment earlier in the thread in such cases.
115+
116+
## Alternatives Considered
117+
118+
There are two main alternatives to this work. One would be to do nothing and let
119+
users figure out these failures on their own, potentially with documentation to
120+
better inform them of the process. The other alternative would be to work on
121+
keeping `main` green all the time and deflake tests rather than work on a
122+
premerge advisor. Both of these alternatives are considered below.
123+
124+
### Deflaking Tests and Keeping Main Green
125+
126+
Instead of putting effort into building the premerge advisor, we could also be
127+
putting effort into deflaking tests and making process changes to ensure `main`
128+
is not broken. These fixes have the bonus of being useful for more than just
129+
premerge, also improving reliability for buildbots and any downstream testing.
130+
While we probably could achieve this at least temporarily with process changes
131+
and a concentrated deflaking effort, we do not believe this is feasible or
132+
scalable.
133+
134+
In order to ensure that main is not broken by new patches, we need to ensure
135+
that every commit is tested directly on top of `main` before landing. This is
136+
not feasible given LLVM's current processes where pushing directly to main is a
137+
critical component of several developer's workflows. We would also need to
138+
reduce the risk of "mid-air collisions", patches that pass premerge testing, but
139+
fail on `main` when landed due to the patch in its original state not being
140+
compatible with the new state of main. This would most likely involve merge
141+
queues which would introduce new CI load and are also not compatible with LLVM's
142+
existing practices for the same reason requiring premerge checks to be run
143+
before landing are not.
144+
145+
Doing an initial effort for deflaking tests is also not scalable from an
146+
engineering effort perspective. While we might be able to deflake existing
147+
tests, additional flaky tests will get added in the future, and it is likely not
148+
feasible to dedicate enough resources to deflake them. Policy improvements
149+
around reverting patches that introduce flaky tests might make this more
150+
scalable, but relies on quick detection of flaky tests, which might be difficult
151+
for tests that experience flaky failures very rarely.
152+
153+
### Not Doing Anything
154+
155+
Alternatively, we could not implement this at all. This system adds quite a bit
156+
of complexity and adds new failure modes. False positive failures also are not
157+
that frequent. However, even a relatively small percentage of failures like we
158+
are observing significantly impacts trust in the premerge system, which
159+
compounds the problem. People learn not to trust the results, ignore true
160+
failures caused by their patch, and then land it, causing many downstream
161+
failures. The frequency of these incidents (typically multiple times per week)
162+
means that it is pretty likely most LLVM developers will run into this class of
163+
issue sooner or later.
164+
165+
The complexity is also well confined to the components specific to this new
166+
infrastructure, and the new failure modes can be mitigated through proper error
167+
handling at the interface between the existing premerge system and the new
168+
premerge advisor infrastructure.

0 commit comments

Comments
 (0)