-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Data flow: No more flow summaries #3110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
10b7ebe
to
3e52aa1
Compare
The performance results for C# and Java are fantastic, but I haven't looked at whether the additional flow is good. Should this PR be aimed at 1.24? |
I had hoped so, but @aschackmull is on vacation this week and next week. |
We'll definitely need @aschackmull to review this one, so 1.24 is probably not an option. What a scary-looking diff! |
I know ;-) I would still argue that the new code is simpler, and in fact the diff removes more code than it adds, even though I have also added more elaborate comments to existing predicates (like for example 334b501#diff-3fe62e36050a4be03a57b4a631bb7052R707-R713). |
Since there has been no review activity on this PR yet, I am going to rebase and rerun the |
5a5d3f1
to
f91af7d
Compare
I have updated the PR description to link to the new jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main changes in this PR can be summed up in 3 points.
- Summaries (yes, they are still here) in the pruning stages are now bespoke rather than calculated up-front with a universal quantification on the access paths they apply to. This means that we allow more access-path-modification-shapes in summaries (unlimited, rather than restricted to the finite set of value-preserving, getters, setters, getter-setters, and their corresponding version with taint steps). This is a strict addition of more flow.
- The summaries that are used in the pruning stages are made less precise by abstracting away which parameter (resp. return node) that the flow-through originated from. This means more nodes visited in the pruning stages, but potentially better performance as the needed summary context during pruning will be smaller (the latter part about performance is purely speculative, as we haven't measured this).
- The ad-hoc flow-restriction from the
fieldFlowBranchLimit
is now also applied to summarized flow. This is a natural change given the change in the way the code is structured. This reduces the amount of flow we calculate.
I've tried to do some measurements in tuple-counts, number of nodes visited, and number of generated access paths in the individual pruning stages. The first two points above should only increase most of these numbers, but these effects are completely drowned out by the third point as all the numbers show dramatic decreases. This also accounts for the increase in performance. So in a sense, the measured end-to-end performance gains are accidental, and may come with an increased exposure to the annoying fieldFlowBranchLimit
.
Given that the first point above is a very nice feature-addition to field-flow, and that the overall changes simplify the code, I like this PR a lot, even though we unfortunately won't be able to measure the interesting parts of it in isolation. Then we can deal with nitpicking like predicate-naming in follow-up PRs.
I thought about this again, and isn't the |
Hmm, yes, that is a good point. |
C#/Java/C++: Add change note for #3110
This PR rewrites how we analyze flow through method calls. Before, we computed flow summaries up-front for (relevant) methods, and restricted such summaries to alter at most the head of the input/output access paths.
This PR proposes to abandon the flow summary approach*, and instead compute flow-through using the existing logic for calculating flow in and flow out. The motivation for doing this is:
The implementation uses essentially the same technique as in the first pruning steps (
nodeCandFwd1
andnodeCand1
), namely to record whether the data-flow node is from an argumentfromArg
(resp. to be returnedtoReturn
), and if so, restrict flow back out to calls that also have flow in (resp. restrict flow back in to calls that also have flow out). Additionally, we also record the access path of the argument (resp. the returned value), using the same level of abstraction as for the node itself (a Boolean in the first step, thenAccessPathFront
, and finallyAccessPath
). Lastly, we only record these access paths when the value may potentially flow through, in order to avoid unnecessary duplication.*We still need to compute value-based summaries in
DataFlowImplCommon.qll
for extended return nodes and for reverse store steps through getters, but this can now be simplified as we no longer need to take stores into account.Differences jobs
Interestingly, even though we now allow for more flow**, we see a significant performance improvement on Mono and JDK of 17% and 14%, respectively (performance is mostly the same for C++). Results are unchanged for C++, for C# there are few extra results, and for Java the are also extra results, in particular
java/tainted-arithmetic
now has 52 extra results.**With the exception that the summaries computed in
DataFlowImplCommon.qll
no longer track flow through store-read steps.