Skip to content

Conversation

hvitved
Copy link
Contributor

@hvitved hvitved commented Mar 23, 2020

This PR rewrites how we analyze flow through method calls. Before, we computed flow summaries up-front for (relevant) methods, and restricted such summaries to alter at most the head of the input/output access paths.

This PR proposes to abandon the flow summary approach*, and instead compute flow-through using the existing logic for calculating flow in and flow out. The motivation for doing this is:

  1. It simplifies and unifies the implementation.
  2. It allows for general flow-through, not restricted to alter at most the head of the access paths.

The implementation uses essentially the same technique as in the first pruning steps (nodeCandFwd1 and nodeCand1), namely to record whether the data-flow node is from an argument fromArg (resp. to be returned toReturn), and if so, restrict flow back out to calls that also have flow in (resp. restrict flow back in to calls that also have flow out). Additionally, we also record the access path of the argument (resp. the returned value), using the same level of abstraction as for the node itself (a Boolean in the first step, then AccessPathFront, and finally AccessPath). Lastly, we only record these access paths when the value may potentially flow through, in order to avoid unnecessary duplication.

*We still need to compute value-based summaries in DataFlowImplCommon.qll for extended return nodes and for reverse store steps through getters, but this can now be simplified as we no longer need to take stores into account.

Differences jobs

Interestingly, even though we now allow for more flow**, we see a significant performance improvement on Mono and JDK of 17% and 14%, respectively (performance is mostly the same for C++). Results are unchanged for C++, for C# there are few extra results, and for Java the are also extra results, in particular java/tainted-arithmetic now has 52 extra results.

**With the exception that the summaries computed in DataFlowImplCommon.qll no longer track flow through store-read steps.

@hvitved hvitved force-pushed the dataflow/no-more-summaries branch 2 times, most recently from 10b7ebe to 3e52aa1 Compare March 24, 2020 14:41
@hvitved hvitved marked this pull request as ready for review March 25, 2020 17:12
@hvitved hvitved requested review from a team as code owners March 25, 2020 17:12
@jbj
Copy link
Contributor

jbj commented Apr 7, 2020

The performance results for C# and Java are fantastic, but I haven't looked at whether the additional flow is good.

Should this PR be aimed at 1.24?

@hvitved
Copy link
Contributor Author

hvitved commented Apr 7, 2020

Should this PR be aimed at 1.24?

I had hoped so, but @aschackmull is on vacation this week and next week.

@jbj
Copy link
Contributor

jbj commented Apr 7, 2020

We'll definitely need @aschackmull to review this one, so 1.24 is probably not an option. What a scary-looking diff!

@hvitved
Copy link
Contributor Author

hvitved commented Apr 7, 2020

What a scary-looking diff!

I know ;-) I would still argue that the new code is simpler, and in fact the diff removes more code than it adds, even though I have also added more elaborate comments to existing predicates (like for example 334b501#diff-3fe62e36050a4be03a57b4a631bb7052R707-R713).

@hvitved
Copy link
Contributor Author

hvitved commented Apr 17, 2020

Since there has been no review activity on this PR yet, I am going to rebase and rerun the Differences jobs.

@hvitved hvitved force-pushed the dataflow/no-more-summaries branch from 5a5d3f1 to f91af7d Compare April 17, 2020 11:49
@hvitved
Copy link
Contributor Author

hvitved commented Apr 20, 2020

Since there has been no review activity on this PR yet, I am going to rebase and rerun the Differences jobs.

I have updated the PR description to link to the new jobs.

Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main changes in this PR can be summed up in 3 points.

  • Summaries (yes, they are still here) in the pruning stages are now bespoke rather than calculated up-front with a universal quantification on the access paths they apply to. This means that we allow more access-path-modification-shapes in summaries (unlimited, rather than restricted to the finite set of value-preserving, getters, setters, getter-setters, and their corresponding version with taint steps). This is a strict addition of more flow.
  • The summaries that are used in the pruning stages are made less precise by abstracting away which parameter (resp. return node) that the flow-through originated from. This means more nodes visited in the pruning stages, but potentially better performance as the needed summary context during pruning will be smaller (the latter part about performance is purely speculative, as we haven't measured this).
  • The ad-hoc flow-restriction from the fieldFlowBranchLimit is now also applied to summarized flow. This is a natural change given the change in the way the code is structured. This reduces the amount of flow we calculate.

I've tried to do some measurements in tuple-counts, number of nodes visited, and number of generated access paths in the individual pruning stages. The first two points above should only increase most of these numbers, but these effects are completely drowned out by the third point as all the numbers show dramatic decreases. This also accounts for the increase in performance. So in a sense, the measured end-to-end performance gains are accidental, and may come with an increased exposure to the annoying fieldFlowBranchLimit.

Given that the first point above is a very nice feature-addition to field-flow, and that the overall changes simplify the code, I like this PR a lot, even though we unfortunately won't be able to measure the interesting parts of it in isolation. Then we can deal with nitpicking like predicate-naming in follow-up PRs.

@aschackmull aschackmull merged commit b745809 into github:master May 5, 2020
@hvitved
Copy link
Contributor Author

hvitved commented May 5, 2020

  • The ad-hoc flow-restriction from the fieldFlowBranchLimit is now also applied to summarized flow. This is a natural change given the change in the way the code is structured. This reduces the amount of flow we calculate.

I thought about this again, and isn't the fieldFlowBranchLimit very likely to be applied in the final pathStep calculation, even before this PR? For example, flow in is restricted to parameters p and access paths ap that can be reached in the previous pruning stage. A similar restriction is made for flow out. This would also explain why we generally don't see a reduction in query results.

@hvitved hvitved deleted the dataflow/no-more-summaries branch May 5, 2020 11:46
@aschackmull
Copy link
Contributor

Hmm, yes, that is a good point.

max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request May 6, 2020
hvitved added a commit to hvitved/codeql that referenced this pull request May 6, 2020
max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request May 6, 2020
MathiasVP added a commit to MathiasVP/ql that referenced this pull request May 7, 2020
rdmarsh2 added a commit that referenced this pull request May 7, 2020
calumgrant added a commit that referenced this pull request May 13, 2020
owen-mc pushed a commit to owen-mc/codeql-go that referenced this pull request Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants