Data flow: No more flow summaries #3110

hvitved · 2020-03-23T14:59:25Z

This PR rewrites how we analyze flow through method calls. Before, we computed flow summaries up-front for (relevant) methods, and restricted such summaries to alter at most the head of the input/output access paths.

This PR proposes to abandon the flow summary approach^*, and instead compute flow-through using the existing logic for calculating flow in and flow out. The motivation for doing this is:

It simplifies and unifies the implementation.
It allows for general flow-through, not restricted to alter at most the head of the access paths.

The implementation uses essentially the same technique as in the first pruning steps (nodeCandFwd1 and nodeCand1), namely to record whether the data-flow node is from an argument fromArg (resp. to be returned toReturn), and if so, restrict flow back out to calls that also have flow in (resp. restrict flow back in to calls that also have flow out). Additionally, we also record the access path of the argument (resp. the returned value), using the same level of abstraction as for the node itself (a Boolean in the first step, then AccessPathFront, and finally AccessPath). Lastly, we only record these access paths when the value may potentially flow through, in order to avoid unnecessary duplication.

^*We still need to compute value-based summaries in DataFlowImplCommon.qll for extended return nodes and for reverse store steps through getters, but this can now be simplified as we no longer need to take stores into account.

Differences jobs

Interestingly, even though we now allow for more flow^**, we see a significant performance improvement on Mono and JDK of 17% and 14%, respectively (performance is mostly the same for C++). Results are unchanged for C++, for C# there are few extra results, and for Java the are also extra results, in particular java/tainted-arithmetic now has 52 extra results.

^**With the exception that the summaries computed in DataFlowImplCommon.qll no longer track flow through store-read steps.

jbj · 2020-04-07T13:27:04Z

The performance results for C# and Java are fantastic, but I haven't looked at whether the additional flow is good.

Should this PR be aimed at 1.24?

hvitved · 2020-04-07T13:30:50Z

Should this PR be aimed at 1.24?

I had hoped so, but @aschackmull is on vacation this week and next week.

jbj · 2020-04-07T14:51:09Z

We'll definitely need @aschackmull to review this one, so 1.24 is probably not an option. What a scary-looking diff!

hvitved · 2020-04-07T19:10:11Z

What a scary-looking diff!

I know ;-) I would still argue that the new code is simpler, and in fact the diff removes more code than it adds, even though I have also added more elaborate comments to existing predicates (like for example 334b501#diff-3fe62e36050a4be03a57b4a631bb7052R707-R713).

hvitved · 2020-04-17T11:46:32Z

Since there has been no review activity on this PR yet, I am going to rebase and rerun the Differences jobs.

hvitved · 2020-04-20T06:36:32Z

Since there has been no review activity on this PR yet, I am going to rebase and rerun the Differences jobs.

I have updated the PR description to link to the new jobs.

aschackmull

The main changes in this PR can be summed up in 3 points.

Summaries (yes, they are still here) in the pruning stages are now bespoke rather than calculated up-front with a universal quantification on the access paths they apply to. This means that we allow more access-path-modification-shapes in summaries (unlimited, rather than restricted to the finite set of value-preserving, getters, setters, getter-setters, and their corresponding version with taint steps). This is a strict addition of more flow.
The summaries that are used in the pruning stages are made less precise by abstracting away which parameter (resp. return node) that the flow-through originated from. This means more nodes visited in the pruning stages, but potentially better performance as the needed summary context during pruning will be smaller (the latter part about performance is purely speculative, as we haven't measured this).
The ad-hoc flow-restriction from the fieldFlowBranchLimit is now also applied to summarized flow. This is a natural change given the change in the way the code is structured. This reduces the amount of flow we calculate.

I've tried to do some measurements in tuple-counts, number of nodes visited, and number of generated access paths in the individual pruning stages. The first two points above should only increase most of these numbers, but these effects are completely drowned out by the third point as all the numbers show dramatic decreases. This also accounts for the increase in performance. So in a sense, the measured end-to-end performance gains are accidental, and may come with an increased exposure to the annoying fieldFlowBranchLimit.

Given that the first point above is a very nice feature-addition to field-flow, and that the overall changes simplify the code, I like this PR a lot, even though we unfortunately won't be able to measure the interesting parts of it in isolation. Then we can deal with nitpicking like predicate-naming in follow-up PRs.

hvitved · 2020-05-05T11:46:11Z

The ad-hoc flow-restriction from the fieldFlowBranchLimit is now also applied to summarized flow. This is a natural change given the change in the way the code is structured. This reduces the amount of flow we calculate.

I thought about this again, and isn't the fieldFlowBranchLimit very likely to be applied in the final pathStep calculation, even before this PR? For example, flow in is restricted to parameters p and access paths ap that can be reached in the previous pruning stage. A similar restriction is made for flow out. This would also explain why we generally don't see a reduction in query results.

aschackmull · 2020-05-05T12:29:29Z

Hmm, yes, that is a good point.

cf github/codeql#3110

C++: Add testcase for #3110

C#/Java/C++: Add change note for #3110

cf github/codeql#3110

hvitved added C# C++ Java labels Mar 23, 2020

hvitved force-pushed the dataflow/no-more-summaries branch 2 times, most recently from 10b7ebe to 3e52aa1 Compare March 24, 2020 14:41

hvitved marked this pull request as ready for review March 25, 2020 17:12

hvitved requested review from a team as code owners March 25, 2020 17:12

hvitved assigned aschackmull Mar 25, 2020

This was referenced Mar 26, 2020

Data flow: Remove uses of Node.getLocation(). #3126

Closed

Data flow: No magic in returnFlowCallableCand #3142

Merged

hvitved added 4 commits April 17, 2020 13:48

Data flow: No more flow summaries

de41e66

Data flow: Sync files

1b6e978

Java: Update test

05ec755

C#: Add more data-flow tests

f91af7d

hvitved force-pushed the dataflow/no-more-summaries branch from 5a5d3f1 to f91af7d Compare April 17, 2020 11:49

This was referenced Apr 22, 2020

Data flow: Support stores into nodes that are not PostUpdateNodes #3312

Merged

C#: Precise data-flow for collections #3366

Merged

kamarcum unassigned aschackmull Apr 28, 2020

hvitved assigned aschackmull Apr 29, 2020

aschackmull approved these changes May 5, 2020

View reviewed changes

aschackmull merged commit b745809 into github:master May 5, 2020

hvitved deleted the dataflow/no-more-summaries branch May 5, 2020 11:46

max-schaefer mentioned this pull request May 6, 2020

Port recent data-flow improvements github/codeql-go#62

Merged

max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request May 6, 2020

Data flow: No more summaries

0bdf502

cf github/codeql#3110

hvitved added a commit to hvitved/codeql that referenced this pull request May 6, 2020

C#: Add change note for github#3110

ddd62a5

max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request May 6, 2020

Data flow: No more summaries

fd2e618

cf github/codeql#3110

MathiasVP added a commit to MathiasVP/ql that referenced this pull request May 7, 2020

C++: Add testcase for github#3110

594f3b1

MathiasVP mentioned this pull request May 7, 2020

C#/Java/C++: Add change note for #3110 #3421

Merged

rdmarsh2 added a commit that referenced this pull request May 7, 2020

Merge pull request #3426 from MathiasVP/test-3110

761e318

C++: Add testcase for #3110

calumgrant added a commit that referenced this pull request May 13, 2020

Merge pull request #3421 from hvitved/csharp/dataflow/change-note

f5daeea

C#/Java/C++: Add change note for #3110

owen-mc pushed a commit to owen-mc/codeql-go that referenced this pull request Jun 10, 2020

Data flow: No more summaries

6cc14ce

cf github/codeql#3110

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data flow: No more flow summaries #3110

Data flow: No more flow summaries #3110

Uh oh!

hvitved commented Mar 23, 2020 •

edited

Loading

Uh oh!

jbj commented Apr 7, 2020

Uh oh!

hvitved commented Apr 7, 2020

Uh oh!

jbj commented Apr 7, 2020

Uh oh!

hvitved commented Apr 7, 2020

Uh oh!

hvitved commented Apr 17, 2020

Uh oh!

hvitved commented Apr 20, 2020

Uh oh!

aschackmull left a comment

Uh oh!

hvitved commented May 5, 2020

Uh oh!

aschackmull commented May 5, 2020

Uh oh!

Uh oh!

Data flow: No more flow summaries #3110

Data flow: No more flow summaries #3110

Uh oh!

Conversation

hvitved commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Differences jobs

Uh oh!

jbj commented Apr 7, 2020

Uh oh!

hvitved commented Apr 7, 2020

Uh oh!

jbj commented Apr 7, 2020

Uh oh!

hvitved commented Apr 7, 2020

Uh oh!

hvitved commented Apr 17, 2020

Uh oh!

hvitved commented Apr 20, 2020

Uh oh!

aschackmull left a comment

Choose a reason for hiding this comment

Uh oh!

hvitved commented May 5, 2020

Uh oh!

aschackmull commented May 5, 2020

Uh oh!

Uh oh!

hvitved commented Mar 23, 2020 •

edited

Loading