Skip to content

C++: IR field flow #3118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Apr 17, 2020
Merged

C++: IR field flow #3118

merged 30 commits into from
Apr 17, 2020

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Mar 24, 2020

This PR takes the the first stab at implementing field flow in the IR.

This is still WIP, but any comments is appreciated. One thing I remember is @jbj mentioning that DefinitionByReferenceNode couldn't be made a PostUpdateNode because there's not a corresponding pre update node to pick, but it seemed like this was the most obvious way to model partial flow to the this parameter in setters. And for now #2921 only reports a couple of problems with nodes missing locations, which shouldn't be a problem to fix.

The failing testcases are from the Semmle/code, where we flag a new bad case for cpp/user-controlled-bypass. When we are happy with the state of this PR I will make an internal PR and update that test.

@MathiasVP MathiasVP added C++ WIP This is a work-in-progress, do not merge yet! labels Mar 24, 2020
…ll arguments need a PostUpdateNode). Also generalized the added flow rule in simpleLocalFlowStep since there isn't always a ChiInstruction - for instance of it's a write to a struct that only has a single field.
@@ -321,6 +354,13 @@ predicate localFlowStep(Node nodeFrom, Node nodeTo) { simpleLocalFlowStep(nodeFr
*/
predicate simpleLocalFlowStep(Node nodeFrom, Node nodeTo) {
simpleInstructionLocalFlowStep(nodeFrom.asInstruction(), nodeTo.asInstruction())
or
exists(LoadInstruction load |
// TODO: These can probably be getSourceValue() after #3112 is merged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now merged (that's the line I meant to comment on a minute ago).

Copy link
Contributor Author

@MathiasVP MathiasVP Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly my comment turned out to be incorrect. The flow in the following program is not captured by only following exact Chi operands:

void sink(int *o);

struct B
{
  int *c;
  void set(int *c) { this->c = c; }
};

void f7(B *b)
{
  b->set(new int);
}

void f8()
{
  B *b = new B();
  f7(b);
  sink(b->c); // flow
}

since the load on b->c only is a total overlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you'll want to merge #3097 and use isResultConflated here.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Mar 26, 2020

Here's a quick rundown of the changes in the output:

  • AuthenticationBypass.qlref in Semmle/code has 1 new true positive
  • ql/cpp/ql/test/library-tests/dataflow/dataflow-tests/test_ir.ql has 3 new true positives and 2 new false positives (the 2 false positives were also present in the old field flow library according to the comments)
  • The changes to ql/cpp/ql/test/library-tests/dataflow/taint-tests/test_diff.ql and ql/cpp/ql/test/library-tests/dataflow/taint-tests/test_ir.ql looks like the expected output given that it's tracking taint.

@jbj
Copy link
Contributor

jbj commented Mar 26, 2020

ql/cpp/ql/test/library-tests/dataflow/dataflow-tests/test_ir.ql has 3 new true positives and 2 new false positives (the 2 false positives were also present in the old field flow library according to the comments)

And those two false positives are caused by limitations of the shared field-flow library rather than the C++ instantiation of it. They'll be present in the other languages too.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Mar 27, 2020

Testing locally on Wireshark/Wireshark revealed something that looks like a bad join order in DataFlowImpl::parameterThroughFlowCand:

[2020-03-27 10:13:23] (3432s) Tuple counts for DataFlowImpl::parameterThroughFlowCand#ff#antijoin_rhs:
                      2981323763 ~0%     {3} r1 = JOIN DataFlowImpl::parameterThroughFlowCand#ff#shared AS L WITH DataFlowUtil::InstructionNode::getEnclosingCallable_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT L.<2>, L.<1>, R.<1>
                      2968445848 ~2%     {4} r2 = JOIN r1 WITH DataFlowImplCommon::Cached::TParamUpdate#2#ff_10#join_rhs AS R ON FIRST 1 OUTPUT r1.<2>, R.<1>, r1.<1>, r1.<0>
                      832264     ~0%     {3} r3 = JOIN r2 WITH project#DataFlowUtil::ParameterNode::isParameterOf_dispred#2#fff AS R ON FIRST 2 OUTPUT r2.<2>, r2.<3>, r2.<0>
                                         return r3

I'm looking into this now.

Edit: Adding pragma[nomagic] to returnFlowCallableCand fixes the problem.

@@ -219,7 +219,7 @@ abstract class PostUpdateNode extends InstructionNode {
abstract private class PartialDefinitionNode extends PostUpdateNode, TInstructionNode {
final Instruction getInstructionOrChi() {
exists(ChiInstruction chi |
// TODO: This should be a non-conflated ChiInstruction once #3123 is merged
not chi.isResultConflated() and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we lose flow in the following case:

void sink(int *);

struct B
{
  int *i;
};

void f7(B* b)
{
  b->i = source();
}
void f8()
{
  B* b = new B;
  f7(b);
  sink(b->i); // flow [not detected]
}

Because the BufferMayWriteSideEffect for b in f7(b) is melded into all aliased memory:

#   15|     r15_1(glval<unknown>) = FunctionAddress[f7]           :
#   15|     r15_2(glval<B *>)     = VariableAddress[b]            :
#   15|     r15_3(B *)            = Load                          : &:r15_2, m14_8
#   15|     v15_4(void)           = Call                          : func:r15_1, 0:r15_3
#   15|     m15_5(unknown)        = ^CallSideEffect               : ~m14_6
#   15|     m15_6(unknown)        = Chi                           : total:m14_6, partial:m15_5
#   15|     v15_7(void)           = ^BufferReadSideEffect[0]      : &:r15_3, ~m15_6
#   15|     m15_8(unknown)        = ^BufferMayWriteSideEffect[0]  : &:r15_3
#   15|     m15_9(unknown)        = Chi                           : total:m15_6, partial:m15_8
#   16|     r16_1(glval<unknown>) = FunctionAddress[sink]         :
#   16|     r16_2(glval<B *>)     = VariableAddress[b]            :
#   16|     r16_3(B *)            = Load                          : &:r16_2, m14_8
#   16|     r16_4(glval<int *>)   = FieldAddress[i]               : r16_3
#   16|     r16_5(int *)          = Load                          : &:r16_4, ~m15_9
#   16|     v16_6(void)           = Call                          : func:r16_1, 0:r16_5
#   16|     m16_7(unknown)        = ^CallSideEffect               : ~m15_9

I'm not totally sure how to recover this flow without accepting flow through Chi instructions that update all aliased memory.

Interestingly, the flow is reported in the following modified program:

void sink(int *);

struct B
{
  int *i;
};

void f7(B* b)
{
  b->i = source();
}
void f8(B* b)
{
  f7(b);
  sink(b->i); // flow
}

since the Chi following the BufferMayWriteSideEffect does not update all aliased memory:

#   15|     r15_1(glval<unknown>) = FunctionAddress[f7]          :
#   15|     r15_2(glval<B *>)     = VariableAddress[b]           :
#   15|     r15_3(B *)            = Load                         : &:r15_2, m12_7
#   15|     v15_4(void)           = Call                         : func:r15_1, 0:r15_3
#   15|     m15_5(unknown)        = ^CallSideEffect              : ~m12_4
#   15|     m15_6(unknown)        = Chi                          : total:m12_4, partial:m15_5
#   15|     v15_7(void)           = ^BufferReadSideEffect[0]     : &:r15_3, ~m12_9
#   15|     m15_8(unknown)        = ^BufferMayWriteSideEffect[0] : &:r15_3
#   15|     m15_9(int *)          = Chi                          : total:m12_9, partial:m15_8
#   16|     r16_1(glval<unknown>) = FunctionAddress[sink]        :
#   16|     r16_2(glval<B *>)     = VariableAddress[b]           :
#   16|     r16_3(B *)            = Load                         : &:r16_2, m12_7
#   16|     r16_4(glval<int *>)   = FieldAddress[i]              : r16_3
#   16|     r16_5(int *)          = Load                         : &:r16_4, ~m15_9
#   16|     v16_6(void)           = Call                         : func:r16_1, 0:r16_5
#   16|     m16_7(unknown)        = ^CallSideEffect              : ~m15_6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a flaw in our support for allocations (#2797)? I would think that new B and a B * parameter should be treated the same by the current (unsound) alias analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, a fresh allocation certainly shouldn't alias more stuff than a B* parameter. I'll look at this issue later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging in #3171 fixes the problem.

@MathiasVP
Copy link
Contributor Author

@jbj
Copy link
Contributor

jbj commented Mar 30, 2020

CPP-difference run: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/996

Most of the differences on openjdk come from #3123. The two changes for cpp/signed-overflow-check are also from a different PR, but I forget which.

@MathiasVP
Copy link
Contributor Author

Most of the differences on openjdk come from #3123. The two changes for cpp/signed-overflow-check are also from a different PR, but I forget which.

Oh yeah, good point. I should have chosen my baseline commit more carefully. I'll create a new one that compares against master

@jbj
Copy link
Contributor

jbj commented Mar 30, 2020

I don't think you want to compare against master since some of the unrelated changes come from unmerged PRs. You want the baseline to be a branch with all the PRs that you've included in this PR (on top of the merge-base to master).

@MathiasVP
Copy link
Contributor Author

I don't think you want to compare against master since some of the unrelated changes come from unmerged PRs. You want the baseline to be a branch with all the PRs that you've included in this PR (on top of the merge-base to master).

I don't think I have included any unmerged PRs in here yet? So far I've only merged from master. So #3123 isn't actually included in this PR. Only the part in #3097.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Mar 31, 2020

Here's a more honest CPP-differences run: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1002. I've looked at the results for linux and openjdk, and they all seem like results we want to flag.

@MathiasVP MathiasVP marked this pull request as ready for review April 2, 2020 09:09
@MathiasVP MathiasVP requested a review from a team as a code owner April 2, 2020 09:09
@MathiasVP MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Apr 2, 2020
@MathiasVP
Copy link
Contributor Author

@jbj
Copy link
Contributor

jbj commented Apr 8, 2020

A fresh CPP-differences run: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1020/

It's not all that fresh -- it seems to be based on an old master from 2 April. There's clearly some performance work to be done. Be sure to merge from the latest master, including path explanations, before investigating performance.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Apr 8, 2020

It's not all that fresh -- it seems to be based on an old master from 2 April. There's clearly some performance work to be done. Be sure to merge from the latest master, including path explanations, before investigating performance.

Will do. Fresh was the wrong choice of word here. I meant fresh as in "a new one based on the changes made to this PR". It should be fine to use it for investigation of query results at least.

@MathiasVP
Copy link
Contributor Author

The latest CPP-Differences revealed a performance regression due to the following predicate in DataFlowImplCommon.qll:

pragma[nomagic]
private predicate parameterValueFlow0(
  ParameterNode p, Node node, ContentOption contentIn, ContentOption contentOut
) {
  p = node and
  Cand::cand(p, _) and
  contentIn = TContentNone() and
  contentOut = TContentNone()
  or
  // local flow
  exists(Node mid |
    parameterValueFlow(p, mid, contentIn, contentOut) and
    LocalFlowBigStep::localFlowBigStep(mid, node)
  )
  or
  ...
}

with this PR we get the following tuple counts for iteration 5:

(2455s) Tuple counts for DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow0#ffff#cur_delta:
1579748   ~0%         {4} r1 = SCAN DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow#ffff#prev_delta AS I OUTPUT I.<1>, I.<0>, I.<2>, I.<3>
109572057 ~11994%     {4} r2 = JOIN r1 WITH DataFlowImplCommon::Cached::FlowThrough::LocalFlowBigStep::localFlowBigStep#ff AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<2>, r1.<3>
...
109577945 ~11984%     {4} r15 = r2 \/ r14
...
109580255 ~11974%     {4} r27 = r15 \/ r26
...
109580255 ~11974%     {4} r30 = r27 \/ r29
...
109580255 ~11974%     {4} r33 = r30 \/ r32
...
109580255 ~11974%     {4} r36 = r33 \/ r35
...
109580255 ~11974%     {4} r39 = r36 \/ r38
...
109580255 ~11974%     {4} r42 = r39 \/ r41
...
109580255 ~11974%     {4} r45 = r42 \/ r44
14493     ~4%         {4} r46 = r45 AND NOT DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow0#ffff#prev AS R(r45.<1>, r45.<0>, r45.<2>, r45.<3>)
14493     ~6%         {4} r47 = SCAN r46 OUTPUT r46.<1>, r46.<0>, r46.<2>, r46.<3>
                    return r47

where's on the current master (6e88b6d) we get:

(61s) Tuple counts for DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow0#ffff#cur_delta:
57061  ~0%     {4} r1 = SELECT DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow#ffff#prev_delta AS I ON I.<2> = I.<3>
57061  ~3%     {3} r2 = SCAN r1 OUTPUT r1.<1>, r1.<0>, r1.<2>
0      ~0%     {3} r3 = JOIN r2 WITH DataFlowImplCommon::Cached::FlowThrough::LocalFlowBigStep::localFlowBigStep#ff AS R ON FIRST 1 OUTPUT r2.<1>, r2.<2>, R.<1>
...
0      ~0%     {3} r9 = r3 \/ r8
...
0      ~0%     {3} r15 = r9 \/ r14
...
0      ~0%     {3} r20 = r15 \/ r19
...
0      ~0%     {3} r25 = r20 \/ r24
...
0      ~0%     {3} r28 = r25 \/ r27
...
0      ~0%     {3} r31 = r28 \/ r30
106958 ~0%     {4} r32 = SELECT DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow0#ffff#prev AS I ON I.<2> = I.<3>
106958 ~5%     {3} r33 = SCAN r32 OUTPUT r32.<0>, r32.<1>, r32.<2>
106958         {3} r34 = MATERIALIZE r33 AS antijoin_rhs
0      ~0%     {3} r35 = r31 AND NOT r34(r31.<0>, r31.<2>, r31.<1>)
0      ~0%     {4} r36 = SCAN r35 OUTPUT r35.<0>, r35.<2>, r35.<1>, r35.<1>
                return r36

I'm not exactly sure what's causing the change in RA. On master we only keep the rows where contentIn and contentOut are equal, where's this is not the case after the changes in this PR.

Here's the relevant predicate differences:

Master:

  exists(/* DataFlowUtil::Node */ TIRDataFlowNode mid |
    exists(/* DataFlowImplCommon::ContentOption */ TContentOption arg3 |
      rec DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow#ffff(p,
                                                                                  mid,
                                                                                  contentOut,
                                                                                  arg3),
      contentOut = arg3
    ),
    DataFlowImplCommon::Cached::FlowThrough::LocalFlowBigStep::localFlowBigStep#ff(mid,
                                                                                   node)
  );

This PR:

exists(/* DataFlowUtil::Node */ TIRDataFlowNode mid |
  DataFlowImplCommon::Cached::FlowThrough::LocalFlowBigStep::localFlowBigStep#ff(mid,
                                                                                 node),
  rec DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow#ffff(p,
                                                                              mid,
                                                                              contentIn,
                                                                              contentOut)
);

@rdmarsh2
Copy link
Contributor

rdmarsh2 commented Apr 10, 2020

I'm not exactly sure what's causing the change in RA. On master we only keep the rows where contentIn and contentOut are equal, where's this is not the case after the changes in this PR.

I think the constraint on master is inferred indirectly from readStep and storeStep being none(), which isn't true after the PR. The real problem is that there are many possible mid nodes that connect a pair of p and node with the same content. That might be coming from chi nodes on parameters, but averaging 100 of those seems extreme. Could it be caused by alias/conflation of parameters?

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Apr 13, 2020

I think the constraint on master is inferred indirectly from readStep and storeStep being none(), which isn't true after the PR. The real problem is that there are many possible mid nodes that connect a pair of p and node with the same content. That might be coming from chi nodes on parameters, but averaging 100 of those seems extreme. Could it be caused by alias/conflation of parameters?

The root cause was a was a stupid mistake on my side. In a previous commit I introduced an abstract class for ParameterNodes (to split explicit and implicit parameters), and then removed the abstract'ness from the class following your comment in #3123. But (unlike Jonas) I forgot to introduce a char pred to the ParameterNode class, so parameterValueFlow0 was not restricted to only parameters.

This fixed the performance problems: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1041/

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Apr 15, 2020

Here's the newest CPP-differences after merging in master after Robert's outparam PR (along with necessary flow from #3220): https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1043/

Performance still looks good. The only major spike in time spend is on microsoft/Chakra, but that is due to increased build time. We have a couple of new results now as well:

=== Changes per project (+new -fixed) ===

g/git/git                                                                   +1 -0
g/php/php-src                                                               +2 -0

=== Changes per query (+new -fixed) ===

cpp/path-injection                                                          +2 -0

cpp/uncontrolled-allocation-size                                            +1 -0

I think the one in git/git is a FP from field conflation, but the other two looks like true positives. I'll take a closer look at these tomorrow.
On further inspection, they all look like true positives.

@jbj
Copy link
Contributor

jbj commented Apr 16, 2020

Here's the newest CPP-differences after merging in master after Robert's outparam PR (along with necessary flow from #3220): https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1043/

That looks great!

Performance still looks good. The only major spike in time spend is on microsoft/Chakra, but that is due to increased build time.

It looks to me like there's a performance regression beyond the noise level. Looking at the analysis time diffs, the first DefaultTaintTracking-based queries in the suite are clearly taking longer than they used to. The biggest relative slowdown is on git/git, but that could be noise.

I do expect a slowdown since we're doing more work than before, and a few percent seems reasonable. But do you see any individual predicates that have slowed down and are now slower than they ought to be?

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM apart from these comments.

jbj
jbj previously approved these changes Apr 17, 2020
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code changes LGTM. What remains is accepting the four test changes in this repo (assuming they're good) and creating a PR in the internal repo to accept the test change there. To avoid conflicts, I suggest basing the internal PR on the mergeback PR I'm going to open within an hour or so.

@MathiasVP
Copy link
Contributor Author

I do expect a slowdown since we're doing more work than before, and a few percent seems reasonable. But do you see any individual predicates that have slowed down and are now slower than they ought to be?

The query that has the biggest slowdown cpp/cpp/path-injection has has the biggest slowdown in the query phase of the evaluation. Looking at the timing stats for that particular stage locally for git/git and wireshark/wireshark (and a brief look at the timing stats for other stages) there's no predicate that immediately caught my eye as unreasonably slow.

@MathiasVP MathiasVP removed the WIP This is a work-in-progress, do not merge yet! label Apr 17, 2020
@MathiasVP
Copy link
Contributor Author

Here's the relevant internal PR: https://git.semmle.com/Semmle/code/pull/36713

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants