-
Notifications
You must be signed in to change notification settings - Fork 1.8k
C++: IR field flow #3118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: IR field flow #3118
Conversation
…ll arguments need a PostUpdateNode). Also generalized the added flow rule in simpleLocalFlowStep since there isn't always a ChiInstruction - for instance of it's a write to a struct that only has a single field.
@@ -321,6 +354,13 @@ predicate localFlowStep(Node nodeFrom, Node nodeTo) { simpleLocalFlowStep(nodeFr | |||
*/ | |||
predicate simpleLocalFlowStep(Node nodeFrom, Node nodeTo) { | |||
simpleInstructionLocalFlowStep(nodeFrom.asInstruction(), nodeTo.asInstruction()) | |||
or | |||
exists(LoadInstruction load | | |||
// TODO: These can probably be getSourceValue() after #3112 is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now merged (that's the line I meant to comment on a minute ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly my comment turned out to be incorrect. The flow in the following program is not captured by only following exact Chi
operands:
void sink(int *o);
struct B
{
int *c;
void set(int *c) { this->c = c; }
};
void f7(B *b)
{
b->set(new int);
}
void f8()
{
B *b = new B();
f7(b);
sink(b->c); // flow
}
since the load on b->c
only is a total overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you'll want to merge #3097 and use isResultConflated
here.
Here's a quick rundown of the changes in the output:
|
And those two false positives are caused by limitations of the shared field-flow library rather than the C++ instantiation of it. They'll be present in the other languages too. |
Testing locally on
I'm looking into this now. Edit: Adding |
@@ -219,7 +219,7 @@ abstract class PostUpdateNode extends InstructionNode { | |||
abstract private class PartialDefinitionNode extends PostUpdateNode, TInstructionNode { | |||
final Instruction getInstructionOrChi() { | |||
exists(ChiInstruction chi | | |||
// TODO: This should be a non-conflated ChiInstruction once #3123 is merged | |||
not chi.isResultConflated() and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we lose flow in the following case:
void sink(int *);
struct B
{
int *i;
};
void f7(B* b)
{
b->i = source();
}
void f8()
{
B* b = new B;
f7(b);
sink(b->i); // flow [not detected]
}
Because the BufferMayWriteSideEffect
for b
in f7(b)
is melded into all aliased memory:
# 15| r15_1(glval<unknown>) = FunctionAddress[f7] :
# 15| r15_2(glval<B *>) = VariableAddress[b] :
# 15| r15_3(B *) = Load : &:r15_2, m14_8
# 15| v15_4(void) = Call : func:r15_1, 0:r15_3
# 15| m15_5(unknown) = ^CallSideEffect : ~m14_6
# 15| m15_6(unknown) = Chi : total:m14_6, partial:m15_5
# 15| v15_7(void) = ^BufferReadSideEffect[0] : &:r15_3, ~m15_6
# 15| m15_8(unknown) = ^BufferMayWriteSideEffect[0] : &:r15_3
# 15| m15_9(unknown) = Chi : total:m15_6, partial:m15_8
# 16| r16_1(glval<unknown>) = FunctionAddress[sink] :
# 16| r16_2(glval<B *>) = VariableAddress[b] :
# 16| r16_3(B *) = Load : &:r16_2, m14_8
# 16| r16_4(glval<int *>) = FieldAddress[i] : r16_3
# 16| r16_5(int *) = Load : &:r16_4, ~m15_9
# 16| v16_6(void) = Call : func:r16_1, 0:r16_5
# 16| m16_7(unknown) = ^CallSideEffect : ~m15_9
I'm not totally sure how to recover this flow without accepting flow through Chi
instructions that update all aliased memory.
Interestingly, the flow is reported in the following modified program:
void sink(int *);
struct B
{
int *i;
};
void f7(B* b)
{
b->i = source();
}
void f8(B* b)
{
f7(b);
sink(b->i); // flow
}
since the Chi
following the BufferMayWriteSideEffect
does not update all aliased memory:
# 15| r15_1(glval<unknown>) = FunctionAddress[f7] :
# 15| r15_2(glval<B *>) = VariableAddress[b] :
# 15| r15_3(B *) = Load : &:r15_2, m12_7
# 15| v15_4(void) = Call : func:r15_1, 0:r15_3
# 15| m15_5(unknown) = ^CallSideEffect : ~m12_4
# 15| m15_6(unknown) = Chi : total:m12_4, partial:m15_5
# 15| v15_7(void) = ^BufferReadSideEffect[0] : &:r15_3, ~m12_9
# 15| m15_8(unknown) = ^BufferMayWriteSideEffect[0] : &:r15_3
# 15| m15_9(int *) = Chi : total:m12_9, partial:m15_8
# 16| r16_1(glval<unknown>) = FunctionAddress[sink] :
# 16| r16_2(glval<B *>) = VariableAddress[b] :
# 16| r16_3(B *) = Load : &:r16_2, m12_7
# 16| r16_4(glval<int *>) = FieldAddress[i] : r16_3
# 16| r16_5(int *) = Load : &:r16_4, ~m15_9
# 16| v16_6(void) = Call : func:r16_1, 0:r16_5
# 16| m16_7(unknown) = ^CallSideEffect : ~m15_6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a flaw in our support for allocations (#2797)? I would think that new B
and a B *
parameter should be treated the same by the current (unsound) alias analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, a fresh allocation certainly shouldn't alias more stuff than a B*
parameter. I'll look at this issue later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging in #3171 fixes the problem.
CPP-difference run: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/996 |
Most of the differences on openjdk come from #3123. The two changes for |
Oh yeah, good point. I should have chosen my baseline commit more carefully. I'll create a new one that compares against master |
I don't think you want to compare against master since some of the unrelated changes come from unmerged PRs. You want the baseline to be a branch with all the PRs that you've included in this PR (on top of the merge-base to master). |
I don't think I have included any unmerged PRs in here yet? So far I've only merged from master. So #3123 isn't actually included in this PR. Only the part in #3097. |
Here's a more honest CPP-differences run: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1002. I've looked at the results for |
A fresh CPP-differences run: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1020/ |
It's not all that fresh -- it seems to be based on an old |
Will do. Fresh was the wrong choice of word here. I meant fresh as in "a new one based on the changes made to this PR". It should be fine to use it for investigation of query results at least. |
The latest CPP-Differences revealed a performance regression due to the following predicate in pragma[nomagic]
private predicate parameterValueFlow0(
ParameterNode p, Node node, ContentOption contentIn, ContentOption contentOut
) {
p = node and
Cand::cand(p, _) and
contentIn = TContentNone() and
contentOut = TContentNone()
or
// local flow
exists(Node mid |
parameterValueFlow(p, mid, contentIn, contentOut) and
LocalFlowBigStep::localFlowBigStep(mid, node)
)
or
...
} with this PR we get the following tuple counts for iteration 5:
where's on the current
I'm not exactly sure what's causing the change in RA. On Here's the relevant predicate differences: Master: exists(/* DataFlowUtil::Node */ TIRDataFlowNode mid |
exists(/* DataFlowImplCommon::ContentOption */ TContentOption arg3 |
rec DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow#ffff(p,
mid,
contentOut,
arg3),
contentOut = arg3
),
DataFlowImplCommon::Cached::FlowThrough::LocalFlowBigStep::localFlowBigStep#ff(mid,
node)
); This PR: exists(/* DataFlowUtil::Node */ TIRDataFlowNode mid |
DataFlowImplCommon::Cached::FlowThrough::LocalFlowBigStep::localFlowBigStep#ff(mid,
node),
rec DataFlowImplCommon::Cached::FlowThrough::Final::parameterValueFlow#ffff(p,
mid,
contentIn,
contentOut)
); |
I think the constraint on master is inferred indirectly from |
The root cause was a was a stupid mistake on my side. In a previous commit I introduced an abstract class for This fixed the performance problems: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1041/ |
Here's the newest CPP-differences after merging in master after Robert's outparam PR (along with necessary flow from #3220): https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1043/ Performance still looks good. The only major spike in time spend is on
|
That looks great!
It looks to me like there's a performance regression beyond the noise level. Looking at the analysis time diffs, the first I do expect a slowdown since we're doing more work than before, and a few percent seems reasonable. But do you see any individual predicates that have slowed down and are now slower than they ought to be? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM apart from these comments.
cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
… on review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code changes LGTM. What remains is accepting the four test changes in this repo (assuming they're good) and creating a PR in the internal repo to accept the test change there. To avoid conflicts, I suggest basing the internal PR on the mergeback PR I'm going to open within an hour or so.
The query that has the biggest slowdown |
Here's the relevant internal PR: https://git.semmle.com/Semmle/code/pull/36713 |
This PR takes the the first stab at implementing field flow in the IR.
This is still WIP, but any comments is appreciated. One thing I remember is @jbj mentioning that
DefinitionByReferenceNode
couldn't be made aPostUpdateNode
because there's not a corresponding pre update node to pick, but it seemed like this was the most obvious way to model partial flow to thethis
parameter in setters. And for now #2921 only reports a couple of problems with nodes missing locations, which shouldn't be a problem to fix.The failing testcases are from the
Semmle/code
, where we flag a new bad case forcpp/user-controlled-bypass
. When we are happy with the state of this PR I will make an internal PR and update that test.