-
Notifications
You must be signed in to change notification settings - Fork 1.8k
C++: Redesign IR dataflow using the shared SSA library #6825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d06b7b0
to
d429f90
Compare
e04052b
to
ddaa28f
Compare
ddaa28f
to
8426857
Compare
0783d46
to
ba20c79
Compare
…Instead, we introduce a StoreNode IPA branch that does store steps and instead use the shared SSA library to transfer flow into these nodes before a store step, and out of them following a sequence of store steps.
…dges based way of doing read steps. Instead, we use the shared SSA library to transfer flow into a new ReadNode IPA branch, perform the necessary read steps, and then use the shared SSA library to transfer flow out of the ReadNode again.
… Instead, we rely on the shared SSA library's use-use edges.
…oadInstructions, we no longer have flow from PhiInstructions to LoadInstructions. We could allow flow in this particular case, but we might as well use the shared SSA library's phi edges.
SAMATE Juliet test results: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/183/
🎉 |
…ns that set 'certain = false' in 'explicitWrite'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read all commits now. There's a limit to how deep I've dived into the code, but overall I'm concerned about the many new places where "data flow inside data flow" is used to bridge gaps that are caused by certain syntactic constructs. I'd like to hear your thoughts on whether these sub-data-flow relations can be avoided by exposing more information from the IR.
I'm overall positive about merging this PR because it's shown good results and acceptable performance in testing.
bb.getInstruction(i1) = write and | ||
bb.getInstruction(i2) = op.getUse() and | ||
// Flow to an instruction that occurs later in the block. | ||
valueFlow*(nodeFrom.getInstruction(), op.getDef()) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this seems very special-cased and syntactic. Can it deal with a = (b ? new A : nullptr)
? What syntax can it not deal with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with it being very special-cased and syntactic, yes :(. I hope we can delete this hack with a more clever instantiation to the shared SSA library in the future (that better accounts for indirections).
Regarding a = (b ? new A : nullptr)
: it works here because the IR inserts a StoreInstruction
that stores to a temporary after the call to new A
:
r11_12(void *) = Call[operator new] : func:r11_10
m11_13(unknown) = ^CallSideEffect : ~m9_4
m11_14(unknown) = Chi : total:m9_4, partial:m11_13
m11_15(unknown) = ^InitializeDynamicAllocation : &:r11_12
r11_16(A *) = Convert : r11_12
r11_17(glval<unknown>) = FunctionAddress[A] :
v11_18(void) = Call[A] : func:r11_17, this:r11_16
m11_19(unknown) = ^CallSideEffect : ~m11_14
m11_20(unknown) = Chi : total:m11_14, partial:m11_19
m11_21(A) = ^IndirectMayWriteSideEffect[-1] : &:r11_16
m11_22(unknown) = Chi : total:m11_15, partial:m11_21
r11_23(glval<A *>) = VariableAddress[#temp11:12] :
m11_24(A *) = Store[#temp11:12] : &:r11_23, r11_16
So in this case we will flow to nodeTo.asOperand()
will be the StoreValueOperand
r11_16
.
private predicate valueFlow(Instruction iFrom, Instruction iTo) { | ||
iTo.(CopyValueInstruction).getSourceValue() = iFrom | ||
or | ||
iTo.(ConvertInstruction).getUnary() = iFrom | ||
or | ||
iTo.(CheckedConvertOrNullInstruction).getUnary() = iFrom | ||
or | ||
iTo.(InheritanceConversionInstruction).getUnary() = iFrom | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the underlying principle here? How can a future maintainer know whether it's appropriate to add a case for pointer arithmetic, for example? An InheritanceConversionInstruction
will sometimes do pointer arithmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1f89b49 adds a QLDoc and renames the predicate to better fit the purpose. This is one of those places where it'd be nice to have a "get me the result of iTo.getUnary()
, but skip past all the conversions.
mi.i = source(); | ||
|
||
sink(mi); // $ ir MISSING: ast | ||
sink(mi.get()); // $ ast,ir | ||
sink(mi); // $ MISSING: ast,ir | ||
sink(mi.get()); // $ ast MISSING: ir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything we can do to restore this result? Would it have worked if pointers were used instead of references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at it. I'm sure we can restore it, but I'll probably create an issue for it to not add even more code that needs to be reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying issue is that, after doing the store step for mi.i = source();
and we end up with flow on the node VariableAddress[mi] [post update]
, we transfer flow to the read side effect on mi.get()
(since that is the next load of mi
following the store to mi
). But in this case, we actually wanted flow to the this
argument (and not *this
) since the read inside get
looks like:
r9_10(glval<unknown>) = VariableAddress[#this] :
r9_11(MyInt *) = Load[#this] : &:r9_10, m9_6
r9_12(glval<int>) = FieldAddress[i] : r9_11
r0_1(int &) = CopyValue : r9_12
m0_2(int &) = Store[#return] : &:r9_9, r0_1
and we infer the read step from VariableAddress[#this]
(i.e., not from *this
) that reads i
.
Now, this should have been handled by 5dbaea8, but for some reason it isn't. I think it's not handled because the code returns a reference, and thus the code is looking for a load that's not there. I'll create an issue that describes this in more detail, and we can revisit it later.
One quick fix to this is to delete the ugly hack in 5dbaea8 and accept object-pointer conflation on call-boundaries, by transfering flow to the this
argument (instead of *this
) following store steps. All this should be a one-line change to the CallInstruction
case in flowOutOfAddressStep
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to test results look good overall - some regressions, but many improvements as well, and I think quite a lot more of the latter. The SAMATE and the DCA results I've looked at look good. I'm also aware of follow-up plans to make things even better.
👍
@@ -97,7 +97,7 @@ void randomTester() { | |||
int r = 0; | |||
int *ptr_r = &r; | |||
*ptr_r = RAND(); | |||
r += 100; // BAD | |||
r += 100; // BAD [NOT DETECTED] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be typical of the cases we lose - where a variable is written to via a pointer, we now don't see that it was written to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Those are precisely the cases the IR did a good job in helping us detect. I hope we can do https://github.com/github/codeql-c-team/issues/714 as a follow-up to get those results back.
cpp/ql/test/query-tests/Security/CWE/CWE-134/semmle/funcs/funcsLocal.expected
Show resolved
Hide resolved
| whilestmt.c:9:7:9:10 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. | | ||
| whilestmt.c:11:5:11:8 | done [post update] | PostUpdateNode should not be the target of local flow. | | ||
| whilestmt.c:40:7:40:7 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. | | ||
| whilestmt.c:42:7:42:7 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot more consistency check failures here than there were before (though there were a large number in the first place) - can you summarize what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. All these consistency issues come from the StoreNodeFlow::flowThrough
predicate which does PostUpdateNode -> PostUpdateNode
flow to handle all the different kinds of conversions happening between fields lookups. The intention of PostUpdateNode
s is that they should only arise as the target of a storeStep
, and should not be produced by simpleLocalFlowStep
. However, in order to skip these (from the purpose of dataflow) irrelevant dataflow nodes that represent conversions, we need to step over them.
We could get rid of these at the cost of adding another "data flow in data flow" mechanism, but as @jbj alluded to https://github.com/github/codeql/pull/6825#pullrequestreview-796179710 this is probably not a solution we want.
As far as I know, no one is aware of any performance issues caused by having PostUpdate -> PostUpdate
simple flow steps. The consistency check was just put in since such steps aren't present in the other language's dataflow implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should remove that particular consistency check (for C++)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely like to get rid of all the consistency warnings we have no plans to address!
The main problem is that the consistency queries are shared via identical-files
between all the languages. I can think of a number of ways to handle this:
- Opt-out of
identical-files
for this file - Add a pyrameterized predicate to modify the behavior of this check on a per-language basis
I'm much more in favor of the second option. We have an instance of this approach already with the isImmutableOrUnobservable
predicate. I'm not exactly sure how the interface for such a pyrameterized predicate would look like, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we jump through such hoops, I think you should talk to the other data flow maintainers about whether flow into post-update nodes can sometimes be a good thing. Last time I checked, Java could not send post-update flow through casts because it was missing flow into post-update nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. It might be that this consistency check should be relaxed for all languages.
Thanks for going through it all! I share your concern about "data flow inside data flow". I think it would help a lot to make it possible to have conversions "on the side" as we have for AST. The good thing about the IR is that all of these conversions are explicit in the syntax and we're forced to think about them, but the bad thing is that we're forced to handle them everywhere. For instance, with the |
…ains its purpose.
… a 'LoadInstruction' at certain places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've suggested a couple of things, but none of them should be considered blocking for this PR.
Do any of @jbj s comments require fixes?
We agreed it's probably worth having a slightly deeper look at the DCA results before we merge this. Here's my bit, by query:
In conclusion, there are certainly differences but nothing surprising given the differences we already expected from looking at the tests (and a few of the lower precision queries could clearly be improved quite easily, but that's nothing to do with this PR). |
Tests are still failing. I intend to merge this as soon as they're fixed. |
They're fixed in the internal PR. |
Ah, sorry, I missed that. I'll review and merge both... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Time to merge.
Fixes https://github.com/github/codeql-c-team/issues/663.