C++: Redesign IR dataflow using the shared SSA library #6825

MathiasVP · 2021-10-06T21:08:16Z

Fixes https://github.com/github/codeql-c-team/issues/663.

…to 'ReturnValueInstruction'.

…Instead, we introduce a StoreNode IPA branch that does store steps and instead use the shared SSA library to transfer flow into these nodes before a store step, and out of them following a sequence of store steps.

…dges based way of doing read steps. Instead, we use the shared SSA library to transfer flow into a new ReadNode IPA branch, perform the necessary read steps, and then use the shared SSA library to transfer flow out of the ReadNode again.

…Nodes.

… Instead, we rely on the shared SSA library's use-use edges.

…oadInstructions, we no longer have flow from PhiInstructions to LoadInstructions. We could allow flow in this particular case, but we might as well use the shared SSA library's phi edges.

…ons.

…t changes.

geoffw0 · 2021-11-02T09:49:29Z

SAMATE Juliet test results: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/183/
Compared to a recent run as baseline: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/181/

headline is that our overall score goes up from 0.083 to 0.088, a decent improvement!
we do considerably better at finding TPs on CWE-134 (68% -> 91%, cpp/tainted-format-string and cpp/tainted-format-string-through-global)
we do marginally better at finding TPs on CWE-78 (4% -> 5%, cpp/command-line-injection)
I don't see any regressions
(note: this analysis was started a few hours after my previous comment)

🎉

…ns that set 'certain = false' in 'explicitWrite'.

jbj

I've read all commits now. There's a limit to how deep I've dived into the code, but overall I'm concerned about the many new places where "data flow inside data flow" is used to bridge gaps that are caused by certain syntactic constructs. I'd like to hear your thoughts on whether these sub-data-flow relations can be avoided by exposing more information from the IR.

I'm overall positive about merging this PR because it's shown good results and acceptable performance in testing.

jbj · 2021-11-03T07:34:26Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

+      bb.getInstruction(i1) = write and
+      bb.getInstruction(i2) = op.getUse() and
+      // Flow to an instruction that occurs later in the block.
+      valueFlow*(nodeFrom.getInstruction(), op.getDef()) and


All this seems very special-cased and syntactic. Can it deal with a = (b ? new A : nullptr)? What syntax can it not deal with?

I agree with it being very special-cased and syntactic, yes :(. I hope we can delete this hack with a more clever instantiation to the shared SSA library in the future (that better accounts for indirections).

Regarding a = (b ? new A : nullptr): it works here because the IR inserts a StoreInstruction that stores to a temporary after the call to new A:

r11_12(void *) = Call[operator new] : func:r11_10 m11_13(unknown) = ^CallSideEffect : ~m9_4 m11_14(unknown) = Chi : total:m9_4, partial:m11_13 m11_15(unknown) = ^InitializeDynamicAllocation : &:r11_12 r11_16(A *) = Convert : r11_12 r11_17(glval<unknown>) = FunctionAddress[A] : v11_18(void) = Call[A] : func:r11_17, this:r11_16 m11_19(unknown) = ^CallSideEffect : ~m11_14 m11_20(unknown) = Chi : total:m11_14, partial:m11_19 m11_21(A) = ^IndirectMayWriteSideEffect[-1] : &:r11_16 m11_22(unknown) = Chi : total:m11_15, partial:m11_21 r11_23(glval<A *>) = VariableAddress[#temp11:12] : m11_24(A *) = Store[#temp11:12] : &:r11_23, r11_16

So in this case we will flow to nodeTo.asOperand() will be the StoreValueOperand r11_16.

jbj · 2021-11-03T07:36:17Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

+  private predicate valueFlow(Instruction iFrom, Instruction iTo) {
+    iTo.(CopyValueInstruction).getSourceValue() = iFrom
+    or
+    iTo.(ConvertInstruction).getUnary() = iFrom
+    or
+    iTo.(CheckedConvertOrNullInstruction).getUnary() = iFrom
+    or
+    iTo.(InheritanceConversionInstruction).getUnary() = iFrom
+  }


What's the underlying principle here? How can a future maintainer know whether it's appropriate to add a case for pointer arithmetic, for example? An InheritanceConversionInstruction will sometimes do pointer arithmetic.

1f89b49 adds a QLDoc and renames the predicate to better fit the purpose. This is one of those places where it'd be nice to have a "get me the result of iTo.getUnary(), but skip past all the conversions.

jbj · 2021-11-03T07:46:53Z

cpp/ql/test/library-tests/dataflow/taint-tests/arrayassignment.cpp

 	mi.i = source();

-	sink(mi); // $ ir MISSING: ast
-	sink(mi.get()); // $ ast,ir
+	sink(mi); // $ MISSING: ast,ir
+	sink(mi.get()); // $ ast MISSING: ir


Is there anything we can do to restore this result? Would it have worked if pointers were used instead of references?

I'll take a look at it. I'm sure we can restore it, but I'll probably create an issue for it to not add even more code that needs to be reviewed.

The underlying issue is that, after doing the store step for mi.i = source(); and we end up with flow on the node VariableAddress[mi] [post update], we transfer flow to the read side effect on mi.get() (since that is the next load of mi following the store to mi). But in this case, we actually wanted flow to the this argument (and not *this) since the read inside get looks like:

r9_10(glval<unknown>) = VariableAddress[#this] : r9_11(MyInt *) = Load[#this] : &:r9_10, m9_6 r9_12(glval<int>) = FieldAddress[i] : r9_11 r0_1(int &) = CopyValue : r9_12 m0_2(int &) = Store[#return] : &:r9_9, r0_1

and we infer the read step from VariableAddress[#this] (i.e., not from *this) that reads i.

Now, this should have been handled by 5dbaea8, but for some reason it isn't. I think it's not handled because the code returns a reference, and thus the code is looking for a load that's not there. I'll create an issue that describes this in more detail, and we can revisit it later.

One quick fix to this is to delete the ugly hack in 5dbaea8 and accept object-pointer conflation on call-boundaries, by transfering flow to the this argument (instead of *this) following store steps. All this should be a one-line change to the CallInstruction case in flowOutOfAddressStep.

geoffw0

The changes to test results look good overall - some regressions, but many improvements as well, and I think quite a lot more of the latter. The SAMATE and the DCA results I've looked at look good. I'm also aware of follow-up plans to make things even better.

👍

geoffw0 · 2021-11-02T12:42:32Z

cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/ArithmeticUncontrolled/test.c

@@ -97,7 +97,7 @@ void randomTester() {
    int r = 0;
    int *ptr_r = &r;
    *ptr_r = RAND();
-    r += 100; // BAD
+    r += 100; // BAD [NOT DETECTED]


This seems to be typical of the cases we lose - where a variable is written to via a pointer, we now don't see that it was written to.

Yes. Those are precisely the cases the IR did a good job in helping us detect. I hope we can do https://github.com/github/codeql-c-team/issues/714 as a follow-up to get those results back.

cpp/ql/test/query-tests/Security/CWE/CWE-134/semmle/funcs/funcsLocal.expected

geoffw0 · 2021-11-02T12:57:44Z

cpp/ql/test/library-tests/syntax-zoo/dataflow-ir-consistency.expected

+| whilestmt.c:9:7:9:10 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. |
+| whilestmt.c:11:5:11:8 | done [post update] | PostUpdateNode should not be the target of local flow. |
+| whilestmt.c:40:7:40:7 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. |
+| whilestmt.c:42:7:42:7 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. |


There are a lot more consistency check failures here than there were before (though there were a large number in the first place) - can you summarize what's going on?

Yes. All these consistency issues come from the StoreNodeFlow::flowThrough predicate which does PostUpdateNode -> PostUpdateNode flow to handle all the different kinds of conversions happening between fields lookups. The intention of PostUpdateNodes is that they should only arise as the target of a storeStep, and should not be produced by simpleLocalFlowStep. However, in order to skip these (from the purpose of dataflow) irrelevant dataflow nodes that represent conversions, we need to step over them.

We could get rid of these at the cost of adding another "data flow in data flow" mechanism, but as @jbj alluded to https://github.com/github/codeql/pull/6825#pullrequestreview-796179710 this is probably not a solution we want.

As far as I know, no one is aware of any performance issues caused by having PostUpdate -> PostUpdate simple flow steps. The consistency check was just put in since such steps aren't present in the other language's dataflow implementation.

Perhaps we should remove that particular consistency check (for C++)?

I would definitely like to get rid of all the consistency warnings we have no plans to address!

The main problem is that the consistency queries are shared via identical-files between all the languages. I can think of a number of ways to handle this:

Opt-out of identical-files for this file

Add a pyrameterized predicate to modify the behavior of this check on a per-language basis

I'm much more in favor of the second option. We have an instance of this approach already with the isImmutableOrUnobservable predicate. I'm not exactly sure how the interface for such a pyrameterized predicate would look like, though.

Before we jump through such hoops, I think you should talk to the other data flow maintainers about whether flow into post-update nodes can sometimes be a good thing. Last time I checked, Java could not send post-update flow through casts because it was missing flow into post-update nodes.

Indeed. It might be that this consistency check should be relaxed for all languages.

cpp/ql/test/library-tests/dataflow/taint-tests/map.cpp

This reverts commit 092beb8.

MathiasVP · 2021-11-03T10:57:29Z

[...] overall I'm concerned about the many new places where "data flow inside data flow" is used to bridge gaps that are caused by certain syntactic constructs. I'd like to hear your thoughts on whether these sub-data-flow relations can be avoided by exposing more information from the IR.

Thanks for going through it all! I share your concern about "data flow inside data flow". I think it would help a lot to make it possible to have conversions "on the side" as we have for AST. The good thing about the IR is that all of these conversions are explicit in the syntax and we're forced to think about them, but the bad thing is that we're forced to handle them everywhere. For instance, with the valueFlow predicate that you commented on earlier: it's really just doing the work that a getUnconvertedInstruction predicate would give us.

…ains its purpose.

… a 'LoadInstruction' at certain places.

geoffw0

I've suggested a couple of things, but none of them should be considered blocking for this PR.

Do any of @jbj s comments require fixes?

geoffw0 · 2021-11-04T19:46:12Z

We agreed it's probably worth having a slightly deeper look at the DCA results before we merge this. Here's my bit, by query:

cpp/cleartext-storage-buffer (2 new results)

both results are in the BAD cases of SAMATE tests

cpp/command-line-injection (54 new results)

all new results are in SAMATE

cpp/path-injection (734 new results, 176 lost)

720 new results on SAMATE.
6 new TPs on git/git
1 new TP on openjdk/jdk
3 new TPs on vim/vim

cpp/tainted-format-string (432 new results, 24 lost)

All in SAMATE.

cpp/tainted-format-string-through-global (96 new results, 0 lost)

All in SAMATE.

cpp/unbounded-write (13 new results, 26 lost)

5 new results on openjdk__jdk (first result is an FP, buffer allocated using a strlen variant + 1 a few lines above the strcpy. This is something the query doesn't account for)
5 new results on vim__vim (first result looks like a FP somewhat similar to the above, i.e. an existing weakness in the query)
8 lost results on openjdk__jdk (first result looks like a FP, same reason as above)
18 lost results on vim__vim (first result may be a TP, flow is complex)

cpp/uncontrolled-allocation-size (1 new result)

one new result in Vim, looks like it involves multiple possible paths e.g. from diff_read, it looks like a TP if I haven't missed a safe sanitizer somewhere in the (long) flow path.
~~this would be nice as a path query!~~ it is a path query, but DCA isn't showing the path, so I have to view it locally.

cpp/uncontrolled-arithmetic (24 new results, 216 lost)

all new and lost results are in SAMATE

cpp/uncontrolled-process-operation (111 new results, 24 lost)

one new result in openjdk__jdk, looks like a correct result
all other new and lost results are in SAMATE

cpp/user-controlled-null-termination-tainted (53 new results, 60 lost)

53 new results in openjdk__jdk (first two results are plausible but lack flow path needed to verify)
1 new result in vim__vim (as above)
13 lost results in git__git (again difficult to judge)
9 lost results in openjdk__jdk (first one looks like a TP involving a pointer)
38 lost results in vim__vim (again difficult to judge)
this could really do with being a path query!

In conclusion, there are certainly differences but nothing surprising given the differences we already expected from looking at the tests (and a few of the lower precision queries could clearly be improved quite easily, but that's nothing to do with this PR).

geoffw0 · 2021-11-09T09:52:13Z

Tests are still failing. I intend to merge this as soon as they're fixed.

MathiasVP · 2021-11-09T09:52:31Z

Tests are still failing. I intend to merge this as soon as they're fixed.

They're fixed in the internal PR.

geoffw0 · 2021-11-09T09:53:04Z

Ah, sorry, I missed that. I'll review and merge both...

geoffw0

👍

Time to merge.

MathiasVP added the C++ label Oct 6, 2021

github-actions bot added the C# label Oct 6, 2021

MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch 2 times, most recently from d06b7b0 to d429f90 Compare October 17, 2021 20:55

MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch 4 times, most recently from e04052b to ddaa28f Compare October 25, 2021 11:05

MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch from ddaa28f to 8426857 Compare October 26, 2021 15:13

MathiasVP marked this pull request as ready for review October 27, 2021 07:06

MathiasVP requested review from a team as code owners October 27, 2021 07:06

jbj requested review from jbj, andersfugmann and dbartol October 27, 2021 07:48

MathiasVP added 3 commits October 28, 2021 10:51

C++: Add 'getReturnAddress' and 'getReturnAddressOperand' predicates …

bccd4e9

…to 'ReturnValueInstruction'.

C++/C#: Sync identical IR files·

13ce256

C++: Add shared SSA library and instantiate it with the IR.

1842fed

MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch from 0783d46 to ba20c79 Compare October 28, 2021 10:08

MathiasVP added 11 commits October 28, 2021 12:35

C++: Rewrite the PartialDefinitionNode classes to match the new Store…

3a48857

…Nodes.

C++: Throw away most of the usage of IR-computed def-use information.…

8caff41

… Instead, we rely on the shared SSA library's use-use edges.

C++: Since we now no longer have flow from exact memory operands to L…

710d0cf

…oadInstructions, we no longer have flow from PhiInstructions to LoadInstructions. We could allow flow in this particular case, but we might as well use the shared SSA library's phi edges.

C++: Remove the taintflow edges that gives performance problems.

b1ea00f

C++: Add a special dataflow step from InitializeIndirection instructi…

5dbaea8

…ons.

C++: Accept test changes.

3efe60f

C++: Add annoying case in SSA.qll related to 'NewExpr' and accept tes…

21a1ee7

…t changes.

C++: Fix join orders in 'DataFlowDispatch.qll' and Ssa.qll.

2547a8d

C++: Autoformat.

521d863

C++: Add a small QLDoc novel above the IPA type for 'TIRDataFlowNode'.

d34e731

MathiasVP added 3 commits November 2, 2021 10:59

C++: Don't count write operations as uses.

092beb8

C++: Add comments to some of the disjuncts in 'addressFlow'.

56cabb8

C++: Add 'InheritanceConversionInstruction' to the list of instructio…

3e6ac74

…ns that set 'certain = false' in 'explicitWrite'.

jbj reviewed Nov 3, 2021

View reviewed changes

geoffw0 previously approved these changes Nov 3, 2021

View reviewed changes

hvitved dismissed geoffw0’s stale review via 3e6ac74 November 3, 2021 09:57

MathiasVP added 2 commits November 3, 2021 10:37

Revert "C++: Don't count write operations as uses."

ad5619f

This reverts commit 092beb8.

Merge branch 'main' into use-shared-ssa-in-ir-dataflow

dfbfbe4

MathiasVP added 3 commits November 3, 2021 12:22

C++: Rename 'valueFlow' to 'conversionFlow' and add a QLDoc that expl…

1f89b49

…ains its purpose.

C++: Remove redundant conjunct.

43a4795

C++: Add comments on why 'ReferenceToInstruction' is interpreted like…

4095c20

… a 'LoadInstruction' at certain places.

MathiasVP mentioned this pull request Nov 3, 2021

C++: Better InlineExpectation tests for path-explanations #7051

Merged

geoffw0 previously approved these changes Nov 4, 2021

View reviewed changes

MathiasVP added 2 commits November 8, 2021 10:44

Merge branch 'main' into use-shared-ssa-in-ir-dataflow

fff5d29

C++: Pull in the latest changes to 'SsaImplCommon'.

8e496f7

MathiasVP dismissed geoffw0’s stale review via 8e496f7 November 8, 2021 10:47

MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Nov 9, 2021

geoffw0 approved these changes Nov 9, 2021

View reviewed changes

geoffw0 merged commit d9e02e8 into github:main Nov 9, 2021

MathiasVP mentioned this pull request Nov 10, 2021

C++: Fix a testcase #7102

Merged

This was referenced Nov 22, 2021

C++: Don't interpret 'ReferenceToInstruction' as a load #7109

Merged

C++: Hide some IR dataflow nodes #7226

Merged

MathiasVP mentioned this pull request Jan 14, 2022

C++: Store destinations should not be uses for dataflow SSA #7578

Merged

C++: Redesign IR dataflow using the shared SSA library #6825

C++: Redesign IR dataflow using the shared SSA library #6825

Uh oh!

Conversation

MathiasVP commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffw0 commented Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MathiasVP commented Nov 3, 2021

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

geoffw0 commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffw0 commented Nov 9, 2021

Uh oh!

MathiasVP commented Nov 9, 2021

Uh oh!

geoffw0 commented Nov 9, 2021

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MathiasVP commented Oct 6, 2021 •

edited

Loading

geoffw0 commented Nov 2, 2021 •

edited

Loading

MathiasVP Nov 3, 2021 •

edited

Loading

MathiasVP Nov 3, 2021 •

edited

Loading

geoffw0 commented Nov 4, 2021 •

edited

Loading