Skip to content

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Oct 6, 2021

@MathiasVP MathiasVP added the C++ label Oct 6, 2021
@github-actions github-actions bot added the C# label Oct 6, 2021
@MathiasVP MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch 2 times, most recently from d06b7b0 to d429f90 Compare October 17, 2021 20:55
@MathiasVP MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch 4 times, most recently from e04052b to ddaa28f Compare October 25, 2021 11:05
@MathiasVP MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch from ddaa28f to 8426857 Compare October 26, 2021 15:13
@MathiasVP MathiasVP marked this pull request as ready for review October 27, 2021 07:06
@MathiasVP MathiasVP requested review from a team as code owners October 27, 2021 07:06
@jbj jbj requested review from jbj, andersfugmann and dbartol October 27, 2021 07:48
@MathiasVP MathiasVP force-pushed the use-shared-ssa-in-ir-dataflow branch from 0783d46 to ba20c79 Compare October 28, 2021 10:08
…Instead, we introduce a StoreNode IPA branch that does store steps and instead use the shared SSA library to transfer flow into these nodes before a store step, and out of them following a sequence of store steps.
…dges based way of doing read steps. Instead, we use the shared SSA library to transfer flow into a new ReadNode IPA branch, perform the necessary read steps, and then use the shared SSA library to transfer flow out of the ReadNode again.
… Instead, we rely on the shared SSA library's use-use edges.
…oadInstructions, we no longer have flow from PhiInstructions to LoadInstructions. We could allow flow in this particular case, but we might as well use the shared SSA library's phi edges.
@geoffw0
Copy link
Contributor

geoffw0 commented Nov 2, 2021

SAMATE Juliet test results: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/183/
Compared to a recent run as baseline: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/181/

  • headline is that our overall score goes up from 0.083 to 0.088, a decent improvement!
  • we do considerably better at finding TPs on CWE-134 (68% -> 91%, cpp/tainted-format-string and cpp/tainted-format-string-through-global)
  • we do marginally better at finding TPs on CWE-78 (4% -> 5%, cpp/command-line-injection)
  • I don't see any regressions
  • (note: this analysis was started a few hours after my previous comment)

🎉

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read all commits now. There's a limit to how deep I've dived into the code, but overall I'm concerned about the many new places where "data flow inside data flow" is used to bridge gaps that are caused by certain syntactic constructs. I'd like to hear your thoughts on whether these sub-data-flow relations can be avoided by exposing more information from the IR.

I'm overall positive about merging this PR because it's shown good results and acceptable performance in testing.

Comment on lines 367 to 370
bb.getInstruction(i1) = write and
bb.getInstruction(i2) = op.getUse() and
// Flow to an instruction that occurs later in the block.
valueFlow*(nodeFrom.getInstruction(), op.getDef()) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this seems very special-cased and syntactic. Can it deal with a = (b ? new A : nullptr)? What syntax can it not deal with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it being very special-cased and syntactic, yes :(. I hope we can delete this hack with a more clever instantiation to the shared SSA library in the future (that better accounts for indirections).

Regarding a = (b ? new A : nullptr): it works here because the IR inserts a StoreInstruction that stores to a temporary after the call to new A:

r11_12(void *)         = Call[operator new]              : func:r11_10
m11_13(unknown)        = ^CallSideEffect                 : ~m9_4
m11_14(unknown)        = Chi                             : total:m9_4, partial:m11_13
m11_15(unknown)        = ^InitializeDynamicAllocation    : &:r11_12
r11_16(A *)            = Convert                         : r11_12
r11_17(glval<unknown>) = FunctionAddress[A]              :
v11_18(void)           = Call[A]                         : func:r11_17, this:r11_16
m11_19(unknown)        = ^CallSideEffect                 : ~m11_14
m11_20(unknown)        = Chi                             : total:m11_14, partial:m11_19
m11_21(A)              = ^IndirectMayWriteSideEffect[-1] : &:r11_16
m11_22(unknown)        = Chi                             : total:m11_15, partial:m11_21
r11_23(glval<A *>)     = VariableAddress[#temp11:12]     :
m11_24(A *)            = Store[#temp11:12]               : &:r11_23, r11_16

So in this case we will flow to nodeTo.asOperand() will be the StoreValueOperand r11_16.

Comment on lines 459 to 467
private predicate valueFlow(Instruction iFrom, Instruction iTo) {
iTo.(CopyValueInstruction).getSourceValue() = iFrom
or
iTo.(ConvertInstruction).getUnary() = iFrom
or
iTo.(CheckedConvertOrNullInstruction).getUnary() = iFrom
or
iTo.(InheritanceConversionInstruction).getUnary() = iFrom
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the underlying principle here? How can a future maintainer know whether it's appropriate to add a case for pointer arithmetic, for example? An InheritanceConversionInstruction will sometimes do pointer arithmetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1f89b49 adds a QLDoc and renames the predicate to better fit the purpose. This is one of those places where it'd be nice to have a "get me the result of iTo.getUnary(), but skip past all the conversions.

Comment on lines 54 to +57
mi.i = source();

sink(mi); // $ ir MISSING: ast
sink(mi.get()); // $ ast,ir
sink(mi); // $ MISSING: ast,ir
sink(mi.get()); // $ ast MISSING: ir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything we can do to restore this result? Would it have worked if pointers were used instead of references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at it. I'm sure we can restore it, but I'll probably create an issue for it to not add even more code that needs to be reviewed.

Copy link
Contributor Author

@MathiasVP MathiasVP Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying issue is that, after doing the store step for mi.i = source(); and we end up with flow on the node VariableAddress[mi] [post update], we transfer flow to the read side effect on mi.get() (since that is the next load of mi following the store to mi). But in this case, we actually wanted flow to the this argument (and not *this) since the read inside get looks like:

r9_10(glval<unknown>) = VariableAddress[#this]       :
r9_11(MyInt *)        = Load[#this]                  : &:r9_10, m9_6
r9_12(glval<int>)     = FieldAddress[i]              : r9_11
r0_1(int &)           = CopyValue                    : r9_12
m0_2(int &)           = Store[#return]               : &:r9_9, r0_1

and we infer the read step from VariableAddress[#this] (i.e., not from *this) that reads i.

Now, this should have been handled by 5dbaea8, but for some reason it isn't. I think it's not handled because the code returns a reference, and thus the code is looking for a load that's not there. I'll create an issue that describes this in more detail, and we can revisit it later.

One quick fix to this is to delete the ugly hack in 5dbaea8 and accept object-pointer conflation on call-boundaries, by transfering flow to the this argument (instead of *this) following store steps. All this should be a one-line change to the CallInstruction case in flowOutOfAddressStep.

geoffw0
geoffw0 previously approved these changes Nov 3, 2021
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to test results look good overall - some regressions, but many improvements as well, and I think quite a lot more of the latter. The SAMATE and the DCA results I've looked at look good. I'm also aware of follow-up plans to make things even better.

👍

@@ -97,7 +97,7 @@ void randomTester() {
int r = 0;
int *ptr_r = &r;
*ptr_r = RAND();
r += 100; // BAD
r += 100; // BAD [NOT DETECTED]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be typical of the cases we lose - where a variable is written to via a pointer, we now don't see that it was written to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Those are precisely the cases the IR did a good job in helping us detect. I hope we can do https://github.com/github/codeql-c-team/issues/714 as a follow-up to get those results back.

| whilestmt.c:9:7:9:10 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. |
| whilestmt.c:11:5:11:8 | done [post update] | PostUpdateNode should not be the target of local flow. |
| whilestmt.c:40:7:40:7 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. |
| whilestmt.c:42:7:42:7 | VariableAddress [post update] | PostUpdateNode should not be the target of local flow. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot more consistency check failures here than there were before (though there were a large number in the first place) - can you summarize what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All these consistency issues come from the StoreNodeFlow::flowThrough predicate which does PostUpdateNode -> PostUpdateNode flow to handle all the different kinds of conversions happening between fields lookups. The intention of PostUpdateNodes is that they should only arise as the target of a storeStep, and should not be produced by simpleLocalFlowStep. However, in order to skip these (from the purpose of dataflow) irrelevant dataflow nodes that represent conversions, we need to step over them.

We could get rid of these at the cost of adding another "data flow in data flow" mechanism, but as @jbj alluded to https://github.com/github/codeql/pull/6825#pullrequestreview-796179710 this is probably not a solution we want.

As far as I know, no one is aware of any performance issues caused by having PostUpdate -> PostUpdate simple flow steps. The consistency check was just put in since such steps aren't present in the other language's dataflow implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should remove that particular consistency check (for C++)?

Copy link
Contributor Author

@MathiasVP MathiasVP Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely like to get rid of all the consistency warnings we have no plans to address!

The main problem is that the consistency queries are shared via identical-files between all the languages. I can think of a number of ways to handle this:

  • Opt-out of identical-files for this file
  • Add a pyrameterized predicate to modify the behavior of this check on a per-language basis

I'm much more in favor of the second option. We have an instance of this approach already with the isImmutableOrUnobservable predicate. I'm not exactly sure how the interface for such a pyrameterized predicate would look like, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we jump through such hoops, I think you should talk to the other data flow maintainers about whether flow into post-update nodes can sometimes be a good thing. Last time I checked, Java could not send post-update flow through casts because it was missing flow into post-update nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. It might be that this consistency check should be relaxed for all languages.

@MathiasVP
Copy link
Contributor Author

[...] overall I'm concerned about the many new places where "data flow inside data flow" is used to bridge gaps that are caused by certain syntactic constructs. I'd like to hear your thoughts on whether these sub-data-flow relations can be avoided by exposing more information from the IR.

Thanks for going through it all! I share your concern about "data flow inside data flow". I think it would help a lot to make it possible to have conversions "on the side" as we have for AST. The good thing about the IR is that all of these conversions are explicit in the syntax and we're forced to think about them, but the bad thing is that we're forced to handle them everywhere. For instance, with the valueFlow predicate that you commented on earlier: it's really just doing the work that a getUnconvertedInstruction predicate would give us.

geoffw0
geoffw0 previously approved these changes Nov 4, 2021
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested a couple of things, but none of them should be considered blocking for this PR.

Do any of @jbj s comments require fixes?

@geoffw0
Copy link
Contributor

geoffw0 commented Nov 4, 2021

We agreed it's probably worth having a slightly deeper look at the DCA results before we merge this. Here's my bit, by query:

cpp/cleartext-storage-buffer (2 new results)

  • both results are in the BAD cases of SAMATE tests

cpp/command-line-injection (54 new results)

  • all new results are in SAMATE

cpp/path-injection (734 new results, 176 lost)

  • 720 new results on SAMATE.
  • 6 new TPs on git/git
  • 1 new TP on openjdk/jdk
  • 3 new TPs on vim/vim

cpp/tainted-format-string (432 new results, 24 lost)

  • All in SAMATE.

cpp/tainted-format-string-through-global (96 new results, 0 lost)

  • All in SAMATE.

cpp/unbounded-write (13 new results, 26 lost)

  • 5 new results on openjdk__jdk (first result is an FP, buffer allocated using a strlen variant + 1 a few lines above the strcpy. This is something the query doesn't account for)
  • 5 new results on vim__vim (first result looks like a FP somewhat similar to the above, i.e. an existing weakness in the query)
  • 8 lost results on openjdk__jdk (first result looks like a FP, same reason as above)
  • 18 lost results on vim__vim (first result may be a TP, flow is complex)

cpp/uncontrolled-allocation-size (1 new result)

  • one new result in Vim, looks like it involves multiple possible paths e.g. from diff_read, it looks like a TP if I haven't missed a safe sanitizer somewhere in the (long) flow path.
  • this would be nice as a path query! it is a path query, but DCA isn't showing the path, so I have to view it locally.

cpp/uncontrolled-arithmetic (24 new results, 216 lost)

  • all new and lost results are in SAMATE

cpp/uncontrolled-process-operation (111 new results, 24 lost)

  • one new result in openjdk__jdk, looks like a correct result
  • all other new and lost results are in SAMATE

cpp/user-controlled-null-termination-tainted (53 new results, 60 lost)

  • 53 new results in openjdk__jdk (first two results are plausible but lack flow path needed to verify)
  • 1 new result in vim__vim (as above)
  • 13 lost results in git__git (again difficult to judge)
  • 9 lost results in openjdk__jdk (first one looks like a TP involving a pointer)
  • 38 lost results in vim__vim (again difficult to judge)
  • this could really do with being a path query!

In conclusion, there are certainly differences but nothing surprising given the differences we already expected from looking at the tests (and a few of the lower precision queries could clearly be improved quite easily, but that's nothing to do with this PR).

@MathiasVP MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Nov 9, 2021
@geoffw0
Copy link
Contributor

geoffw0 commented Nov 9, 2021

Tests are still failing. I intend to merge this as soon as they're fixed.

@MathiasVP
Copy link
Contributor Author

Tests are still failing. I intend to merge this as soon as they're fixed.

They're fixed in the internal PR.

@geoffw0
Copy link
Contributor

geoffw0 commented Nov 9, 2021

Ah, sorry, I missed that. I'll review and merge both...

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Time to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C# C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants