Skip to content

C++: Wire up param/arg indirections in data flow #3123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 4, 2020

Conversation

jbj
Copy link
Contributor

@jbj jbj commented Mar 25, 2020

This PR is based on #3097, which should be merged first. Only b622d62 is unique to this PR (for now).

This PR is an attempt to do a principled version of #2737, conservative enough that it applies to data flow. I thought it would just be a matter of wiring up ReadSideEffectInstruction to InitializeParameterInstruction, but it turns out that the involved operands are imprecise. I hope @dbartol or @rdmarsh2 have perspectives on whether they're supposed to be imprecise and whether the solution in this PR could be improved.

@jbj jbj added the C++ label Mar 25, 2020
@jbj jbj requested review from rdmarsh2 and dbartol March 25, 2020 14:38
@jbj
Copy link
Contributor Author

jbj commented Mar 25, 2020

The test failures look similar but not identical to the undesired changes in #2704. I'll investigate, but I'd still like to hear comments on the overall approach of this PR.

Copy link
Contributor

@rdmarsh2 rdmarsh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to expose isParameterOf with negative numbers while hiding ParameterIndirectionNode. I expect that ParameterIndirectionNode would be a fairly popular taint source, so this API would funnel users towards relying on the behavior of isParameterOf.

The new names are chosen to align with Java's `DataFlowUtil.qll`.
jbj added a commit to jbj/ql that referenced this pull request Mar 26, 2020
This change removes some duplicate results that will otherwise appear
due to github#3123 and possibly
github#2704.
jbj added a commit to jbj/ql that referenced this pull request Mar 26, 2020
This change removes some duplicate results that will otherwise appear
due to github#3123 and possibly
github#2704.
@jbj
Copy link
Contributor Author

jbj commented Mar 26, 2020

I've merged #3137 into this PR, hoping it'll fix the test failures.

@rdmarsh2
Copy link
Contributor

re: the operands being imprecise, that was an attempt to represent the uncertainty about what the callee would actually read. I don't think there's a problem with making them precise.

@jbj
Copy link
Contributor Author

jbj commented Mar 27, 2020

@jbj jbj mentioned this pull request Mar 30, 2020
@jbj
Copy link
Contributor Author

jbj commented Apr 6, 2020

The CPP-Differences job showed many new results on openjdk/jdk. Thanks to #3189 I've been able to see that they were FPs, and I was able to find the cause. I'm guessing #2704 is affected by the same problem.

In code like

mystruct s;
s.x = tainted();
// s := chi(total = Uninitialized[s], partial = tainted())
sink(s.y);

there is (and should be) no taint into sink even though there is taint into the chi instruction. That's because the address through which y is loaded is a fresh VariableAddressInstruction rather than anything derived from the chi instruction, and the load is inexact. But if the last line of the example happens in a separate function, then this PR makes it possible (and it's the intended behaviour) for the taint in the chi instruction to end up in the this pointer, and from there it can then spread through a FieldAddressInstruction (which is a UnaryInstruction) to the address of a LoadInstruction. It's all due to the interaction between these three rules:

https://github.com/Semmle/ql/blob/e5d3286ee9dfafa07db488d0b1590d89722f4073/cpp/ql/src/semmle/code/cpp/ir/dataflow/DefaultTaintTracking.qll#L187-L193

jbj added 2 commits April 6, 2020 16:13
@jbj
Copy link
Contributor Author

jbj commented Apr 6, 2020

…-args

Accepted test results that were in semantic merge conflict between
these branches. The changed results are due to a bug that that's part of
github/codeql-c-team#35.
@jbj jbj marked this pull request as ready for review April 7, 2020 12:28
@jbj jbj requested a review from a team as a code owner April 7, 2020 12:28
@jbj
Copy link
Contributor Author

jbj commented Apr 7, 2020

I looked through the CPP-Differences. We lost 21 results, and they all look like false positives to me. They were due to field conflation.

I've marked this PR as ready for review. It started out as a PR to add more flow, but in practice it now causes less taint. The extra flow we're adding was (mostly) already present as taint due to #2737. With the latest change to block FieldAddressInstruction we now have less taint than before. This PR should make a difference for field flow, though, since field flow is carried by data flow but not by taint. I hope that merging this PR will also let us remove the workaround from #2737 soon.

@jbj
Copy link
Contributor Author

jbj commented Apr 7, 2020

I think this PR is essentially ready for merge, but it doesn't provide much value on its own, so we may not even want it in 1.24 unless it's needed for #3118.

The FPs removed with this PR might also get removed by #3219. We're still waiting for the CPP-Differences on that one.

@MathiasVP
Copy link
Contributor

MathiasVP commented Apr 7, 2020

I think this PR is essentially ready for merge, but it doesn't provide much value on its own, so we may not even want it in 1.24 unless it's needed for #3118.

I don't think we'll need it for #3118.

EDIT: It does actually does provide the flow needed to capture field flow of the form:

struct A {
  int x, y;
};

void callSink(A* b) {
  sink(b->x);
}

void foo() {
  A a;
  a.x = user_input();
  callSink(&a);
}

without me doing anything more than merging this PR into #3118.

jbj added 2 commits May 11, 2020 14:44
…-args

Conflicts:
	cpp/ql/src/semmle/code/cpp/ir/dataflow/DefaultTaintTracking.qll
	cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
	cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/defaulttainttracking.cpp
	cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/tainted.expected
	cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/test_diff.expected
	cpp/ql/test/library-tests/dataflow/dataflow-tests/test_ir.expected
This fixes a cosmetic bug in `.../CWE-134/.../examples.c` in the
internal repo.
@jbj jbj requested a review from a team as a code owner May 11, 2020 12:46
@jbj
Copy link
Contributor Author

jbj commented May 11, 2020

Now that IR field flow has been merged, this PR makes a difference. There are three new results (plus two duplicates) in fields/ir-flow.expected. Most other result changes are cosmetic.

I've started https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1114/ to evaluate performance and correctness.

A test is failing in the internal repo because it has some cosmetic differences in the path graph. I think we'll want to revert parts of #2737 at some point to avoid duplicate paths, but it seems to happen only rarely.

@jbj
Copy link
Contributor Author

jbj commented May 13, 2020

I went through all the new CPP-Differences result and found that all were FPs caused by field conflation except two results for cpp/path-injection in php:

  • zend_stream.c:174: new TP
  • main.c:2564: new TP

I've pushed a fix.

…-args

Conflicts:
	cpp/ql/test/library-tests/dataflow/fields/ir-flow.expected
@jbj
Copy link
Contributor Author

jbj commented May 13, 2020

@jbj
Copy link
Contributor Author

jbj commented May 14, 2020

The CPP-Differences shows two FPs that seem to be caused by field conflation. I'll investigate.

This should have been removed in 038bea2.
@jbj
Copy link
Contributor Author

jbj commented May 14, 2020

I'm quite confident that the two new FPs are caused by an existing field conflation that's just been amplified by this PR because there's more flow overall. I've opened #3475 with a test case that's independent of this PR.

With that said, I think this PR is still blocked on fixing that field conflation because it introduces two FPs on git/git.

jbj added 2 commits June 2, 2020 15:35
…-args

The conflicts came from how `this` is now a parameter but not a
`Parameter` on `master`.

Conflicts:
	cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
	cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/defaulttainttracking.cpp
	cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/tainted.expected
	cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/test_diff.expected
	cpp/ql/test/library-tests/dataflow/dataflow-tests/dataflow-ir-consistency.expected
	cpp/ql/test/library-tests/dataflow/fields/ir-flow.expected
	cpp/ql/test/library-tests/syntax-zoo/dataflow-ir-consistency.expected
// A read side effect is almost never exact since we don't know exactly how
// much memory the callee will read.
iTo.(ReadSideEffectInstruction).getSideEffectOperand().getAnyDef() = iFrom and
not iFrom.isResultConflated()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this rule. It sends flow into a ReadSideEffectInstruction, which actually means there is flow to the result value of that instruction. But it's an instruction without a result!

The rule is here because we need a node to be the ArgumentNode in DataFlowPrivate.qll, and I've chosen the node for the ReadSideEffectInstruction. The alternatives I can see are:

  1. Omit this rule and use the most recent definition of the indirection (here named iFrom) as the ArgumentNode. Then a node could be the argument for multiple calls, leading to confusing path explanations.
  2. Create a new synthetic node for this purpose.
  3. Change the data-flow library so that flow alternates between Instruction and Operand nodes. The list of good reasons to do this is starting to get long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the effect on path explanations in argvLocal.expected, where many new nodes with the awkward name BufferReadSideEffect are now there.

@jbj
Copy link
Contributor Author

jbj commented Jun 2, 2020

@jbj
Copy link
Contributor Author

jbj commented Jun 2, 2020

There will be a semantic merge conflict with #3602. (resolved)

jbj added 2 commits June 2, 2020 18:03
…-args

Fixed a semantic merge conflict by accepting test changes in
`cpp/ql/test/library-tests/dataflow/fields/ir-path-flow.expected`.
Without this override, end users would see the string
`BufferReadSideEffect` in path explanations.
@jbj jbj added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Jun 3, 2020
@jbj
Copy link
Contributor Author

jbj commented Jun 3, 2020

I think this PR is ready to go.

I just added 8f702d4, which improves the toString output for the new nodes.

Comment on lines +523 to +527
// Check that the types match. Otherwise we can get flow from an object to
// its fields, which leads to field conflation when there's flow from other
// fields to the object elsewhere.
init.getParameter().getType().getUnspecifiedType().(DerivedType).getBaseType() =
iTo.getResultType().getUnspecifiedType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we still get flow from object to field when the type of the field is equal to the parameter type? I'm thinking about a situation with a binary tree like this:

struct Tree { Tree *left, *right; };

Tree source();
void sink(Tree*);

void read_left_subtree(Tree* tree) {
  sink(tree->left);
}
...
Tree tree = source();
read_left_subtree(&tree);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll add a test to find out. But even if the answer is that such conflation is possible, does that mean it's undesirable? If we want conflation between array indexes, don't we then also want conflation between entries in a tree or a linked list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's probably reasonable to have object to field flow in such situations.

@jbj
Copy link
Contributor Author

jbj commented Jun 3, 2020

After a long day where I made probably every mistake that can be made on a pair of synced PRs, the internal PR is finally green.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants