Skip to content

CPP: Model taint through std::string and std::stringstream #1637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

geoffw0
Copy link
Contributor

@geoffw0 geoffw0 commented Jul 25, 2019

Model taint on some very basic std::string and std::stringstream operations. Enhance the AST taint library as required (@jbj please review changes to TaintTracking.qll and DataFlowUtil.qll when you get back; there was a little bit of experiment-until-it-seems-to-work-right involved in some of these changes).

For https://jira.semmle.com/browse/CPP-395 (though further work may be needed to resolve that issue).

In the long run I hope these models will benefit IR taint as well.

@geoffw0 geoffw0 added the C++ label Jul 25, 2019
@geoffw0 geoffw0 requested a review from a team as a code owner July 25, 2019 17:27
@geoffw0
Copy link
Contributor Author

geoffw0 commented Jul 25, 2019

Performance seems fine (tested NonConstantFormat.ql on a snapshot of MongoDB (205s -> 227s) and Qt (493 -> 495s); no new results).

// (this is a bit odd; another way to think of it is the sink effectively
// flowing from the first argument to the return value)
input.isInReturnValue() and
output.isOutParameterPointer(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense for taint to come in through the return value. Can we model << in another way? How about saying that in a << b there's taint from b -> a, a -> return, and b -> return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of:

ss3 << "123" << source();

i.e.

(ss3 << "123") << source();

taint needs to flow from the bracketed expression to ss3 somehow (in actuality I think the thing that's flowing is a reference to the sink, in the other direction, which is why my attempt at solving this looks so backwards).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the problem now.

The ss3 argument should have a post-update node (a DefinitionByReferenceNode) since it's passed by mutable reference into operator<<. Does it work to add flow from (ss3 << "123") to the post-update node of ss3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly this won't actually be a change to the model - you're still asking for flow from the return value to the first parameter. You just want the taint library to wire that up to the post-update node of that parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I understand the problem now. It's the third time I think I understand it, so bear with me.

I think there are two things to be modelled about operator<<. The first is that it copies its first argument (a) to its return value (reference to reference). When that's put into the DataFlowFunction model, the taint-tracking library should be extended to add flow from pre-update a to pre-update call and from post-update call to pre-update a. The latter one goes backwards like in https://github.com/Semmle/ql/pull/1766/files#diff-255652465f915d69e8f3adeed3e96f11R238. The second thing to be modelled is the taint. I haven't thought this through completely, and I've got to leave now. I'll revisit this tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, I think these are the steps we want in the model of operator<<:

  • Data flow: isInParameter(0) --> isOutReturnValue()
  • Taint: isInParameterPointer(1) --> isOutReturnPointer()

We should really write some documentation in FunctionInputsAndOutputs.qll. Maybe the documentation is missing because we're still figuring out what each of the input/output kinds ought to mean for taint and data flow. I think that for pointer/reference types, the Value kinds (like isOutReturnValue) refer refer to the pointer itself, and the Pointer kinds (like isOutReturnPointer) refer to the contents that are pointed to. If that's the case, we should rename Pointer to Content or Pointee or something.

When a model specifies "parameter value" to "return value" data flow, I think the data flow library should add both forward flow between the pre-update nodes and backwards flow between the post-update nodes. I had at one point a plan to generalise https://github.com/Semmle/ql/pull/1766/files#diff-255652465f915d69e8f3adeed3e96f11R238 to add such backwards post-update flow everywhere that the right post-update nodes existed, but we decided that it was more sensible to add these cases one at a time. As an example of why we want such flow, consider an identity function on reference types: int& id(int &x) { return x; }. Adding backwards post-update flow would let us model id(id(a)) = source; as an assignment to a (assuming we have flow from pre-update RHS to post-update LHS of =).

When a model specifies flow to a Pointer output (the contents pointed to), I think the data flow (or taint flow) library should interpret that as flow into the appropriate post-update node. In this case, flow to isOutReturnPointer() goes to the post-update node for the operator<< call. Such a post-update node should be made to exist when the call itself is being used as an argument to a function that takes a pointer argument. It doesn't exist today -- that's one of the coverage gaps we talked about in #1900.

With this mapping in place between the models and the data flow library, I think it's going to work. Then we get the following data-flow edges for your example with (ss3 <<₁ "123") <<₂ source();, where ==> is taint, and --> is data flow:

# Edges for <<₁:
pre-ss3 --> pre-<<₁
post-<<₁ --> post-ss3 # backwards post-update flow
pre-"123" ==> post-<<₁

# Edges for <<₂:
pre-<<₁ --> pre-<<₂
post-<<₂ --> post-<<₁ # backwards post-update flow
pre-source() ==> post-<<₂

In that graph, there is a path from pre-source() to post-ss3 as desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've discussed this with @aschackmull. The Java library has a StringBuilder class that is more or less equivalent to std::stringstream. Calls like sb.append(s1).append(s2) are modelled by introducing a concept of the "initial qualifier" on a chain of calls -- sb in this case. For C++ we'd prefer to do this modelling only on the Function and not on the Call, and I think that's possible.

When a model specifies flow to a Pointer output (the contents pointed to), I think the data flow (or taint flow) library should interpret that as flow into the appropriate post-update node.

The post-update node I mentioned above may not exist. I'd like to try and investigate whether a Call can be its own post-update node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two-rule modelling of operator<< that I proposed in #1637 (comment) was too "clever", relying on flow through post-order nodes of calls that happened to exist in some cases but only by coincidence. Here's how I now think we should model operator<<, using the terminology of #1938. It's the same as what I wrote in #1637 (comment), but this time I'm being precise about what's pre and what's post.

  • Data flow: isParameter(0) --> isReturnValue(). On a << b, this should give data flow from pre-a to pre-<< and from post-<< to post-a.
  • Taint: isParameterDeref(1) --> isReturnValueDeref(). On a << b, this should give taint flow from pre-b to pre-<<. I'd previously claimed this should go to post-<<, but that doesn't exist in general for a call, and the "post-result" of a call is just the result.
  • Taint: isParameterDeref(1) --> isParameterDeref(0). On a << b, this should give taint flow from pre-b to post-a.

That gives us the following edges for (ss3 <<₁ "123") <<₂ source();, where there exists a path from pre-source() to post-ss3.

# Edges for <<₁:
pre-ss3 --> pre-<<₁ # due to isInParameter(0) --> isOutReturnValue()
post-<<₁ --> post-ss3 # backwards post-update flow of the above
pre-"123" ==> pre-<<₁ # due to isParameterDeref(1) --> isReturnValueDeref()
pre-"123" ==> post-ss3 # due to isParameterDeref(1) --> isParameterDeref(0)

# Edges for <<₂:
pre-<<₁ --> pre-<<₂ # due to isInParameter(0) --> isOutReturnValue()
# post-<<₂ --> post-<<₁ # (there is no post-<<₂ node)
pre-source() ==> pre-<<₂ # due to isParameterDeref(1) --> isReturnValueDeref()
pre-source() ==> post-<<₁ # due to isParameterDeref(1) --> isParameterDeref(0)

exprIn = call.getArgument(argInIndex)
) or exists(int argInIndex |
inModel.isInParameterPointer(argInIndex) and
call.passesByReference(argInIndex, exprIn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passesByReference predicate is intended for out parameters, and it only holds for references to non-const. I don't think this case has the intended effect. If you think it does, please document clearly what effect that should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same pattern occurs in exprToDefinitionByReferenceStep in master. Is that wrong? Or is there a difference?

(I should probably add test coverage for this specific line if we don't decide to get rid of it...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually wrong what I said: passesByReference holds also for references to non-const.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern is also wrong in TaintTrackingUtil.exprToDefinitionByReferenceStep. Since this is taint tracking, data will flow through & already, so that part of passesByReference should not contribute anything. This pattern will allow flow from x to &x.a in the context of f(&x.a), but is that the intended effect? Hopefully that's handled by field flow; if not, please write a test case and let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a test case and remove passesByReference from both predicates.

exprIn = call.getArgument(argInIndex)
) or exists(int argInIndex |
inModel.isInParameterPointer(argInIndex) and
call.passesByReference(argInIndex, exprIn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, I don't think the passesByReference case does what you intend.

exprOut = call
) or exists(int argOutIndex |
outModel.isOutParameterPointer(argOutIndex) and
exprOut = call.getArgument(argOutIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding flow from an out-parameter-pointer to an (input) argument? That seems backwards, and I don't think that flow will go anywhere since we have def-use flow and not use-use. What case is it needed for?

@geoffw0
Copy link
Contributor Author

geoffw0 commented Aug 16, 2019

Fixed merge issues (tests and change note).

@geoffw0
Copy link
Contributor Author

geoffw0 commented Aug 16, 2019

I haven't addressed all comments yet though.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Apr 2, 2020

Closed in favour of #3195 (the operator<< stuff will have to be repaired later).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants