-
Notifications
You must be signed in to change notification settings - Fork 1.8k
CPP: Model taint through std::string and std::stringstream #1637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Performance seems fine (tested |
// (this is a bit odd; another way to think of it is the sink effectively | ||
// flowing from the first argument to the return value) | ||
input.isInReturnValue() and | ||
output.isOutParameterPointer(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense for taint to come in through the return value. Can we model <<
in another way? How about saying that in a << b
there's taint from b -> a
, a -> return
, and b -> return
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of:
ss3 << "123" << source();
i.e.
(ss3 << "123") << source();
taint needs to flow from the bracketed expression to ss3
somehow (in actuality I think the thing that's flowing is a reference to the sink, in the other direction, which is why my attempt at solving this looks so backwards).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the problem now.
The ss3
argument should have a post-update node (a DefinitionByReferenceNode
) since it's passed by mutable reference into operator<<
. Does it work to add flow from (ss3 << "123")
to the post-update node of ss3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding correctly this won't actually be a change to the model - you're still asking for flow from the return value to the first parameter. You just want the taint library to wire that up to the post-update node of that parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I understand the problem now. It's the third time I think I understand it, so bear with me.
I think there are two things to be modelled about operator<<
. The first is that it copies its first argument (a
) to its return value (reference to reference). When that's put into the DataFlowFunction
model, the taint-tracking library should be extended to add flow from pre-update a
to pre-update call and from post-update call to pre-update a
. The latter one goes backwards like in https://github.com/Semmle/ql/pull/1766/files#diff-255652465f915d69e8f3adeed3e96f11R238. The second thing to be modelled is the taint. I haven't thought this through completely, and I've got to leave now. I'll revisit this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise, I think these are the steps we want in the model of operator<<
:
- Data flow:
isInParameter(0) --> isOutReturnValue()
- Taint:
isInParameterPointer(1) --> isOutReturnPointer()
We should really write some documentation in FunctionInputsAndOutputs.qll
. Maybe the documentation is missing because we're still figuring out what each of the input/output kinds ought to mean for taint and data flow. I think that for pointer/reference types, the Value
kinds (like isOutReturnValue
) refer refer to the pointer itself, and the Pointer
kinds (like isOutReturnPointer
) refer to the contents that are pointed to. If that's the case, we should rename Pointer
to Content
or Pointee
or something.
When a model specifies "parameter value" to "return value" data flow, I think the data flow library should add both forward flow between the pre-update nodes and backwards flow between the post-update nodes. I had at one point a plan to generalise https://github.com/Semmle/ql/pull/1766/files#diff-255652465f915d69e8f3adeed3e96f11R238 to add such backwards post-update flow everywhere that the right post-update nodes existed, but we decided that it was more sensible to add these cases one at a time. As an example of why we want such flow, consider an identity function on reference types: int& id(int &x) { return x; }
. Adding backwards post-update flow would let us model id(id(a)) = source;
as an assignment to a
(assuming we have flow from pre-update RHS to post-update LHS of =
).
When a model specifies flow to a Pointer
output (the contents pointed to), I think the data flow (or taint flow) library should interpret that as flow into the appropriate post-update node. In this case, flow to isOutReturnPointer()
goes to the post-update node for the operator<<
call. Such a post-update node should be made to exist when the call itself is being used as an argument to a function that takes a pointer argument. It doesn't exist today -- that's one of the coverage gaps we talked about in #1900.
With this mapping in place between the models and the data flow library, I think it's going to work. Then we get the following data-flow edges for your example with (ss3 <<₁ "123") <<₂ source();
, where ==>
is taint, and -->
is data flow:
# Edges for <<₁:
pre-ss3 --> pre-<<₁
post-<<₁ --> post-ss3 # backwards post-update flow
pre-"123" ==> post-<<₁
# Edges for <<₂:
pre-<<₁ --> pre-<<₂
post-<<₂ --> post-<<₁ # backwards post-update flow
pre-source() ==> post-<<₂
In that graph, there is a path from pre-source()
to post-ss3
as desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've discussed this with @aschackmull. The Java library has a StringBuilder
class that is more or less equivalent to std::stringstream
. Calls like sb.append(s1).append(s2)
are modelled by introducing a concept of the "initial qualifier" on a chain of calls -- sb
in this case. For C++ we'd prefer to do this modelling only on the Function
and not on the Call
, and I think that's possible.
When a model specifies flow to a
Pointer
output (the contents pointed to), I think the data flow (or taint flow) library should interpret that as flow into the appropriate post-update node.
The post-update node I mentioned above may not exist. I'd like to try and investigate whether a Call
can be its own post-update node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two-rule modelling of operator<<
that I proposed in #1637 (comment) was too "clever", relying on flow through post-order nodes of calls that happened to exist in some cases but only by coincidence. Here's how I now think we should model operator<<
, using the terminology of #1938. It's the same as what I wrote in #1637 (comment), but this time I'm being precise about what's pre and what's post.
- Data flow:
isParameter(0) --> isReturnValue()
. Ona << b
, this should give data flow frompre-a
topre-<<
and frompost-<<
topost-a
. - Taint:
isParameterDeref(1) --> isReturnValueDeref()
. Ona << b
, this should give taint flow frompre-b
topre-<<
. I'd previously claimed this should go topost-<<
, but that doesn't exist in general for a call, and the "post-result" of a call is just the result. - Taint:
isParameterDeref(1) --> isParameterDeref(0)
. Ona << b
, this should give taint flow frompre-b
topost-a
.
That gives us the following edges for (ss3 <<₁ "123") <<₂ source();
, where there exists a path from pre-source()
to post-ss3
.
# Edges for <<₁:
pre-ss3 --> pre-<<₁ # due to isInParameter(0) --> isOutReturnValue()
post-<<₁ --> post-ss3 # backwards post-update flow of the above
pre-"123" ==> pre-<<₁ # due to isParameterDeref(1) --> isReturnValueDeref()
pre-"123" ==> post-ss3 # due to isParameterDeref(1) --> isParameterDeref(0)
# Edges for <<₂:
pre-<<₁ --> pre-<<₂ # due to isInParameter(0) --> isOutReturnValue()
# post-<<₂ --> post-<<₁ # (there is no post-<<₂ node)
pre-source() ==> pre-<<₂ # due to isParameterDeref(1) --> isReturnValueDeref()
pre-source() ==> post-<<₁ # due to isParameterDeref(1) --> isParameterDeref(0)
exprIn = call.getArgument(argInIndex) | ||
) or exists(int argInIndex | | ||
inModel.isInParameterPointer(argInIndex) and | ||
call.passesByReference(argInIndex, exprIn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The passesByReference
predicate is intended for out parameters, and it only holds for references to non-const. I don't think this case has the intended effect. If you think it does, please document clearly what effect that should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same pattern occurs in exprToDefinitionByReferenceStep
in master. Is that wrong? Or is there a difference?
(I should probably add test coverage for this specific line if we don't decide to get rid of it...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually wrong what I said: passesByReference
holds also for references to non-const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pattern is also wrong in TaintTrackingUtil.exprToDefinitionByReferenceStep
. Since this is taint tracking, data will flow through &
already, so that part of passesByReference
should not contribute anything. This pattern will allow flow from x
to &x.a
in the context of f(&x.a)
, but is that the intended effect? Hopefully that's handled by field flow; if not, please write a test case and let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a test case and remove passesByReference from both predicates.
exprIn = call.getArgument(argInIndex) | ||
) or exists(int argInIndex | | ||
inModel.isInParameterPointer(argInIndex) and | ||
call.passesByReference(argInIndex, exprIn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, I don't think the passesByReference
case does what you intend.
exprOut = call | ||
) or exists(int argOutIndex | | ||
outModel.isOutParameterPointer(argOutIndex) and | ||
exprOut = call.getArgument(argOutIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid adding flow from an out-parameter-pointer to an (input) argument? That seems backwards, and I don't think that flow will go anywhere since we have def-use flow and not use-use. What case is it needed for?
Fixed merge issues (tests and change note). |
I haven't addressed all comments yet though. |
Closed in favour of #3195 (the |
Model taint on some very basic
std::string
andstd::stringstream
operations. Enhance the AST taint library as required (@jbj please review changes toTaintTracking.qll
andDataFlowUtil.qll
when you get back; there was a little bit of experiment-until-it-seems-to-work-right involved in some of these changes).For https://jira.semmle.com/browse/CPP-395 (though further work may be needed to resolve that issue).
In the long run I hope these models will benefit IR taint as well.