-
Notifications
You must be signed in to change notification settings - Fork 1.8k
C++: DataFlow::Node.toString consistency #1926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd appreciate if we could get #1881 merged first (I've just updated it), as it involves a submodule bump and is likely to conflict. |
| taint.cpp:8:8:8:13 | clean1 | taint.cpp:4:27:4:33 | source1 | | ||
| taint.cpp:16:8:16:14 | source1 | taint.cpp:12:22:12:27 | ... = ... | | ||
| taint.cpp:16:8:16:14 | source1 | taint.cpp:12:22:12:27 | call to source | | ||
| taint.cpp:17:8:17:16 | ++ ... | taint.cpp:12:22:12:27 | ... = ... | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two new lines with the source being ... = ...
are new, and I don't think we want them. I also don't think that the toString
changes in this PR are the cause of the problem -- they just make it visible.
The IR DataFlow::Node.asExpr
and asConvertedExpr
map from an Expr
to an Instruction
using https://github.com/Semmle/ql/blob/10076a6b2bd509ca8dc021c05c508c572cf6e957/cpp/ql/src/semmle/code/cpp/ir/implementation/aliased_ssa/Instruction.qll#L437-L440. Notice how the QLDoc on it implies that it won't have multiple results. That predicate a wrapper for https://github.com/Semmle/ql/blob/10076a6b2bd509ca8dc021c05c508c572cf6e957/cpp/ql/src/semmle/code/cpp/ir/implementation/raw/internal/IRConstruction.qll#L49-L54, which has only one result as long as no two expressions claim to have the same Instruction as result. Unfortunately, such expressions exist. For example, C++ Assignment
expressions have their LHS as result: https://github.com/Semmle/ql/blob/10076a6b2bd509ca8dc021c05c508c572cf6e957/cpp/ql/src/semmle/code/cpp/ir/implementation/raw/internal/TranslatedExpr.qll#L1242-L1254.
@dave-bartolomeo, how can we fix this? What uses of getResult
do we have that requires TranslatedExpr
to sometimes specify that its result is an instruction that doesn't belong to itself? I've informally thought of getResult
as getting the Instruction
whose value is the value computed by the expression; for an assignment, that would be the Store
instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do this to handle the following cases (which aren't currently included in the IR tests or, AFAICT, the syntax zoo)
void UseAssignExpr(int x) {
int w, y, z;
y = z = x;
(y = z) = x;
if (w = x) {
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what would be the problem with those test cases if the result of the =
were its Store
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first and third cases, the consumers of the result of the =
expect a register operand, not a memory operand. In the second case, the result of the first =
is an lvalue for the left operand, that is then stored into by the second =
. We could use CopyValueInstruction
s to preserve the current semantics while avoiding the duplicate sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see what you're saying.
The rules for the return value of an assignment are different between C and C++. To keep this simple, let's focus on C++. Let me know if I've understood this correctly. A C++ assignment expression a = b
has a result that's an lvalue for the same address as a
. This is consistent with overloaded assignment operators, which look like C& C::operator=(const C &rhs) { ...; return *this; }
. This makes it possible to implement a class my_int { int content; ... }
for which assignment behaves exactly (okay, exactly is a big word in this context) the same as the built-in assignment on int
.
I'd argue that ideally the IR translation of a built-in =
should work in the same way as an overloaded operator=
. It'll have an lvalue as result, and then that lvalue will be Load
ed if necessary (test case 1 and 3). And that seems to be how it works now.
So your suggestion is to add a CopyValueInstruction
to TranslatedAssignment
. For performance and clarity, we'd probably want to omit that instruction in the common case where =
is in a void context.
I'm still not sure how common it is for TranslatedElement
s to have results that are not their own instructions. If it's common, perhaps it will bite us in DataFlow::asExpr
for other kinds of expression, and we should be looking for a way to fix all the cases in a unified way. Like automatically inserting CopyValueInstruction
before a TranslatedLoad
that would otherwise get its operand from a different AST element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently TranslatedCommaExpr
, TranslatedPrefixCrementOperation
in C++ when the result is not being loaded, TranslatedTransparentExpr
, TranslatedAssignment
, and TranslatedConditionDeclExpr
have a getResult
that is a child's getResult
. TranslatedDestructorFieldDestruction
has no getResult
.
Of those, TranslatedTransparentExpr
and TranslatedAssignment
seem like the only ones that are likely to show up frequently as sources or sinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. It sounds to me like adding CopyValueInstruction
to all these instructions would be a good solution. But that's not for this PR.
Do you think this PR should be delayed until we can improve the instruction-to-expr mapping? I think this PR is just exposing a problem that was there already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdmarsh2's analysis is spot on. While we've always tried to ensure that there's at least one unique Instruction
for each leaf Expr
, we've been willing to let interior Expr
nodes return the result of a child. We could go back and add CopyValue
in all of those cases that aren't already in a void context, but one consequence of that would be that every use of parentheses in an expression would get another Instruction
, which seems extravagant. I think that targeting only those expressions that are likely to be used as sources or sinks would make more sense. Either that, or only target those expressions that are not off to the side as Conversion
s. That may be the same set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's essentially the same set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged back from master
, and this problem has gone away thanks to #1999.
This string looked out of place compared to `ExplicitParameterNode`, whose string is simply the name of the parameter and therefore indistinguishable from an access to the parameter without looking at the location also. This has not been a problem so far, and if we want to distinguish more clearly between initial values and accesses at some point, we should do it for `ExplicitParameterNode` and `UninitializedNode` too.
The `toString` for IR data-flow nodes are now similar to AST data-flow nodes. This should make it easier to use the IR as a drop-in replacement in the future. There are still differences because the IR data flow library takes conversions into account. I did not attempt to align the new nodes we use for field flow. That can come later, when we add field flow to IR data flow.
1899e1f
to
7cfbe88
Compare
Rebased to fix semantic merge conflict with #1881. |
(this PR is waiting for #1999 to get merged first) |
#1999 has been merged. I'll review this PR as soon as the conflicts are resolved. |
Solved conflicts in `*.expected` by re-running the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR is my attempt to address #1888 (comment).