Skip to content

Conversation

jbj
Copy link
Contributor

@jbj jbj commented Sep 6, 2019

This PR is a less ambitious alternative to #1778.

The data flow library conflates pointers and their objects in some places but not others. For example, a member function call x.f() will cause flow from x of type T to this of type T* inside f. It might be ideal to avoid that conflation, but that's not realistic without using the IR.

We've had good experience in the taint tracking library with conflating pointers and objects, and it improves results for field flow, so perhaps it's time to try it out for all data flow.

@jbj jbj added the C++ label Sep 6, 2019
@jbj jbj requested a review from a team as a code owner September 6, 2019 09:34
@jbj jbj force-pushed the dataflow-addressof branch from 526f91e to 3794a0b Compare September 9, 2019 08:41
@jbj jbj force-pushed the dataflow-addressof branch from 3794a0b to 5041f59 Compare September 9, 2019 10:51
@jbj
Copy link
Contributor Author

jbj commented Sep 9, 2019

Updated to fix semantic merge conflict with #1900.

Copy link
Contributor

@rdmarsh2 rdmarsh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of test nits. I'm nervous about the potential false positive impact - have we done any real-world testing?

@@ -245,6 +264,7 @@ edges
| by_reference.cpp:51:10:51:20 | call to getDirectly | by_reference.cpp:50:17:50:26 | call to user_input | by_reference.cpp:51:10:51:20 | call to getDirectly | call to getDirectly flows from $@ | by_reference.cpp:50:17:50:26 | call to user_input | call to user_input |
| by_reference.cpp:57:10:57:22 | call to getIndirectly | by_reference.cpp:56:19:56:28 | call to user_input | by_reference.cpp:57:10:57:22 | call to getIndirectly | call to getIndirectly flows from $@ | by_reference.cpp:56:19:56:28 | call to user_input | call to user_input |
| by_reference.cpp:63:10:63:28 | call to getThroughNonMember | by_reference.cpp:62:25:62:34 | call to user_input | by_reference.cpp:63:10:63:28 | call to getThroughNonMember | call to getThroughNonMember flows from $@ | by_reference.cpp:62:25:62:34 | call to user_input | call to user_input |
| by_reference.cpp:69:8:69:20 | call to nonMemberGetA | by_reference.cpp:68:21:68:30 | call to user_input | by_reference.cpp:69:8:69:20 | call to nonMemberGetA | call to nonMemberGetA flows from $@ | by_reference.cpp:68:21:68:30 | call to user_input | call to user_input |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment for this in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -12,7 +12,7 @@ struct Outer {
};

void absink(struct AB *ab) {
sink(ab->a); // flow x3 [NOT DETECTED]
sink(ab->a); // flow (three paths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better if it were three separate sources. Losing a path would be easier to miss than losing a source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jbj
Copy link
Contributor Author

jbj commented Sep 11, 2019

I'm nervous about the potential false positive impact - have we done any real-world testing?

I don't think we have too many standard queries that would be affected by this. I just tested four data flow queries on MySQL, and this PR made no difference.

I take comfort in the fact that my original data flow library prototype (from 2017) had flow through both & and *, and that worked well. The old and the new taint tracking libraries have these flows as well, and they don't seem to cause problems.

The IR data flow libraries do not have flow through &, of course, because there is no & in the IR.

@geoffw0
Copy link
Contributor

geoffw0 commented Sep 12, 2019

LGTM.

Just did some real-world testing of some dataflow queries on Qt with and without this change. No differences in results or measurable differences in performance.

@rdmarsh2 how do you feel about this PR now?

@rdmarsh2
Copy link
Contributor

That's reassuring. I'm happy to merge once the test conflict is resolved.

jbj added 4 commits September 17, 2019 13:16
The data flow library conflates pointers and their objects in some
places but not others. For example, a member function call `x.f()` will
cause flow from `x` of type `T` to `this` of type `T*` inside `f`. It
might be ideal to avoid that conflation, but that's not realistic
without using the IR.

We've had good experience in the taint tracking library with conflating
pointers and objects, and it improves results for field flow, so perhaps
it's time to try it out for all data flow.
These new results seem better than the previous ones, but the previous
ones are still there. Perhaps the `Buffer.qll` library could use some
adjustment, but this seems like an improvement in isolation.
This addresses PR comments by @rdmarsh2.
@jbj jbj force-pushed the dataflow-addressof branch from e041052 to b2df18a Compare September 17, 2019 11:20
@jbj
Copy link
Contributor Author

jbj commented Sep 17, 2019

Rebased to fix conflicts with #1881.

@rdmarsh2 rdmarsh2 merged commit fd88f7a into github:master Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants