Skip to content

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented May 5, 2020

This PR adds support for reverse read field flow in the IR instantiation of the shared dataflow library. It does this by adding two new dataflow nodes, which adds "source code"-like structure to field lookups.

The previous iterations of this PR hasn't had any performance problems, but I've started a new CPP-differences anwyay to be safe: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1159/

@MathiasVP MathiasVP added C++ WIP This is a work-in-progress, do not merge yet! labels May 5, 2020
@MathiasVP MathiasVP force-pushed the flat-structs branch 2 times, most recently from edd1f87 to 609b939 Compare May 7, 2020 09:09
@MathiasVP MathiasVP removed the WIP This is a work-in-progress, do not merge yet! label May 29, 2020
@MathiasVP MathiasVP marked this pull request as ready for review May 29, 2020 11:57
@MathiasVP MathiasVP requested a review from a team as a code owner May 29, 2020 11:57
@rdmarsh2
Copy link
Contributor

rdmarsh2 commented Jun 1, 2020

I'm finding the parent/child and beginning/end terminology in the store chain to be unintuitive, and I can't find a written explanation of what exactly a reverse read is anywhere. If I'm understanding the store side correctly, for each instruction which stores to the result of a FieldAddressInstruction, there's a single store step, which goes from the StoreInstruction or SideEffectInstruction that writes to the innermost field to the StoreNode for that field. Then there's a set of read steps going from the StoreNode for each field to the StoreNode for the next field in, as well as one from the total operand of the ChiInstruction to the StoreNode for the outermost field, if a ChiInstruction is present. There's also a flow step from the StoreNode of the outermost field to the ChiInstruction if present, or the StoreInstruction if not. The net effect of all this is that when the data flow library looks for a matching access path for a read step that the ChiInstruction or StoreInstruction flows to, each StoreNode will add its field to the access path that resulted from the instruction which did the store (or remove it from the access path that caused the search?). Is that accurate?

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Jun 2, 2020

I'm finding the parent/child and beginning/end terminology in the store chain to
be unintuitive, and I can't find a written explanation of what exactly a reverse
read is anywhere.

Good point. I'll add some more comments about what a reverse read is. I guess it
makes sense to add it to the QLDoc for the StoreNode class.

If I'm understanding the store side correctly, for each
instruction which stores to the result of a FieldAddressInstruction, there's a
single store step, which goes from the StoreInstruction or SideEffectInstruction
that writes to the innermost field to the StoreNode for that field.

Correct. For a.b.c = x there's a storeStep from the StoreInstruction to
the StoreNode with the store chain c.

Then there's a set of read steps going from the StoreNode for each field to the
StoreNode for the next field in, as well as one from the total operand of the
ChiInstruction to the StoreNode for the outermost field, if a ChiInstruction is
present.

Correct. For a.b.c = x there's a read step from the total operand of a
ChiInstruction to b, and from b to c.

There's also a flow step from the StoreNode of the outermost field to the
ChiInstruction if present, or the StoreInstruction if not.

Correct. In the a.b.c = x example the StoreNode b has dataflow to the
ChiInstruction if present, and to the StoreInstruction if not.

The net effect of all this is that when the data flow library looks for a
matching access path for a read step that the ChiInstruction or StoreInstruction
flows to, each StoreNode will add its field to the access path that resulted
from the instruction which did the store (or remove it from the access path that
caused the search?). Is that accurate?

Yes, that is accurate. In a.b.c = x the pre update node for the StoreNode for
c is b, and the pre update node for the StoreNode b is the total chi
operand, and since there's a read step from the total chi operand to b, the
field b is added to the access path.

…edicates. I also simplified the code a bit by moving common implementations of predicates into shared super classes. Finally, I added a getLocation predicate to StoreNode to match the structure of the LoadNode class.
rdmarsh2
rdmarsh2 previously approved these changes Jun 2, 2020
Copy link
Contributor

@rdmarsh2 rdmarsh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdmarsh2
Copy link
Contributor

rdmarsh2 commented Jun 2, 2020

Actually, I just went back and looked at the performance numbers in the difference job - the 5% slowdown on Linux looks pretty bad, but I think it's partly noise (42% of the slowdown was from TaintedPath.ql, but most of the rest is from unrelated queries). I've triggered a rebuild: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1169/

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Jun 3, 2020

Actually, I just went back and looked at the performance numbers in the difference job - the 5% slowdown on Linux looks pretty bad, but I think it's partly noise (42% of the slowdown was from TaintedPath.ql, but most of the rest is from unrelated queries). I've triggered a rebuild: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1169/

Thanks for rebuilding! It looks like the Linux slowdown was mostly noise: the slowdown is down to 1.7% now. The slowdown is still mostly in cpp/path-injection, though. I'll investigate this.

Edit: I've pushed a tiny change that should improve performance (but probably not much). For some reason I hadn't restricted the side effect column of StoreChainEndInstructionSideEffect to be write side effects only. I've started a new CPP-differences to check whether this has the intended effect: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1171/

@MathiasVP
Copy link
Contributor Author

Edit: I've pushed a tiny change that should improve performance (but probably not much). For some reason I hadn't restricted the side effect column of StoreChainEndInstructionSideEffect to be write side effects only. I've started a new CPP-differences to check whether this has the intended effect: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1171/

Looks like the performance improvement didn't have any negative consequences (although it didn't have any noticeable positive effects either).

@MathiasVP
Copy link
Contributor Author

Since this PR is still open I decided to fix a low-hanging fruit that was waiting for #3123 to be merged.

@rdmarsh2 rdmarsh2 merged commit 982fb38 into github:master Jun 10, 2020
jbj added a commit to jbj/ql that referenced this pull request Jun 22, 2020
There was unfortunately a semantic merge conflict between github#3419 and
 github#3587 that caused a performance regression on (at least) OpenJDK.

This reverts commit 982fb38, reversing
changes made to b841cac.
MathiasVP added a commit that referenced this pull request Jun 23, 2020
C++: Revert #3419 to fix OpenJDK performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants