Skip to content

Conversation

@erik-krogh
Copy link
Contributor

@erik-krogh erik-krogh commented Dec 8, 2021

For this PR I have chosen not to rebase and squash, the commit history are the iterations I went through in making this PR.
(And the commit-messages are accordingly).

I recommend not reviewing each commit as much gets reverted along the way.
But the commits give a good idea about the process that I followed.

nightly/lgtm-full evaluation (~11% average speedup).
nightly/lgtm-full evaluation (single-threaded) (~15% average speedup).

The speedup gets larger the more queries are added in an evaluation, so the above represents a best-case performance gain.
Single-query performance will be slower if you start with an empty cache.

Here is the DB size increase.


Here is the process I followed:

  1. Copy-paste the CachedStages.qll file from JavaScript into Python.
  2. Do a single-threaded DCA run on main, to figure out which queries are slow.
  3. Run a slow and advanced query. Check the output log and look sections starting with Results in:.
    These sections tell you which predicates are the cached predicates in a given stage.
    Use CachedStages.qll to group some of these stages in a way that makes sense.
  4. Look at the evaluation, find some good pairs of slow query and slow source.
  5. Get databases of the slow sources locally, run some slow queries with tuple counting, and fix the performance by doing these:
    5.1. Look at the second to last Clause timing report (the report for the last stage) of a slow query, add cache to expensive non-query specific predicates (and add them to CachedStages.qll).
    5.2. Fix any bad join orders that are introduced.
    5.3. Look at Results in: again, and check if there are any cached predicates that should be grouped.
    5.4. If you've accidentally grouped too many stages into one big group, then revert whatever you did.
    5.5. Revert some caching if the DB size increases too much.
    5.6. Revert some imports if I caused ql/abstract-class-import alerts.
  6. Do a single-threaded DCA run with your new code.
    ( Start an experiment -> Use saved experiment -> From previous experiment is helpful).
  7. If you are happy with the new evaluation go to 8, else go to 4.
  8. Do a validation evaluation using the default codeql-action options.
  9. Open a PR.

@erik-krogh erik-krogh added Python no-change-note-required This PR does not need a change note labels Dec 8, 2021
@erik-krogh erik-krogh force-pushed the pyPerf branch 3 times, most recently from af54bcc to dbfe0c5 Compare December 15, 2021 14:02
@erik-krogh erik-krogh force-pushed the pyPerf branch 4 times, most recently from ade85c4 to 8893268 Compare March 9, 2022 09:41
@erik-krogh erik-krogh force-pushed the pyPerf branch 2 times, most recently from 080bdb0 to 2c4af29 Compare March 10, 2022 08:26
@erik-krogh erik-krogh changed the title Python: Cache more predicates and improve performance. Python: Cache more predicates to improve performance. Mar 24, 2022
@erik-krogh erik-krogh marked this pull request as ready for review March 24, 2022 12:06
@erik-krogh erik-krogh requested a review from a team as a code owner March 24, 2022 12:06
tausbn
tausbn previously approved these changes Mar 30, 2022
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!
All of the changes look sensible to me (modulo one comment). 👍

Now I guess I need to rejigger my intuitions for what happens in which stages...

/** Gets the function for this statement */
Function getDefinedFunction() {
exists(FunctionExpr func | this.containsInScope(func) and result = func.getInnerScope())
result = f.getInnerScope() // XXX: This behaves very differently. But from inspecting the results of the previous version, that had every function in the same scope as the result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no. 😲
This looks more correct to me.

Copy link
Contributor Author

@erik-krogh erik-krogh Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I'll delete the comment.
Edit: It seems I had already deleted it in a later commit.

* Where var may be redefined in call to `foo` if `var` escapes (is global or non-local).
*/
pragma[noinline]
pragma[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was surprising to me, but I assume the performance tests show that it's a net improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while, so I don't really remember.
But I suppose it was an improvement.

Comment on lines 44 to 47
private predicate isBeforeCode(Comment c, File f) {
f = c.getLocation().getFile() and
minStmtLine(f) < c.getLocation().getStartLine()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that you're just following the naming scheme set out in the file already, but isn't this predicate actually expressing that the comment c appears after code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused. I'm sure I saw a commit that fixed this, but it appears to be unfixed. 🤔

Copy link
Contributor Author

@erik-krogh erik-krogh Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You definitely saw that, I'm not sure why that disappeared in the rebase.
(I did the rebase to try to get QL-for-QL to stop complaining).

I reintroduced the fix, and also updated the expected output for the location test (that has an extra result due to the Django models getting imported).

@erik-krogh erik-krogh requested a review from a team as a code owner March 30, 2022 20:59
@erik-krogh
Copy link
Contributor Author

erik-krogh commented Mar 30, 2022

Code-scanning was complaining about an ql/name-casing issue that was not introduced in this PR.

(I did introduce it, but that was in the def-node PR).

It's fixed now.

@erik-krogh erik-krogh requested a review from tausbn March 30, 2022 21:10
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Let's get it in before it starts conflicting with other stuff. :shipit:

@erik-krogh erik-krogh merged commit 29a5bdb into github:main Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-note-required This PR does not need a change note Python QL-for-QL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants