Python: Cache more predicates to improve performance. #7339

erik-krogh · 2021-12-08T19:45:56Z

For this PR I have chosen not to rebase and squash, the commit history are the iterations I went through in making this PR.
(And the commit-messages are accordingly).

I recommend not reviewing each commit as much gets reverted along the way.
But the commits give a good idea about the process that I followed.

nightly/lgtm-full evaluation (~11% average speedup).
nightly/lgtm-full evaluation (single-threaded) (~15% average speedup).

The speedup gets larger the more queries are added in an evaluation, so the above represents a best-case performance gain.
Single-query performance will be slower if you start with an empty cache.

Here is the DB size increase.

Here is the process I followed:

Copy-paste the CachedStages.qll file from JavaScript into Python.
Do a single-threaded DCA run on main, to figure out which queries are slow.
Run a slow and advanced query. Check the output log and look sections starting with Results in:.
These sections tell you which predicates are the cached predicates in a given stage.
Use CachedStages.qll to group some of these stages in a way that makes sense.
Look at the evaluation, find some good pairs of slow query and slow source.
Get databases of the slow sources locally, run some slow queries with tuple counting, and fix the performance by doing these:
5.1. Look at the second to last Clause timing report (the report for the last stage) of a slow query, add cache to expensive non-query specific predicates (and add them to CachedStages.qll).
5.2. Fix any bad join orders that are introduced.
5.3. Look at Results in: again, and check if there are any cached predicates that should be grouped.
5.4. If you've accidentally grouped too many stages into one big group, then revert whatever you did.
5.5. Revert some caching if the DB size increases too much.
5.6. Revert some imports if I caused ql/abstract-class-import alerts.
Do a single-threaded DCA run with your new code.
( Start an experiment -> Use saved experiment -> From previous experiment is helpful).
If you are happy with the new evaluation go to 8, else go to 4.
Do a validation evaluation using the default codeql-action options.
Open a PR.

tausbn

Thanks for doing this!
All of the changes look sensible to me (modulo one comment). 👍

Now I guess I need to rejigger my intuitions for what happens in which stages...

tausbn · 2022-03-30T11:07:15Z

python/ql/lib/semmle/python/Function.qll

  /** Gets the function for this statement */
  Function getDefinedFunction() {
-    exists(FunctionExpr func | this.containsInScope(func) and result = func.getInnerScope())
+    result = f.getInnerScope() // XXX: This behaves very differently. But from inspecting the results of the previous version, that had every function in the same scope as the result.


Oh no. 😲
This looks more correct to me.

👍
I'll delete the comment.
Edit: It seems I had already deleted it in a later commit.

tausbn · 2022-03-30T11:10:33Z

python/ql/lib/semmle/python/pointsto/PointsTo.qll

   * Where var may be redefined in call to `foo` if `var` escapes (is global or non-local).
   */
-  pragma[noinline]
+  pragma[inline]


This one was surprising to me, but I assume the performance tests show that it's a net improvement.

It's been a while, so I don't really remember.
But I suppose it was an improvement.

tausbn · 2022-03-30T11:17:54Z

python/ql/lib/semmle/python/filters/GeneratedCode.qll

+private predicate isBeforeCode(Comment c, File f) {
+  f = c.getLocation().getFile() and
+  minStmtLine(f) < c.getLocation().getStartLine()
+}


I know that you're just following the naming scheme set out in the file already, but isn't this predicate actually expressing that the comment c appears after code?

I'm very confused. I'm sure I saw a commit that fixed this, but it appears to be unfixed. 🤔

You definitely saw that, I'm not sure why that disappeared in the rebase.
(I did the rebase to try to get QL-for-QL to stop complaining).

I reintroduced the fix, and also updated the expected output for the location test (that has an extra result due to the Django models getting imported).

…crease too much

This reverts commit 84bc904. It caused ql/abstract-class-import alerts

…mport

erik-krogh · 2022-03-30T21:06:13Z

Code-scanning was complaining about an ql/name-casing issue that was not introduced in this PR.

(I did introduce it, but that was in the def-node PR).

It's fixed now.

tausbn

Nice! Let's get it in before it starts conflicting with other stuff.

erik-krogh added Python no-change-note-required This PR does not need a change note labels Dec 8, 2021

erik-krogh force-pushed the pyPerf branch from 07804f0 to 2fab44f Compare December 10, 2021 14:21

github-actions bot added JS Ruby labels Dec 10, 2021

erik-krogh force-pushed the pyPerf branch 3 times, most recently from af54bcc to dbfe0c5 Compare December 15, 2021 14:02

erik-krogh removed JS Ruby labels Dec 17, 2021

erik-krogh force-pushed the pyPerf branch 4 times, most recently from ade85c4 to 8893268 Compare March 9, 2022 09:41

erik-krogh force-pushed the pyPerf branch 2 times, most recently from 080bdb0 to 2c4af29 Compare March 10, 2022 08:26

erik-krogh force-pushed the pyPerf branch from f2b0821 to 79de343 Compare March 22, 2022 11:46

erik-krogh changed the title ~~Python: Cache more predicates and improve performance.~~ Python: Cache more predicates to improve performance. Mar 24, 2022

erik-krogh marked this pull request as ready for review March 24, 2022 12:06

erik-krogh requested a review from a team as a code owner March 24, 2022 12:06

tausbn previously approved these changes Mar 30, 2022

View reviewed changes

erik-krogh dismissed tausbn’s stale review via 1ee4ba2 March 30, 2022 13:08

erik-krogh added 8 commits March 30, 2022 22:53

add the cached stages pattern to Python

71eacea

cached stages iteration 2

60b5af2

cached stages iteration 3

37a9b41

cached stages iteration 3.5

f68357a

cached stages iteration 4

a8f9a91

cached stages iteration 5

c9e3a62

rename the SSA stages to AST

0da80f9

revert caching of some large predicates that caused the DB size to in…

4089788

…crease too much

erik-krogh added 17 commits March 30, 2022 22:54

get around identical files by adding the ref() call somewhere else

6eca4ba

few join order fixes

758a5d7

fix bad mistake

3e9ee88

various join order fixes

79da097

cache the remainder of the pointsto layer

88e8969

joiner order fixes

35c7fa5

revert bad nomagic

7643aac

a bit more caching

79713e0

cache more basicblock predicates

040196f

make private predicates private

d9ced55

cache a bit more (again)

b74852f

revert changes in MRO.qll

b959705

import all the frameworks that extend RegexString

5caff81

nomagic on containsInScope

3b9335c

Revert "import all the frameworks that extend RegexString"

7e4ab4c

This reverts commit 84bc904. It caused ql/abstract-class-import alerts

revert the Taint stage, as it caused an alert for ql/abstract-class-i…

7ca6426

…mport

remove TODO

1847a57

erik-krogh force-pushed the pyPerf branch from 1ee4ba2 to 1847a57 Compare March 30, 2022 20:54

fix ql/name-casing, and drive-by QL-for-QL typo fix

1218c4f

erik-krogh requested a review from a team as a code owner March 30, 2022 20:59

github-actions bot added the QL-for-QL label Mar 30, 2022

erik-krogh requested a review from tausbn March 30, 2022 21:10

erik-krogh added 2 commits April 1, 2022 12:55

rename isBeforeCode to isCommentAfterCode

ed7e120

update expected output for Locations.ql

eae2a6a

tausbn approved these changes Apr 1, 2022

View reviewed changes

erik-krogh merged commit 29a5bdb into github:main Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Cache more predicates to improve performance. #7339

Python: Cache more predicates to improve performance. #7339

Uh oh!

erik-krogh commented Dec 8, 2021 •

edited

Loading

Uh oh!

tausbn left a comment

Uh oh!

tausbn Mar 30, 2022

Uh oh!

erik-krogh Mar 30, 2022 •

edited

Loading

Uh oh!

tausbn Mar 30, 2022

Uh oh!

erik-krogh Mar 30, 2022

Uh oh!

tausbn Mar 30, 2022

Uh oh!

tausbn Apr 1, 2022

Uh oh!

erik-krogh Apr 1, 2022 •

edited

Loading

Uh oh!

erik-krogh commented Mar 30, 2022 •

edited

Loading

Uh oh!

tausbn left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Python: Cache more predicates to improve performance. #7339

Python: Cache more predicates to improve performance. #7339

Uh oh!

Conversation

erik-krogh commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

tausbn Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

erik-krogh Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tausbn Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

erik-krogh Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

tausbn Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

tausbn Apr 1, 2022

Choose a reason for hiding this comment

Uh oh!

erik-krogh Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erik-krogh commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erik-krogh commented Dec 8, 2021 •

edited

Loading

erik-krogh Mar 30, 2022 •

edited

Loading

erik-krogh Apr 1, 2022 •

edited

Loading

erik-krogh commented Mar 30, 2022 •

edited

Loading