Shared: Switch to dot-separated access paths in summary specs #7878

asgerf · 2022-02-07T16:10:40Z

This changes the syntax of input/output specifications to the dot-separated A.B style instead of B of A.

It briefly retains support for both syntaxes so the tests pass at each commit, but support for the of-style is removed entirely at the end.

AccessPathSyntax.qll is shared between JS and the shared data-flow libraries. Since parsing is no longer entirely trivial, it made sense to share it. I also think we should also share other aspects of parsing (such as the n1..n2 argument ranges), but I'd like to separate that from the mega-PR.

One change from the previous parsing is that an empty access path now has zero tokens; previously, the empty path consisted of a single empty token. This mattered for how Annotated summaries were implemented in Java, where the empty input/output spec must be interpreted in a certain way. This was preserved by special-casing the empty access path in interpretInput and interpretOutput. This doesn't feel great, but it seems like this special case was already there, it's just more explicit now.

I've started DCA evaluations for all four affected languages (against Differences code-scanning.qls).

The text was updated successfully, but these errors were encountered:

asgerf · 2022-02-09T14:53:47Z

The evaluations aren't done yet, but in the interest of time, it might be worth starting reviews before that.

I'd like to nominate some reviewers, but others are obviously welcome to join in:

@aschackmull to review AccessPathSyntax.qll, FlowSummaryImpl.qll and the Java-specific changes
@erik-krogh to review JS-specific changes
@hvitved to review the Ruby and C#-specific changes

erik-krogh

JS 👍 (But that's the easy review).

For the rest of you: I've added a backref for how the x of y -> y.x rewrite was made.

asgerf · 2022-02-10T09:17:14Z

Evaluations:

Java appears to show a failure and a slowdown, but both of these can be observed in nightly runs as well, so probably not related to this PR. I've manually inspected the tuple counts for the greatest slowdown and could not find anything amiss.
C# looks fine, albeit with a slight bias towards a slowdown. I investigated the tuple counts from the greatest slowdown, and again, could not find anything amiss.
Ruby looks fine, uneventful
JS also fine, uneventful

hvitved

LGTM, thanks for making this change Asger. Could I please ask you to also update getComponentStackCsv in FlowSummaryImpl.qll(and consequently the expected test output in ql/test/library-tests/dataflow/library and ql/test/library-tests/frameworks/EntityFramework)?

javascript/ql/lib/semmle/javascript/frameworks/data/internal/AccessPathSyntax.qll

javascript/ql/lib/semmle/javascript/frameworks/data/internal/Shared.qll

smowton · 2022-02-14T14:41:05Z

java/ql/lib/semmle/code/java/dataflow/internal/AccessPathSyntax.qll

+private string getRawToken(AccessPath path, int n) {
+  // Avoid splitting by '.' since tokens may contain dots, e.g. `Field[foo.Bar.x]`.
+  // Instead use regexpFind to match valid tokens, and supplement with a final length
+  // check to ensure all characters were included in a token.


Suggested change

// check to ensure all characters were included in a token.

// check (in `AccessPath.hasSyntaxError`) to ensure all characters were included in a token.

smowton · 2022-02-14T14:41:05Z

java/ql/lib/semmle/code/java/dataflow/internal/AccessPathSyntax.qll

+  }
+
+  /** Gets the `n`th-last token, with 0 being the last token. */
+  AccessPathToken getLastToken(int n) { result = getToken(getNumToken() - 1 - n) }


Noting ql-for-ql warnings re: implicit this

smowton · 2022-02-14T14:41:05Z

java/ql/lib/semmle/code/java/dataflow/internal/AccessPathSyntax.qll

+  string getArgumentList() { result = this.getPart(2) }
+
+  /** Gets the `n`th argument to this token, such as `x` or `y` from `Member[x,y]`. */
+  string getArgument(int n) { result = this.getArgumentList().splitAt(",", n) }


Should we be flexible on whitespace while we have the chance? Should Token[arg1, arg2] mean the same as Token[arg1,arg2]?

I'm on the fence here.

On one hand I don't think the spaces actually improve readability, and I also wouldn't want to need an auto-formatter to resolve arguments about CSV rows should be formatted.

On the other hand, allowing spaces is probably the least surprising behaviour; could save someone a lot of time if they simply assumed spaces would work. However, this can also be handled by emitting a CSV validation error when an argument has trailing spaces.

Ultimately I'm leaning towards not allowing spaces, but I'm open to changing it if someone feels strongly about changing it.

I'd lean towards allow because when someone is writing new rules it'll take them a wasted hour or so fiddling with other things before they think to run the CSV validator. If the validator was run early enough that e.g. you got a warning from VSCode that'd help, but I would normally only expect something like that to be run at CI time

Added support for surrounding spaces

smowton · 2022-02-14T14:41:05Z

java/ql/lib/semmle/code/java/dataflow/internal/FlowSummaryImpl.qll

+      exists(ParameterPosition pos |
+        parseArg(token, pos) and result = SummaryComponent::argument(pos)
      )
+      or
+      exists(ArgumentPosition pos |
+        parseParam(token, pos) and result = SummaryComponent::parameter(pos)
+      )


These ParameterPosition / ArgumentPosition uses seem backwards

It's supposed to be backwards (see the diff). An Argument[n] token results in a synthesized parameter, and Parameter[n] results in a synthesized argument (to a callback).

smowton · 2022-02-14T14:41:05Z

java/ql/lib/semmle/code/java/dataflow/internal/FlowSummaryImpl.qll

+      idx = spec.getNumToken() - 1 and
+      stack = SummaryComponentStack::singleton(interpretComponent(spec.getLastToken(idx)))


getNumToken - 1 and a use of getLastToken? Do these cancel out?

I tried to preserve the original indexing scheme, but yeah it ended up looking weird.

I've pushed a commit that replaces the whole thing with indexes that start at the beginning and use getToken instead of getLastToken. Could you take another look?

Thanks, looks good! AFAICT getLastToken is now dead code, and I'd like to see whitespace significance avoided wherever possible; otherwise lgtm.

Removed getLastToken

smowton · 2022-02-14T14:44:17Z

Also would be interested to hear arguments for a regexpFind (which might bind and find matches other than at the start of an access path specifier) plus a cross-check based on match lengths, vs. a stateful parse starting at the LHS, with a predicate argument tracking the offset where the previous access path element parse finished, and so where the next one should begin. I imagine the combination of greed and lookaheads in the regexp are supposed to produce exactly the same behaviour, but my regex-fu isn't sufficient to know that for sure.

asgerf · 2022-02-15T10:34:04Z

Also would be interested to hear arguments for a regexpFind [...]

The regexpFind call parses the whole string with a single string operation, without producing unneeded intermediate string values.

If you try to parse one token at a time, you can't use regexps efficiently, because you can't tell the engine where to start matching (it will start at the beginning and tell you where each match began, that's different). So to parse the n+1th token you'd have to strip off the first n tokens and parse from the remaining string; that's not as efficient. You could manually parse one character at a time, or rely on indexOf, but that seems quite fiddly to me.

Also, the regexpFind solution can be factored it into a predicate with a bindingset (which you can't do with a recursive predicate), though in the current formulation we don't actually need this.

asgerf · 2022-02-15T11:15:52Z

Could I please ask you to also update getComponentStackCsv in FlowSummaryImpl.qll(and consequently the expected test output in ql/test/library-tests/dataflow/library and ql/test/library-tests/frameworks/EntityFramework)?

Argh, I overlooked this request first time around. Should be fixed in caf5931

This reverts commit 9bf522b.

asgerf · 2022-02-21T07:25:23Z

Force-pushed to resolve conflicts in the Ruby libraries. Only the commit Ruby: update CSV rows to dot-separated syntax was affected.

Comments addressed

github-actions bot added C# Java JS Ruby labels Feb 7, 2022

asgerf force-pushed the dot-separated-access-paths branch 5 times, most recently from 067404d to 560394f Feb 9, 2022

asgerf added the no-change-note-required label Feb 9, 2022

asgerf changed the title ~~TEST ONLY (Switch to dot-separated access paths)~~ Shared: Switch to dot-separated access paths in summary specs Feb 9, 2022

asgerf marked this pull request as ready for review Feb 9, 2022

asgerf requested review from as code owners Feb 9, 2022

erik-krogh previously approved these changes Feb 9, 2022

View changes

hvitved previously requested changes Feb 10, 2022

View changes

asgerf mentioned this pull request Feb 10, 2022

Go: sync FlowSummaryImpl.qll github/codeql-go#690

Merged

asgerf dismissed erik-krogh’s stale review via f01781d Feb 10, 2022

smowton reviewed Feb 14, 2022

View changes

nickrolfe mentioned this pull request Feb 14, 2022

Ruby: split standard library models into multiple files #7886

Merged

asgerf added 3 commits Feb 21, 2022

JS: Fix accidental recursion

e2cbf47

JS: Factor out AccessPathSyntax.qll

7c2cff3

JS: Move ".."-parsing trick into AccessPathSyntax.qll

3025468

asgerf added 23 commits Feb 21, 2022

Java: update model generator

7f80871

Java: update CSV rows to dot-separated syntax

a121b73

Java: remove support for legacy syntax

affdbe9

C#: use AccessPathSyntax.qll to parse input/output summary specs

dffa1d1

C#: update CSV rows to dot-separated syntax

6bb15dc

C#: remove support for legacy syntax

0af9e8a

Ruby: use AccessPathSyntax.qll to parse input/output summary specs

6dbeb81

Ruby: manually rewrite DigSummary access path

7005d53

Ruby: update CSV rows to dot-separated syntax

e3605ee

Ruby: remove support for legacy syntax

57bf0b1

Revert "JS: Add support for " of " syntax to help during transition"

c189df2

This reverts commit 9bf522b.

Shared: fix qldoc and move getRawToken to top-level

be63cf7

Shared: sync AccessPathSyntax.qll and FlowSummaryImpl.qll

2907d53

Shared: update comment in AccessPathSyntax.qll

dc6a132

Shared: add explicit this

c4304a9

Shared: use getToken instead of getLastToken

d911e0a

Shared: sync AccessPathSyntax.qll and FlowSummaryImpl.qll

7fcbdbe

Shared: auto format

dcc523a

Shared: update getSummaryCsv and related test output

4985fbb

Shared: auto format

55ac5cb

Shared: allow spaces between arguments in a token

2c2a82a

Shared: Remove getLastToken again

d7f0716

Shared: sync AccessPathSyntax.qll

7848fce

asgerf force-pushed the dot-separated-access-paths branch from 558c35c to 7848fce Feb 21, 2022

smowton approved these changes Feb 21, 2022

View changes

erik-krogh approved these changes Feb 21, 2022

View changes

asgerf merged commit 02c4966 into github:main Feb 21, 2022
47 checks passed

asgerf mentioned this pull request Feb 21, 2022

Go: Switch to dot-separated access paths in summary specs github/codeql-go#696

Merged

github / codeql Public

Shared: Switch to dot-separated access paths in summary specs #7878

Shared: Switch to dot-separated access paths in summary specs #7878

asgerf commented Feb 7, 2022 •

edited

asgerf commented Feb 9, 2022

erik-krogh left a comment

asgerf commented Feb 10, 2022

hvitved left a comment

smowton Feb 14, 2022

smowton Feb 14, 2022

smowton Feb 14, 2022

asgerf Feb 15, 2022

smowton Feb 15, 2022

asgerf Feb 16, 2022

smowton Feb 14, 2022

asgerf Feb 15, 2022

smowton Feb 14, 2022

asgerf Feb 15, 2022

smowton Feb 15, 2022

asgerf Feb 16, 2022

smowton commented Feb 14, 2022

asgerf commented Feb 15, 2022 •

edited

asgerf commented Feb 15, 2022

asgerf commented Feb 21, 2022

	// check to ensure all characters were included in a token.
	// check (in `AccessPath.hasSyntaxError`) to ensure all characters were included in a token.

		idx = spec.getNumToken() - 1 and
		stack = SummaryComponentStack::singleton(interpretComponent(spec.getLastToken(idx)))

github / codeql Public

Shared: Switch to dot-separated access paths in summary specs #7878

Shared: Switch to dot-separated access paths in summary specs #7878

Conversation

asgerf commented Feb 7, 2022 • edited

asgerf commented Feb 9, 2022

erik-krogh left a comment

asgerf commented Feb 10, 2022

hvitved left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smowton commented Feb 14, 2022

asgerf commented Feb 15, 2022 • edited

asgerf commented Feb 15, 2022

asgerf commented Feb 21, 2022

asgerf commented Feb 7, 2022 •

edited

asgerf commented Feb 15, 2022 •

edited