Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared: Switch to dot-separated access paths in summary specs #7878

Merged
merged 28 commits into from Feb 21, 2022

Conversation

@asgerf
Copy link
Contributor

@asgerf asgerf commented Feb 7, 2022

This changes the syntax of input/output specifications to the dot-separated A.B style instead of B of A.

It briefly retains support for both syntaxes so the tests pass at each commit, but support for the of-style is removed entirely at the end.

AccessPathSyntax.qll is shared between JS and the shared data-flow libraries. Since parsing is no longer entirely trivial, it made sense to share it. I also think we should also share other aspects of parsing (such as the n1..n2 argument ranges), but I'd like to separate that from the mega-PR.

One change from the previous parsing is that an empty access path now has zero tokens; previously, the empty path consisted of a single empty token. This mattered for how Annotated summaries were implemented in Java, where the empty input/output spec must be interpreted in a certain way. This was preserved by special-casing the empty access path in interpretInput and interpretOutput. This doesn't feel great, but it seems like this special case was already there, it's just more explicit now.

I've started DCA evaluations for all four affected languages (against Differences code-scanning.qls).

@asgerf asgerf force-pushed the dot-separated-access-paths branch 5 times, most recently from 067404d to 560394f Feb 9, 2022
@asgerf asgerf changed the title TEST ONLY (Switch to dot-separated access paths) Shared: Switch to dot-separated access paths in summary specs Feb 9, 2022
@asgerf
Copy link
Contributor Author

@asgerf asgerf commented Feb 9, 2022

The evaluations aren't done yet, but in the interest of time, it might be worth starting reviews before that.

I'd like to nominate some reviewers, but others are obviously welcome to join in:

  • @aschackmull to review AccessPathSyntax.qll, FlowSummaryImpl.qll and the Java-specific changes
  • @erik-krogh to review JS-specific changes
  • @hvitved to review the Ruby and C#-specific changes

@asgerf asgerf marked this pull request as ready for review Feb 9, 2022
@asgerf asgerf requested review from as code owners Feb 9, 2022
erik-krogh
erik-krogh previously approved these changes Feb 9, 2022
Copy link
Contributor

@erik-krogh erik-krogh left a comment

JS 👍 (But that's the easy review).

For the rest of you: I've added a backref for how the x of y -> y.x rewrite was made.

@asgerf
Copy link
Contributor Author

@asgerf asgerf commented Feb 10, 2022

Evaluations:

  • Java appears to show a failure and a slowdown, but both of these can be observed in nightly runs as well, so probably not related to this PR. I've manually inspected the tuple counts for the greatest slowdown and could not find anything amiss.
  • C# looks fine, albeit with a slight bias towards a slowdown. I investigated the tuple counts from the greatest slowdown, and again, could not find anything amiss.
  • Ruby looks fine, uneventful
  • JS also fine, uneventful

hvitved
hvitved previously requested changes Feb 10, 2022
Copy link
Contributor

@hvitved hvitved left a comment

LGTM, thanks for making this change Asger. Could I please ask you to also update getComponentStackCsv in FlowSummaryImpl.qll(and consequently the expected test output in ql/test/library-tests/dataflow/library and ql/test/library-tests/frameworks/EntityFramework)?

private string getRawToken(AccessPath path, int n) {
// Avoid splitting by '.' since tokens may contain dots, e.g. `Field[foo.Bar.x]`.
// Instead use regexpFind to match valid tokens, and supplement with a final length
// check to ensure all characters were included in a token.
Copy link
Contributor

@smowton smowton Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// check to ensure all characters were included in a token.
// check (in `AccessPath.hasSyntaxError`) to ensure all characters were included in a token.

}

/** Gets the `n`th-last token, with 0 being the last token. */
AccessPathToken getLastToken(int n) { result = getToken(getNumToken() - 1 - n) }
Copy link
Contributor

@smowton smowton Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting ql-for-ql warnings re: implicit this

string getArgumentList() { result = this.getPart(2) }

/** Gets the `n`th argument to this token, such as `x` or `y` from `Member[x,y]`. */
string getArgument(int n) { result = this.getArgumentList().splitAt(",", n) }
Copy link
Contributor

@smowton smowton Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be flexible on whitespace while we have the chance? Should Token[arg1, arg2] mean the same as Token[arg1,arg2]?

Copy link
Contributor Author

@asgerf asgerf Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence here.

On one hand I don't think the spaces actually improve readability, and I also wouldn't want to need an auto-formatter to resolve arguments about CSV rows should be formatted.

On the other hand, allowing spaces is probably the least surprising behaviour; could save someone a lot of time if they simply assumed spaces would work. However, this can also be handled by emitting a CSV validation error when an argument has trailing spaces.

Ultimately I'm leaning towards not allowing spaces, but I'm open to changing it if someone feels strongly about changing it.

Copy link
Contributor

@smowton smowton Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lean towards allow because when someone is writing new rules it'll take them a wasted hour or so fiddling with other things before they think to run the CSV validator. If the validator was run early enough that e.g. you got a warning from VSCode that'd help, but I would normally only expect something like that to be run at CI time

Copy link
Contributor Author

@asgerf asgerf Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for surrounding spaces

exists(ParameterPosition pos |
parseArg(token, pos) and result = SummaryComponent::argument(pos)
)
or
exists(ArgumentPosition pos |
parseParam(token, pos) and result = SummaryComponent::parameter(pos)
)
Copy link
Contributor

@smowton smowton Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ParameterPosition / ArgumentPosition uses seem backwards

Copy link
Contributor Author

@asgerf asgerf Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to be backwards (see the diff). An Argument[n] token results in a synthesized parameter, and Parameter[n] results in a synthesized argument (to a callback).

idx = spec.getNumToken() - 1 and
stack = SummaryComponentStack::singleton(interpretComponent(spec.getLastToken(idx)))
Copy link
Contributor

@smowton smowton Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNumToken - 1 and a use of getLastToken? Do these cancel out?

Copy link
Contributor Author

@asgerf asgerf Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to preserve the original indexing scheme, but yeah it ended up looking weird.

I've pushed a commit that replaces the whole thing with indexes that start at the beginning and use getToken instead of getLastToken. Could you take another look?

Copy link
Contributor

@smowton smowton Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! AFAICT getLastToken is now dead code, and I'd like to see whitespace significance avoided wherever possible; otherwise lgtm.

Copy link
Contributor Author

@asgerf asgerf Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed getLastToken

@smowton
Copy link
Contributor

@smowton smowton commented Feb 14, 2022

Also would be interested to hear arguments for a regexpFind (which might bind and find matches other than at the start of an access path specifier) plus a cross-check based on match lengths, vs. a stateful parse starting at the LHS, with a predicate argument tracking the offset where the previous access path element parse finished, and so where the next one should begin. I imagine the combination of greed and lookaheads in the regexp are supposed to produce exactly the same behaviour, but my regex-fu isn't sufficient to know that for sure.

@asgerf
Copy link
Contributor Author

@asgerf asgerf commented Feb 15, 2022

Also would be interested to hear arguments for a regexpFind [...]

The regexpFind call parses the whole string with a single string operation, without producing unneeded intermediate string values.

If you try to parse one token at a time, you can't use regexps efficiently, because you can't tell the engine where to start matching (it will start at the beginning and tell you where each match began, that's different). So to parse the n+1th token you'd have to strip off the first n tokens and parse from the remaining string; that's not as efficient. You could manually parse one character at a time, or rely on indexOf, but that seems quite fiddly to me.

Also, the regexpFind solution can be factored it into a predicate with a bindingset (which you can't do with a recursive predicate), though in the current formulation we don't actually need this.

@asgerf
Copy link
Contributor Author

@asgerf asgerf commented Feb 15, 2022

Could I please ask you to also update getComponentStackCsv in FlowSummaryImpl.qll(and consequently the expected test output in ql/test/library-tests/dataflow/library and ql/test/library-tests/frameworks/EntityFramework)?

Argh, I overlooked this request first time around. Should be fixed in caf5931

@asgerf asgerf force-pushed the dot-separated-access-paths branch from 558c35c to 7848fce Feb 21, 2022
@asgerf
Copy link
Contributor Author

@asgerf asgerf commented Feb 21, 2022

Force-pushed to resolve conflicts in the Ruby libraries. Only the commit Ruby: update CSV rows to dot-separated syntax was affected.

@smowton smowton dismissed hvitved’s stale review Feb 21, 2022

Comments addressed

@asgerf asgerf merged commit 02c4966 into github:main Feb 21, 2022
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants