Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Deduplicate dataflow query results #14151

Merged
merged 15 commits into from Sep 12, 2023

Conversation

MathiasVP
Copy link
Contributor

This PR fixes the problem of duplicated dataflow results we've seen since we switched to IR-based dataflow.

To see why we get such duplicated results, consider the following query:

import cpp
import semmle.code.cpp.dataflow.new.DataFlow

module C implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node node) { node.asExpr().(Call).getTarget().hasName("source") }

  predicate isSink(DataFlow::Node node) {
    node.asExpr() = any(Call call | call.getTarget().hasName("sink")).getAnArgument()
  }
}

module MyDataFlow = TaintTracking::Global<C>;

from DataFlow::Node source, DataFlow::Node sink
where MyDataFlow::flow(source, sink)
select sink, ""

When running on:

int source();
void sink(long);

void test() {
  int x = source();
  sink(x); // Important: There's an int-to-long conversion here
}

we get two identical results. This is because both the dataflow node for the unconverted x and the dataflow node for the converted x map to the x expression.

This PR fixes this issue by ensuring that only the fully converted expression has a result for asExpr(). Calling asExpr() on a dataflow node gives the unconverted expression, and calling asConvertedExpr() gives the converted expression.

Commit-by-commit strongly encouraged (You can see the effects of each commit in the updated .expected files):

  • The first commit introduces a new predicate that we'll use to convert an instruction to an expression.
  • The second predicate naively replaces the use of Instruction::getConvertedResultExpression() with this new predicate.
  • The next couple of commits fixes various predicates to implement the logic around "only the fully converted instruction should have a result for asExpr.
  • Finally, the last two commits fixes a bunch of queries to use asExpr instead of asConvertedExpr, and removes manual deduplication. These workarounds are no longer needed.

DCA shows significant result deduplication 🎉. There are a couple of reported lost results, but the only ones that are genuinely lost are the two cpp/very-likely-overrunning-write on vim. This is a bit unfortunate, but I can't see a good way of bringing them back with the current semantics of unique.

I'll create a follow-up issue for this last part. I don't think it should block the PR, though.

@MathiasVP MathiasVP requested a review from a team as a code owner September 6, 2023 12:03
@github-actions github-actions bot added the C++ label Sep 6, 2023
@MathiasVP MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Sep 6, 2023
@MathiasVP
Copy link
Contributor Author

Note: This PR isn't green on CI because there are some internal tests that need to be updated. There's a backlink to the internal PR (for those with access) where the PR is green 🟢

Copy link
Contributor

@alexet alexet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. There is a weird result for one of the tests that I don't understand. (I have used request changes because of this but if it is fine then I can just approve).

I know most of the n are on internal only predicates but it would be useful to have some doc so it is easier understand what they mean (even if it is just a non doc comment at the start of the predicate).

| test.cpp:30:27:30:31 | ... * ... | test.cpp:30:27:30:31 | ... * ... | test.cpp:30:27:30:31 | ... * ... | Potentially overflowing value from $@ is used in the size of this allocation. | test.cpp:30:27:30:31 | ... * ... | multiplication |
| test.cpp:31:27:31:31 | ... * ... | test.cpp:31:27:31:31 | ... * ... | test.cpp:31:27:31:31 | ... * ... | Potentially overflowing value from $@ is used in the size of this allocation. | test.cpp:31:27:31:31 | ... * ... | multiplication |
| test.cpp:37:46:37:49 | size | test.cpp:45:36:45:40 | ... * ... | test.cpp:37:46:37:49 | size | Potentially overflowing value from $@ is used in the size of this allocation. | test.cpp:45:36:45:40 | ... * ... | multiplication |
| test.cpp:30:18:30:32 | new[] | test.cpp:30:18:30:32 | new[] | test.cpp:30:18:30:32 | new[] | Potentially overflowing value from $@ is used in the size of this allocation. | test.cpp:30:18:30:32 | new[] | multiplication |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These results look a bit weird and aren't on the other side of the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted! This is due to the IR inserting an additional conversion on n in an expression like new int[n], and the generated ConvertInstruction doesn't map back to n. I've fixed this in 7d2c12e (and added a few more comments).

@MathiasVP MathiasVP merged commit d6e143a into github:main Sep 12, 2023
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants