New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ast.get_source_segment is slower than it needs to be because it reads every line of the source.
#103285
Comments
|
I won't consider this as a "bug", but it's something that we can definitely improve. Breaking on |
|
I tried to replicate the exact behavior of the current
This is not trivial in pure string manipulation, so I used Core code is _line_pattern = re.compile(r"(.*?(?:\r\n|\n|\r|$))")
def _splitlines_no_ff(source, maxlines=-1):
lines = []
for lineno, match in enumerate(_line_pattern.finditer(source), 1):
if maxlines > 0 and lineno > maxlines:
break
lines.append(match[0].replace("\r\n", "\n"))
return linesPre-compile regex helped a bit in benchmarks, we can switch that to inline compile if that's considered ugly. The regex + replace almost replicated the behavior exactly except for the fact that it will end up with an empty string at the end when the source is not terminated by >>> re.findall(r"(.*?(?:\r\n|\r|\n|$))", "a\f\rb\n\nc\r\ndd")
['a\x0c\r', 'b\n', '\n', 'c\r\n', 'dd', '']Benchmark code as below: import timeit
short_setup = f"""
import ast
code = \"\"\"def fib(x):
if x < 2:
return 1
return fib(x - 1) + fib(x - 2)
\"\"\"
module_node = ast.parse(code)
function_node = module_node.body[0]
"""
long_setup_start = f"""
import ast
with open("Lib/inspect.py") as f:
code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[2]
"""
long_setup_end = f"""
import ast
with open("Lib/inspect.py") as f:
code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[-2]
"""
test = """
ast.get_source_segment(code, function_node)
"""
print(f"short: {timeit.timeit(test, setup=short_setup, number=10000)}")
print(f"long+start: {timeit.timeit(test, setup=long_setup_start, number=10)}")
print(f"long+end: {timeit.timeit(test, setup=long_setup_end, number=10)}")We tested on a very short source code and a long one( The result of current implementation with no optimization: The result of improved method using As we can tell from the result, even on very short source code, the new implementation has a ~3x speed up, which is due to the elimination of character-level loop. For the long source code, the speed up is even more obvious ~4x-5x.
Overall, I believe this is a promising improvement to |
|
Thanks again! |
Bug report
There is a private function
_splitlines_no_ffwhich is only ever called inast.get_source_segment. This functions splits the entire source given to it, butast.get_source_segmentonly needs at mostnode.end_lineolines to work.cpython/Lib/ast.py
Lines 308 to 330 in 1acdfec
cpython/Lib/ast.py
Lines 344 to 378 in 1acdfec
If, for example, you want to extract an import line from a very long file, this can seriously degrade performance.
The introduction of a
max_lineskwarg in_splitlines_no_ffwhich functions likemaxsplitinstr.splitwould minimize unneeded work. An implementation of the proposed fix is below (which makes my use case twice as fast):Your environment
Linked PRs
The text was updated successfully, but these errors were encountered: