Message 412752 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jonathan-lp
Recipients	jonathan-lp, tim.peters
Date	2022-02-07.15:01:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1644246112.45.0.216314907126.issue46667@roundup.psfhosted.org>
In-reply-to

Content
I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected. Given this, I'd suggest the following then: * `Autojunk=False` should be the default unless there's some reason to believe SequenceMatcher is mostly used for code comparisons. * If - for whatever reason - the default can't be changed, I'd suggest a nice big docs "Warning" (at a minimum a "Note") saying something like "The default autojunk=True is not suitable for normal string comparison. See autojunk for more information". * Human-friendly doc explanation for autojunk. The current explanation is only going to be helpful to the tiny fraction of users who understand the algorithm. Your explanation is a good start: "Autojunk was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. But it more often backfires when comparing strings (viewed as sequences of characters)" Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises.

I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected.

Given this, I'd suggest the following then:

* `Autojunk=False` should be the default unless there's some reason to believe SequenceMatcher is mostly used for code comparisons.

* If - for whatever reason - the default can't be changed, I'd suggest a nice big docs "Warning" (at a minimum a "Note") saying something like "The default autojunk=True is not suitable for normal string comparison. See autojunk for more information".

* Human-friendly doc explanation for autojunk. The current explanation is only going to be helpful to the tiny fraction of users who understand the algorithm. Your explanation is a good start:
	"Autojunk was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. But it more often backfires when comparing strings (viewed as sequences of characters)"

Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises.

History
Date	User	Action	Args
2022-02-07 15:01:52	jonathan-lp	set	recipients: + jonathan-lp, tim.peters
2022-02-07 15:01:52	jonathan-lp	set	messageid: <1644246112.45.0.216314907126.issue46667@roundup.psfhosted.org>
2022-02-07 15:01:52	jonathan-lp	link	issue46667 messages
2022-02-07 15:01:52	jonathan-lp	create