chore: improve commit finder accuracy #862
Open
+2,534
−166
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes a number of improvements to the Commit Finder based on results from a much larger dataset than has been previously used for testing.
Changes
Prefixes
The base prefix case pattern has been made more accepting:
A new prefix case, "prefix_2", has been added that accepts a string of any alphabetic characters with no prefix separator.
Suffixes
Suffixes can now begin with numbers as well as alphabetic characters.
The logic for determining when suffix parts should be made optional has been improved. Requirements are:
Separator pattern logic between version parts has been changed to only allow differences when a version contains more than one separator, e.g. v1.2-3 vs. v1.2.3. Specific separators also become optional in cases where a version part was split from a single part, e.g. 1.3rc5 -> [1, 3, rc5]
Preventing Misaligned Matches
The old prefix patterns were designed to help prevent versions from matching incorrectly within a tag, such as part way through numbers, or version parts, e.g. 11.33 should not match 1.33, or 33. To address this a negative look behind has been added for the first version part instead, as well as a new function that performs realignment when a version part has been marked as a prefix by mistake.
Pre-step Evaluation
As a pre-step to the full evaluation, a vastly simplified regex has been added that checks if the tag matches the version closely enough, using: <release_prefix>/<artifact_name>-, with only the version part being required.
Sorting Function
The compute similarity function has been extended to consider more information from the tag matches.
This includes preferring shorter prefixes and prefix separators; prefixes that are a superstring of the artifact name; and prefixes and suffixes that have release keywords in them, or use separators that match the version.
Tests
More regression cases have been added for new tags of interest. Strange unicode tags that were previously rejected despite being valid Git tags are now accepted using the pre-step evaluation.
The hypothesis test for pattern creation and evaluation has been removed. Creating a slightly less restrictive pattern that is still mostly correct has become far more difficult as the Commit Finder has evolved, and likely not worth the extra effort.
(More unit tests to come in a future PR.)