Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve commit finder accuracy #862

Open
wants to merge 2 commits into
base: staging
Choose a base branch
from

Conversation

benmss
Copy link
Member

@benmss benmss commented Sep 17, 2024

This PR makes a number of improvements to the Commit Finder based on results from a much larger dataset than has been previously used for testing.

Changes

Prefixes

The base prefix case pattern has been made more accepting:

  • Can start with any character instead of requiring an alphabetic character
  • Can end with one or two numbers

A new prefix case, "prefix_2", has been added that accepts a string of any alphabetic characters with no prefix separator.

Suffixes

Suffixes can now begin with numbers as well as alphabetic characters.

The logic for determining when suffix parts should be made optional has been improved. Requirements are:

  • Version parts that are alphanumeric
  • AND do not come before parts that are purely numeric
  • OR version parts that come after a change in separator

Separator pattern logic between version parts has been changed to only allow differences when a version contains more than one separator, e.g. v1.2-3 vs. v1.2.3. Specific separators also become optional in cases where a version part was split from a single part, e.g. 1.3rc5 -> [1, 3, rc5]

Preventing Misaligned Matches

The old prefix patterns were designed to help prevent versions from matching incorrectly within a tag, such as part way through numbers, or version parts, e.g. 11.33 should not match 1.33, or 33. To address this a negative look behind has been added for the first version part instead, as well as a new function that performs realignment when a version part has been marked as a prefix by mistake.

Pre-step Evaluation

As a pre-step to the full evaluation, a vastly simplified regex has been added that checks if the tag matches the version closely enough, using: <release_prefix>/<artifact_name>-, with only the version part being required.

Sorting Function

The compute similarity function has been extended to consider more information from the tag matches.
This includes preferring shorter prefixes and prefix separators; prefixes that are a superstring of the artifact name; and prefixes and suffixes that have release keywords in them, or use separators that match the version.

Tests

More regression cases have been added for new tags of interest. Strange unicode tags that were previously rejected despite being valid Git tags are now accepted using the pre-step evaluation.

The hypothesis test for pattern creation and evaluation has been removed. Creating a slightly less restrictive pattern that is still mostly correct has become far more difficult as the Commit Finder has evolved, and likely not worth the extra effort.

(More unit tests to come in a future PR.)

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 17, 2024
Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
@benmss benmss marked this pull request as ready for review September 18, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant