Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement phone number analyzer #15915

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rursprung
Copy link
Contributor

@rursprung rursprung commented Sep 12, 2024

Description

this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only phone and
phone-search are supported right now, phone-email can be added
if/when there's a clear need for it.

using libphonenumber is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers
.

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (ZZ) had been used
which worked as long as international numbers were prefixed with + but
did not work when using local numbers (e.g. a number stored as
+4158... was not matched against a number entered as 004158... or
058...).

example configuration for an index:

{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}

this creates four analyzers: phone and phone-search which do not
explicitly specify a region and thus fall back to ZZ (unknown region,
regional version of international dialing prefix (e.g. 00 instead of
+ in most of europe) will not be recognised) and phone-ch and
phone-search-ch which will try to parse the phone number as a swiss
phone number (thus e.g. 00 as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a String in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in analysis-common (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes #11326

Signed-off-by: Ralph Ursprung Ralph.Ursprung@avaloq.com

Related Issues

Resolves #11326

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Relevance labels Sep 12, 2024
@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from 74429fe to d844ea9 Compare September 12, 2024 16:56
Copy link
Contributor

❌ Gradle check result for 74429fe: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d844ea9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from d844ea9 to 24e60a5 Compare September 13, 2024 13:06
Copy link
Contributor

❌ Gradle check result for 24e60a5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from 24e60a5 to f7669e2 Compare September 13, 2024 13:30
Copy link
Contributor

❕ Gradle check result for f7669e2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 97.36842% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.03%. Comparing base (4e3a6d0) to head (f7e36c2).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...earch/analysis/phone/PhoneNumberTermTokenizer.java 97.87% 0 Missing and 1 partial ⚠️
...nalysis/phone/PhoneNumberTermTokenizerFactory.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15915      +/-   ##
============================================
+ Coverage     71.91%   72.03%   +0.12%     
- Complexity    64491    64602     +111     
============================================
  Files          5289     5294       +5     
  Lines        301509   301585      +76     
  Branches      43557    43566       +9     
============================================
+ Hits         216818   217260     +442     
+ Misses        66892    66482     -410     
- Partials      17799    17843      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rursprung
Copy link
Contributor Author

rursprung commented Sep 13, 2024

testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

❕ Gradle check result for f7669e2: UNSTABLE

* **TEST FAILURES:**
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

this is a flaky test: #14304

and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it)

Copy link
Contributor

❌ Gradle check result for d76826c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from d76826c to f1dffe9 Compare September 27, 2024 15:43
@rursprung
Copy link
Contributor Author

namely the javadoc one; it seems to ignore some things in common but not here

turns out that check is excluded for most existing plugins, so that's why it didn't fail there 😅:

project(":modules:analysis-common"),

as discussed on slack i've now added the new plugin to that list as well as the check is overzealous (details also in the commit-msg)

Thanks @rursprung , the plugin should be under plugins folder, the modules are plugins that are installed by default, As far as I remember, we concluded that this plugin is optional (opt-in).

ah, i didn't spot that yet. moved it now.

should now be ready for review! (the previous failure seems like a false reject; i just rebased now, hopefully it goes away by itself)

@rursprung
Copy link
Contributor Author

sorry, missed your comment before :(

  • could you please add YAML tests for analyzer itself (fe analysis-icu/src/yamlRestTest/resources/rest-api-spec/test/analysis_icu/20_search.yml)

does this add any benefit over the java-based test? or should i even replace the java-based test with this one?

if everything (names of analyzers, index setting, etc.) is ok for you then i can do this (just wanted to avoid having to get it updated several times)

Copy link
Contributor

❌ Gradle check result for f1dffe9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@reta
Copy link
Collaborator

reta commented Sep 27, 2024

does this add any benefit over the java-based test? or should i even replace the java-based test with this one?

it gives a good example of how the analyzer could be used (+ very consistent with the test anaylsis-* suites)

if everything (names of analyzers, index setting, etc.) is ok for you then i can do this (just wanted to avoid having to get it updated several times)

Sure, this is just an issue (not a documentation update), so docs team knows that new features will be added to the release, there is no need to create a pull request now

Copy link
Contributor

❌ Gradle check result for f1dffe9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung
Copy link
Contributor Author

❌ Gradle check result for f1dffe9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

flaky test: #16109

does this add any benefit over the java-based test? or should i even replace the java-based test with this one?

it gives a good example of how the analyzer could be used (+ very consistent with the test anaylsis-* suites)

i'll update it next week

if everything (names of analyzers, index setting, etc.) is ok for you then i can do this (just wanted to avoid having to get it updated several times)

Sure, this is just an issue (not a documentation update), so docs team knows that new features will be added to the release, there is no need to create a pull request now

done: opensearch-project/documentation-website#8389

@rursprung
Copy link
Contributor Author

does this add any benefit over the java-based test? or should i even replace the java-based test with this one?

it gives a good example of how the analyzer could be used (+ very consistent with the test anaylsis-* suites)

i'll update it next week

i've now added the test.

note that the analysis-icu example is probably not particularly good since it's testing with a french text, thus not testing the ICU part at all 😅

Copy link
Contributor

github-actions bot commented Oct 3, 2024

✅ Gradle check result for f7e36c2: SUCCESS

@reta
Copy link
Collaborator

reta commented Oct 3, 2024

@rursprung thank you, could you please resolve the conflicts?

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone
[libphonenumber]: https://github.com/google/libphonenumber
[falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from f7e36c2 to a3ac6dc Compare October 3, 2024 15:09
@rursprung
Copy link
Contributor Author

@rursprung thank you, could you please resolve the conflicts?

done (let's hope it stays mergeable for longer this time!)

@rursprung rursprung requested a review from reta October 3, 2024 15:11
Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @rursprung ! @msfroh anything left on your side?

Copy link
Contributor

github-actions bot commented Oct 3, 2024

❌ Gradle check result for a3ac6dc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 3, 2024

✅ Gradle check result for a3ac6dc: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] new plugin with normalizer & analyzer for phone numbers
4 participants