-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement phone number analyzer #15915
base: main
Are you sure you want to change the base?
Implement phone number analyzer #15915
Conversation
74429fe
to
d844ea9
Compare
❌ Gradle check result for 74429fe: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for d844ea9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
d844ea9
to
24e60a5
Compare
❌ Gradle check result for 24e60a5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
24e60a5
to
f7669e2
Compare
❕ Gradle check result for f7669e2: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15915 +/- ##
============================================
+ Coverage 71.91% 72.03% +0.12%
- Complexity 64491 64602 +111
============================================
Files 5289 5294 +5
Lines 301509 301585 +76
Branches 43557 43566 +9
============================================
+ Hits 216818 217260 +442
+ Misses 66892 66482 -410
- Partials 17799 17843 +44 ☔ View full report in Codecov by Sentry. |
this is a flaky test: #14304 and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it) |
❌ Gradle check result for d76826c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
d76826c
to
f1dffe9
Compare
turns out that check is excluded for most existing plugins, so that's why it didn't fail there 😅: OpenSearch/gradle/missing-javadoc.gradle Line 109 in e8b02c9
as discussed on slack i've now added the new plugin to that list as well as the check is overzealous (details also in the commit-msg)
ah, i didn't spot that yet. moved it now. should now be ready for review! (the previous failure seems like a false reject; i just rebased now, hopefully it goes away by itself) |
sorry, missed your comment before :(
does this add any benefit over the java-based test? or should i even replace the java-based test with this one?
if everything (names of analyzers, index setting, etc.) is ok for you then i can do this (just wanted to avoid having to get it updated several times) |
❌ Gradle check result for f1dffe9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
it gives a good example of how the analyzer could be used (+ very consistent with the test anaylsis-* suites)
Sure, this is just an issue (not a documentation update), so docs team knows that new features will be added to the release, there is no need to create a pull request now |
❌ Gradle check result for f1dffe9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
flaky test: #16109
i'll update it next week
|
f1dffe9
to
f7e36c2
Compare
i've now added the test. note that the |
@rursprung thank you, could you please resolve the conflicts? |
...lysis-phonenumber/src/main/java/org/opensearch/analysis/phone/PhoneNumberAnalysisPlugin.java
Show resolved
Hide resolved
inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
f7e36c2
to
a3ac6dc
Compare
done (let's hope it stays mergeable for longer this time!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @rursprung ! @msfroh anything left on your side?
❌ Gradle check result for a3ac6dc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only
phone
andphone-search
are supported right now,phone-email
can be addedif/when there's a clear need for it.
using
libphonenumber
is required since parsing phone numbers is anon-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers.
this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (
ZZ
) had been usedwhich worked as long as international numbers were prefixed with
+
butdid not work when using local numbers (e.g. a number stored as
+4158...
was not matched against a number entered as004158...
or058...
).example configuration for an index:
this creates four analyzers:
phone
andphone-search
which do notexplicitly specify a region and thus fall back to
ZZ
(unknown region,regional version of international dialing prefix (e.g.
00
instead of+
in most of europe) will not be recognised) andphone-ch
andphone-search-ch
which will try to parse the phone number as a swissphone number (thus e.g.
00
as a prefix is recognised as theinternational dialing prefix).
note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a
String
in memory,making it unsuitable for large field values.
this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in
analysis-common
(see the discussion on the issue and thePR for further details).
note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.
closes #11326
Signed-off-by: Ralph Ursprung Ralph.Ursprung@avaloq.com
Related Issues
Resolves #11326
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.