Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing / Multilingual support / Add per field Analyzer #5911

Closed

Conversation

fxprunayre
Copy link
Member

@fxprunayre fxprunayre commented Aug 12, 2021

The main goals are:

  • to be able to define analyzer depending on languages. Analyzers will take care of tokenizing and filtering fields based on language characteristics.
  • to be able to configure per portal the strategy to search on one or more languages

Search & indexing benefits

Index changes

The full text field any is now an object:

{
  "any": {
    "common": "", < contains non language specific things eg. UUID, resource identifier
    "langfre": "", < contains French
    "langeng": "" < contains English
    ...
  }
}

Stop words

eg. in French des is a stopword and currently returns a lot of results if the analyzer does not take care of this.

  • Search using standard analyzer (current default)

image

  • Same search after indexing French using a French analyzer (a record containing only des does not match anymore, using des in search does not affect results)

image

Ignore plurals

image

Ignore accents and ellision

image

Indexing / Languages configured

Analyzers added for:

  • French
  • German
  • Italian
  • English

Add synonyms

eg. bathy = bathymétrie, sig = ids, ...

image

User interface

  • If records are in one language, the record language is used by default (and no options are proposed to users)
  • if records are in various languages, the following options are available:

image

Languages are ordered by frequency in records. Languages proposed does not take into account templates.
Language detection is made on the list of record's languages. The more languages you have, the more the language detection precision decrease.

Catalogue administrator can define the language strategy to use by default in UI configuration:

  • queryBase: to adjust query
          // * Search in languages depending on the strategy selected
          'queryBase': 'any.${searchLang}:(${any}) any.common:(${any}) resourceTitleObject.${searchLang}:(${any})^2',
          // * Force UI language - in this case set languageStrategy to searchInUILanguage
          // and disable language options in searchOptions
          // 'queryBase': 'any.${uiLang}:(${any}) any.common:(${any}) resourceTitleObject.${uiLang}:(${any})^2',
          // * Search in French fields (with french analysis)
          // 'queryBase': 'any.langfre:(${any}) any.common:(${any}) resourceTitleObject.langfre:(${any})^2',
  • searchOptions.language: to display language options
  • languageStrategy

Language strategies:

  • searchInUILanguage: search in UI languages eg. full text field is any.langfre if French
  • searchInAllLanguages: search using any.* fields (no analysis is done, more records are returned)
  • searchInDetectedLanguage: restrict the search to the language detected based on user search. If language detection fails, search in all languages.
  • forceALanguage: depending on user selection, force a language.

To improve language detection, a whitelist of languages is computed based on records but can also be defined to focus on some language only:

          // Limit language detection to some languages only.
          // If empty, the list of languages in catalogue records is used
          // and if none found, mods.header.languages is used.
          'languageWhitelist': [],

This is useful, when you harvest records with various languages you don't want to display or when only one record contains a language not much used in all other records.

Admin / Search option configuration

image

Other fixes

  • Search on title only was ignoring facets

Annex

Version 3 language configurations (for memory):

image

Björn Höfling and others added 19 commits August 13, 2021 14:54
* web/src/main/webResources/WEB-INF/data/config/index/records.json: Add fields langfre and anyfre with french analyzer.
records.json: Add properties lang... for different languages.
copy_to to these properties.
* Remove schema for french only.
* Add asciifolding in french for better support of accents
* Update doc
* Dispatch keywords in proper full text lang field.
* If records are in one language, the record language is used by default (and no options are proposed to users)
* if records are in various languages, the following options are available:

Languages are ordered by frequency in records. Languages proposed does not take into account templates.
Language detection is made on the list of record's languages. The more languages you have, the more the language detection precision decrease.

Catalogue administrator can define the language strategy to use by default in UI configuration:
* queryBase: to adjust query

* searchOptions.language: to display language options
* languageStrategy

Language strategies:
* searchInUILanguage: search in UI languages eg. full text field is any.langfre if French
* searchInAllLanguages: search using any.* fields (no analysis is done, more records are returned)
*  searchInDetectedLanguage: restrict the search to the language detected based on user search. If language detection fails, search in all languages.
* forceALanguage: depending on user selection, force a language.
Copy link
Contributor

@julsbreakdown julsbreakdown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments,
Thanks for this contribution !

es/README.md Outdated Show resolved Hide resolved
es/README.md Outdated Show resolved Hide resolved
<div class="col-md-2">
<span data-translate="">searchWithLang</span>
</div>
<div class="col-md-10">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of large strings :
Maybe add a margin move

<div class="col-md-2">
  <span data-translate="">searchWithLang</span>
</div>

=>

<div class="col-md-2">
  <span data-translate="">searchWithLang</span>
</div>
<div title="{{'searchAllLanguages-help' | translate}}"
     data-ng-show="optionsConfig.language"
     class="row">
  <div class="col-md-10">...</div>
</div>

@jahow
Copy link
Contributor

jahow commented Jan 17, 2022

@fxprunayre @josegar74 hi, any idea why this PR is still opened? Could we merge this in main and propagate this improvement to customer project? Thanks!

@fxprunayre
Copy link
Member Author

any idea why this PR is still opened?

Probably because you told us that "no experimentation at all" is acceptable, and this is clearly an experimentation.

Also maybe because, we heard some comments when working on this implementation and some of us are probably not inline with the approach of this work ?

And unfortunately, like many other PRs in GeoNetwork, a PR gets some care only :

  • when a "FOAF" merge strategy is adopted (without always gets much tested)
  • when a company project suddenly need some work which may have been forgotten for weeks/months/years
  • when a self merge strategy finally take responsibility to make stuff move forward
    ... it rarely gets care, (constructive) care in itself. Something we always had difficulties to improve.

Closing experimentation.

@fxprunayre fxprunayre closed this Jan 18, 2022
jahow pushed a commit to georchestra/geonetwork that referenced this pull request Feb 24, 2022
This commit is a squashed version of the #5911 PR
jahow added a commit to georchestra/geonetwork that referenced this pull request Sep 10, 2022
Handpicked some changes from geonetwork/core-geonetwork#5911
to use analyzers in main search.

This requires regenerating the UI config.
landryb pushed a commit to landryb/geonetwork that referenced this pull request Jun 2, 2023
Handpicked some changes from geonetwork/core-geonetwork#5911
to use analyzers in main search.

This requires regenerating the UI config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants