Skip to content

TextAnalysisConfiguration

Osma Suominen edited this page Dec 18, 2015 · 11 revisions

Starting from version 1.4 Skosmos relies exclusively on the jena-text index for text searches (as long as JenaText is set as the SPARQL dialect). This means that the jena-text analyzer configuration can be adjusted to make different kinds of matching strategies possible.

NOTE Using alternative analyzers is an experimental feature and hasn't been tested much at the time of the 1.4 release. Please try it and report your experiences in the skosmos-users group or as issues here on GitHub!

The analyzer is set in the Fuseki configuration file. Note that the analyzer is set in three places, separately for each SKOS label property (prefLabel, altLabel, hiddenLabel). Always set the same analyzer for each property!

The default analyzer is LowerCaseKeywordAnalyzer and it is configured like this:

           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]

Note that you will need to rebuild the text index if you change the analyzer configuration.

Matching individual words

The default configuration of Skosmos considers each label as a separate token and doesn't distinguish words within labels. This means that e.g. fra* doesn't match academic fraud (you need to use *fra* instead).

This can be changed by setting the jena-text analyzer configuration to use SimpleAnalyzer, which splits the labels into words based on non-word letters (whitespace, commas etc.) and then matches individual words.

           text:analyzer [ a text:SimpleAnalyzer ]

Another alternative is StandardAnalyzer which does more intelligent tokenizing including a list of (English language) stop words and heuristics for acronyms, numbers, words with apostrophes etc.

           text:analyzer [ a text:StandardAnalyzer ]

Language-specific analyzers

Jena-text can intelligently choose an analyzer based on the language of labels using a MultilingualAnalyzer. Some of these analyzers perform stemming and/or use language-specific stop word lists. See the jena-text documentation for details on how to configure this.

Accent folding i.e. matching regardless of diacritics

Searches can be made diacritic-insensitive (e.g. a search for deja vu will match déjà vu) by using a ConfigurableAnalyzer which is configured to use an ASCIIFoldingFilter. This filter drops all non-ASCII characters into their nearest ASCII equivalents, for example éïèåäö will become eieaao. The downside of this simple algorithm is that in many languages, some diacritics are more significant than others - for example in Finnish "paatos" and "päätös" are completely different words but with this algorithm searches for any one of them will also match the other. A more sophisticated, language-aware analyzer would be needed to avoid this kind of wrong results.

This requires support for ConfigurableAnalyzer, which was added in Fuseki 1.3.1/2.3.1. Unfortunately those versions of Fuseki contain a bug which affects searching, so you should use a recent (2015-12-17 or later) 1.4.0-SNAPSHOT or 2.4.0-SNAPSHOT version instead (download directory for Fuseki 1.4.0-SNAPSHOT and 2.4.0-SNAPSHOT).

Configuration:

           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:KeywordTokenizer ;
             text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ] 

Accent folding plus individual words

This is the same as above, but using LetterTokenizer to split the label into individual words.

           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:LetterTokenizer ;
             text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ]