Going Dutch: stemming in Apache Solr

Fragment block

none object-oriented code. But then luckily Drupal
Windmills, what else...
So, stemming, what is stemming? Generally speaking, stemming is finding the basic form of a word. For example, in the sentence "he walks" the verb is inflicted by adding a "s" to it. In this case the stem is "walk" which, in English, also happens to be the infinitive of the verb. We will first present a few examples of stemming in natural language, and since Dutch is my native language I will concentrate on Dutch examples. After that we will show the results of a number of stemmers present in Solr and give a few pointers about what to do if the results of these stemmers are not good enough for your application.

Plurals

On of the things you absolutely want your user to be able to, is to find results which contain the single form of a word while searching for the plural and vice versa, e.g.: finding "cat" when looking for "cats" and finding "cats" when searching for "cat". Although in English there are well-defined rules for creating the plural form (suffix with "s", "es" or change "y" to "ie" and suffix "s"), there also are a number of irregular nouns ("woman" -> "women") and nouns for which the single and plural form are the same ("sheep", "fish"). In Dutch more or less the same situation exists, be it with different suffixes ("s", "'s", "en") and, of course, other exceptions. Furthermore, in Dutch if the stem ends on a consonant directly preceded by a vowel, this consonant is doubled (otherwise, in the plural form, the vowel would sound like a long vowel instead of a short vowel), e.g.: kat (cat) -> katten (cats) But, to this rule there also are exceptions, like monnik (monk) -> monniken (monks) in contrasts with: krik (car jack)-> krikken (car jacks)

Verb conjugation

Conjugation of verbs in Dutch is, to be blunt, a bit of a mess. In Dutch, for forming the past tense of a verb, two types of conjugation co-exist: the (pre-) medieval system, now called strong and the more recent system, called weak. When I say the systems co-exist, one should note that most (native) Dutch speakers are not aware of the fact that the strong-system is a system at all: the they consider the strong verbs to be exceptions, best learned by heart. An example of a strong verb is "lopen" (to walk): hij loopt (he walks) -> hij liep (he walked) While an example of a weak verb is "rennen" (to run): hij rent (he runs) -> hij rende (he ran) These examples make clear that determening which verb is strong and which verb weak is indeed a case of learning by heart. Furthermore the change from strong to weak verbs is a ongoing process, and is (and always has been) also influenced by immigrants who import some parts of their native language system in to Dutch. One example of a verb which is currently in transition from strong to weak is the verb "graven" (to dig) of which both the form "hij groef" (he digged) and "hij graafde" can also be found, although most language-purist would consider the last form as "wrong". NB: if you are interested in this kind of things, a classic book about language changes is Jean Aitchisons Language change: progress or decay (1981, yes, it is a bit pre-internet...)

Diminutives

In a number of languages, like Dutch, German, Polish and many more, diminutives are created by inflicting the word. In English you form the diminutive by adding an adjective like 'little', but in Dutch the general rule to form a diminutive is to add the suffix "je" to the word, e.g.: huis (house) -> huisje (little house) This is the general rule, because the suffix can als be inflicted, like in bloem (flower) -> bloempje (little flower) And in some words te ending consonant is changed to keep the word pronouncable: hemd (shirt) -> hempje (little shirt) It is however also possible in Dutch to use an adjective like 'klein' (little) and even to combine both: kleine bloem (little flower) -> klein bloempje (small little flower) A last peculiarity I should mention is that in Dutch (but also in many other languages) there are words which only have a diminutive form, like 'meisje' (girl).

Homographs and homonyms

For some words it is not possible to find the correct stemming without knowing the semantics or context, e.g. kantelen which if pronounced like kantélen means "battlements" but when pronounced kántelen means "tip over". Or zij kust ("she kisses") versus kust like in de Noordzeekust ("the North sea coast").

Why bother?

So maybe by now you are asking yourself: why bother? Well, you should be bothered because stemming will make it easier for the visitors of your site to find what they are looking for. For example, you can almost be sure that when a visitor is interested in articles about houses (in Dutch 'huizen'), he will also be interested in articles which mention a house ('huis'). So when using the search term 'huizen' it would be nice if results which contain 'huis' would automatically be shown. Of course searching a verb is much less common, and the chance that a visitor will use the dimunitive is also not very great, but still it happens and if it takes only a minimal effort to make sure the visitor finds what he is searching for, then why not?

Solr

Starting form Solr version 3.1, for English (and a number of other languages) there is a standard filter "EnglishMinimalStemFilterFactory" which has the ability to stem English words. For Dutch however, such a simple filter factory is not available. There are however a number of default languages that can be used with the SnowballPorterFilterFactory and in the default schema included in Solr 5 a number of such fields are predefined.

Solr 5 default schema

In Solr 5 the default schema defines a list of language specific fieldtypes. For Dutch the fieldtype 'text_nl' is defined as follows: <dynamicField name="*_txt_nl" type="text_nl" indexed="true" stored="true"/> <fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" /> <filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemdict_nl.txt" ignoreCase="false"/> <filter class="solr.SnowballPorterFilterFactory" language="Dutch"/> </analyzer> </fieldType> So in short, in the SnowballPorterFilterFactory the language is set to Dutch. There is however a alternative stemming algorithm avilable, the Kraaij-Pohlmann algorithm, see Porter’s stemming algorithm for Dutch, known in Solr as Kp To compare both algorithms, we define a new Dutch fieldtype as follows: <dynamicField name="*_txt_nlkp" type="text_nlkp" indexed="true" stored="true"/> <fieldType name="text_nlkp" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" /> <filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemdict_nl.txt" ignoreCase="false"/> <filter class="solr.SnowballPorterFilterFactory" language="Kp"/> </analyzer> </fieldType> To complete our analysis we will also use the default English language field, defined as: <dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>

Comparison

In the data shown next we compare the three above defined fields with the correct values.

Fieldtype text_en

Input
  • katten
  • monniken
  • meisje
  • hempje
  • krikken
  • huizen
  • huisje
  • bloempje
  • loopt
Output
  • katten
  • monniken
  • meisj
  • hempj
  • krikken
  • huizen
  • huisj
  • bloempj
  • loopt
Input
  • liep
  • lopen
  • rent
  • rende
  • rennen
  • kust
  • kussen
  • kantelen
Output
  • liep
  • lopen
  • rent
  • rend
  • rennen
  • kust
  • kussen
  • kantel

Fieldtype text_nl (language = dutch)

Input
  • katten
  • monniken
  • meisje
  • hempje
  • krikken
  • huizen
  • huisje
  • bloempje
  • loopt
Output
  • kat
  • monnik
  • meisj
  • hemp
  • krik
  • huiz
  • huisj
  • bloempj
  • loopt
Input
  • liep
  • lopen
  • rent
  • rende
  • rennen
  • kust
  • kussen
  • kantelen
Output
  • liep
  • lop
  • rent
  • rend
  • renn
  • kust
  • kuss
  • kantel

Fieldtype text_nlkp (language = kp)

Input
  • katten
  • monniken
  • meisje
  • hempje
  • krikken
  • huizen
  • huisje
  • bloempje
  • loopt
Output
  • kat
  • monnik
  • meis
  • hem
  • krik
  • huis
  • huis
  • bloem
  • loop
Input
  • liep
  • lopen
  • rent
  • rende
  • rennen
  • kust
  • kussen
  • kantelen
Output
  • liep
  • loop
  • rent
  • rend
  • ren
  • kust
  • kus
  • kantel

Correct values

Input
  • katten
  • monniken
  • meisje
  • hempje
  • krikken
  • huizen
  • huisje
  • bloempje
  • loopt
Output
  • kat
  • monnik
  • meisje
  • hemd
  • krik
  • huis
  • huis
  • bloem
  • loop
Input
  • liep
  • lopen
  • rent
  • rende
  • rennen
  • kust
  • kussen
  • kantelen
Output
  • loop
  • loop
  • ren
  • ren
  • ren
  • kus (verb) of kust (noun)
  • kus
  • kantel (verb) kanteel (noun)
The most noticeable conclusion of above comparison is that the output of the text_nl-field does not differ much from the text_en-field. It seems that the 'Dutch'-language implementation of the SnowballPorterFilter has no way of stemming diminutives, results like "huisj" and "bloempj" are just plain wrong, while the Kraaij-Pohlmann correctly returns "huis" en "bloem". The same holds for the plural "huizen" which is correctly stemt by Kraaij-Pohlmann to "huis". The diminutive "meisje" is stemt by Kraaij-Pohlmann to "meis" which in some dialects of Dutch, like the dialect spoken in De Zaanstreek, is actually correct. There is however a way to correct this, see the section about KeywordMarkerFilterFactory under "a bit disappointed?". And "hempje" is wrongly stemt to "hem", which seems a too general application of the rule which correctly stems "bloempje" to "bloem" None of the algorithms knows how to handle homographs like "kust" and "kantelen" but this was to expected.

A bit disappointed?

Well, maybe your expectations were a bit high then... Natural language processing is notoriously hard and, for that part that requires background knowledge, as good as impossible when working on single words or short phrases. But in general the Kraaij-Pohlmann algorithm does a rather good job stemming Dutch words. Sometimes however, like with the word "meisje" it is a bit over-enthusiastic. But there are a number of ways to improve stemming if, for some reason, the results of Kraaij-Pohlmann algorithms are not good enough.

KeywordMarkerFilterFactory

The KeywordMarkerFilter makes it possible to exclude words from a (UTF-8) text file from stemming. A word like "meisje" would be a good candidate for this. To use it, add a filter to your fieldtype like this: <filter class="solr.KeywordMarkerFilterFactory" protected="notStemmed.txt" /> The file "notStemmed.txt" should be in the same directory as the schema.xml.

StemmerOverrideFilterFactory

The StemmerOverrideFilterFactory is a variation on the KeywordMarkerFilterFactory filter, but instead only saying "do not stem these words" you must provide a file which defines the stemming for given words. To use it, add a filter to the field type like this: <filter class="solr.StemmerOverrideFilterFactory" dictionary="dictionary.txt" ignoreCase="true"/> and make sure the file "dictionary.txt" is present in the conf-directory. In this file (which, like all others has to be encoded UTF-8) you add 1 ine per word, each lne consisting of the word an the stemmed word seperated by 1 tab, like this: hempje hemd Both KeywordMarkerFilterFactory and StemmerOverrideFilterFactory should be used as addition to the default stemming,

HunspellStemFilterFactory

Hunspell is the open source spellchecker used in a number of open source projects like LibreOffice, Mozilla Thunderbird etc. It is possible to use Hunspell if it supports your language. To do so add a filter to the fieltype like this: <filter class="solr.HunspellStemFilterFactory" dictionary="nl_NL.dic" affix="nl_NL.aff" ignoreCase="true" /> And make sure the files "nl_NL.dic" and "nl_NL.aff" are present in the conf-directory.

Creating your own stemming algorithm

Of course if you are really ambitious you can start from scratch and write your own Snowball implementation, from the Snwoball website:
Snowball is a language in which stemming algorithms can be easily represented. The Snowball compiler translates a Snowball script (a .sbl file) into either a thread-safe ANSI C program or a Java program. For ANSI C, each Snowball script produces a program file and corresponding header file (with .c and .h extensions). The language has a full manual, and the various stemming scripts act as example programs.
But be warned: no natural language or natural language phenomenon is easy to fit in an algorithm and you have to be sure to have all quirks and exceptions absolutely clear before you start.

Links and literature

Author