4

I am trying to use the German stemmer that comes with RTextTools but the results I get are quite off the mark.

Say, I have the following vector:

v <- c("groß", "größer", "am", "größten", "ähnlicher")

Using

library(RTextTools)
wordStem(v, "german")

I get

[1] "groß"    "größer"  "am"      "größten" "ähnlich"

What am I missing??

smci
  • 26,085
  • 16
  • 96
  • 138
Dominic
  • 415
  • 1
  • 4
  • 8
  • 1
    Well, apparently it should be using the Snowball algorithm (I checked it out here [link](http://snowball.tartarus.org/algorithms/german/stemmer.html) - The first thing it should do is to replace "ß" with "ss" und then clearly "groß", "größer", "größten" should all result in "gross", "gross", "gross". – Dominic Jun 08 '12 at 21:24
  • 2
    it looks as if the algorithm is not handled the "ß" properly. "ähnlich" is correct. Try it with "grösser" "grössten" – moskito-x Jun 08 '12 at 21:59

1 Answers1

2

The algorithm in Snowball

/*
    Extra rule for -nisse ending added 11 Dec 2009
*/

routines (
           prelude postlude
           mark_regions
           R1 R2
           standard_suffix
)

externals ( stem )

integers ( p1 p2 x )

groupings ( v s_ending st_ending )

stringescapes {}

/* special characters (in ISO Latin I) */

stringdef a"   hex 'E4'
stringdef o"   hex 'F6'
stringdef u"   hex 'FC'
stringdef ss   hex 'DF'
......

looks like it is translated back to 'DF' "ß"

Representation of umlaut by following e The German letters ä, ö and ü, are occasionally represented by ae, oe and ue respectively. The stemmer here is a variant of the main German stemmer to take this into account.

The main German stemmer begins with the rule,

First, replace ß by ss, and put u and y between vowels into upper case. 

This is replaced with the rule,

Put u and y between vowels into upper case, and then do the following mappings,

    (a) replace ß with ss, **"MAYBE WRONG ORDER"**
    (a) replace ae with ä,
    (a) replace oe with ö,
    (a) replace ue with ü unless preceded by q. 



So in quelle, ue is not mapped to ü because it follows q, and in feuer it is not mapped because the first part of the rule changes it to feUer, so the u is not found. 
moskito-x
  • 11,454
  • 5
  • 42
  • 56