Posts about NLTK

Stemming Words and Sentences

Goal

This post aims to introduce stemming words and sentences using nltk (Natural Language Tool Kit)

Reference:

Library

In [1]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
porter = PorterStemmer()
lancaster=LancasterStemmer()
[nltk_data] Downloading package punkt to /Users/hiro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Words & Sentences to be stemmed

In [2]:
l_words1 = ['cats', 'trouble', 'troubling', 'troubled']
l_words2 = ['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

print(l_words1)
print(l_words2)
['cats', 'trouble', 'troubling', 'troubled']
['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

The example of sentences is Wiki - Stemming #Examples

In [3]:
sentence = 'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'
sentence
Out[3]:
'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'

Stemming words

Porter Stemming

Porter Stemming keeps only prefix for each words and leave non English words like troubl. It might not be useful to see non English words for further analysis but it is simple and efficient.

In [4]:
for word in l_words1:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
cats            -> cat
trouble         -> troubl
troubling       -> troubl
troubled        -> troubl
In [5]:
for word in l_words2:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cake
indices         -> indic
matrices        -> matric

Lancaster Stemming

Lancaster stemming is a rule-based stemming based on the last letter of the words. It is computationally heavier than Porter stemming.

In [6]:
for word in l_words1:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
cats            -> cat
trouble         -> troubl
troubling       -> troubl
troubled        -> troubl
In [7]:
for word in l_words2:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cak
indices         -> ind
matrices        -> mat

Stemming sentences

Tokenize

In [8]:
tokenized_words=word_tokenize(sentence)
print(tokenized_words)
['A', 'stemmer', 'for', 'English', 'operating', 'on', 'the', 'stem', 'cat', 'should', 'identify', 'such', 'strings', 'as', 'cats', ',', 'catlike', ',', 'and', 'catty', '.', 'A', 'stemming', 'algorithm', 'might', 'also', 'reduce', 'the', 'words', 'fishing', ',', 'fished', ',', 'and', 'fisher', 'to', 'the', 'stem', 'fish', '.', 'The', 'stem', 'need', 'not', 'be', 'a', 'word', ',', 'for', 'example', 'the', 'Porter', 'algorithm', 'reduces', ',', 'argue', ',', 'argued', ',', 'argues', ',', 'arguing', ',', 'and', 'argus', 'to', 'the', 'stem', 'argu', '.']

Stemming by Porter stemming

In [9]:
tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(porter.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[9]:
'A stemmer for english oper on the stem cat should identifi such string as cat , catlik , and catti . A stem algorithm might also reduc the word fish , fish , and fisher to the stem fish . the stem need not be a word , for exampl the porter algorithm reduc , argu , argu , argu , argu , and argu to the stem argu .'

Stemming by lancaster

In [10]:
tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(lancaster.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[10]:
'a stem for engl op on the stem cat should ident such strings as cat , catlik , and catty . a stem algorithm might also reduc the word fish , fish , and fish to the stem fish . the stem nee not be a word , for exampl the port algorithm reduc , argu , argu , argu , argu , and arg to the stem argu .'