Goal¶

This post aims to introduce stemming words and sentences using nltk (Natural Language Tool Kit)

Reference:

Stemming and Lemmatization in Python
Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library¶

In [1]:

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
porter = PorterStemmer()
lancaster=LancasterStemmer()

[nltk_data] Downloading package punkt to /Users/hiro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Words & Sentences to be stemmed¶

In [2]:

l_words1 = ['cats', 'trouble', 'troubling', 'troubled']
l_words2 = ['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

print(l_words1)
print(l_words2)

['cats', 'trouble', 'troubling', 'troubled']
['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

The example of sentences is Wiki - Stemming #Examples

In [3]:

sentence = 'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'
sentence

Out[3]:

'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'

Stemming words¶

Porter Stemming¶

Porter Stemming keeps only prefix for each words and leave non English words like troubl. It might not be useful to see non English words for further analysis but it is simple and efficient.

In [4]:

for word in l_words1:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))

cats            -> cat
trouble         -> troubl
troubling       -> troubl
troubled        -> troubl

In [5]:

for word in l_words2:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))

dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cake
indices         -> indic
matrices        -> matric

Lancaster Stemming¶

Lancaster stemming is a rule-based stemming based on the last letter of the words. It is computationally heavier than Porter stemming.

In [6]:

for word in l_words1:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))

cats            -> cat
trouble         -> troubl
troubling       -> troubl
troubled        -> troubl

In [7]:

for word in l_words2:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))

dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cak
indices         -> ind
matrices        -> mat

Stemming sentences¶

Tokenize¶

In [8]:

tokenized_words=word_tokenize(sentence)
print(tokenized_words)

['A', 'stemmer', 'for', 'English', 'operating', 'on', 'the', 'stem', 'cat', 'should', 'identify', 'such', 'strings', 'as', 'cats', ',', 'catlike', ',', 'and', 'catty', '.', 'A', 'stemming', 'algorithm', 'might', 'also', 'reduce', 'the', 'words', 'fishing', ',', 'fished', ',', 'and', 'fisher', 'to', 'the', 'stem', 'fish', '.', 'The', 'stem', 'need', 'not', 'be', 'a', 'word', ',', 'for', 'example', 'the', 'Porter', 'algorithm', 'reduces', ',', 'argue', ',', 'argued', ',', 'argues', ',', 'arguing', ',', 'and', 'argus', 'to', 'the', 'stem', 'argu', '.']

Stemming by Porter stemming¶

In [9]:

tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(porter.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence

Out[9]:

'A stemmer for english oper on the stem cat should identifi such string as cat , catlik , and catti . A stem algorithm might also reduc the word fish , fish , and fisher to the stem fish . the stem need not be a word , for exampl the porter algorithm reduc , argu , argu , argu , argu , and argu to the stem argu .'

Stemming by lancaster¶

In [10]:

tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(lancaster.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence

Out[10]:

'a stem for engl op on the stem cat should ident such strings as cat , catlik , and catty . a stem algorithm might also reduc the word fish , fish , and fish to the stem fish . the stem nee not be a word , for exampl the port algorithm reduc , argu , argu , argu , argu , and arg to the stem argu .'