Stemming Words and Sentences
Goal¶
This post aims to introduce stemming words and sentences using nltk
(Natural Language Tool Kit)
Reference:
- Stemming and Lemmatization in Python
- Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)
Library¶
In [1]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
porter = PorterStemmer()
lancaster=LancasterStemmer()
Words & Sentences to be stemmed¶
In [2]:
l_words1 = ['cats', 'trouble', 'troubling', 'troubled']
l_words2 = ['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']
print(l_words1)
print(l_words2)
The example of sentences is Wiki - Stemming #Examples
In [3]:
sentence = 'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'
sentence
Out[3]:
Stemming words¶
Porter Stemming¶
Porter Stemming keeps only prefix for each words and leave non English words like troubl
. It might not be useful to see non English words for further analysis but it is simple and efficient.
In [4]:
for word in l_words1:
print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
In [5]:
for word in l_words2:
print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
Lancaster Stemming¶
Lancaster stemming is a rule-based stemming based on the last letter of the words. It is computationally heavier than Porter stemming.
In [6]:
for word in l_words1:
print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
In [7]:
for word in l_words2:
print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
Stemming sentences¶
Tokenize¶
In [8]:
tokenized_words=word_tokenize(sentence)
print(tokenized_words)
Stemming by Porter stemming¶
In [9]:
tokenized_sentence = []
for word in tokenized_words:
tokenized_sentence.append(porter.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[9]:
Stemming by lancaster¶
In [10]:
tokenized_sentence = []
for word in tokenized_words:
tokenized_sentence.append(lancaster.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[10]:
Comments
Comments powered by Disqus