Stemming Words and Sentences
Goal¶
This post aims to introduce stemming words and sentences using nltk (Natural Language Tool Kit)
Reference:
- Stemming and Lemmatization in Python
- Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)
Library¶
In [1]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
porter = PorterStemmer()
lancaster=LancasterStemmer()
Words & Sentences to be stemmed¶
In [2]:
l_words1 = ['cats', 'trouble', 'troubling', 'troubled']
l_words2 = ['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']
print(l_words1)
print(l_words2)
The example of sentences is Wiki - Stemming #Examples
In [3]:
sentence = 'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'
sentence
Out[3]:
Stemming words¶
Porter Stemming¶
Porter Stemming keeps only prefix for each words and leave non English words like troubl. It might not be useful to see non English words for further analysis but it is simple and efficient.
In [4]:
for word in l_words1:
print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
In [5]:
for word in l_words2:
print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
Lancaster Stemming¶
Lancaster stemming is a rule-based stemming based on the last letter of the words. It is computationally heavier than Porter stemming.
In [6]:
for word in l_words1:
print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
In [7]:
for word in l_words2:
print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
Stemming sentences¶
Tokenize¶
In [8]:
tokenized_words=word_tokenize(sentence)
print(tokenized_words)
Stemming by Porter stemming¶
In [9]:
tokenized_sentence = []
for word in tokenized_words:
tokenized_sentence.append(porter.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[9]:
Stemming by lancaster¶
In [10]:
tokenized_sentence = []
for word in tokenized_words:
tokenized_sentence.append(lancaster.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[10]:
Comments
Comments powered by Disqus