Tokenize Text

Goal

This post aims to introduce how to tokenize text using nltk.

Reference

Libraries

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

Create a sentences

In [8]:
paragraph = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"
paragraph
Out[8]:
"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"

Tokenize a paragraph into sentences

In [9]:
sent_tokenize(paragraph)
Out[9]:
['Python is an interpreted, high-level, general-purpose programming language.',
 "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
 'Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects']

Tokenize a paragraph into words

In [10]:
word_tokenize(paragraph)
Out[10]:
['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'high-level',
 ',',
 'general-purpose',
 'programming',
 'language',
 '.',
 'Created',
 'by',
 'Guido',
 'van',
 'Rossum',
 'and',
 'first',
 'released',
 'in',
 '1991',
 ',',
 'Python',
 "'s",
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'notable',
 'use',
 'of',
 'significant',
 'whitespace',
 '.',
 'Its',
 'language',
 'constructs',
 'and',
 'object-oriented',
 'approach',
 'aims',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 ',',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects']

Comments

Comments powered by Disqus