# BERT Word Embeddings

## Goal¶

This post aims to introduce how to use BERT word embeddings.

Reference

## Libraries¶

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import matplotlib.pyplot as plt
%matplotlib inline


## Load a pre-trained takenizer model¶

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

## Create a sample text¶

# text = "This is a sample text"
text = "This is the sample sentence for BERT word embeddings"
marked_text = "[CLS] " + text + " [SEP]"

print (marked_text)

[CLS] This is the sample sentence for BERT word embeddings [SEP]


## Tokenization¶

tokenized_text = tokenizer.tokenize(marked_text)
print (tokenized_text)

['[CLS]', 'this', 'is', 'the', 'sample', 'sentence', 'for', 'bert', 'word', 'em', '##bed', '##ding', '##s', '[SEP]']


## Convert tokens to ID¶

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

for tup in zip(tokenized_text, indexed_tokens):
print(tup)

('[CLS]', 101)
('this', 2023)
('is', 2003)
('the', 1996)
('sample', 7099)
('sentence', 6251)
('for', 2005)
('bert', 14324)
('word', 2773)
('em', 7861)
('##bed', 8270)
('##ding', 4667)
('##s', 2015)
('[SEP]', 102)


# Tokenize Text

## Goal¶

This post aims to introduce how to tokenize text using nltk.

Reference

## Libraries¶

from nltk.tokenize import sent_tokenize, word_tokenize


## Create a sentences¶

paragraph = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"
paragraph

"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"

## Tokenize a paragraph into sentences¶

sent_tokenize(paragraph)

['Python is an interpreted, high-level, general-purpose programming language.',
"Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
'Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects']

## Tokenize a paragraph into words¶

word_tokenize(paragraph)

['Python',
'is',
'an',
'interpreted',
',',
'high-level',
',',
'general-purpose',
'programming',
'language',
'.',
'Created',
'by',
'Guido',
'van',
'Rossum',
'and',
'first',
'released',
'in',
'1991',
',',
'Python',
"'s",
'design',
'philosophy',
'emphasizes',
'code',
'with',
'its',
'notable',
'use',
'of',
'significant',
'whitespace',
'.',
'Its',
'language',
'constructs',
'and',
'object-oriented',
'approach',
'aims',
'to',
'help',
'programmers',
'write',
'clear',
',',
'logical',
'code',
'for',
'small',
'and',
'large-scale',
'projects']

# Remove Punctuation

## Goal¶

This post aims to introduce how to remove punctuation using string.

Reference

## Libraries¶

import string


## Create a document¶

documents = ["this isn't a sample.",
'this is another example.' ,
'this" also appears in the second example.'
'Is this an example?']

documents

["this isn't a sample.",
'this is another example.',
'this" also appears in the second example.Is this an example?']

## Remove Punctuation¶

table = str.maketrans('', '', string.punctuation)
doc_removed_punctuation = [w.translate(table) for w in documents]
doc_removed_punctuation

['this isnt a sample',
'this is another example',
'this also appears in the second exampleIs this an example']

# Bag Of Words

## Goal¶

This post aims to introduce Bag of words which can be used as features for each document or images.

Simply, bag of words are frequency of word with associated index for each word.

## Libaries¶

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


## Create a document¶

documents = ['this is a sample.',
'this is another example. "this" also appears in the second example.']

documents

['this is a sample.',
'this is another example. "this" also appears in the second example.']

## Create a count vector¶

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(documents)

df_word_counts = pd.DataFrame(word_counts.todense(), columns=count_vect.get_feature_names())
df_word_counts

also another appears example in is sample second the this
0 0 0 0 0 0 1 1 0 0 1
1 1 1 1 2 1 1 0 1 1 2

## Create a frequency vector¶

tf_transformer = TfidfTransformer(use_idf=False).fit(df_word_counts)
word_freq = tf_transformer.transform(df_word_counts)
df_word_freq = pd.DataFrame(word_freq.todense(), columns=count_vect.get_feature_names())
df_word_freq

also another appears example in is sample second the this
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.577350 0.57735 0.000000 0.000000 0.577350
1 0.258199 0.258199 0.258199 0.516398 0.258199 0.258199 0.00000 0.258199 0.258199 0.516398

# Term Frequency Inverse Document Frequency

## Goal¶

This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.

Term Frequency (TF)

Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$.

\begin{equation*} TF = \frac{n_{word}}{N_{word}} \end{equation*}

Document Frequency (DF)

Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$.

\begin{equation*} DF = \frac{n_{doc}}{N_{doc}} \end{equation*}

Inverse Document Frequency (IDF)

The inverse document frequency is the inverse of DF.

\begin{equation*} IDF = \frac{N_{doc}}{n_{doc}} \end{equation*}

Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.

\begin{equation*} IDF = log(\frac{N_{doc}}{n_{doc}+1}) \end{equation*}

Reference

## Libraries¶

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


## Create a document¶

documents = ['this is a sample.',
'this is another example. "this" also appears in the second example.']

documents

['this is a sample.',
'this is another example. "this" also appears in the second example.']

## Apply TF-IDF vectorizer¶

Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf

also another appears example in is sample second the this
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.501549 0.704909 0.00000 0.00000 0.501549
1 0.28249 0.28249 0.28249 0.56498 0.28249 0.200994 0.000000 0.28249 0.28249 0.401988
# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)

0     sample
1    example
dtype: object
# TF-IDF is zero if the word does not appear in a document
df_tfidf==0

also another appears example in is sample second the this
0 True True True True True False False True True False
1 False False False False False False True False False False

# Parse HTML

## Goal¶

This post aims to introduce how to parse the HTML data fetched by BeautifulSoup

Reference

## Library¶

from bs4 import BeautifulSoup
import requests


## Simple HTML from string¶

html_simple = '<h1>This is Title<h1>'
html_simple

'<h1>This is Title<h1>'
soup = BeautifulSoup(html_simple)

soup.text

'This is Title'

# Create a word cloud

## Goal¶

This post aims to introduce how to create a word cloud using wordcloud

As the source of words, I use one of my posts in 200Wordsaday a.k.a. 200WaD where is the community for those who want to build a writing habit.

Reference