BERT Word Embeddings

Goal

This post aims to introduce how to use BERT word embeddings.

Reference

Libraries

In [2]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import matplotlib.pyplot as plt
%matplotlib inline

Load a pre-trained takenizer model

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
100%|██████████| 231508/231508 [00:00<00:00, 426744.34B/s]

Create a sample text

In [10]:
# text = "This is a sample text"
text = "This is the sample sentence for BERT word embeddings"
marked_text = "[CLS] " + text + " [SEP]"

print (marked_text)
[CLS] This is the sample sentence for BERT word embeddings [SEP]

Tokenization

In [11]:
tokenized_text = tokenizer.tokenize(marked_text)
print (tokenized_text)
['[CLS]', 'this', 'is', 'the', 'sample', 'sentence', 'for', 'bert', 'word', 'em', '##bed', '##ding', '##s', '[SEP]']

Convert tokens to ID

In [12]:
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

for tup in zip(tokenized_text, indexed_tokens):
    print(tup)
('[CLS]', 101)
('this', 2023)
('is', 2003)
('the', 1996)
('sample', 7099)
('sentence', 6251)
('for', 2005)
('bert', 14324)
('word', 2773)
('em', 7861)
('##bed', 8270)
('##ding', 4667)
('##s', 2015)
('[SEP]', 102)

Comments

Comments powered by Disqus