Goal¶

This post aims to introduce how to use BERT word embeddings.

Reference

Chris McCormick - BERT Word Embeddings Tutorial

Libraries¶

In [2]:

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import matplotlib.pyplot as plt
%matplotlib inline

Load a pre-trained takenizer model¶

In [3]:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

100%|██████████| 231508/231508 [00:00<00:00, 426744.34B/s]

Create a sample text¶

In [10]:

# text = "This is a sample text"
text = "This is the sample sentence for BERT word embeddings"
marked_text = "[CLS] " + text + " [SEP]"

print (marked_text)

[CLS] This is the sample sentence for BERT word embeddings [SEP]

Tokenization¶

In [11]:

tokenized_text = tokenizer.tokenize(marked_text)
print (tokenized_text)

['[CLS]', 'this', 'is', 'the', 'sample', 'sentence', 'for', 'bert', 'word', 'em', '##bed', '##ding', '##s', '[SEP]']

Convert tokens to ID¶

In [12]:

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

for tup in zip(tokenized_text, indexed_tokens):
    print(tup)

('[CLS]', 101)
('this', 2023)
('is', 2003)
('the', 1996)
('sample', 7099)
('sentence', 6251)
('for', 2005)
('bert', 14324)
('word', 2773)
('em', 7861)
('##bed', 8270)
('##ding', 4667)
('##s', 2015)
('[SEP]', 102)

Goal¶

Libraries¶

Load a pre-trained takenizer model¶

Create a sample text¶

Tokenization¶

Convert tokens to ID¶

Comments