Replace Characters

Goal

This post aims to introduce how to replace the characters in python.

Create strings

In [2]:
# Create strings
strings = 'String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects'
strings
Out[2]:
'String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects'

Replace characters

.replace('{old}', '{new}')

In [16]:
# .replace('{old}', '{new}')
strings.replace('S', 'a')
Out[16]:
'atringstructurealongflexiblestructuremadefromthreadstwistedtogetherwhichisusedtotiebindorhangotherobjects'

.replace can be chained

In [5]:
# .replace can be chained
strings.replace('(', '').replace(' ', '_')
Out[5]:
'String_structure),_a_long_flexible_structure_made_from_threads_twisted_together,_which_is_used_to_tie,_bind,_or_hang_other_objects'

replace multiple characters using dictionary

In [15]:
d_replace = {'(': '',
             ')': '',
             ' ': '',
             ',': ''}

for old, new in d_replace.items():
    print(f'replace {old} with {new} ')
    strings = strings.replace(old, new)
strings
replace ( with  
replace ) with  
replace   with  
replace , with  
Out[15]:
'Stringstructurealongflexiblestructuremadefromthreadstwistedtogetherwhichisusedtotiebindorhangotherobjects'

Add Padding Around String

Goal

This post aims to introduce how to add padding around string.

Reference:

Create a string and number

In [6]:
string = 'abc_def'
num = 10

Add padding by " "(space) or other character

There is a method for string, called ljust

S.ljust(width[, fillchar]) -> str
In [3]:
string.ljust(10)
Out[3]:
'abc_def   '
In [5]:
string.ljust(10, 'a')
Out[5]:
'abc_defaaa'

Add zero padding to numbers

In [10]:
# the chaaracter after ":" is the one used for padding
'{:010}'.format(num)
Out[10]:
'0000000010'
In [13]:
# python >= 3.6 
# the character after ":" is the one used for padding
f'{num:010}'
Out[13]:
'0000000010'

Invert A Matrix

Goal

This post aims to show how to invert a matrix using numpy i.e., calculating a inverse matrix $A^{-1}$ from $A$

For example, if we have

$$A = \begin{bmatrix} a & b \\ c & d \end{bmatrix} $$

Then, $A^{-1}$ should meet with

$$A A^{-1} = I = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $$

Reference:

Library

In [1]:
import numpy as np
from numpy.linalg import inv

Create a matrix

In [2]:
arr = np.array([[1, 2], [3, 4]])
arr
Out[2]:
array([[1, 2],
       [3, 4]])

Invert a matrix

In [3]:
arr_inv = inv(arr)
arr_inv
Out[3]:
array([[-2. ,  1. ],
       [ 1.5, -0.5]])

Check $A A^{-1} = I$

In [4]:
np.dot(arr_inv, arr)
Out[4]:
array([[1.00000000e+00, 0.00000000e+00],
       [2.22044605e-16, 1.00000000e+00]])
In [5]:
np.dot(arr, arr_inv)
Out[5]:
array([[1.0000000e+00, 0.0000000e+00],
       [8.8817842e-16, 1.0000000e+00]])

Reshape An Array

Goal

This post aims to describe how to reshape an array from 1D to 2D or 2D to 1D using numpy.

Reference:

Library

In [1]:
import numpy as np

Create a 1D and 2D array

In [6]:
# 1D array
arr_1d = np.array(np.arange(0, 10))
arr_1d
Out[6]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [7]:
arr_1d.shape
Out[7]:
(10,)
In [12]:
# 2D array
arr_2d = np.array([np.arange(1, 20, 2), np.arange(100, 80, -2)]).T
arr_2d
Out[12]:
array([[  1, 100],
       [  3,  98],
       [  5,  96],
       [  7,  94],
       [  9,  92],
       [ 11,  90],
       [ 13,  88],
       [ 15,  86],
       [ 17,  84],
       [ 19,  82]])
In [13]:
arr_2d.shape
Out[13]:
(10, 2)

Reshape

reshape from 1D to 2D

In [9]:
# 1D with shape (10, ) to 2D with shape (2, 5)
np.reshape(arr_1d, [2, 5])
Out[9]:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
In [18]:
np.reshape(arr_1d, [2, 5]).shape
Out[18]:
(2, 5)

reshape from 2D to 1D

In [16]:
# 2D with shape (10, 2) to 2D with shape (20, )
np.reshape(arr_2d, arr_2d.size)
Out[16]:
array([  1, 100,   3,  98,   5,  96,   7,  94,   9,  92,  11,  90,  13,
        88,  15,  86,  17,  84,  19,  82])
In [17]:
np.reshape(arr_2d, arr_2d.size).shape
Out[17]:
(20,)

Stemming Words and Sentences

Goal

This post aims to introduce stemming words and sentences using nltk (Natural Language Tool Kit)

Reference:

Library

In [1]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
porter = PorterStemmer()
lancaster=LancasterStemmer()
[nltk_data] Downloading package punkt to /Users/hiro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Words & Sentences to be stemmed

In [2]:
l_words1 = ['cats', 'trouble', 'troubling', 'troubled']
l_words2 = ['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

print(l_words1)
print(l_words2)
['cats', 'trouble', 'troubling', 'troubled']
['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

The example of sentences is Wiki - Stemming #Examples

In [3]:
sentence = 'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'
sentence
Out[3]:
'A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.'

Stemming words

Porter Stemming

Porter Stemming keeps only prefix for each words and leave non English words like troubl. It might not be useful to see non English words for further analysis but it is simple and efficient.

In [4]:
for word in l_words1:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
cats            -> cat
trouble         -> troubl
troubling       -> troubl
troubled        -> troubl
In [5]:
for word in l_words2:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))
dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cake
indices         -> indic
matrices        -> matric

Lancaster Stemming

Lancaster stemming is a rule-based stemming based on the last letter of the words. It is computationally heavier than Porter stemming.

In [6]:
for word in l_words1:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
cats            -> cat
trouble         -> troubl
troubling       -> troubl
troubled        -> troubl
In [7]:
for word in l_words2:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))
dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cak
indices         -> ind
matrices        -> mat

Stemming sentences

Tokenize

In [8]:
tokenized_words=word_tokenize(sentence)
print(tokenized_words)
['A', 'stemmer', 'for', 'English', 'operating', 'on', 'the', 'stem', 'cat', 'should', 'identify', 'such', 'strings', 'as', 'cats', ',', 'catlike', ',', 'and', 'catty', '.', 'A', 'stemming', 'algorithm', 'might', 'also', 'reduce', 'the', 'words', 'fishing', ',', 'fished', ',', 'and', 'fisher', 'to', 'the', 'stem', 'fish', '.', 'The', 'stem', 'need', 'not', 'be', 'a', 'word', ',', 'for', 'example', 'the', 'Porter', 'algorithm', 'reduces', ',', 'argue', ',', 'argued', ',', 'argues', ',', 'arguing', ',', 'and', 'argus', 'to', 'the', 'stem', 'argu', '.']

Stemming by Porter stemming

In [9]:
tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(porter.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[9]:
'A stemmer for english oper on the stem cat should identifi such string as cat , catlik , and catti . A stem algorithm might also reduc the word fish , fish , and fisher to the stem fish . the stem need not be a word , for exampl the porter algorithm reduc , argu , argu , argu , argu , and argu to the stem argu .'

Stemming by lancaster

In [10]:
tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(lancaster.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentence
Out[10]:
'a stem for engl op on the stem cat should ident such strings as cat , catlik , and catty . a stem algorithm might also reduc the word fish , fish , and fish to the stem fish . the stem nee not be a word , for exampl the port algorithm reduc , argu , argu , argu , argu , and arg to the stem argu .'

Adding Or Substracting Time

Goal

This post aims to add or subtract time from date column using pandas:

  • Pandas

Reference:

  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library

In [1]:
import pandas as pd

Create date columns using date_range

In [2]:
date_rng = pd.date_range(start='20160101', end='20190101', freq='m', closed='left')
date_rng
Out[2]:
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

1036. Escape a Large Maze

Problem Setting

In a 1 million by 1 million grid, the coordinates of each grid square are (x, y) with 0 <= x, y < 10^6.

We start at the source square and want to reach the target square. Each move, we can walk to a 4-directionally adjacent square in the grid that isn't in the given list of blocked squares.

Return true if and only if it is possible to reach the target square through a sequence of moves

Link for Problem: leetcode

Example 1:

Input: blocked = [[0,1],[1,0]], source = [0,0], target = [0,2]

Output: false

Explanation: The target square is inaccessible starting from the source square, because we can't walk outside the grid.

Example 2:

Input: blocked = [], source = [0,0], target = [999999,999999]

Output: true

Explanation:

Because there are no blocked cells, it's possible to reach the target square.

Ordinal Encoding using Scikit-learn

Goal

This post aims to convert one of the categorical columns for further process using scikit-learn:

Library

In [1]:
import pandas as pd
import sklearn.preprocessing

Create categorical data

In [2]:
df = pd.DataFrame(data={'type': ['cat', 'dog', 'sheep'], 
                       'weight': [10, 15, 50]})
df
Out[2]:
type weight
0 cat 10
1 dog 15
2 sheep 50

Ordinal Encoding

Ordinal encoding is replacing the categories into numbers.

In [3]:
# Instanciate ordinal encoder class
oe = sklearn.preprocessing.OrdinalEncoder()

# Learn the mapping from categories to the numbers
oe.fit(df.loc[:, ['type']])
Out[3]:
OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
In [4]:
# Apply this ordinal encoder to new data 
oe.transform(pd.DataFrame(['cat'] * 3 + 
                          ['dog'] * 2 + 
                          ['sheep'] * 5))
Out[4]:
array([[0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.]])

Create A Sparse Matrix

Goal

This post aims to create a sparse matrix in python using following modules:

  • Numpy
  • Scipy

Reference:

  • Scipy Document
  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library

In [8]:
import numpy as np
import scipy.sparse

Create a sparse matrix using csr_matrix

CSR stands for "Compressed Sparse Row" matrix

In [9]:
nrow = 10000
ncol = 10000

# CSR stands for "Compressed Sparse Row" matrix
arr_sparse = scipy.sparse.csr_matrix((nrow, ncol))
arr_sparse
Out[9]:
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>