Select Date And Time Ranges

Goal

This post aims to introduce how to select a subset of the pandas dataframe by selecting a date range.

Libraries

In [2]:
import pandas as pd
import numpy as np

Create a dataframe

In [18]:
date_ranges = pd.date_range('20190101', '20191231', freq='d')
df_rand = pd.DataFrame({'date': date_ranges, 
                       'value': np.random.random(date_ranges.shape[0])})
df_rand.head()
Out[18]:
date value
0 2019-01-01 0.332090
1 2019-01-02 0.690167
2 2019-01-03 0.237744
3 2019-01-04 0.060678
4 2019-01-05 0.572691

Select a range using .between

In [19]:
df_rand.loc[df_rand.date.between('20190201', '20190211'),:]
Out[19]:
date value
31 2019-02-01 0.449901
32 2019-02-02 0.803429
33 2019-02-03 0.299074
34 2019-02-04 0.630970
35 2019-02-05 0.294973
36 2019-02-06 0.510857
37 2019-02-07 0.345567
38 2019-02-08 0.877957
39 2019-02-09 0.990186
40 2019-02-10 0.000186
41 2019-02-11 0.378379

Convert Strings To Dates

Goal

This post aims to introduce how to convert strings to dates using pandas

Libraries

In [1]:
import pandas as pd

Date in string

In [3]:
df_date = pd.DataFrame({'date':['20190101', '20190102', '20190105'], 
                       'temperature': [23.5, 32, 25]})
df_date
Out[3]:
date temperature
0 20190101 23.5
1 20190102 32.0
2 20190105 25.0

Convert strings to date format

In [4]:
pd.to_datetime(df_date['date'])
Out[4]:
0   2019-01-01
1   2019-01-02
2   2019-01-05
Name: date, dtype: datetime64[ns]

Tokenize Text

Goal

This post aims to introduce how to tokenize text using nltk.

Reference

Libraries

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

Create a sentences

In [8]:
paragraph = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"
paragraph
Out[8]:
"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"

Tokenize a paragraph into sentences

In [9]:
sent_tokenize(paragraph)
Out[9]:
['Python is an interpreted, high-level, general-purpose programming language.',
 "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
 'Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects']

Tokenize a paragraph into words

In [10]:
word_tokenize(paragraph)
Out[10]:
['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'high-level',
 ',',
 'general-purpose',
 'programming',
 'language',
 '.',
 'Created',
 'by',
 'Guido',
 'van',
 'Rossum',
 'and',
 'first',
 'released',
 'in',
 '1991',
 ',',
 'Python',
 "'s",
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'notable',
 'use',
 'of',
 'significant',
 'whitespace',
 '.',
 'Its',
 'language',
 'constructs',
 'and',
 'object-oriented',
 'approach',
 'aims',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 ',',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects']

Remove Punctuation

Goal

This post aims to introduce how to remove punctuation using string.

Reference

Libraries

In [9]:
import string

Create a document

In [10]:
documents = ["this isn't a sample.", 
            'this is another example.' ,
            'this" also appears in the second example.'
            'Is this an example?']

documents
Out[10]:
["this isn't a sample.",
 'this is another example.',
 'this" also appears in the second example.Is this an example?']

Remove Punctuation

In [11]:
table = str.maketrans('', '', string.punctuation)
doc_removed_punctuation = [w.translate(table) for w in documents]
doc_removed_punctuation
Out[11]:
['this isnt a sample',
 'this is another example',
 'this also appears in the second exampleIs this an example']

Bag Of Words

Goal

This post aims to introduce Bag of words which can be used as features for each document or images.

Simply, bag of words are frequency of word with associated index for each word.

Libaries

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Create a document

In [18]:
documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents
Out[18]:
['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Create a count vector

In [19]:
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(documents)

df_word_counts = pd.DataFrame(word_counts.todense(), columns=count_vect.get_feature_names())
df_word_counts
Out[19]:
also another appears example in is sample second the this
0 0 0 0 0 0 1 1 0 0 1
1 1 1 1 2 1 1 0 1 1 2

Create a frequency vector

In [20]:
tf_transformer = TfidfTransformer(use_idf=False).fit(df_word_counts)
word_freq = tf_transformer.transform(df_word_counts)
df_word_freq = pd.DataFrame(word_freq.todense(), columns=count_vect.get_feature_names())
df_word_freq
Out[20]:
also another appears example in is sample second the this
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.577350 0.57735 0.000000 0.000000 0.577350
1 0.258199 0.258199 0.258199 0.516398 0.258199 0.258199 0.00000 0.258199 0.258199 0.516398

Term Frequency Inverse Document Frequency

Goal

This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.

Term Frequency (TF)

Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$.

\begin{equation*} TF = \frac{n_{word}}{N_{word}} \end{equation*}

Document Frequency (DF)

Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$.

\begin{equation*} DF = \frac{n_{doc}}{N_{doc}} \end{equation*}

Inverse Document Frequency (IDF)

The inverse document frequency is the inverse of DF.

\begin{equation*} IDF = \frac{N_{doc}}{n_{doc}} \end{equation*}

Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.

\begin{equation*} IDF = log(\frac{N_{doc}}{n_{doc}+1}) \end{equation*}

Reference

Libraries

In [18]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a document

In [36]:
documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents
Out[36]:
['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Apply TF-IDF vectorizer

Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.

In [37]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf
Out[37]:
also another appears example in is sample second the this
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.501549 0.704909 0.00000 0.00000 0.501549
1 0.28249 0.28249 0.28249 0.56498 0.28249 0.200994 0.000000 0.28249 0.28249 0.401988
In [38]:
# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)
Out[38]:
0     sample
1    example
dtype: object
In [39]:
# TF-IDF is zero if the word does not appear in a document
df_tfidf==0
Out[39]:
also another appears example in is sample second the this
0 True True True True True False False True True False
1 False False False False False False True False False False

Survival Analysis

Goal

This post aims to introduce how to do survival analysis using lifelines. In this post, I use fellowship information in 200 Words a day to see what the survival curve looks like, which might be useful for users retention.

200 Words a day is the platform where those who wants to build a writing habit make a post with more than 200 words. There is a function to show how many consecutive days each user has been making a post as X-day streak.

image

Reference

Deleting Missing Values

Goal

This post aims to introduce how to delete missing values using pandas in python.

Libraries

In [3]:
import pandas as pd
import numpy as np

Create DataFrame

In [13]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
Out[13]:
A B C D
0 -1.111902 1.095301 0.140572 0.541279
1 1.197394 0.173438 -0.369171 0.861130
2 1.472260 2.063012 -1.214586 -1.709280
3 -2.990860 -0.315950 -0.521123 -0.889226
4 -0.148088 0.891630 -0.422730 -0.095359
5 0.297797 -0.617062 -0.144902 -1.628348
In [14]:
# create missing values
df.loc[3, 'B'] = None
df.loc[4, 'D'] = None
df
Out[14]:
A B C D
0 -1.111902 1.095301 0.140572 0.541279
1 1.197394 0.173438 -0.369171 0.861130
2 1.472260 2.063012 -1.214586 -1.709280
3 -2.990860 NaN -0.521123 -0.889226
4 -0.148088 0.891630 -0.422730 NaN
5 0.297797 -0.617062 -0.144902 -1.628348

Deleting Missing Values

In [15]:
# identify the index by fillna
df.isna()
Out[15]:
A B C D
0 False False False False
1 False False False False
2 False False False False
3 False True False False
4 False False False True
5 False False False False
In [21]:
df.isna().any(axis=1)
Out[21]:
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool
In [23]:
# Deleteging the rows containing NaN
df.loc[~df.isna().any(axis=1), :]
Out[23]:
A B C D
0 -1.111902 1.095301 0.140572 0.541279
1 1.197394 0.173438 -0.369171 0.861130
2 1.472260 2.063012 -1.214586 -1.709280
5 0.297797 -0.617062 -0.144902 -1.628348
In [24]:
# Deleteging the ciks containing NaN
df.loc[:, ~df.isna().any(axis=0)]
Out[24]:
A C
0 -1.111902 0.140572
1 1.197394 -0.369171
2 1.472260 -1.214586
3 -2.990860 -0.521123
4 -0.148088 -0.422730
5 0.297797 -0.144902

Parse HTML

Goal

This post aims to introduce how to parse the HTML data fetched by BeautifulSoup

Reference

Library

In [12]:
from bs4 import BeautifulSoup
import requests

Simple HTML from string

In [24]:
html_simple = '<h1>This is Title<h1>'
html_simple
Out[24]:
'<h1>This is Title<h1>'
In [25]:
soup = BeautifulSoup(html_simple)
In [26]:
soup.text
Out[26]:
'This is Title'

Calculate The Trace Of A Matrix

Goal

This post aims to show how to calculate the trace of a matrix using numpy i.e., $tr(A)$

$tr(A)$ is defined as

$$ tr(A) = \sum_{i=1}^{n} a_{ii} = a_{11} + a_{22} + \cdots + a_{nn} $$

Reference:

Libraries

In [1]:
import numpy as np

Create a matrix

In [2]:
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr
Out[2]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Calculate the trace

In [4]:
arr.trace()
Out[4]:
15
In [7]:
sum([arr[i, i] for i in range(len(arr))])
Out[7]:
15