Select Date And Time Ranges

h1ros

Jun 1, 2019, 12:10:46 AM

Goal¶

This post aims to introduce how to select a subset of the pandas dataframe by selecting a date range.

Libraries¶

In [2]:

import pandas as pd
import numpy as np

Create a dataframe¶

In [18]:

date_ranges = pd.date_range('20190101', '20191231', freq='d')
df_rand = pd.DataFrame({'date': date_ranges, 
                       'value': np.random.random(date_ranges.shape[0])})
df_rand.head()

Out[18]:

	date	value
0	2019-01-01	0.332090
1	2019-01-02	0.690167
2	2019-01-03	0.237744
3	2019-01-04	0.060678
4	2019-01-05	0.572691

Select a range using `.between`¶

In [19]:

df_rand.loc[df_rand.date.between('20190201', '20190211'),:]

Out[19]:

	date	value
31	2019-02-01	0.449901
32	2019-02-02	0.803429
33	2019-02-03	0.299074
34	2019-02-04	0.630970
35	2019-02-05	0.294973
36	2019-02-06	0.510857
37	2019-02-07	0.345567
38	2019-02-08	0.877957
39	2019-02-09	0.990186
40	2019-02-10	0.000186
41	2019-02-11	0.378379

Convert Strings To Dates

h1ros

May 31, 2019, 1:12:43 AM

Comments

Goal¶

This post aims to introduce how to convert strings to dates using pandas

Libraries¶

In [1]:

import pandas as pd

Date in string¶

In [3]:

df_date = pd.DataFrame({'date':['20190101', '20190102', '20190105'], 
                       'temperature': [23.5, 32, 25]})
df_date

Out[3]:

	date	temperature
0	20190101	23.5
1	20190102	32.0
2	20190105	25.0

Convert strings to date format¶

In [4]:

pd.to_datetime(df_date['date'])

Out[4]:

0   2019-01-01
1   2019-01-02
2   2019-01-05
Name: date, dtype: datetime64[ns]

Tokenize Text

h1ros

May 30, 2019, 5:40:08 AM

Comments

Goal¶

This post aims to introduce how to tokenize text using nltk.

Reference

Libraries¶

In [5]:

from nltk.tokenize import sent_tokenize, word_tokenize

Create a sentences¶

In [8]:

paragraph = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"
paragraph

Out[8]:

"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"

Tokenize a paragraph into sentences¶

In [9]:

sent_tokenize(paragraph)

Out[9]:

['Python is an interpreted, high-level, general-purpose programming language.',
 "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
 'Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects']

Tokenize a paragraph into words¶

In [10]:

word_tokenize(paragraph)

Out[10]:

['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'high-level',
 ',',
 'general-purpose',
 'programming',
 'language',
 '.',
 'Created',
 'by',
 'Guido',
 'van',
 'Rossum',
 'and',
 'first',
 'released',
 'in',
 '1991',
 ',',
 'Python',
 "'s",
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'notable',
 'use',
 'of',
 'significant',
 'whitespace',
 '.',
 'Its',
 'language',
 'constructs',
 'and',
 'object-oriented',
 'approach',
 'aims',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 ',',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects']

Remove Punctuation

h1ros

May 29, 2019, 12:15:44 AM

Comments

Goal¶

This post aims to introduce how to remove punctuation using string.

Reference

Machine Learning Mastery - How to Clean Text for Machine Learning with Python

Libraries¶

In [9]:

import string

Create a document¶

In [10]:

documents = ["this isn't a sample.", 
            'this is another example.' ,
            'this" also appears in the second example.'
            'Is this an example?']

documents

Out[10]:

["this isn't a sample.",
 'this is another example.',
 'this" also appears in the second example.Is this an example?']

Remove Punctuation¶

In [11]:

table = str.maketrans('', '', string.punctuation)
doc_removed_punctuation = [w.translate(table) for w in documents]
doc_removed_punctuation

Out[11]:

['this isnt a sample',
 'this is another example',
 'this also appears in the second exampleIs this an example']

Bag Of Words

h1ros

May 27, 2019, 11:40:24 PM

Comments

Goal¶

This post aims to introduce Bag of words which can be used as features for each document or images.

Simply, bag of words are frequency of word with associated index for each word.

Libaries¶

In [17]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Create a document¶

In [18]:

documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents

Out[18]:

['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Create a count vector¶

In [19]:

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(documents)

df_word_counts = pd.DataFrame(word_counts.todense(), columns=count_vect.get_feature_names())
df_word_counts

Out[19]:

	also	another	appears	example	in	is	sample	second	the	this
0	0	0	0	0	0	1	1	0	0	1
1	1	1	1	2	1	1	0	1	1	2

Create a frequency vector¶

In [20]:

tf_transformer = TfidfTransformer(use_idf=False).fit(df_word_counts)
word_freq = tf_transformer.transform(df_word_counts)
df_word_freq = pd.DataFrame(word_freq.todense(), columns=count_vect.get_feature_names())
df_word_freq

Out[20]:

	also	another	appears	example	in	is	sample	second	the	this
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.577350	0.57735	0.000000	0.000000	0.577350
1	0.258199	0.258199	0.258199	0.516398	0.258199	0.258199	0.00000	0.258199	0.258199	0.516398

Term Frequency Inverse Document Frequency

h1ros

May 26, 2019, 10:15:17 PM

Comments

Goal¶

This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.

Term Frequency (TF)

Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$.

\begin{equation*} TF = \frac{n_{word}}{N_{word}} \end{equation*}

Document Frequency (DF)

Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$.

\begin{equation*} DF = \frac{n_{doc}}{N_{doc}} \end{equation*}

Inverse Document Frequency (IDF)

The inverse document frequency is the inverse of DF.

\begin{equation*} IDF = \frac{N_{doc}}{n_{doc}} \end{equation*}

Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.

\begin{equation*} IDF = log(\frac{N_{doc}}{n_{doc}+1}) \end{equation*}

Reference

Libraries¶

In [18]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a document¶

In [36]:

documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents

Out[36]:

['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Apply TF-IDF vectorizer¶

Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.

In [37]:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf

Out[37]:

	also	another	appears	example	in	is	sample	second	the	this
0	0.00000	0.00000	0.00000	0.00000	0.00000	0.501549	0.704909	0.00000	0.00000	0.501549
1	0.28249	0.28249	0.28249	0.56498	0.28249	0.200994	0.000000	0.28249	0.28249	0.401988

In [38]:

# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)

Out[38]:

0     sample
1    example
dtype: object

In [39]:

# TF-IDF is zero if the word does not appear in a document
df_tfidf==0

Out[39]:

	also	another	appears	example	in	is	sample	second	the	this
0	True	True	True	True	True	False	False	True	True	False
1	False	False	False	False	False	False	True	False	False	False

Survival Analysis

h1ros

May 25, 2019, 11:30:44 PM

Comments

Goal¶

This post aims to introduce how to do survival analysis using lifelines. In this post, I use fellowship information in 200 Words a day to see what the survival curve looks like, which might be useful for users retention.

200 Words a day is the platform where those who wants to build a writing habit make a post with more than 200 words. There is a function to show how many consecutive days each user has been making a post as X-day streak.

Reference

Deleting Missing Values

h1ros

May 24, 2019, 11:43:34 PM

Comments

Goal¶

This post aims to introduce how to delete missing values using pandas in python.

Libraries¶

In [3]:

import pandas as pd
import numpy as np

Create DataFrame¶

In [13]:

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df

Out[13]:

	A	B	C	D
0	-1.111902	1.095301	0.140572	0.541279
1	1.197394	0.173438	-0.369171	0.861130
2	1.472260	2.063012	-1.214586	-1.709280
3	-2.990860	-0.315950	-0.521123	-0.889226
4	-0.148088	0.891630	-0.422730	-0.095359
5	0.297797	-0.617062	-0.144902	-1.628348

In [14]:

# create missing values
df.loc[3, 'B'] = None
df.loc[4, 'D'] = None
df

Out[14]:

	A	B	C	D
0	-1.111902	1.095301	0.140572	0.541279
1	1.197394	0.173438	-0.369171	0.861130
2	1.472260	2.063012	-1.214586	-1.709280
3	-2.990860	NaN	-0.521123	-0.889226
4	-0.148088	0.891630	-0.422730	NaN
5	0.297797	-0.617062	-0.144902	-1.628348

Deleting Missing Values¶

In [15]:

# identify the index by fillna
df.isna()

Out[15]:

	A	B	C	D
0	False	False	False	False
1	False	False	False	False
2	False	False	False	False
3	False	True	False	False
4	False	False	False	True
5	False	False	False	False

In [21]:

df.isna().any(axis=1)

Out[21]:

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [23]:

# Deleteging the rows containing NaN
df.loc[~df.isna().any(axis=1), :]

Out[23]:

	A	B	C	D
0	-1.111902	1.095301	0.140572	0.541279
1	1.197394	0.173438	-0.369171	0.861130
2	1.472260	2.063012	-1.214586	-1.709280
5	0.297797	-0.617062	-0.144902	-1.628348

In [24]:

# Deleteging the ciks containing NaN
df.loc[:, ~df.isna().any(axis=0)]

Out[24]:

	A	C
0	-1.111902	0.140572
1	1.197394	-0.369171
2	1.472260	-1.214586
3	-2.990860	-0.521123
4	-0.148088	-0.422730
5	0.297797	-0.144902

Parse HTML

h1ros

May 23, 2019, 11:18:35 PM

Comments

Goal¶

This post aims to introduce how to parse the HTML data fetched by BeautifulSoup

Reference

Using BeautifulSoup to parse HTML and extract press briefings URLs

Library¶

In [12]:

from bs4 import BeautifulSoup
import requests

Simple HTML from string¶

In [24]:

html_simple = '<h1>This is Title<h1>'
html_simple

Out[24]:

'<h1>This is Title<h1>'

In [25]:

soup = BeautifulSoup(html_simple)

In [26]:

soup.text

Out[26]:

'This is Title'

Calculate The Trace Of A Matrix

h1ros

May 22, 2019, 11:58:57 PM

Comments

Goal¶

This post aims to show how to calculate the trace of a matrix using numpy i.e., $tr(A)$

$tr(A)$ is defined as

$$ tr(A) = \sum_{i=1}^{n} a_{ii} = a_{11} + a_{22} + \cdots + a_{nn} $$

Reference:

Libraries¶

In [1]:

import numpy as np

Create a matrix¶

In [2]:

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr

Out[2]:

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Calculate the trace¶

In [4]:

arr.trace()

Out[4]:

In [7]:

sum([arr[i, i] for i in range(len(arr))])

Out[7]:

Goal¶

Libraries¶

Create a dataframe¶

Select a range using .between¶

Goal¶

Libraries¶

Date in string¶

Convert strings to date format¶

Goal¶

Libraries¶

Create a sentences¶

Tokenize a paragraph into sentences¶

Tokenize a paragraph into words¶

Goal¶

Libraries¶

Create a document¶

Remove Punctuation¶

Goal¶

Libaries¶

Create a document¶

Create a count vector¶

Create a frequency vector¶

Goal¶

Libraries¶

Create a document¶

Apply TF-IDF vectorizer¶

Goal¶

Goal¶

Libraries¶

Create DataFrame¶

Deleting Missing Values¶

Goal¶

Library¶

Simple HTML from string¶

Goal¶

Libraries¶

Create a matrix¶

Calculate the trace¶

Select a range using `.between`¶