Survival Analysis
Goal¶
This post aims to introduce how to do survival analysis using lifelines
. In this post, I use fellowship information in 200 Words a day to see what the survival curve looks like, which might be useful for users retention.
200 Words a day is the platform where those who wants to build a writing habit make a post with more than 200 words. There is a function to show how many consecutive days each user has been making a post as X-day streak.
Reference
Libraries¶
In [19]:
from lifelines import KaplanMeierFitter, LogNormalFitter
from bs4 import BeautifulSoup
import requests
import pandas as pd
%matplotlib inline
Fetch "fellowship" and "streak" data¶
Please have a look at parse HTML if you are not familiar with parsing HTML.
Note that this code only fetches the first pages (the first 50 users data) so this is biased.
In [4]:
url_fellowship = 'https://200wordsaday.com/fellowship'
response = requests.get(url_fellowship)
soup = BeautifulSoup(response.text)
In [45]:
# List of users
users = []
badges = []
for h2 in soup.find_all('h2'):
users.append(h2.text.strip().replace('@', '').replace('PATRON', ''))
if h2.find('span') is not None:
badges.append(h2.find('span').text)
else:
badges.append(None)
df_user = pd.DataFrame({'username': users, 'badge': badges})
# List of streak/posts/pages
attributes = []
for att in soup.find_all('p')[1:-11]:
att_row = []
for e in att.find_all('small'):
att_row.extend([e.text[0], e.text[1:]])
attributes.append(att_row)
df_attributes = pd.DataFrame(attributes).drop([2, 4], axis=1)
df_attributes.columns = ['streak', 'nstreak', 'posts', 'pages']
In [46]:
# Get all username and streak with attribute (patron or not)
df_user = pd.concat([df_user, df_attributes], axis=1, ignore_index=False)
df_user.head()
Out[46]:
In [48]:
# Data Cleansing by replaceing '' with 0
df_user['nstreak'] = df_user['nstreak'].replace('', 1).astype(int)
Descriptive Analysis for Users¶
In [44]:
# Distribution of streak
df_user['nstreak'].plot.hist(bins=20);
In [43]:
# Distribution of Patron
df_user['badge'].fillna('Non-PATRON').value_counts().plot(kind='bar');
Simple Survival Curve - Kaplan Meier Curve¶
In [50]:
kmf = KaplanMeierFitter()
kmf.fit(df_user['nstreak'], event_observed=1*(df_user['streak']=='🔥')) # or, more succinctly, kmf.fit(T, E)
kmf.plot(title='Survival Curve for 200WaD Streak based on top 50 users');
In [14]:
lnf = LogNormalFitter()
lnf.fit(df_user['nstreak'], event_observed=1*(df_user['streak']=='🔥')) # or, more succinctly, kmf.fit(T, E)
lnf.plot_survival_function();
Survival curve by group¶
Team Streak (more than 30 posts)¶
In [15]:
idx_teamstreak = df_user['nstreak'] >=30
# Team Streak
kmf = KaplanMeierFitter()
kmf.fit(df_user.loc[idx_teamstreak, 'nstreak'],
event_observed=1*(df_user.loc[idx_teamstreak, 'streak']=='🔥'), label='team streak')
ax = kmf.plot();
# Non team streak
kmf.fit(df_user.loc[~idx_teamstreak, 'nstreak'],
event_observed=1*(df_user.loc[~idx_teamstreak, 'streak']=='🔥'), label='non team streak')
kmf.plot(ax=ax);
Patron¶
In [16]:
idx_patron = df_user['badge'] == 'PATRON'
# Team Streak
kmf = KaplanMeierFitter()
kmf.fit(df_user.loc[idx_patron, 'nstreak'],
event_observed=1*(df_user.loc[idx_patron, 'streak']=='🔥'), label='PATRON')
ax = kmf.plot();
# Non team streak
kmf.fit(df_user.loc[~idx_patron, 'nstreak'],
event_observed=1*(df_user.loc[~idx_patron, 'streak']=='🔥'), label='Non-PATRON')
kmf.plot(ax=ax);
Comments
Comments powered by Disqus