Survival Analysis

Goal

This post aims to introduce how to do survival analysis using lifelines. In this post, I use fellowship information in 200 Words a day to see what the survival curve looks like, which might be useful for users retention.

200 Words a day is the platform where those who wants to build a writing habit make a post with more than 200 words. There is a function to show how many consecutive days each user has been making a post as X-day streak.

image

Reference

Libraries

In [19]:
from lifelines import KaplanMeierFitter, LogNormalFitter
from bs4 import BeautifulSoup
import requests
import pandas as pd
%matplotlib inline

Fetch "fellowship" and "streak" data

Please have a look at parse HTML if you are not familiar with parsing HTML.

Note that this code only fetches the first pages (the first 50 users data) so this is biased.

In [4]:
url_fellowship = 'https://200wordsaday.com/fellowship'
response = requests.get(url_fellowship)
soup = BeautifulSoup(response.text)
In [45]:
# List of users
users = []
badges = []
for h2 in soup.find_all('h2'):
    users.append(h2.text.strip().replace('@', '').replace('PATRON', ''))
    if h2.find('span') is not None:
        badges.append(h2.find('span').text)
    else:
        badges.append(None)
    
df_user = pd.DataFrame({'username': users, 'badge': badges})

# List of streak/posts/pages
attributes = []

for att in soup.find_all('p')[1:-11]:
    att_row = []
    for e in att.find_all('small'):
        att_row.extend([e.text[0], e.text[1:]])
    
    attributes.append(att_row)    

df_attributes = pd.DataFrame(attributes).drop([2, 4], axis=1)
df_attributes.columns = ['streak', 'nstreak', 'posts', 'pages']
In [46]:
# Get all username and streak with attribute (patron or not)
df_user = pd.concat([df_user, df_attributes], axis=1, ignore_index=False)
df_user.head()
Out[46]:
username badge streak nstreak posts pages
0 jasonleow None 🔥 170 207 327
1 basilesamel PATRON 🔥 185 202 335
2 brianball PATRON 🔥 49 194 213
3 valentino None 🔥 185 185 376
4 vickenstein None 🔥 175 177 183
In [48]:
# Data Cleansing by replaceing '' with 0
df_user['nstreak'] = df_user['nstreak'].replace('', 1).astype(int)

Descriptive Analysis for Users

In [44]:
# Distribution of streak
df_user['nstreak'].plot.hist(bins=20);
In [43]:
# Distribution of Patron
df_user['badge'].fillna('Non-PATRON').value_counts().plot(kind='bar');

Simple Survival Curve - Kaplan Meier Curve

In [50]:
kmf = KaplanMeierFitter()
kmf.fit(df_user['nstreak'], event_observed=1*(df_user['streak']=='🔥'))  # or, more succinctly, kmf.fit(T, E)
kmf.plot(title='Survival Curve for 200WaD Streak based on top 50 users');
In [14]:
lnf = LogNormalFitter()
lnf.fit(df_user['nstreak'], event_observed=1*(df_user['streak']=='🔥'))  # or, more succinctly, kmf.fit(T, E)
lnf.plot_survival_function();

Survival curve by group

Team Streak (more than 30 posts)

In [15]:
idx_teamstreak = df_user['nstreak'] >=30

# Team Streak
kmf = KaplanMeierFitter()
kmf.fit(df_user.loc[idx_teamstreak, 'nstreak'], 
        event_observed=1*(df_user.loc[idx_teamstreak, 'streak']=='🔥'), label='team streak') 
ax = kmf.plot();

# Non team streak
kmf.fit(df_user.loc[~idx_teamstreak, 'nstreak'], 
        event_observed=1*(df_user.loc[~idx_teamstreak, 'streak']=='🔥'), label='non team streak')  
kmf.plot(ax=ax);

Patron

In [16]:
idx_patron = df_user['badge'] == 'PATRON'


# Team Streak
kmf = KaplanMeierFitter()
kmf.fit(df_user.loc[idx_patron, 'nstreak'], 
        event_observed=1*(df_user.loc[idx_patron, 'streak']=='🔥'), label='PATRON') 
ax = kmf.plot();

# Non team streak
kmf.fit(df_user.loc[~idx_patron, 'nstreak'], 
        event_observed=1*(df_user.loc[~idx_patron, 'streak']=='🔥'), label='Non-PATRON')  
kmf.plot(ax=ax);

Comments

Comments powered by Disqus