Inthe previous post, I have discussed how we can use one of the Survival Analysis techniques called ‘Survival Curve’ to analyze how the customer retention rates change over time with different cohorts.

Introduction to Survival Analysis Part 1― Survival Curve

The name ‘Survival Analysis’ sounds somewhat intimidating. Are we going to analyze someone who are on the edge of… blog.exploratory.io

In order to analyze the customer retention or churn even further, the next question would be:

“what makes the customers stop using the service (churn) or stay (retain)?”

For this, we can build a ‘Survival Model’ by using an algorithm called Cox Regression or also known as Proportional Hazard Model.

The previous Retention Analysis with Survival Curve focuses on the time to event (Churn), but analysis with Survival Model focuses on the relationship between the time to event and the variables (e.g. age, country, operating system, etc.).

Let’s take a look step by step.

Build SurvivalModel

Since we want to keep the previous Retention Cohort Analysis result, instead of starting from scratch, we can create a branch at the step before ‘Group By’ so that we can start from the step where we have the data we need.

In the newly created branch, you can select ‘Build Survival Analysis Model (Cox Regression)’ from Add button menu.

Similar to what we have done for calculating the survival curves before, we can set ‘ weeks_on_service ’ column to Survival Time, ‘ is_churned ’ column to Survival Status, and the columns we are interested in seeing their impacts on the ‘churn’ event for Predictors.

In this case, we are interested in seeing how the operating system (OS) they use and the countries they live would impact on ‘churn’ event, we can select ‘country’ and ‘os’ columns here.

Once you hit Run button, then you will see the model being built and get the summary result like below.

Understand Model Summary Information Summary ofFit

Under Summary of Fit section, there are several metrics to evaluate the model.

’Number of Events’ means the number of the customers who have churned in this scenario.

There are other metrics types for statistical tests.

Likelihood Ratio test Score (Logrank) test Wald test

P value for Likelihood Test (Likelihood Ratio Test P Value) shows less than 0.05, so this model can reject ‘null hypothesis’ of “any of the parameters listed below have no impact on the outcome” at 95% confidence level.

Parameter Estimates

This is where we can find the answers for “what makes the customers leave or stay?”

Under Parameter Estimates section, positive Estimate values mean that they would impact on the event, in this case that is ‘churn’, negatively. So these are the parameters that could potentially make the customers churn more . And the negative values mean the opposite, that is ‘not churn’. This means, countries like ‘Iceland’, ‘Georgia’, ‘Croatia’ could make customers churn less .

When we click on ‘Estimate’ column header to sort in a descending order, we can see countries like ‘El Salvador’, ‘Uruguay’, ‘Nepal’ would make them churn less.

However, we need to be careful and think about how much we can be confident with these numbers. For that, we can check their P values to see if those impacts are ‘statistically significant’ or not. Typically, we need these numbers to be less than 0.05 to be ‘statistically significant’ or 95% confidence so that we can reject the null hypothesis of a given parameter not having any impact on the outcome. We can click on P Value column header to sort from the lowest.

With this information, we can now say that countries like ‘United Kingdom’, ‘Japan’ are making customers less churn while countries like ‘Germany’ and ‘Croatia’ are making them churn more.

We can also take a look at ‘ Confidence Interval ’ ― Conf Low and Conf High. If you see the range contains 0 (or crosses 0), that means that it can impact in either a positive or negative way. Basically the model is saying that it can’t make up its mind given the data it has!;)

To understand this better, we can extract this Parameter Estimates information into a data frame by selecting ‘Extract Parameter Estimates’ from ‘Add’ button menu.

This will get you a data frame with Parameter Estimates information.

And, we can quickly go to Viz view and visualize this data.

First, we can use Scatter and assign ‘Term’ to X-Axis and ‘Estimate’ to Y-Axis.

Next, we can assign the confidence interval columns by checking ‘Show Range’ check box inside Range menu dialog.

The labels at X-Axis are hard to read because every value has ‘country’ or ‘os’ at the beginning. We can remove this by using ‘ str_replace ’ function from ‘stringr’ package, inside ‘mutate’ command like below.

mutate(term = str_replace(term, "country", ""))

As you see some countries have wide ranges of the confidence intervals. As we saw before, some of the variables (countries and os) have large P values, so we can filter out some of them by using filter command like below.

filter(p_value < 0.05)

As you can see, these variables’ confidence intervals vary but their ranges stay only at one side, either positive or negative. With this view, we can say that countries like India, Japan, or operating system like Mac decrease the risk of churn while countries like Germany, Croatia, Phillippines, increase the risk.

But, how should we interpret these numbers of ‘estimate’ column? Also, how come we are not seeing ‘windows’ in this output? For this, we will need to take a look at something called Hazard Ratio.

Hazard Ratio You can find ‘Hazard Ratio’ column at the end of Parameter Estimate table in the summary view. It is really a result of exponentiating the estimate values (log likelihood). Hazard Ratio values should b

tags: Survival,churn,like,countries

1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责；
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性，不作出任何保证或承若；
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。