Analyze A/B Test Results

This project will assure you have mastered the subjects covered in the statistics lessons. The hope is to have this project be as comprehensive of these topics as possible. Good luck!

Table of Contents

Introduction

A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these

For this project, you will be working to understand the results of an A/B test run by an e-commerce website. Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

As you work through this notebook, follow along in the classroom and answer the corresponding quiz questions associated with each question. The labels for each classroom concept are provided for each question. This will assure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the criteria. As a final check, assure you meet all the criteria on the RUBRIC.

Part I - Probability

To get started, let's import our libraries.

1. Now, read in the ab_data.csv data. Store it in df. Use your dataframe to answer the questions in Quiz 1 of the classroom.

a. Read in the dataset and take a look at the top few rows here:

b. Use the below cell to find the number of rows in the dataset.

c. The number of unique users in the dataset.

d. The proportion of users converted.

e. The number of times the new_page and treatment don't line up.

f. Do any of the rows have missing values?

2. For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. Use Quiz 2 in the classroom to provide how we should handle these rows.

a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz. Store your new dataframe in df2.

3. Use df2 and the cells below to answer questions for Quiz3 in the classroom.

a. How many unique user_ids are in df2?

b. There is one user_id repeated in df2. What is it?

c. What is the row information for the repeat user_id?

d. Remove one of the rows with a duplicate user_id, but keep your dataframe as df2.

4. Use df2 in the below cells to answer the quiz questions related to Quiz 4 in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?

b. Given that an individual was in the control group, what is the probability they converted?

c. Given that an individual was in the treatment group, what is the probability they converted?

d. What is the probability that an individual received the new page?

e. Consider your results from a. through d. above, and explain below whether you think there is sufficient evidence to say that the new treatment page leads to more conversions.

Answer
There is an equal chance to get either a new page or old page: P(old) = P(new) = 0.5 = 50%. The probability to convert given an old page or a new page is the same, that is 0.12 or 12% (this probability is calculated from the data we have). We can calculate Bayes Rule posterior probability and get the result for both 0.5 or 50% (P(CON|New_Page) = 0.06/0.012 = 0.5 & P(CON|Old_Page) = 0.06/0.012 = 0.5). Based on these calculations we cannot say that there is sufficient evidence that a new treatment page leads to more conversions.

Part II - A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?

These questions are the difficult parts associated with A/B tests in general.

1. For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of $p_{old}$ and $p_{new}$, which are the converted rates for the old and new pages.

RESEARCH QUESTION: Does the experiment page drive higher traffic than the control page?

$H_{0}$: The new version of a page draws the same amount or less traffic than the old version of a page (new version is equal or worse than the old).
$H_{1}$: The new version of a page draws more traffic than the old version of a page (new version is better than the old version).

$$H_0: p_{new} - p_{old} \le 0$$$$H_1: p_{new} - p_{old} > 0$$

$p_{new}$ and $p_{old}$ are the values for the old page and the new page, respectively.

2. Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the converted success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.

Use a sample size for each page equal to the ones in ab_data.csv. No sample size needed, we are using the whole ab_data.csv dataset.

Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.

Use the cells below to provide the necessary parts of this simulation. If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem. You can use Quiz 5 in the classroom to make sure you are on the right track.

a. What is the convert rate for $p_{new}$ under the null?

b. What is the convert rate for $p_{old}$ under the null?

Note
We assume that under the null hypothesis, p_new and p_old both have "true" success rates and therefore are equal to the converted success rate regardless of page - that is p_new and p_old are equal. Since they are both equal, we don't need to split into treatment types and consider all conversions together. Because we are using 0s and 1s to confirm conversion, it's possible to take the mean of this to find the rate (source: Udacity Knowledge FAQ).

c. What is $n_{new}$?

d. What is $n_{old}$?

e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.

f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.

Note
Here we stimulate the sample with the np.random.binomial method
WHY: We stimulate this under the null hypothesis, to see how the mean of distribution looks like if it came from the null hypothesis. Then we calculate p-value (from actual) in order to reject or fail to reject the null hypothesis.
This is singular example for cell h where we stimulate for 10000 samples.

1 = trial size (0s and 1s)
p_new = probability of trial (calculated)
n_new = number of trials to run

because we are storing the value in form of 0s and 1s we use n=1: np.random.binomial

g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

Note
Now that we know the observed difference in this sample (dataset in our case), we have to see if this difference is significant and not just due to chance. Therefore, we will simulate 10,000 values and calculate the differences in proportions ($p_{new}$ - $p_{old}$).

h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts a. through g. above. Store all 10,000 values in a numpy array called p_diffs.

Note: Formula explained
if: np.random.binomial(n_new,𝑝_𝑛𝑒𝑤,10000) -> we get results how many times 1s appear in one trial.
if: np.random.binomial(n_new,𝑝_𝑛𝑒𝑤,10000)/n_new -> we get a probabability of ocurrance of 1s.

n_new = trial size (0s and 1s)
p_new = probability event of interest occurs on any one trial (calculated)
10000 = number of times to run this experiment

because we are caunting how many times 1s appear in one trial we use n=n_new and divide with n/new to get the proportion: np.random.binomial

p_diffs = then we calculate the difference of probability converted between new and old page. It should be 0, since we are calculating this distribution form null hypotesis which is $H_0: p_{new} - p_{old} \le 0$

Note: Further understanding of distribution and number of trialsFormula explained
Below is a graphical visualization of distribution of a trials and the difference between 10000 trials and 50 trials. Source: Stack Overflow*

Note
When conducting hypothesis testing, we always simulate the null population and then compare to the observed statistic.

i. Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.

Note
This plot is expected - follow normal distribution (large number and normal distribution theory). Above are two plots that show what happen if the number of trials is low in comarisson with a large number of trials.

blue area: the distribution from the null (assuming the null is true)
dark blue dashed line: the observed mean - actual mean (not from the null)
gray dashed line: the null mean
red lines: 95% confidence interval

Now we need to calculate the area - our alternative hypothesis is: $H_1: p_{new} - p_{old} > 0$

j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

k. In words, explain what you just computed in part j. What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?

Answer
Firstly we made 10,000 trials assuming p_old and p_new are equal (they are coming from a null hypothesis) and created a normal distribution of differences under this assumption. Next, we compared this distribution with the actual difference in our dataset to see how likely our null hypothesis is - this is a p-value. We use p-value to determine the statistical significance of our observed difference.

In cell j we computed p-value for our statistics which is the observed difference in proportions.
Firstly, we calculated by simulating the distribution under the null hypothesis and then finding the probability that our statistics came from this distribution. To simulate from the null we created a normal distribution centered at zero with the same standard deviation as sampling distribution and size. Next, we computed the p-value by finding the proportion of values in the null distribution that were greater than our observed difference.

Formula explained:
np.random.normal -> Draw random samples from a normal (Gaussian) distribution
loc = 0 -> Mean (“centre”) of the distribution (p_diffs = 0)
scale = p_diffs.std() -> Standard deviation (spread or “width”) of the distribution
size = p_diffs.size -> Size of distribution.

p-value of 0.9009 means that nearly all statistics came from a null (almost all ~ 90%); therefore, we fail to reject null hypothesis, meaning that alternative hypothesis is not true (new page is the same or worse than the old page.)

Note and additional resources
The p-value helps us make a decision. Because of the way we construct our assumptions, when calculated, the p-value tells us the probability of committing a Type I error if the null hypothesis is true. (A Type I error is when you incorrectly reject the null hypothesis - usually we would consider making Type I errors to be 'bad,' so we want to make as few of them as possible, and make this chance quite low)
A low p-value is often considered to be less than 0.05 in business and research, and 0.01 in medicine, but it could be any value appropriate to the situation. That is, if you get a p-value that is 0.05, this means that there is a 5% chance that a statistic that you observed came from a population where the null hypothesis is true. With this reasoning, at low p-values we typically reject the null hypothesis. That is, we act on the assumption that the observed statistic came from a population where the alternate hypothesis is true. Source: p-value

l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.

m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.

Source: statisticshowto.com

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?

Answer
Proportions z_test build-in function did all the computation in a few lines of code that reflect what we did in Part II. P-value is the same as in Part II (in cell j). p-value and z-score computed in cell m agree with p-value computed in cell j, that is p-value of 0.905, meaning we fail to reject the null hypotesis and based on this computations we can conclude that the new page won't attrack more traffic.

Interpretation of p-value and z-value
The Z-value is a test that measures the difference between an observed statistic and its hypothesized population parameter in units of standard error. We can compare the Z-value to critical values of the standard normal distribution to determine whether to reject the null hypothesis. Z-score shows how many standard deviations away our observed (actual) difference is to the center. How many standard deviations away pdiff_actual is from p_diffs. In order to interpret z-score we look at the critcal value. Critical value for the 95% confidence interval (or alpha level of 0.05 or 5%) is 1.64. Our z-test is -3.11; therefore z-score value falls out of this critical value and we fail to reject the null hypotesis.

The p-value is a probability that measures the evidence against the null hypothesis. A smaller p-value provides stronger evidence against the null hypothesis.

Source: minitab.com

z-score of -1.31 falls between -1st and -2nd standard deviation - shaded gray area
critical value of $\alpha$ = 0.05 (95% confidence interval) - shaded red area

Source: Udacity Knowledge FAQ & z-score & stacloverflow

Part III - A regression approach

1. In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.

a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

Answer
Unlike linear regression (used for predicting quantiative response a continious numerical variable), logistic regression is used to predict a categorical response, a binary response with only two possible outcomes in our case a conversion vs. no conversion.

b. The goal is to use statsmodels to fit the regression model you specified in part a. to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create a column for the intercept, and create a dummy variable column for which page each user received. Add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.

Note
Intercept == 1: initialize the value of the bias to 1 because it will be multiplied by the bias weights to produce the final bias value. If it was set to 0, it would always produce 0. If it was set to 5 it would scale the weights to much.
Source: Udacity Knowledge.

c. Use statsmodels to import your regression model. Instantiate the model, and fit the model using the two columns you created in part b. to predict whether or not an individual converts.

d. Provide the summary of your model below, and use it as necessary to answer the following questions.

e. What is the p-value associated with ab_page? Why does it differ from the value you found in Part II?

Hint: What are the null and alternative hypotheses associated with your regression model, and how do they compare to the null and alternative hypotheses in the Part II?

Answer
p-value for ab_page is 0.190. This p-value still indicates the same as p-value in Part II, that is we fail to reject the null hypotesis and based on this computations we can conclude that the new page won't attrack more traffic. The p-value differs because in Part II, we are doing a one-sided test since our null hypothesis is "p_old - p_new >= 0" in Part III, we are doing a two-sided test ("p_old = p_new"). For logistic regression we have two outputs possible "converted" or "not converted."

f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?

Answer
Adding other features to the model can improve the model performance; however, we need to be carful with the interpretation. One potential side effect of having multicollinearity in the model is that the coeficent can be counter-intuitive. This happen if predictors are strongly corelated with one another. We can check for this correlation either with scatter plots or VIFs (variance inflation factors). In order to interpret the model more accurately we could remove at least one of highly correlated variable that are of least interest.

g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives. You will need to read in the countries.csv dataset and merge together your datasets on the approporiate rows. Here are the docs for joining tables.

Does it appear that country had an impact on conversion? Don't forget to create dummy variables for these country columns - Hint: You will need two columns for the three dummy variables. Provide the statistical output as well as a written response to answer this question.

Interpretation of the results - p-value
Based on the p-value country doesn't have impact on conversion. None of the variables are statistical significant (p-value < 0.05). In logistic regression model summary we might use p-values to help us understand if a particular variable was significant and it's a great quick check to understand which relationship appear to be important. Furthermore we can interpret these coeficients to help us understand corelations.

In order to interpret coeficient we need to exponentiate each:
The math.exp() method returns E raised to the power of x (Ex). 'E' is the base of the natural system of logarithms (approximately 2.718282) and x is the number passed to it. www.w3schools

Interpretation of the results - coeficient
This results in multiplicative change in the odds of being in the one category of this value, holding all other variable constant.
We can interpret results above if individual is from US is 1.01-times likely to convert than if they came form UK and individual is 1.04 likely to convert if they came form CA.

Interpretation of the results
Adding more terms in this case did not change the model. None of the variables are statistical significant (p-value < 0.05) and coeficient stayed similar than in model above. We can conclude that there is no significant p-value(all higher than 0.05) even after the addition of country dependent conversion and therefore we fail to reject the null. Company should stay on the old_pages only as there's no enough evidence that the new_pages are doing better.

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results, and your conclusions based on the results.

When we include higher order terms into our model we also need to include lower order terms. Mathematically, an interaction is created by multiplying two variables by one another and adding this term to our linear regression model. If the slope (vertical difference) between two variables is not the same, we consider adding interaction in our model if the slope is the same than we do not add an interaction.

Interpretation of results When the slopes for two variables no longer match we would want to add an interaction term between two variables (page and country) to our model. In this case the way ab page is related to the converted and is dependent from which country that individual is coming from.

Adding higher terms did not improve the model. Based on p-value for CA_abpage and UK_abpage is 0.383 and 0.238, respectively indicating that interactions are not significant and we would consider removing them from the model. However it is essential to be aware of interactions since they can improve our models or even hurt if we do not add them and show significance.

Higher order terms - notes
Sometimes we would like to fit models where the response is not lineary related to the explanatory variable. We can do this with what are known as higher order terms. Higher order terms include quadratics, cubics and many other relationships. In order to add these terms to our linear models, we can simply multiply our columns by one another.

Additional notes - not part of the analysis
🎈logistic regression: confusion matrix, exponentiate each variable, VIF & multicollinearity
🎈 multiple linerar regression: VIF, scatter plots in multicollinearity
VIF&multicollinearity

Conclusions

Congratulations on completing the project!

Gather Submission Materials

Once you are satisfied with the status of your Notebook, you should save it in a format that will make it easy for others to read. You can use the File -> Download as -> HTML (.html) menu to save your notebook as an .html file. If you are working locally and get an error about "No module name", then open a terminal and try installing the missing module using pip install <module_name> (don't include the "<" or ">" or any words following a period in the module name).

You will submit both your original Notebook and an HTML or PDF copy of the Notebook for review. There is no need for you to include any data files with your submission. If you made reference to other websites, books, and other resources to help you in solving tasks in the project, make sure that you document them. It is recommended that you either add a "Resources" section in a Markdown cell at the end of the Notebook report, or you can include a readme.txt file documenting your sources.

Submit the Project

When you're ready, click on the "Submit Project" button to go to the project submission page. You can submit your files as a .zip archive or you can link to a GitHub repository containing your project files. If you go with GitHub, note that your submission will be a snapshot of the linked repository at time of submission. It is recommended that you keep each project in a separate repository to avoid any potential confusion: if a reviewer gets multiple folders representing multiple projects, there might be confusion regarding what project is to be evaluated.

It can take us up to a week to grade the project, but in most cases it is much faster. You will get an email once your submission has been reviewed. If you are having any problems submitting your project or wish to check on the status of your submission, please email me at hi@priyanshuraj.online. In the meantime, you should feel free to continue on with your learning journey by beginning the next module in the program.