Data Analysis of eCommerce User Activity - by Ian Hudson

Using 5 months of user activities from view to purchase to see if dependent Y (user likely to purchase or not), can be predicted.

20.6 million user events, upon 53,904 products from a Multi-category Online Store (MOS) eCommerce website.

Let's get some useful libraries.

Creating the Dataframe and converting the event_time column to datetime type

Convert the event_time column to datetime type

Checking the event_time data type

Data Cleansing

What does the dataframe look like?

Lets remove that unamed:0 column, it's a duplicate of the index column.

That column should now be gone, let's confirm

Let's see if there are any missing values in columns

There are some missing values in category_code, brand and user_session.

Let's check out what percentage of data is missing.

98% of category_code is missing, 42% of brand is missing, 0.02% of user_session is missing.

There is a huge amount of missing data for category_code and brand so I will drop these columns from the dataframe.

User_session will remain in the dataframe.

Let's remove the category_code column

Let's also remove the brand column

Let's remove any duplicate row data.

Let's remove any leading or trailing spaces from string columns event_type and user_session

Now that the data is clean let's do some exploration

To see the distribution of data, let's create some Histograms

The most common price is the 1 - 15 Euro range. Hmm, there seems to be some non-positive values in the price series, I'll have to check on that later

Bar chart Event totals

The most frequent events are viewing products then putting products into the cart, removing a product from the cart and lastly making a purchase. Each event drops down to about half of the previous event.

Let's see what the percentages are

What are the 'Event type' totals?

Why are there non-positive values in the 'price' column?

We will remove 'product refunds' (the non-positive price values) since they are beyond the scope of this analysis.

Checking to see if any still exsist ...

Looks like we are ok now.

What's the average price for a product?

Let's see the unique value counts from select columns

What are the unique values for event_type?

How many unique Users are there?

Creating new variables

Let's create our Y dependent column

1 for 'purchased' and 0 for everything else

Let's get the number of total purchases

Let's get the number of purchases per user

Let's create a number of sessions per user column

Let's get the event levels per user

Let's get the average event duration per user in minutes

Day of week, most purchases, per user

The day that most purchases occur is day 6, Sunday

Let's Investigate outliers with Box plots

There are many outliers and they will be removed so as not to skew the data.

Let's identify our outliers

Adjusted - Price Box plot

Investigate Correlations

Bar plot of Correlated Features

Notes:

Achieving accuracy of more than 50 percent is improbable in cases where prediction is related to psychology, as we don't know the user's mindset at the time of purchase. In some fields, it is entirely expected that an R-squared value would be low. For example, any field that attempts to predict human behavior. That is precisely what we are doing in this analysis. Considering this 0.499 is a good fit to the model.

The coefficient for event_type_level shows that for every single unit increase the Y dependant variable (purchase or no purchase) will increase by 22 units.

Summary

Event Type percentages:
| view | 0.49 || cart | 0.29 || remove_from_cart | 0.15 || purchase | 0.07 |

Insights

Expensive shipping is a significant reason why users will remove products from the cart.

A key would be to analyze why users remove nearly 1/2 of the cart contents before making a purchase. This may indicate that shipping prices are too high versus the total amount of the purchase. For example, a user may find it difficult to justify an 8 Euro purchase while having to pay 10 additional Euros for shipping.

Recommendations

  1. Get shipping prices. Expensive shipping is another reason users will remove products from the cart. Also get variables for multiple shipping options, if available.
  1. Get a complete list of product names and categories. This dataset identified most only as a reference number.
  2. Get 1 or more years of data. This datasets purchase events are a very small percentage of the dataset.

  3. Online user polling may help to reveal any user/usage 'pain points'.

Exporting analyzed dataset to CSV file