Linus Larsson

Google Analytics Related Products with R

I finally managed to create something really beautiful! With the following script in R you will be able to find out what products your visitors are buying in the same transaction. I will explain how it works one piece at a time. It's never a good idea to simply copy paste a script without understanding it, but if you think you can do no wrong then you can find the script in it's full version on GitHub

NOTE: If you have a lot of products and you find yourself receiving sampled data you should consider collecting the data from BigQuery instead (given that you are using GA360). That solution can be found commented in the script on GitHub. Make sure not to run ga_auth() if you plan on using BigQuery!


First off, make sure to install the packages that you need to run the functions. Then import the libraries and authorize a connection to Google Analytics. 

The next part of the script is a function to collect all Google Analytics data in a data frame structured with product SKUs in pairs. I'll split it up in order for you to better understand what's happening.

alsoBoughtTable <- function(id, start, end){
  ga <- google_analytics_3(id = id,
                           start = start,
                           end = end,
                           dimensions = c("date","transactionId","productSku"), 
                           metrics = c("itemQuantity"),
                           samplingLevel = "WALK",
                           max_results = 999999999)
  ga <- ga[1:3]
  ga <- subset(ga, !(duplicated(ga[2:3])))

The first part of the function imports all product SKUs and transaction IDs together with item quantity by date. For our use we don't need the metrics (but we have to use something to be allowed to gather data) so the next line removes that column. Then we also delete any possible duplicated transactions.

cross <- matrix(nrow=0, ncol=4)
  colnames(cross) <- c("date","productSku","alsoBought","transactionId")
  dates <- unique(ga$date)
  for(d in 1:(length(dates))){

As you might have guessed we will have to run loops to combine the related products together. First off we create an empty matrix with the correct column names. Then we create a vector of unique dates that we can use to create a loop that will run once for each unique date. 

products <- unique(ga[which(ga$date == dates[d]),'productSku'])
    cr <- matrix(nrow=0, ncol=3)
    colnames(cr) <- c("productSku","alsoBought","transactionId")
    for (i in 1:(length(products))){

So what are we gonna put in the loop? Well of course another loop! :) First we create a new vector containing all unique product SKUs for that given day.  This will be used for running the loop. It should run once for every single product SKU (yeah it's gonna take some time to run this shit). Then we once again create an empty matrix that we're going to bind with our daily data. 

receipts <- ga[which(ga$productSku == products[i] & ga$date == dates[d]),'transactionId']
      bp <- subset(ga, transactionId %in% receipts)
      bp <- data.frame(productSku = products[i],
                       alsoBought = bp$productSku,
                       transactionId = bp$transactionId)
      cr <- rbind(cr, bp)

In this loop we will start by creating a vector containing all transaction IDs  where the specific product SKU was included for that specific date. This is then used to create a subset from the entire Google Analytics data frame where the transaction ID is included in our receipts vector.

if(i == 1 | i%%10 == 0 | i == length(products)){
            d, " of ", length(dates)," dates running: ",round(i*100/(length(products)),1), "% computed"

The last part of this inner loop is a simple but very handy printer. It displays in the console how many days have been completed and the percentage until the current day processed will be done. 

    cross <- rbind(cross, cbind(data.frame(date = dates[d]),cr))
  cross$productSku <- as.character(cross$productSku)
  cross$alsoBought <- as.character(cross$alsoBought)
  return (cross)

Then we finish the outer loop by simply combining the daily data with the entire data. In order to make the data frame a bit more "clean" I change the columns from factor to character. You can skip these lines if you prefer factors (but I really don't see it). 

Now you have the function that will allow you to gather the data in a table format that can be used to make calculations. You now have two options; do you want to see the results in R or would you rather export the data in order to use it in a BI tool and do the calculations there? I prefer to use it as a data source in a BI tool since that allows me to explore the data in a better way. I can relate the product SKU with other variables easily and pick different dates faster. You can of course export the data to a csv file and import it to a BI tool but I suggest you look into the way I do it, which is to import the data directly to BigQuery from R. This makes it possible to append new dates easily and you will not have to run the script for a lot of days at once if you make sure to run it once a week or even more often.

If, however, you decide to run the calculations in R then you can use the function below. Which is going to group and summarize all combined values based on unique transaction IDs and merge it with the data for all transactions where the product SKU was included. Then there's a simple line for calculating the share of all receipts that contained both products.

calculateReceiptShare <- function(productDf){
  share <- group_by(productDf, productSku, alsoBought) %>%
    summarise(uniquePurchases = n_distinct(transactionId)) %>%
  receipts <- group_by(productDf, productSku) %>%
    summarise(allReceipts = n_distinct(transactionId))
  share <- merge(share,receipts, by = "productSku", all.x = TRUE)
  share$shareOfAllReceipts <- round(share$uniquePurchases / share$allReceipts, 2)
  return (share)

If you are only interested in seeing the statistics and you don't want a table to save for later you could run the function below. This will collect the table but only use it to gather the statistics and return them for you. 

alsoBought <- function(id, start, end){
  ga <- alsoBoughtTable(id, start, end)
  ga <- calculateReceiptShare(ga)
  ga <- ga[order(-ga$uniquePurchases),]
  return (ga)

To actually run the functions you simply run the following lines. I think you're smart enough to understand how they work :)

ga_id <- XXXXXXX    
df <- alsoBoughtTable(ga_id, Sys.Date()-30, Sys.Date()-1)
df <- alsoBought(ga_id, Sys.Date()-30, Sys.Date()-1)

I hope this was helpful and please comment if you have any questions and you can find the script in full version on GitHub.


Leave a Reply

Your email address will not be published. Required fields are marked *


Cookie Settings

© Copyright - - 2018-2024