Linus Larsson

Automatically store screenshots of your website with R

How cool would it be to be able to see how your, or anyone else's, website was presented on a specific day? With this script in R, I will show you how you can save screenshots of any website you want to and store the images in Google Cloud Storage, although you can easily store them anywhere you would like to.

Installing the packages

If you are going to use a local machine to run the script then you can install the relevant packages and import them to the project as shown below.

install.packages("magick")
install.packages("webshot")
install.packages("googleAuthR")
install.packages("googleCloudStorageR")
library(magick)
library(webshot)
library(googleAuthR)
library(googleCloudStorageR)

If, however, you want to schedule the screenshots in Google Cloud, the installations of the packages are a bit more tricky. First of all, you will have to install PhantomJS in order for the webshot package to work. On a local machine, this will be installed when you install the package, but in Google Cloud it refuses to install it via the package. So we have to use the terminal to install it manually instead. Run the code below in order to install PhantomJS on your virtual machine. Notice that you have to run it in the terminal, not the console! I found the code on Stack Overflow, and you can check out the original post here.

sudo apt-get update
sudo apt-get install build-essential chrpath libssl-dev libxft-dev -y
sudo apt-get install libfreetype6 libfreetype6-dev -y
sudo apt-get install libfontconfig1 libfontconfig1-dev -y
cd ~
export PHANTOM_JS="phantomjs-2.1.1-linux-x86_64"
wget https://github.com/Medium/phantomjs/releases/download/v2.1.1/$PHANTOM_JS.tar.bz2
sudo tar xvjf $PHANTOM_JS.tar.bz2
sudo mv $PHANTOM_JS /usr/local/share
sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
phantomjs --version

Next up, we have to install the magick package with the terminal as well, since the default package installer won't work in Google Cloud. This one is a bit easier. Simply build the package from source by running the following line of code in the terminal. You can find more info about the installation and the features of the magick package here.

sudo apt-get install libmagick++-dev

Capturing the screenshots

To capture the screenshots we first have to choose what pages will be captured. I recommend creating a data frame that includes both the URL and the desired filename for the screenshot.

pages <- data.frame(
  url = c(
    "https://lynuhs.com/",
    "https://www.theverge.com/"
  ),
  fileName = c(
    "lynuhs_homepage",
    "theverge_homepage"
  ),
  fileFormat = "jpg",
  stringsAsFactors = FALSE
)

# This line will make the files searchable by dates
pages$fileName <- paste0(pages$fileName, "_", format(Sys.Date(),"%Y%m%d"), ".", pages$fileFormat)

Now that we know what we want to capture, we can run the loop that will do the actual capturing. What we will do is to capture the screenshot with webshot, then read it as an image with magick and convert it from png to jpg in order to make the file size smaller. Then we will write the image to the project folder and then upload it to Google Cloud Storage, but here you can switch the code to store it at any location you want to. Afterwards, we will delete the image from the project folder in order to not use unnecessary memory space.

for(n in 1:nrow(pages)){
  # I use 6 seconds delay in order to make sure the page will be fully loaded
  screen <- webshot(pages[n,'url'],file = pages[n,'fileName'], delay = 6) 
  img <- image_read(screen)
  img <- image_convert(img, pages[n,'fileFormat'])
  img <- image_write(img, pages[n,'fileName'])
  
  # Replace all values below to your own bucket and folder
  gcs_upload(file = img,
             name = paste0("screenshots/dummy/",pages[n,'fileName']),
             bucket = "lynuhs-screenshots_dummy",
             predefinedAcl = "public")
}

And that's it! Not a super advanced script, but extremely useful nonetheless! Make sure you change all Google Cloud bucket information. In my script I use folders within a bucket to place the images, hence the paste0 function instead of just using the file name. Make sure to set the file to "public", otherwise you can't share the image URL without accessing Google Cloud Storage, and that would be a problem if you want to use the image in a report or dashboard.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Cookie Settings

© Copyright - Lynuhs.com - 2018-2024