Sitemap

How to create your own cloud crawler with ScreamingFrog in Google Cloud.

5 min readDec 20, 2020

--

While I was trying to set my own screamingFrog instance on Google cloud last month, I realized that the Screaming frog has a bug that causing scheduled crawls to fail also the guides online are outdated or hard to understand.
Therefore I decided to write my own guide not only for teaching the others but also to remember the steps for the next setup :).

With this article, you will be able to crawl your sites automatically and create your own dashboard on data studio. Also using SF on cloud helped me to check multiple sites at once or crawl big sites without interrupting my workflow.

2 important things you need to have:

Google account with enabled billing.

ScreamingFrog Licence

Let’s start with creating the VM instance.

Go to https://console.cloud.google.com/projectselector2/home/dashboard?supportedpurview=project
Create a new project

Activate your free trial :) it will take some time to get approved(around 10–15mins). Also, I suggest you create a budget to limit your spendings in case something goes wrong.

From the menu select VM instances

Enable billing

Configure your virtual machine’s system resources. Depends on how big your site, you can change increase/decrease the ram & GPU size.

Select the operating system and disk type&size. I prefer to use ubuntu as it’s the best option for the free tier :D. It’s important to use an SSD disk in order to use ScreamingFrog on storage mode. The limit is 250GB for some regions.

Make sure that you allowed full access to all Cloud APIs, and allow HTTP/HTTPS traffic. Then click on the create button.

After about a minute your virtual machine will be ready to use.
First we will install google remote desktop therefore open the terminal in the browser window.

Copy and paste the following code.

wget https://dl.google.com/linux/direct/chrome-remote-desktop_current_amd64.deb

You should see result like below.

Update the library

sudo apt-get update -y

Then install dependencies of remote desktop

sudo apt-get install -y xvfb

Install xbase-clients

sudo apt-get install -y xbase-clients

Finally, install the remote desktop, so again copy-paste the following code.

sudo apt install ./chrome-remote-desktop_current_amd64.deb

than

systemctl status chrome-remote-desktop

The next step is to install the User interface

sudo apt-get update && sudo apt-get upgrade

Next install tasksel

sudo apt-get install tasksel

Next, install slim and select slim when the prompt appears

sudo apt-get install slim

Install gnome

sudo tasksel

Make sure only the Ubuntu Desktop is selected and press ok.

Wait for the installation

Next, go to https://remotedesktop.google.com/headless

Click on next (we have already installed it )

Click on Authorise

Copy and paste the Linux code to your ssh terminal press enter and create your password.

Then go to https://remotedesktop.google.com/access your VM will appear under remote devices. Use the same password you entered during the authentication to log in.

Use the browser to download the screamingfrog

Install the screamingFrog change the /screamingfrogseospider_14.1_all.deb part depends on your version.

sudo apt-get install ~/Downloads/screamingfrogseospider_14.1_all.deb

Accept the agreement

Now you have installed the screamingFrog

Enter your license / set your config and restart

Schedule your crawl and export your reports to google sheet. (this is the new cool feature that helps us to upload the files to gsheet without coding )

When you open the screamingFrog next time you will see an error like the following. Inside the log, you will see the error like JavaFX — Caused by: java.lang.UnsupportedOperationException: Unable to open DISPLAY.

I have contacted the ScreamingFrog dev team and here is the solution they come up with which works well. https://www.screamingfrog.co.uk/configure-x-virtual-framebuffer/

We already installed but just in case install again

sudo apt-get install -y xvfb

Configuration

Copy and paste the following on the command line to configure the service:

sudo tee /etc/systemd/system/xvfb.service <<HERE > /dev/null
[Unit]
Description=X Virtual Frame Buffer Service
After=network.target
[Service]
ExecStart=/usr/bin/Xvfb :0 -screen 0 1024x768x24
[Install]
WantedBy=multi-user.target
HERE

Service Registration

Now register and start the service.

sudo systemctl enable /etc/systemd/system/xvfb.service
sudo systemctl start xvfb.service
sudo systemctl enable xvfb

Then set this as the display to use.

echo "export DISPLAY=:0" >> ~/.bashrc
source ~/.bashrc

Now you can schedule your crawls.

When the crawl ends you will have date stamped folder inside the gdrive

Just give you an example I will create a very simple dashboard on data studio.

Go to https://datastudio.google.com/navigation/reporting , click on create report

Connect google sheets to data studio.

You can connect your reports create custom dashboards as you wish :)

Feel free to contact me if you need help with the setup.
https://www.linkedin.com/in/emrecansanli/

--

--

Emrecan Sanli
Emrecan Sanli

Written by Emrecan Sanli

With 10+ years of experience in technical SEO and web development, I help businesses boost their online visibility and performance.

No responses yet