How to create your own cloud crawler with ScreamingFrog in Google Cloud.
While I was trying to set my own screamingFrog instance on Google cloud last month, I realized that the Screaming frog has a bug that causing scheduled crawls to fail also the guides online are outdated or hard to understand.
Therefore I decided to write my own guide not only for teaching the others but also to remember the steps for the next setup :).
With this article, you will be able to crawl your sites automatically and create your own dashboard on data studio. Also using SF on cloud helped me to check multiple sites at once or crawl big sites without interrupting my workflow.
2 important things you need to have:
Google account with enabled billing.
ScreamingFrog Licence
Let’s start with creating the VM instance.
Go to https://console.cloud.google.com/projectselector2/home/dashboard?supportedpurview=project
Create a new project
Activate your free trial :) it will take some time to get approved(around 10–15mins). Also, I suggest you create a budget to limit your spendings in case something goes wrong.
From the menu select VM instances
Enable billing
Configure your virtual machine’s system resources. Depends on how big your site, you can change increase/decrease the ram & GPU size.
Select the operating system and disk type&size. I prefer to use ubuntu as it’s the best option for the free tier :D. It’s important to use an SSD disk in order to use ScreamingFrog on storage mode. The limit is 250GB for some regions.
Make sure that you allowed full access to all Cloud APIs, and allow HTTP/HTTPS traffic. Then click on the create button.
After about a minute your virtual machine will be ready to use.
First we will install google remote desktop therefore open the terminal in the browser window.
Copy and paste the following code.
wget https://dl.google.com/linux/direct/chrome-remote-desktop_current_amd64.deb
You should see result like below.
Update the library
sudo apt-get update -y
Then install dependencies of remote desktop
sudo apt-get install -y xvfb
Install xbase-clients
sudo apt-get install -y xbase-clients
Finally, install the remote desktop, so again copy-paste the following code.
sudo apt install ./chrome-remote-desktop_current_amd64.deb
than
systemctl status chrome-remote-desktop
The next step is to install the User interface
sudo apt-get update && sudo apt-get upgrade
Next install tasksel
sudo apt-get install tasksel
Next, install slim and select slim when the prompt appears
sudo apt-get install slim
Install gnome
sudo tasksel
Make sure only the Ubuntu Desktop is selected and press ok.
Wait for the installation
Next, go to https://remotedesktop.google.com/headless
Click on next (we have already installed it )
Click on Authorise
Copy and paste the Linux code to your ssh terminal press enter and create your password.
Then go to https://remotedesktop.google.com/access your VM will appear under remote devices. Use the same password you entered during the authentication to log in.
Use the browser to download the screamingfrog
Install the screamingFrog change the /screamingfrogseospider_14.1_all.deb part depends on your version.
sudo apt-get install ~/Downloads/screamingfrogseospider_14.1_all.deb
Accept the agreement
Now you have installed the screamingFrog
Enter your license / set your config and restart
Schedule your crawl and export your reports to google sheet. (this is the new cool feature that helps us to upload the files to gsheet without coding )
When you open the screamingFrog next time you will see an error like the following. Inside the log, you will see the error like JavaFX — Caused by: java.lang.UnsupportedOperationException: Unable to open DISPLAY.
I have contacted the ScreamingFrog dev team and here is the solution they come up with which works well. https://www.screamingfrog.co.uk/configure-x-virtual-framebuffer/
We already installed but just in case install again
sudo apt-get install -y xvfb
Configuration
Copy and paste the following on the command line to configure the service:
sudo tee /etc/systemd/system/xvfb.service <<HERE > /dev/null
[Unit]
Description=X Virtual Frame Buffer Service
After=network.target
[Service]
ExecStart=/usr/bin/Xvfb :0 -screen 0 1024x768x24
[Install]
WantedBy=multi-user.target
HERE
Service Registration
Now register and start the service.
sudo systemctl enable /etc/systemd/system/xvfb.service
sudo systemctl start xvfb.service
sudo systemctl enable xvfb
Then set this as the display to use.
echo "export DISPLAY=:0" >> ~/.bashrc
source ~/.bashrc
Now you can schedule your crawls.
When the crawl ends you will have date stamped folder inside the gdrive
Just give you an example I will create a very simple dashboard on data studio.
Go to https://datastudio.google.com/navigation/reporting , click on create report
Connect google sheets to data studio.
You can connect your reports create custom dashboards as you wish :)
Feel free to contact me if you need help with the setup.
https://www.linkedin.com/in/emrecansanli/