How to Crawl Enterprise Sites in the Cloud with Screaming Frog

Congratulations! You just landed a new client, with a large site, and they want an audit. 

However, You find that you’re struggling to crawl the site, and you don’t want to get locked into an expensive software just to handle the occasional large site.  

Sound familiar? So what do you do in this situation

Well, the answer is easy and accessible to everyone. It lies in deploying Screaming Frog into the Cloud using Google’s own cloud infrastructure. This will let you crawl larger sites than your computer could normally handle.

Before you’re up and crawling massive websites in the cloud, there are some preliminary steps to cover first. So let’s jump into it! 

Before Beginning

You’ll be setting up a Linux operating system and working within the command line. It may be scary at first but once you get going, there are many use cases for it.  Linux is versatile, open source, secure and is ideal for programmers regardless of the language.

  1. Configure Compute Engine API

First, you need to set up your own Google Cloud Compute instance on Google’s cloud platform. 

Navigating the backend can be tricky so:

The initial dashboard screen can be pretty confusing as there are a lot of options. For this guide, everything you’ll need is in the Compute Engine section.

The Compute Engine API, is the service that provides virtual machines that run on Google’s infrastructure. The API provides an interface for interacting with projects, instances, networks, firewalls, etc.

In order to activate and use the API, you’ll need to input billing information. Fortunately, Google provides a $300 credit so you can play around with running crawls before you commit.

After you enable billing you will come back to another screen and from here click Enable.

Once you navigate back to the VM instances section (if you get lost you can always use the left-hand navigation to find the Compute Engine) you’ll see a toaster icon in the header that says activate your Free Trial.

Once you activate it you’re good to go!

  1. Create the Virtual Machine

In the top navigation, you’ll see an option to Create Instance — An instance is a virtual machine (VM) hosted on Google’s cloud infrastructure.

Press “Create Instance”, and now you’ll need to configure your VM. 

Here’s a sample Virtual Machine set up to get you started:

Machine Configuration

Region & zone: Leave these as the default value

Series: E2

Machine type: c2-standard-8 (8vCPU, 32 GB memory)

Boot Disk

Boot disk: OS Ubuntu, Version Ubuntu 18.04 LTS, Size 10GB

Helpful Definitions

A machine family is a curated set of processor and hardware configurations optimized for specific workloads.

The machine type specifies a collection of virtualized hardware resources available to a virtual machine (VM) instance. This includes items like memory size, virtual CPU count, and disk.

Each virtual machine instance requires a disk to boot from. The boot disk will contain your operating system, version, size, and type.

When the virtual machine is set up, you won’t be able to make changes to the machine family, type, or boot disk. So, take your time on this step and triple-check everything.

Now that you have your machine configured to your liking, hit create. From here you’ll get an estimated cost breakdown. It’s critical to remember that you will be charged for as long as you run the virtual machine and it will also charge you to store data.

So, every time you’re done crawling and auditing sites, you’ll need to shut down the VM. To turn off the machine click the three dots on the right side of the screen. Finally, a drop-down will open with options to start and stop.

This is how you shut it down

  1. Download and setup Screaming Frog

Now that you set up the VM, and know-how to shut it down, let’s talk about SSH (Secure Shell).  Looking at the screen where you started and stopped your VM, you will see another drop-down option, the SSH. This is a type of network communication protocol that enables two computers to communicate.

  • Click the drop-down icon
  • select Open in a browser window (If you see a tiny screen like this then you’re in the right place)

The SSH will launch a command-line terminal. Before moving forward, I recommend playing around and getting comfortable with the terminal.

Here are some basic Linux commands to get you started:

pwd    prints the working directory

ls        gives a list of contents of the current directory

cd       used for navigation between directories and folders

cd..     this will take you back to the previous directory you were focused on.

cd /     this will take you to the root directory

cp       this will copy files

rm       remove or delete unneeded files

The first thing you’ll want to do each time you launch a new session in your terminal is to check for updates. This is extremely easy with Linux– all you have to do is run a few simple commands below:

  • sudo apt update
  • sudo apt upgrade

The sudo apt-get update command is used to download package information from all configured sources. Every time you run this command, it will download all info on updated versions of packages and their dependencies.

The sudo apt-get upgrade installs available upgrades of all packages currently installed on the system from the sources configured. Plus, it looks pretty cool!

Sudo allows you to elevate your current user account to have root privileges temporarily. This comes in handy if you run a command and you get the dreaded Permission denied warning.

Helpful hint: If you ever run into an issue downloading a package you will always get an error code that you can do a Google search on. Chances are very high that someone has already encountered that same issue and solved it.

After you update and upgrade the system, run the below commands:

  • sudo apt-get install tasksel
  • sudo apt-get install lightdm
  • sudo apt update

Tasksel will help install multiple related packages at once. 

Lightdm is a lightweight display manager and will allow you to set up a GUI (graphical user interface).

After you’ve installed the packages above: 

Here’s a cheat sheet to navigate the tasksel screen

  • hit Space to make your selection
  • hit Tab 
  • hit Enter

If everything checks out ok you will see a package configuration screen.

By default, when you set up the Virtual Machine, there will not be a GUI associated with it. You will need a GUI to interact with Screaming Frog (at least initially you can also do a lot of work from the command line).

Here, you can download the Screaming Frog package from the command line. Run this command:

Helpful hint: You can swap out (16.0) for whichever is the most up-to-date version of Screaming Frog you are using.

Now, you just need to install Screaming Frog by running the below command:

  • sudo apt-get install /home/username/screamingfrogseospider_16.0_all.deb

You’ll need to replace username with whatever directory you’ve installed Screaming Frog in. 

Helpful hint: To find this: 

If you get lost to find the directory you installed Screaming Frog in run:

  • find /home/ -name “*screamingfrogseospider_16.0_all.deb”

The find command will locate a specific file by name or extension. The above will find the “*screamingfrogseospider_16.0_all.deb” file located within the /home/ directory and all subsequent folders.

Next, type pwd to get the file path to append to the file name.

Now you have the complete file path.

If the process was done correctly, you should see this below:

Finally, accept the user agreement that pops up by pressing Tab and Enter.

Voila! You’ve now installed Screaming Frog from within the command line!

  1. Configuring the Remote Desktop

The last step is to configure the remote desktop. This is a software tool developed by Google that allows users to remotely control another computer.

You can use this anytime you want to log into your VM and run Screaming Frog.

Since you’re using a Linux distribution for the remote desktop, you will need to follow the instructions to set up Chrome remote desktop for Linux.

  • install the Chrome browser
  • turn on your VM
  • In the SSH window install the Debian Linux Chrome Remote Desktop installation package with these two commands:

Helpful hint: You can install and configure the Google Chrome browser on your virtual machine with the commands below:

Next, you need to configure and set up the Chrome Remote Desktop service.

On your local computer, not your virtual machine, go to the Remote Desktop page.

You will need to make sure you’re signed in to the same email address you used to set up the VM. Hit Begin to get started.

You will choose the Windows operating system to configure the software locally. Once you click the link, an automated download will start.

Follow the instructions to install the software on your local computer. Once it’s done hit next, and click Authorize on the next page.

Here you should be provided some code you’ll need to copy and run in the SSH of your VM. This will complete the setup process.

When you’re prompted, enter a 6-digit PIN. This number will be used for additional authorization when you connect later.

You might see errors like No net_fetcher or Failed to read. You can ignore these errors.

To verify the service is running, enter the following command:

  • sudo systemctl status chrome-remote-desktop@$USER

If everything is configured correctly, you should see this below. You can exit out of this screen by pressing ctrl + c

Now, all you have to do is log in and run Screaming Frog.

  1. Logging in and Running Screaming Frog

WooHoo, You’re almost done! 

Go to the Remote Desktop login page

Helpful hint: Do yourself a favor, and save this URL for future reference.

 You may see an option to Access My Computer. Click it and select your VM.

Enter in your secure PIN, and you’re in!

Before you can start to use your newly configured Virtual Machine, you’ll need to set a password for yourself.

Go back to the SSH terminal in the VM instances section.

Enter the following command:

  • sudo passwd username

Remember to replace username with whatever your username is.

If done correctly, you should see something similar to this:

You’ve now configured the Compute Engine API, created a Virtual Machine with a Linux operating system and set up a cloud-based version of Screaming Frog.

Now you can begin creating automated cloud-based workflows or even run one-off audits for Enterprise clients without purchasing expensive software subscriptions!

Leave a comment

Your email address will not be published. Required fields are marked *