HELP! My sitemap contains urls which are blocked by robots.txt

Oh no! You open Search Console just to find out that your sitemap you submitted has pages that Google can’t crawl. 

That’s a bummer. But what does it even mean that there’s URLs that are blocked by the robots file?

What does blocked by robots.txt even mean?

According to Google, the page that was submitted for indexing is blocked by your site’s robots.txt file.

The robots.txt file is a useful tool for preventing overloading your site with requests from web crawlers. More often than not the robots.txt file will be the first thing a crawler accesses from a website.

A few things to keep in mind when working with robots files:

  • bad crawlers can and will likely ignore your robots file. Especially malware robots, email harvesters and other general spammers.
  • the robots is a publicly available file located at: /robots.txt
  • It’s not useful for ‘removing’ content out of the index

Why you need (to maintain your) a robots.txt file

Robots’ mistakes can slip through fairly easily, so it’s very important to routinely audit your configuration.

Here’s a few reasons to utilize your robots.txt file:

  • Keep resources under control
  • Specify the location of your sitemaps
  • Block internal search results

Identify where these errors are occurring

Crawl your XML sitemap and filter for all pages blocked by the robots file

The reason being Search Console will only show a sample set of URLs. While this is a good starting point if you have a larger site then this isn’t suitable.

Before you initiate the craw you’ll need to adjust Screaming Frog’s default settings. By default it’s set up to obey robots files.

Click settings, robots.txt and settings. Here you’ll need to toggle ‘Ignore robots.txt’. But bear this in mind,

But remember, “With great power comes great responsibility” – Uncle Ben.

Or something like that.

Sort by indexability status

Once you sort your URLs by their indexability status you should get an idea of which chunks are currently being blocked.

How do I fix URLs blocked by robots.txt?

Blocked URLs can occur through many different situations either from developers or overly eager analysts. There’s a few different ways to approach this issue depending on your goals.

Here are a few things to check when fixing “sitemap contains URLs which are blocked by robots.txt” errors.

If you need the page(s) crawled then…

remove the syntax in the robots file

  • Review your disallow rules within your robots.txt file and compare it to your crawl from earlier. Use the robots.txt tester tool to double check that Google can crawl the pages you need to keep.

If you need the page(s) blocked then…

remove the URL from your sitemap

  • If you don’t want Google to attempt to crawl these pages remove the URL. To speed up the process and verify the change you can inspect a URL and resubmit the updated XML sitemap.

delete unneeded XML sitemaps

  • You can go into your Sitemaps report within Google Search Console and find which one has an error status. If you don’t need the XML sitemap you can delete it. 
  • From there you can get access to your file manager through FTP access or your hosting provider and remove the XML sitemap from the server as well.

Validate fixes in search console

You can test out a sample of your pages within Google’s URL inspection tool. You can get quick high level insights about the indexed version of the page along with any crawlability issues.

Next, you can take it a step further and test the live URL. This will tell you if Google can now crawl your page. If everything looks good you can move on.

Once you feel confident with your changes you can go into the coverage report section and select validate fix. This will prompt Google to recrawl your pages. Once you start the validation they will notify you about the progress.

Site crawlability and indexability is foundational for any successful SEO campaign.

Need help? Drop me a line I’m a Technical SEO consultant!

Recommended Resources:

Leave a comment

Your email address will not be published.