Log in

Understanding how search engines crawl and index content in 2021

What is search engine crawling and why is it important?

‘Crawling’ refers to the way search engine bots (known as ‘spiders) identify content on the web. In its most basic form, these spiders – such as Google’s – will take a few starting pages and scour them for links to find new content. By following these links, the spiders will be able to create a network of interlinking pages which it saves to a database. This database sets the foundation for what kind of content the search engines will show in their results pages.

If a site doesn’t have a solid foundation, it may hinder a search engine spider’s ability to discover new pages. This will prevent the pages from being added to or updated in a search engine’s database. 

How do search engines index crawled content?

Once a search engine has compiled a database of websites and pages, it will process them with the intention of delivering them to users.

When an individual searches a query, the search engine will aim to deliver the most relevant content from its index to the user. This is evaluated on several measures, known as ranking factors.

Ensuring the content has been crawled and can be indexed is the first step to delivering appropriate content that’s valuable to the user.

Understanding crawl budget in 2021

Crawl budget is the term used for the allocation of resources a search engine spider, namely Googlebot, gives to a website. In theory, there’s a limited amount of time a spider will spend crawling pages, so if a site is large, spiders may prioritise important pages or limit the amount it crawls. The amount of resources the spiders allocate to your site will be dependent on a variety of factors and it is difficult to determine the priority Google will give you. However, for the majority of websites, crawl budget won’t be a limiting factor as the resource allocation will outweigh the size of the site.

There are two main scenarios where crawl budget should be considered.

  • Your site is significant in size.

Some sites require a large volume of pages, such as ecommerce sites that have thousands of products with unique variations, which need to be crawled. These sites should place greater strategy on optimising crawl efficiency.

  • Your site is adding unnecessary resource to crawlers

In some cases, the way a website is structured may lead to duplicate pages being generated endlessly, a significant number of unnecessary redirects, or a high volume of slow loading pages. All these issues can increase the amount of resources required. If this happens on a large enough scale, it can cause problems with crawl budget. 

Methods to influence how search engines crawl and index content

Luckily there are a few ways to control how search engines crawl and index your site:

Robots.txt File

Robots.txt files are located in the root directory of your site and provide guidance on how search engines should crawl a site. Basic robots directives allow you to prevent search engines from crawling certain sections of your site, as well determining how fast they crawl the site. More specific rules can be created to apply to specific web crawlers. 

Different search engines treat robots’ rules in different ways and some may choose to ignore them.  To see how Google treats robots.txt files you can refer to their guidelines.

Robots Directives

The robots directives are a signal to instruct search engines on how a page should be indexed without preventing it from being crawled. This is most commonly used to instruct a search engine not to index a particular page in the form of a no-index tag. 

This can be implemented in two different ways, via the source code using the meta tags or using the X-Robots-Tag in the HTTP header.

Canonical Tags

A canonical tag is an HTML tag that defines the ‘true’ version of a page, helping search engines understand which page to show in its index. This is just a signal, however, and search engines like Google may choose to ignore it. This means they will still crawl the page, but they will just be less likely to index it. 

Hreflang Tags

The hreflang tag is used to inform search engines of the country and language of your site. Search engines won’t inherently determine if a site written in English is intended for the US market or vice versa. In the case where your site exists in multiple countries with similar content, hreflang tags can help define which users should be sent to which variation and prevent duplicate content issues.

Using an XML Sitemap

An XML sitemap is essentially a document listing all the important pages of your site that you would like search engines to crawl. This helps search engines find your pages faster and means your site doesn’t rely as heavily on a strong internal linking structure. 

The limit to a sitemap is 50,000 URLs, so for larger sites these are often used together with separate sitemaps for blog content, product pages and support content. It is recommended to reference all XML sitemaps in the robots.txt file and to submit them via Google Webmaster tools. 

URL Inspection Tool

Google provides tools to help their spiders understand your website. One of these tools is the ‘URL inspection tool’ in Google Search Console, which shows the status of the page and whether it has been indexed. If it’s not indexed, you can request Google to do so. This gives Google a clear signal about your desire to have the page appear in results and will prompt them to recrawl it.

It is recommended to use this sparingly. It is designed to be used in cases where a problem has occurred removing a page from index, and the issue needs to be resolved quickly. 

Defining URL parameters in Google Search Console

Many sites will serve slightly different content by appending different parameters to a URL. This is commonly used with faceted navigation and search filters. Google will aim to identify the representative URL on its own, but if there are specific parameters that need to be excluded, they can block them directly in the Google Search Console. This prevents the Googlebot from crawling them altogether. 

HTTP Authentication

HTTP authentication blocks content behind a log-in portal, which will stop both users and search engines from being able to access it. This is useful in a number of scenarios including building a staging website to test what a page will look like once live or adding sensitive information that you only want approved users to see.  

Now we’ve covered the crawling and indexing, why not explore the rest of our technical SEO hub

 

Trusted by

Case Studies

Read about some our recent successes with some of our fantastic clients.