How to Discourage Archive Crawlers in WordPress

Avoid search result and site performance issues with these tips.

Have you ever noticed crawler hits on archive pages with exceedingly high pagination numbers? Numbers so high that it’s unlikely any human would ever navigate to them? It happens all the time, unfortunately.

Such behavior can be both wasteful and problematic for website performance and search result accuracy. So let’s talk about how to mitigate these issues.

What’s with this web crawler behavior?

Web crawlers, by design, are programmed to follow links in search of new content to index. However, they don’t always discern the context or relevance of the links they follow. This can lead to what we term as “context blindness.”

A WordPress site with a sitemap already provides crawlers with direct paths to all its content. Yet, some crawlers still follow links to archive pages with high pagination numbers, which is redundant and inefficient.

Consider this URL example: /section/sports/hockey/entertainment/page/1466/. URLs with high pagination numbers like this are often uncached, as they aren’t typically accessed by human users within the cache’s time-to-live (TTL) period.

Even worse: due to the sequential nature of pagination, the content indexed by crawlers on a numbered archive page becomes outdated as soon as a new post is published. This can lead to search result inaccuracies.

A user might search for a specific post on Google, only to be directed to an archive page where the post no longer exists because it has moved due to newer content being published. This misdirection is not only frustrating for users but also dilutes the quality of search results.

Fortunately, many WordPress themes have recognized this issue and implemented measures to discourage bots from unnecessary crawling. However, there are still many themes that haven’t addressed this problem, leading to potential resource wastage and user dissatisfaction.

Furthermore, there are instances where a website inadvertently allows indiscriminate crawlers to access these high-pagination archive pages, causing performance issues.

How to discourage archive crawlers in WordPress

1. Robots.txt

One of the simplest ways to discourage crawlers from accessing specific parts of your website is by using the robots.txt file. By specifying Disallow directives, you can prevent crawlers from accessing high pagination archive pages.

2. Meta tags

Add a <meta name="robots" content="noindex, follow"> tag to archive pages with high pagination numbers. This tells search engines not to index these pages. Here’s an example of a snippet we’ve tested and implemented for a customer’s site:

Copy Code

function custom_noindex_nofollow_archive_pages() {
        if ( is_paged() ) {
        $current_page = get_query_var('paged');

        $bot_threshold = 1; // pages before we discourage bots

        if ($current_page >= $bot_threshold) {
            echo '<meta name="robots" content="noindex, nofollow" />';
        }
    }
}

add_action('template_redirect', 'custom_noindex_nofollow_archive_pages');

3. Pagination plugins

Some WordPress plugins allow you to customize pagination behavior, making it easier to set limits or adjust how archive pages are presented to both users and crawlers.

4. Monitor crawler behavior

Use tools like Google Search Console to monitor how crawlers interact with your site. If you notice any unusual behavior, take corrective action.

5. Educate and collaborate

If you’re working with a specific theme or plugin developer, note issues you’ve observed with crawlers. Collaboration leads to solutions that benefit the broader WordPress community.

Run, don’t crawl

While web crawlers play a crucial role in indexing and ranking content, it’s essential to ensure they operate efficiently and accurately.

By understanding and addressing the challenges posed by archive crawlers in WordPress, website owners will improve user experience, search result accuracy, and overall site performance.

Author

Tallulah Ker-Oldfield, VIP Engineer