Load Balanced Bot-split Approach to Counter Excessive Bot Traffic. When Search Engines Work Against You!

Websites with a large number of links tend to get a good share of hammering from crawlers like Googlebot. This crawler traffic often leads to major slowdowns and can often knock your server offline. And there isn’t much you can do besides upgrading your hardware. You could easily block all search engine traffic and be done with it. But that would be a death knell to your online business. It’s a lose lose situation. That’s because you depend on indexing and ranking to get noticed online. There’s, however, a workaround to dilemma. We’re happy to share it with you here.

 

Atrapitis

 

Instead of increasing your hardware fleet to several folds your current needed capacity for the sole purpose of absorbing search engine traffic, you can leverage a dirt-cheap high-density RAM server to cache as much of your content on a separate server where all bot traffic will be “trapped.” See, we already know how to identify most search engine traffic. This triage of traffic is key to logically and physically route bots and visitors exactly where we want them to go.

 

 

2015-03-14 10.59.21

 

More concretely, we run Varnish on the load balancer node to programatically split traffic based on the visitor’s user agent string (Nginx is a possible option but some LB features are only available in the commercial version Nginx Plus). Bots have well known and documented user-agent strings. It takes a few VCL lines to switch traffic lines to their respective backends (read: servers).

if (req.http.user-agent ~ “(?i)bing|(?i)googlebot”) {

    set req.backend = bot;

}

else {

    set req.backend = default;

}

Now that we have bot traffic routed to where we want it to go, let’s go over what the “cache box” should be and do.

First of all, it’s a high-RAM node (commensurate with the size of your content) with a one or two-core CPU running Varnish Cache. In terms of software configuration, it needs to be near-identical to the “realtime” box because it needs to be able to run the very same website (same software requirements). The “realtime” box will regularly push the latest copy of its DB and synchronize files. “Regularly” could be any reasonable interval you wish that to be. But one hour is reasonable for most implementations. A couple of shell scripts should do it for both the DB dump/pump and file sync. Be sure to avoid table locks if you’re not running InnoDB.

It’s important to put a reasonably high TTL if you’re going to deploy a low-powered CPU on the bot box. A 2-day TTL for both static and assets and pages wouldn’t be unusual. Your cache hit won’t be efficient and will defeat the purpose of the whole setup otherwise. A simple cache warmer can render the whole setup super efficient (ex: runs on sitemap changes)

That’s all!

Leave a Reply

Comment moderation is enabled. Your comment may take some time to appear.


Search The Blog







Categories