How can I control bots, spiders and crawlers | GreggHosting

The affects of Web Hosts on Your Business Website

Here're ways your choice of web hosting service provider impacts the performance of your business:


check

Website loading speed

check

Uptime and availability

check

Customer support

check

Security

check

Server location

How can I control bots, spiders and crawlers?

How can I control bots, spiders and crawlers?

Overview

Bots, spiders, and other crawlers can consume a lot of resources (memory and CPU) when they visit your dynamic sites. This can put a lot of strain on the server and cause it to slow down (s).

Create a robots.txt file at the base of your website to limit server demand from bots, spiders, and other crawlers. This instructs search engines what information they should and shouldn’t index on your site. This is useful if you wish to keep a section of your site out of the Google search engine index, for example.

If you’d rather not create this file yourself, you can have GreggHost do it for you on the Block Spiders page (on a per-domain basis).

While most major search engines honor robots.txt directives, this file merely serves as a suggestion to complying search engines and does not prohibit search engines (or other similar tools like email/content scrapers) from accessing or making the information available.

Blocking robots

It’s possible that your site is being over-browsed by Google, Yahoo, or another search engine bot. (This is the type of problem that feeds on itself; if the bot is unable to complete its search due to a lack of resources, it may repeat the same search.)

Blocking Googlebots

In the case below, the IP 66.249.66.167 was discovered in your access.log. Using SSH, use the ‘host’ command to see whose firm this IP belongs to:

[server]$ crawl-66-249-66-167.googlebot.com host 66.249.66.167 167.66.249.66.in-addr.arpa domain name pointer
Use the following in your robots.txt file to stop this Googlebot:

# leave the room Googlebot
Googlebot is the user agent.
Allow: / Explanation of the above fields:
# leave the room
This is a comment that simply serves to remind you why you made this rule.
User-agent
The bot’s name, to which the following rule will be applied.
Disallow
The URL you want to block’s route. This forward slash will prevent access to the entire site.
Click the following link for more information on Google robots:

Blocking Yahoo

Yahoo is being blocked by Google’s robots.txt file.
The crawl-delay rule in robots.txt is followed by Yahoo’s crawling bots, which limits their fetching activities. You might add the following to notify Yahoo not to fetch a page more than once every 10 seconds:

# Yahoo should be slowed down
Slurp is the user agent.
Crawl-delay: ten minutes
Explanation of the above-mentioned fields:
# Yahoo should be slowed down
This is a comment that simply serves to remind you why you made this rule.
Slurp is the user agent.
Yahoo’s user-agent name is Slurp. To block Yahoo, you must utilize this method.
Crawl-delay
The User-agent is instructed to wait 10 seconds between requests to the server.
Click the following link for further information on Yahoo robots:

Yahoo’s robots.txt file

Slowing good bots

To delay some, but not all, good bots, use the following:

Crawl-Delay: 10 User-agent:
Explanation of the above-mentioned fields:
User-agent: * All User-agents are affected.
Crawl-delay
The User-agent is instructed to wait 10 seconds between requests to the server.
Bots from Google

The crawl-delay command is ignored by Googlebot.
You’ll need to sign up for Google Search Console to slow down Googlebot.
You can set the crawl rate on their panel once your account is setup.
Blocking all bots
To ban all bots, follow these steps:

* Disallow: / To prevent them from accessing a certain folder:

Bad bots may use this material as a target list. User-agent: * Disallow: /yourfolder/

Explanation of the above-mentioned fields:
* User-agent
All User-agents are affected.
Disallow: / Prevents everything from being indexed.
Allows indexing of this single folder. Disallow: /yourfolder/
Use caution
Your site will be de-indexed by reputable search engines if you block all bots (User-agent: *) from your entire site (Disallow: /). Also, because harmful bots are likely to ignore your robots.txt file, you may wish to utilize an.htaccess file to prevent their user-agent.

You may wish to forgo listing directories in the robots.txt file since bad bots may use it as a target list. Bad bots may also utilize phony or deceptive User-agents, therefore using.htaccess to prevent User-agents may not be as effective as expected.

This is an excellent default robots.txt file if you don’t want to block anyone:

User-agent: * Disallow: If you don’t mind 404 queries in your logs, you may need to remove robots.txt in this instance.

Unless you’re very certain that’s what you want, GreggHost recommends that you only restrict certain User-agents and files/directories, rather than *.

Blocking bad referrers
Please see the page on how to block referrers for more information.