If you've been surfing search engine optimization web sites, you've no doubt come across the above being mentioned on many occasions.
Crawlers, Agents, Bots, Robots and Spiders
Five terms all describing basically the same thing, but in this article they'll be referred to collectively as spiders or "agents". A search engine spider is an automated software program used to locate and collect data from web pages for inclusion in a search engine's database and to follow links to find new pages on the World Wide Web. The term "agent" is more commonly applied to web browsers and mirroring software.
Not all spiders are good
Who actually owns these spiders? It's good to know the beneficial from the bad. Some agents are generated by software such as Teleport Pro, an application that allows people to download a full "mirror" of your site onto their hard drives for viewing later on, or sometimes for more insidious purposes such as plagiarism. If you have a large or image heavy site, the practice of web site stripping could also have a serious impact on your bandwidth usage each month.
Banning spiders and agents
If you notice entries like Teleport Pro and WebStripper in your traffic reports, someone's been busy attempting to download your web site. You don't have to just sit back and let this happen. If you are commercially hosted, you'll be able to add a couple of lines to your robots.txt file to prevent repeat offenders from stripping your site.
If you don't have a robots.txt file, create one in notepad and upload it to the docs directory (or the root of whichever directory your web pages are stored in). Never use a blank robots.txt file as some search engines may see this as an indication that you don't want your site spidered at all! Have at least one entry in the file.
Unfortunately, defining web stripper agents and spiders in your robots.txt file won't work in all cases as some mirroring software applications have the ability to mimic web browser identifiers; but at least it's some protection that may save you some valuable bandwidth.
Search engine spider identification
The following is a basic listing of search engine spider names and their "owners". This is by no means complete, as there are many thousands of search engines on the Internet, but it covers the more common beneficial spiders. Look for these in your traffic reports or search for the names through your server logs to discover which pages they have been spidering. You'll find that many of the entries will also have accompanying numbers or letters e.g Googlebot/2.1 or Slurp.so/1.0
If you have spotted any significant activity from these spiders in your reports or logs, there's a good chance that you'll be listed on that particular search engine. But you'll need to be patient; some Search Engines take far longer than others to refresh their databases!
Further learning resources:
Learn more about positioning in our SE optimization tutorials section.
Studying Web Traffic and Server Logs. What is a hit? What is a visitor? What is a page view? Traffic statistics terminology and methods of web site traffic reporting.
A basic tutorial on the use of Meta Tags in improving search engine rankings. A solid set of meta-tags is an important component of any overall promotion strategy.
What do all those browser error codes and server response codes mean? Try our server response code reference.
paid cash taking online surveys - free to join online
In Loving Memory - Mignon Ann Bloch
copyright (c) 1999-2011 Taming the Beast Adelaide - South Australia