There is a hidden, relentless force that permeates the web and its billions of web pages and files, unbeknownst to the
majority of us sentient beings. I'm talking about search engine crawlers and robots here. Every day hundreds of them go
out and scour the web, whether it's Google trying to index the entire web, or a spam bot collecting any email address it
could find for less than honorable intentions. As site owners, what little control we have over what robots are allowed
to do when they visit our sites exist in a magical little file called "robots.txt."
1) Here's a basic "robots.txt":
User-agent: *
Disallow: /
With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/").
Most likely not what you want, but you get the idea.
2) Lets get a little more discriminatory now. While every webmaster loves Google, you may not want Google's Image bot
crawling your site's images and making them searchable online, if just to save bandwidth. The below declaration will
do the trick:
User-agent: Googlebot-Image
Disallow: /
3) The following disallows all search engines and robots from crawling select directories and pages:
User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm
4) You can conditionally target multiple robots in "robots.txt." Take a look at the below:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/
This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google,
which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not
inheritance.
5) There is a way to use Disallow: to essentially turn it into "Allow all", and that is by not entering a value after the
semicolon(
:User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:
Here I'm saying all crawlers should be prohibited from crawling our site, except for Alexa, which is allowed.
6) Finally, some crawlers now support an additional field called "Allow:", most notably, Google. As its name implies,
"Allow:" lets you explicitly dictate what files/folders can be crawled. However, this field is currently not part of
the "robots.txt" protocol, so my recommendation is to use it only if absolutely needed, as it might confuse some less
intelligent crawlers.
Per Google's FAQs for webmasters, the below is the preferred way to disallow all crawlers from your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
open source matters. joomla 1.5 legacy, native,
http://www.projectunderground.vze.com, projectunderground,
web, website, designs, develop, developers, themes,
templates, free, downloads.
Project underground, Project, Underground, projectunderground, www.projectunderground.vze.com,Free, Joomla, CSS, templates, Download, downloads, website, layout, open, open source, web design, design, themes for CMS, blogs and forums. Free HTML, free template CSS, psd, SEF, craiglist, SEO, php, ASP, Java, javascript, applet, XML, SQL, apache, mambo, drupal, zen, zencart, cart, ecommerce, acajoom, dtdonate, dtregister, jevents, events, calendar, plugins, javascrips applet, community builder, virtuemart, flickrslide, flickr. webhosting, websitehosting, hosting affordable but powerful CMS development, widgets, gadgets, fonts download, icons, social bookmarking.Iligan city, cebu, makati, manila, philippines, asia, experts, expert. mamatza, mamatza.co.cc. www.mamatza.co.cc.
No comments:
Post a Comment