Reply to comment

robots.txt, nofollow, noindex, and Search Engine Behavior

Tue, 12/30/2008 - 23:22 - peter | |

Search Engine Behavior

Take a quick look at the second page of search results for this site, about a week after google began indexing the page. Some links have far more information! Google Search Results Example nofollow noindex index follow Clearly, there is something "different" about the top two results, from the rest. What is different is that the Googlebot never visited the bottom 8 links. It knows about those links, but it never officially went to those links because they are outlawed by the robots.txt file. So google officially doesn't know what is on those pages, and hence only displays the link, as that link was found on a separate page.

nofollow

The same behavior would have been found if those pages had been given the "nofollow" attribute - the googlebot would not have officially visited them, and although the result could appear in the Search Engine Result Page (SERP), it would be without any description or title. At the bottom of this page, by the way, is the full text of my drupal 6 robots.txt file. One of my main goals is to eliminate duplicate content, remember. If you are very serious that you don't even want links for certain pages to appear in google at all, unfortunately you can't achieve that goal simply by using nofollow or robots.txt. You need noindex.

noindex

Ironically, in order to have your pages not appear in google, you need to allow the pages to be read by google with your robots.txt file and allow the links to be dofollow (not have the rel="nofollow" attribute). Then, you need to add the "noindex" meta tag to your page. This will result in the page not appearing in the SERPs.

Page Rank

The various search engines appear to have their own separate algorithms for determining the value of pages on the internet. Most have some sense of this rank being dependent on quality links. So how do these issues affect pagerank? From my research, I believe that nofollow links do not affect the receiving page's pagerank, but followed links do affect the receiving page's pagerank, even if the receiving page is robots.txt'd to be invisible. Hence, the links in the screenshot at the beginning of this article do have pagerank, even though google doesn't know what they contain, and hence google indexed them. If those links had also been nofollowed, google probably would not have indexed them, because they would not have had any pagerank.

Drupal and nofollow and robots.txt and pagerank

I think probably it would be handy for drupal to have an option such that it would nofollow any pages that are disallowed in the robots.txt file. So far, the nofollow options for drupal (nofollowlist, or nofollowing based on input format or user class) do not have this behavior. The reason why this behavior would be desirable is that it would help the googlebot (and presumably the other bots) to focus their attention (whether it's called pagerank or not) on the areas of one's site that one wants public: the areas not in the robots.txt file!

Robots.txt

(my robots.txt file: http://palma-seo.com/robots.txt )

# $Id: robots.txt,v 1.9 2007/06/27 22:37:44 goba Exp $
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html

# Directories
User-agent: *
Disallow: /userlist/content
Disallow: /userlist/content/
Disallow: /s/
Disallow: /*/book/*
Disallow: /*/book*
Disallow: /*/book
Disallow: /*/export*
Disallow: /*/export/*
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
Disallow: /profile
Disallow: /profile/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/password
Disallow: /user/login/
Disallow: /user/
# Paths (no clean URLs)
Disallow: /?q=es/
Disallow: /?q=es
Disallow: /?q
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /user/
Disallow: /user
Disallow: /admin
Disallow: /admin/
Disallow: /node/add
Disallow: /node/add/
Disallow: /aggregator/
Disallow: /aggregator
Disallow: /comment/
Disallow: /comment
Disallow: /contact
Disallow: /contact/
Disallow: /logout
Disallow: /logout/
Disallow: /search/
Disallow: /search
Disallow: /tribune
Disallow: /tribune/
Disallow: /calendar
Disallow: /calendar/
Disallow: /Calendar
Disallow: /Calendar/
#Disallow: /tracker
Disallow: /tracker/
Disallow: /*/track/
Disallow: /tracker?
Disallow: /*/feed$
Disallow: /*/feed*
Disallow: /*/feed/
Disallow: /blog/
Disallow: /*/track$

Disallow: /*/subscribe
Disallow: /*/subscribe/
Disallow: /*/subscribe*
# Views and Forum module problem:
Disallow: /*sort=
# Image module problem
Disallow: /*size=
#This avoids the creation of a duplicate home-page.
#    The URL http://example.com/node is a duplicate of http://example.com/.
Disallow: /node$
Disallow: /print/
Disallow: /es
Disallow: /es/
Disallow: /category
Disallow: /category/
Disallow: /messages
Disallow: /messages/
Disallow: /taxonomy
Disallow: /taxonomy/
Disallow: /taxonomy_vtn
Disallow: /taxonomy_vtn/
Disallow: /aggregator
Disallow: /aggregator/
Disallow: /*/guestbook
Disallow: /node
Disallow: /node/
#This disallows the numerical forum urls (can still access at
# /forums/nicaragua etc).
Disallow: /forum/
Disallow: /image_captcha
Disallow: /image_captcha/
Disallow: /?
Disallow: /?page=*
Disallow: /?page=
Disallow: /?page=1
Disallow: /?page=2
Disallow: /?page=4
Disallow: /?page=3
Disallow: /?page=5
Disallow: /?page=6
Disallow: /?page=7
Disallow: /?page=8
Disallow: /?page=9
Disallow: /?page=10
Disallow: /?page=11
Disallow: /?page=12
Disallow: /?page=13
Disallow: /?page=14
Disallow: /?page=15
Disallow: /?page=16
Disallow: /?page=17
Disallow: /popular
Disallow: /popular/
Disallow: /node/
Disallow: /search
Disallow: /piwik
Disallow: /piwik.php
Disallow: /piwik/
Disallow: /search$
Disallow: /*?page=0,0$
Disallow: /*?page=0,1000$
Disallow: /central-america-latest-blogs
Disallow: /central-america-news
Disallow: /central-america-latest-blogs?page=*
Disallow: /central-america-news?page=*
Allow: /
Allow: /sites/*/files/

Reply

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for clevery testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.