robots.txt, nofollow, noindex, and Search Engine Behavior

Search Engine Behavior

Take a quick look at the second page of search results for this site, about a week after google began indexing the page. Some links have far more information! Google Search Results Example nofollow noindex index follow Clearly, there is something "different" about the top two results, from the rest. What is different is that the Googlebot never visited the bottom 8 links. It knows about those links, but it never officially went to those links because they are outlawed by the robots.txt file. So google officially doesn't know what is on those pages, and hence only displays the link, as that link was found on a separate page.

nofollow

The same behavior would have been found if those pages had been given the "nofollow" attribute - the googlebot would not have officially visited them, and although the result could appear in the Search Engine Result Page (SERP), it would be without any description or title. At the bottom of this page, by the way, is the full text of my drupal 6 robots.txt file. One of my main goals is to eliminate duplicate content, remember. If you are very serious that you don't even want links for certain pages to appear in google at all, unfortunately you can't achieve that goal simply by using nofollow or robots.txt. You need noindex.

noindex

Ironically, in order to have your pages not appear in google, you need to allow the pages to be read by google with your robots.txt file and allow the links to be dofollow (not have the rel="nofollow" attribute). Then, you need to add the "noindex" meta tag to your page. This will result in the page not appearing in the SERPs.

Page Rank

The various search engines appear to have their own separate algorithms for determining the value of pages on the internet. Most have some sense of this rank being dependent on quality links. So how do these issues affect pagerank? From my research, I believe that nofollow links do not affect the receiving page's pagerank, but followed links do affect the receiving page's pagerank, even if the receiving page is robots.txt'd to be invisible. Hence, the links in the screenshot at the beginning of this article do have pagerank, even though google doesn't know what they contain, and hence google indexed them. If those links had also been nofollowed, google probably would not have indexed them, because they would not have had any pagerank.

Drupal and nofollow and robots.txt and pagerank

I think probably it would be handy for drupal to have an option such that it would nofollow any pages that are disallowed in the robots.txt file. So far, the nofollow options for drupal (nofollowlist, or nofollowing based on input format or user class) do not have this behavior. The reason why this behavior would be desirable is that it would help the googlebot (and presumably the other bots) to focus their attention (whether it's called pagerank or not) on the areas of one's site that one wants public: the areas not in the robots.txt file!

Robots.txt

(my robots.txt file: http://palma-seo.com/robots.txt )

# $Id: robots.txt,v 1.9 2007/06/27 22:37:44 goba Exp $
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html

# Directories
User-agent: *
Disallow: /userlist/content
Disallow: /userlist/content/
Disallow: /s/
Disallow: /*/book/*
Disallow: /*/book*
Disallow: /*/book
Disallow: /*/export*
Disallow: /*/export/*
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
Disallow: /profile
Disallow: /profile/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/password
Disallow: /user/login/
Disallow: /user/
# Paths (no clean URLs)
Disallow: /?q=es/
Disallow: /?q=es
Disallow: /?q
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /user/
Disallow: /user
Disallow: /admin
Disallow: /admin/
Disallow: /node/add
Disallow: /node/add/
Disallow: /aggregator/
Disallow: /aggregator
Disallow: /comment/
Disallow: /comment
Disallow: /contact
Disallow: /contact/
Disallow: /logout
Disallow: /logout/
Disallow: /search/
Disallow: /search
Disallow: /tribune
Disallow: /tribune/
Disallow: /calendar
Disallow: /calendar/
Disallow: /Calendar
Disallow: /Calendar/
#Disallow: /tracker
Disallow: /tracker/
Disallow: /*/track/
Disallow: /tracker?
Disallow: /*/feed$
Disallow: /*/feed*
Disallow: /*/feed/
Disallow: /blog/
Disallow: /*/track$

Disallow: /*/subscribe
Disallow: /*/subscribe/
Disallow: /*/subscribe*
# Views and Forum module problem:
Disallow: /*sort=
# Image module problem
Disallow: /*size=
#This avoids the creation of a duplicate home-page.
#    The URL http://example.com/node is a duplicate of http://example.com/.
Disallow: /node$
Disallow: /print/
Disallow: /es
Disallow: /es/
Disallow: /category
Disallow: /category/
Disallow: /messages
Disallow: /messages/
Disallow: /taxonomy
Disallow: /taxonomy/
Disallow: /taxonomy_vtn
Disallow: /taxonomy_vtn/
Disallow: /aggregator
Disallow: /aggregator/
Disallow: /*/guestbook
Disallow: /node
Disallow: /node/
#This disallows the numerical forum urls (can still access at
# /forums/nicaragua etc).
Disallow: /forum/
Disallow: /image_captcha
Disallow: /image_captcha/
Disallow: /?
Disallow: /?page=*
Disallow: /?page=
Disallow: /?page=1
Disallow: /?page=2
Disallow: /?page=4
Disallow: /?page=3
Disallow: /?page=5
Disallow: /?page=6
Disallow: /?page=7
Disallow: /?page=8
Disallow: /?page=9
Disallow: /?page=10
Disallow: /?page=11
Disallow: /?page=12
Disallow: /?page=13
Disallow: /?page=14
Disallow: /?page=15
Disallow: /?page=16
Disallow: /?page=17
Disallow: /popular
Disallow: /popular/
Disallow: /node/
Disallow: /search
Disallow: /piwik
Disallow: /piwik.php
Disallow: /piwik/
Disallow: /search$
Disallow: /*?page=0,0$
Disallow: /*?page=0,1000$
Disallow: /central-america-latest-blogs
Disallow: /central-america-news
Disallow: /central-america-latest-blogs?page=*
Disallow: /central-america-news?page=*
Allow: /
Allow: /sites/*/files/

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

postscript - follow, noindex

As an experiment, I did something differently when I launched another site (Philippines Living).  In this case, I set the default robot meta tag to be "follow,noindex".  I expected that nothing would show up in the search results: all the pages would be either robots.txt'd out, or not indexed because of noindex.

I was wrong.  Here's a few weeks after launch, when I still had the site set to "follow,noindex" (right before setting to the normal setting of "follow,index"):

noindex,follow google search results

So, what happened?

Google crawled all the pages not disallowed by robots.txt, and it saw the links to the robots.txt disallowed URLs, but didn't officially crawl them. Google then did not index any of the crawled pages, because it knew that the pages were "noindex" in the meta tag.  But, it did index the links to the robots.txt disallowed URLs!  Of course, they have no titles or descriptions, because the googlebot doesn't officially crawl them... but, yes, the only pages google indexed were the robots.txt barred pages.  Pretty amusing.  I'll make another entry when I see that google has figured out that the pages are indexable and put them in the online index.  We'll see how long past January 4th that is.

First page to appear: Here's the Story

The first page to appear is not the site's home page.  Two days after making all pages "index,follow", one page was mentioned on ask.metafilter; this resulted in about 100 hits, and within 3 hours google had properly indexed it and included it in search results.  All the rest of the pages remain unindexed.

Google Results for index,follow first page to appear in SERP

And although there is a Cache button, there's nothing in the cache server: Google result - freshly indexed page not in cache

I believe this is an indication that google is paying quite a lot of attention to the traffic, because the link structure of the site internally I think would have supported also considering the home page important.  It will be interesting to see when the main page appears and replaces that sub-page in the SERP site query.

SERP Changes the next day... Home Page Takes Over

The next morning, the home page appeared in the SERP - at the top, usurping the sub-page.  Interestingly, by the evening, our page from the day before had disappeared!!  Explanation below the pictures.

6 AM Google SERP Home Page Present

4 PM Google Search Results Home Page Present Subsequent Page Gone

The disappearance of the "Some Musings" sub-page shows that in fact google had briefly considered the page important enough to store in some "new" server aspect of its SERP-creation data centers. But either due to the 1-day age of the page, or the reduced number of hits, that page was lowered in importance, lowered so much it was flushed from the "new and important" section and so disappeared from the SERPs entirely. 

Will the home-page stay there for good?  My bet is yes, that because it is the site's home-page, that it won't be disappear... we'll keep watching!

Thanks Peter for all this

Thanks Peter for all this insightful information, you couldn't have been more specific than that. Believe it or not it was entertaining to read your post, Google does have "funny" ways of indexing the pages but knowing these ways is great advantage for web developers, they can preview what's gonna happen with a new web page and apply strategies to avoid unwanted results. Do you also have more relevant information on internet reputation management? I'd love to see that. I am also curious to know what happened next to your new website.