Avoiding Duplicate Content with Robots.txt
Wednesday, July 30, 2008
Anyone who has spent some time researching organic Search Engine Optimization knows that avoiding duplicate content is both a must, and a struggle. In an effort to provide the best search results possible - Google makes an effort to provide only the original version of a document in the search results.
This is carried out algorithmically by a specially designed Google bot which scours the web looking for duplicate content, and then trying to figure out which is the original version. Those documents deemed to be unoriginal duplicates are de-indexed from Google.
As the SEO business is especially prone to rumors, there are a number of people who will tell you that having duplicate content on your site will cause Google to penalize you. This is not technically true - when Google finds duplicate content it removes it from the Google index. Some may argue that this is a penalty in itself, it does not directly harm search engine rankings like a real Google penalty.
Although this is an effective way to fight webspam artists who scrape wikipedia and other sites for content, it can often cause problems for webmasters and here’s how:
Assume you make an awesome blog post on your site www.example.com and it gets a lot of attention. Great job! Now lots of people are noticing example.com and even linking to you in their blogs. But there is trouble brewing… your blog post is located at http://www.example.com/blog/greatpost/ but your blogging system keeps another copy at http://www.example.com/blog/archive/greatpost/
Google’s duplicate content bot determines that the location at http://www.example.com/blog/archive/greatpost/ is the original and indexes that, nixing the main version from the Google index. This is a BIG PROBLEM because nobody has linked to the version in the archive, and it therefore has no PageRank, and can’t rank well in Search Engine Result Placements - despite the fact that it is the only version of your post visible in Google search results.
So here you have created a great blog post, had a lot of attention, and now you don’t get any SEO benefit because of duplicate content rules - Talk about frustrating!
How could this have been prevented?
Perhaps the quickest and easiest way to avoid duplicate content issues is with the Robots.TXT file
by adding the line:
Disallow: /blog/archive/
to the site’s robots.txt file, search engines will know immediately not to index any files in the /blog/archive/ directory. Not only will this solve the issue above, but will prevent this from ever happening again.
It is important when building a site, especially when it contains a CMS system like a blog, to consider all places where content may be duplicated. A decision should be made at that point of whether to block those pages from being indexed in the robots.txt file.
Posted by Quentin Muhlert on 07/30 at 09:44 AM(0) Comments • Permalink

- Online Marketing in an Unstable Economy
- Flash and Google Analytics: Google ports ga.js to Actionscript 3
- Sponsored Video Ads from YouTube
- New Reporting Metrics from Facebook
- The Top 20 Twitterers in Vancouver
- Key Takeaways from Google Webmaster Chat: Ticks and Treats
- Yahoo May be in Trouble
- Ideas On Tap
- Major Google Stock Price Error Causes Investor Panic
- Beware of AdWords Scams
