I recently re-released my website as a minimal usable design ; meaning that it was initially released with only the minimal feature set needed to be usable. However, there are some nagging technical concerns in the underbelly of every website which all web developers would like to ignore, but, If I'm going to call myself a professional web developer, I reckon I should be tending to them.
One of these technical concerns are the meta files which help babysit the automated web crawling robots (or simply "bots") that collect web pages to be indexed by the search engines. All search engines use crawler bots to find websites and build their databases, so if you want your site to be found, this stuff is important.
Two specific files come top to mind: robots.txt and Sitemap. I'll talk about Sitemaps in this post.
A Sitemap is a file listing the page URLs of your website which you want indexed in the search engines. It is typically an XML file, although there are other text based formats which will work too. To see what one looks like you can simply point your web browser to http://www.kixx.name/sitemap.xml , which is the Sitemap for this very website.
Although a Sitemap is not necessary to get the search engines to find and index your site, it appears they can help out, especially in cases where you have dynamically generated pages. Popular ecommerce and blogging platforms come to mind.
The easiest way to submit your Sitemap to the search engines is to do it all at
once by hosting it on your server off the root URL of your hostname like this:
http://www.example.com/sitemap.xml. But there is more to the story.
Where Are the Sitemaps for Other Sites?
I was curious to see how other sites publish their Sitemaps, and so I poked around a few of the big online content publishers and this is what I found:
The Huff' Post simply lists all of their recent top articles in the Sitemap. And yes, it's a big file.
BuzzFeed uses their
/sitemap.xml file as in index to other Sitemap files
which organize content by tag, category, author etc. This is their
category Sitemap file.
The Daily Beast also uses their
sitemap.xml as an index file, but they simply
segment their entire catalog into numbered groups referencing other Sitemap
this one at
(be careful, that file is really big).
The Verge does not have a Sitemap at all.
Boing Boing served an empty Sitemap, which is probably not good practice.
Where is The Right Place to Host Your Sitemap?
Aside from the empty file served by Boing Boing, they are all right.
Typically it's a good idea to host your Sitemap at the root of the website like
http://www.example.com/sitemap.xml. But for big sites you should consider
using the root sitemap.xml file as an index file which then points the crawlers
to other sitemap files similar to the techniques used by BuzzFeed and The Daily
Beast. Google Webmaster documentation has some great
guidelines for Sitemaps
which will give you some more detailed and authoritative information. Google's
Webmaster docs say a Sitemap that reaches 50MB is too big and should be split
up... You don't say?!
According to the Google Webmaster Docs you can also list your Sitemaps in your robots.txt file. If you have a Google Webmaster account you can even upload your sitemap directly to Google through their web portal, but I don't see any advantages to doing it that way. I'd rather host it on my web domain for all search crawlers to find.
How to Create a Sitemap?
Chances are that your site already has one which is automatically generated by your shiny modern web framework!
Wordpress generates a Sitemap right out of the box, but there are also some good plugins to help you make it even better.
There is a Sitemap generation module for Drupal too.
So before you fret over it, checkout the
/sitemap.xml URL on your root
domain. The Sitemap might already be there. If so, it's worth looking it over
and checking it against the
docs to make sure it contains what you think it should. Then maybe review the
docs on your web framework to find out what plugins and modules are available
to modify it if you think it could be improved.
It's a boring thing to think about and certainly no fun to create, but a Sitemap seems to provide a benefit to search engines. So, if SEO is important to your website, it's worth taking the time to get the Sitemap right.
Leaf image by London Permaculture