Understanding Robots.txt and Sitemap.xml

Table of Contents

For those unaware, there are two special files that almost all websites use to influence how visitors see them. Not human visitors, but bots — web crawlers, search engines, one of the many various “internet archival” sites, you name it. Maybe you’d like to know this because you’re building up a website or service, or maybe you’re just curious on how to read them. In either case, let’s get to explaining.

Robots.txt

A robots.txt file placed in the root of a server (immediately after the first / in the URL) defines that website’s use of the Robot Exclusion Standard, a way of informing web crawlers what should and should not be scanned. The file itself is a series of lines of text, each line being one “command.”

Note that the system is purely advisory, there is nothing actually enforcing it, and it is up to the bot in question to determine if it will actually obey the directives or not (or even grab the file in the first place). Most well established search engines will respect a robots.txt, but other less-than-scrupulous ones and more malicious agents may not regard it, more on that later.

Example

YouTube:

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /channel/*/community
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /user/*/community
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml

TD-StorageBay:

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-Agent: *
# Disallow: /

# Add a 1 second delay between successive requests to the same server, limits resources used by crawler
# Only some crawlers respect this setting, e.g. Googlebot does not
Crawl-delay: 1

# Based on details in https://gitlab.com/gitlab-org/gitlab/blob/master/config/routes.rb, https://gitlab.com/gitlab-org/gitlab/blob/master/spec/routing, and using application
User-Agent: *
Disallow: /autocomplete/users
Disallow: /search
Disallow: /api
Disallow: /admin
Disallow: /profile
Disallow: /dashboard
Disallow: /projects/new
Disallow: /groups/new
Disallow: /groups/*/edit
Disallow: /users
Disallow: /help
# Only specifically allow the Sign In page to avoid very ugly search results
Allow: /users/sign_in

# Global snippets
User-Agent: *
Disallow: /s/
Disallow: /snippets/new
Disallow: /snippets/*/edit
Disallow: /snippets/*/raw

# Project details
User-Agent: *
Disallow: /*/*.git
Disallow: /*/*/fork/new
Disallow: /*/*/repository/archive*
Disallow: /*/*/activity
Disallow: /*/*/new
Disallow: /*/*/edit
Disallow: /*/*/raw
Disallow: /*/*/blame
Disallow: /*/*/commits/*/*
Disallow: /*/*/commit/*.patch
Disallow: /*/*/commit/*.diff
Disallow: /*/*/compare
Disallow: /*/*/branches/new
Disallow: /*/*/tags/new
Disallow: /*/*/network
Disallow: /*/*/graphs
Disallow: /*/*/milestones/new
Disallow: /*/*/milestones/*/edit
Disallow: /*/*/issues/new
Disallow: /*/*/issues/*/edit
Disallow: /*/*/merge_requests/new
Disallow: /*/*/merge_requests/*.patch
Disallow: /*/*/merge_requests/*.diff
Disallow: /*/*/merge_requests/*/edit
Disallow: /*/*/merge_requests/*/diffs
Disallow: /*/*/project_members/import
Disallow: /*/*/labels/new
Disallow: /*/*/labels/*/edit
Disallow: /*/*/wikis/*/edit
Disallow: /*/*/snippets/new
Disallow: /*/*/snippets/*/edit
Disallow: /*/*/snippets/*/raw
Disallow: /*/*/deploy_keys
Disallow: /*/*/hooks
Disallow: /*/*/services
Disallow: /*/*/protected_branches
Disallow: /*/*/uploads/
Disallow: /*/-/group_members
Disallow: /*/project_members

Note that like in multiple programming languages, lines beginning with a # are comments and are ignored.

This is the general format of a robots.txt: You specify a User-agent to match on, and then a list of patterns you either Allow or Disallow access to. The * is a “match any sequence of characters” wildcard that can be used in paths or agents. Therefore, User-agent: * will match anything.

There is one bonus feature: the Sitemap directive, which links to a sitemap to follow, which is discussed down below.

Controversy

In theory, the ability to block results from being crawled is useful for blocking things like internal use only pages, backend resources that do nothing by themselves when visited, and just stuff that you may not want being randomly accessed and spread around.

However, some groups like Archive Team and the Internet Archive explicitly disregard them. The Archive Team call the standard “obsolete” and a “suicide note”, deeming it a “understandable stop-gap fix for a temporary problem, a problem that has long, long since been solved.”1

The Internet Archive made an announcement in April 2017 that going forward it will disregard a robots.txt, claiming that they’re primarily geared for search engines, they don’t play well with the main goal of an archival project.2

My Take

(Warning: opinions)

Here’s my stance on this, in two words: you’re wrong. What they are both saying, is, in a sense, that the webmaster / owner of a particular site is not allowed to dictate what someone sees on their own site. As my site, it is my ability to decide what does and does not get shown to the public, searched by crawlers, or be made available for all of eternity, because it’s my property, and mine to do with as I please. If I need to resort to unilaterally excluding certain IPs, user agents, or any other form of identifying information to shape the site in my image, then I am 100% within my rights to do so. (Yes, this is from the perspective of a private owner, not a government institution or public corporation.) By disregarding robots.txt, you’re indirectly telling us that what we think about our own property does not matter, since you think the method is obsolete, or that your intent is to preserve every single aspect of a website, regardless of if it’s actually useful content, that you are now freely allowed to do as you please.

I do not add disallow directives to stop useful content from being shared, I add disallow directives because visiting that page will likely just give you an error (internal usage), or produce bogus results (processing / redirect pages), or it only serves to deliver content the same way 5 different times and I don’t want to clutter anyone’s results with that.

Ahem, I may have gotten a little carried away.

Sitemap.xml

A sitemap, on the other hand, is the opposite. Instead of specifying what cannot be reached, it’s a pre-generated list of pages that can be reached.

Sitemaps are an XML file with a list of locations that web crawlers may use as starting points to start expanding their search, instead of starting at the web root and blindly checking any link they see. They may also contain extra information like priorities, if one page is more important than others, how often a page changes and should be re-scanned, and the last time it was updated, allowing crawlers to skip pages they know have not changed since the last time they checked it. Sitemaps can also reference other sitemaps, creating a tree of XML files. Since the official specification limits the size to 50 MiB or 50,000 URLs, this may be necessary for larger sites

Example

This site, truncated:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <url>
        <loc>https://teknikaldomain.me/gallery/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/author/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/categories/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/cubing/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>
            https://teknikaldomain.me/gallery/endless-cubing-fun/
        </loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/gan356-x/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/giiker-i3s/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/categories/real-life/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/rubiks-cube/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/categories/teks-toys/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/author/teknikal_domain/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/</loc>
        <lastmod>2020-02-18T11:08:35-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/cards/</loc>
        <lastmod>2020-02-18T10:39:27-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/gallery/me-and-my-cards/</loc>
        <lastmod>2020-02-18T10:39:27-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/playing-cards/</loc>
        <lastmod>2020-02-18T10:39:27-05:00</lastmod>
    </url>
    <url>
        <loc>https://teknikaldomain.me/tags/hugo/</loc>
        <lastmod>2020-02-18T09:44:48-05:00</lastmod>
    </url>
</urlset>

These can be huge, as, ideally, they’re a list of every single reachable page on your site, the one for this right now is 750 lines long and 20 kB as of the time of writing.

The benefits of a sitemap depend on your search engine of choice, ranging from nothing to potentially better results, though you’ll never usually get worse results for adding one in. You can also ping a search engine by providing them a map, essentially informing the engine that “this has changed, and here’s everything that you can access now”. This works usually by submitting the sitemap URL in a request to a designated endpoint. For example, the URL http://www.google.com/ping?sitemap=MAP, where MAP itself is a valid URL like https://teknikaldomain.me/sitemap.xml, will then tell Google that your sitemap was just updated, and the existing index data that it has should be refreshed.

With all that covered though, it’s a really simple standard, and there’s not much more to say. Sitemaps are an inclusive measure not an exclusive one, meaning they likely won’t see the kind of treatment that the robots.txt files are getting anytime soon.