Sitemap.xml and robots.txt explained

Almost every site on the web has these two files at the root: /robots.txt and /sitemap.xml. Almost every site also gets at least one of them wrong. They look similar (small text files, search engines read them, somebody told you to set them up) but they do opposite things, and confusing them produces real damage.

This article goes through what each one is for, what each one is not for, and what a sensible default looks like.

robots.txt: a request, not a rule

robots.txt lives at https://example.com/robots.txt. It is the first thing a respectful crawler asks for when it visits your site. The file lists which paths the crawler should and should not request.

Minimal example, an empty rule meaning "everything is fair game":

User-agent: *
Disallow:

A more typical WordPress site:

User-agent: *
Disallow: /wp-admin/
Disallow: /?s=
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

What that says: any crawler is welcome, do not crawl /wp-admin/ or search-result pages, but admin-ajax.php is fine because some pages need it. The sitemap lives at the URL on the last line.

Important: robots.txt is a polite request, not enforcement. Google, Bing, and most search engines respect it. Aggressive scrapers, malicious bots, and AI training crawlers respect it less, or not at all. If you want to actually block a path, you do it at the server level (HTTP basic auth, IP allowlist, firewall rule). robots.txt does not protect anything.

What robots.txt is not for

The most common misuse: people put a page in robots.txt to "hide it from Google". That does the opposite of what they want. If a page is Disallow-ed in robots.txt, Google does not crawl it, but if some other site links to it, Google will still index the URL, sometimes with the snippet "no description available because of robots.txt". The URL ends up in search results without you being able to control the snippet.

To actually keep a page out of Google's index, you have two real options:

<meta name="robots" content="noindex"> in the HTML head. Google must crawl the page to see the meta tag, so this only works if you do not Disallow the page in robots.txt. Allow the crawl, then noindex the result. Pages with this meta will be removed from the index after Google next crawls them.
HTTP X-Robots-Tag: noindex header. Same effect, but works for non-HTML files (PDFs, images).

The takeaway: robots.txt controls crawling, the meta tag controls indexing. They are not interchangeable.

sitemap.xml: an invitation list

sitemap.xml is the opposite job. It tells crawlers "here are the URLs I want you to know about". The file is XML and lists each URL with optional metadata:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-04-26</lastmod>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2026-03-01</lastmod>
  </url>
</urlset>

The fields that actually matter today are <loc> (the URL) and <lastmod> (when you last meaningfully updated the page). Older fields like <priority> and <changefreq> are ignored by Google since 2017.

What a sitemap gets you:

Faster discovery of new pages, especially for sites without many incoming links.
Discovery of orphan pages (URLs not linked from any other page on your site).
A clean record for Google Search Console of every URL you consider canonical.

What a sitemap does not get you:

Higher ranking. Inclusion in a sitemap does not boost a page.
Forced indexing. Google can still decide a URL is not worth indexing even if you list it.
Protection from bad URLs. If your sitemap lists URLs that 404 or redirect, Google's report flags it. Quality of the sitemap matters.

Limits and structure

A single sitemap can list up to 50,000 URLs and weigh up to 50 MB uncompressed. Sites bigger than that split into multiple sitemaps and link them from a sitemap index file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-04-26</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-04-26</lastmod>
  </sitemap>
</sitemapindex>

WordPress with Yoast or Rank Math generates these automatically and groups them by post type. For a custom site, generating it from a build script or a small PHP loop is 30 lines of code.

Common mistakes worth avoiding

Listing the same URL with and without trailing slash. Pick one form, use it everywhere, redirect the other one with a 301.
Listing redirected URLs. A sitemap should contain only canonical 200-OK URLs. If /old-page 301s to /new-page, list only /new-page.
Listing pages that have a noindex meta. Google wonders why you are advertising URLs you also tell it to ignore.
Forgetting to update <lastmod>. A <lastmod> that never changes is like no <lastmod> at all. Update it when the content changes meaningfully, not on every cron run.
Forgetting to declare the sitemap in robots.txt. The Sitemap: https://example.com/sitemap.xml line lets crawlers find it without you having to submit it manually.

A sensible default for a small site

Drop these two files at the root of your site, then forget about them.

/robots.txt:

User-agent: *
Disallow: /wp-admin/
Disallow: /?s=
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

/sitemap.xml: generated by your CMS (WordPress, Drupal, Ghost, Hugo, Astro all have native or plugin support) or by a 30-line PHP script that lists your fixed pages plus your blog posts. Submit it once in Google Search Console, then leave it alone.

Verify in Search Console that Google can read both files (Settings, Crawling). If Search Console says "Couldn't fetch sitemap", investigate before assuming Google has discovered everything. The number of "Discovered, not indexed" URLs in Search Console is also worth a periodic look: if it grows, your content quality or internal linking has issues, not your sitemap.