Skip to content

Add support for subproject and per-version sitemaps, and styled main sitemap #12249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

agjohnson
Copy link
Contributor

@agjohnson agjohnson commented Jun 13, 2025

This was a little morning project to expand our default sitemap to make
allow projects to create more granular sitemaps. The idea here is that
the main sitemap becomes a sitemapindex sitemap that points to
multiple individual sitemap files. These include:

  • A sitemap of version URLs (this is the current prod sitemap)
  • A sitemap for each version (new)
  • A sitemap for each subproject version (new)

The new sitemaps will 404 if the user doesn't output them.

And in addition, this replaces the XML and XML comment approach, which
doesn't render in many browsers, with a super basic XSLT template. The
template isn't necessiarly needed but it's a nicer experience for
browsers.

image

image

image

…sitemap

This was a little morning project to expand our default sitemap to make
allow projects to create more granular sitemaps. The idea here is that
the main sitemap becomes a `sitemapindex` sitemap that points to
multiple individual sitemap files. These include:

- A sitemap of version URLs (this is the current prod sitemap)
- A sitemap for each version (new)
- A sitemap for each subproject version (new)

The new sitemaps will 404 if the user doesn't output them.

And in addition, this replaces the XML and XML comment approach, which
doesn't render in many browsers, with a super basic XSLT template. The
template isn't necessiarly needed but it's a nicer experience for
browsers.
@humitos
Copy link
Member

humitos commented Jun 17, 2025

  • A sitemap of version URLs (this is the current prod sitemap)
    • A sitemap for each version (new)
    • A sitemap for each subproject version (new)
      The new sitemaps will 404 if the user doesn't output them.

Are these new sitemaps auto-generated by Read the Docs if the user doesn't output them as we currently do?

@humitos
Copy link
Member

humitos commented Jun 17, 2025

Does it make sense to return different content under /sitemap.xml when adding the ?version argument? Is this something standard for sitemaps?

@humitos
Copy link
Member

humitos commented Jun 17, 2025

And in addition, this replaces the XML and XML comment approach, which
doesn't render in many browsers, with a super basic XSLT template.

Is XSLT supported for sitemaps? I've never heard of it. I checked the official and Google documentation and I didn't find any mention to it. Do we have documentation around this?

@agjohnson
Copy link
Contributor Author

agjohnson commented Jun 18, 2025

Is XSLT supported for sitemaps?

XSLT is a feature from the late 90s for XML, not sitemaps specifically. Sitemaps are just XML so XSLT is supported through that. Crawlers don't interact with the template, they just use the XML.

There are many examples of this in the wild, like: https://www.sitemap.style/

Is this something standard for sitemaps?

The top-level sitemap request is convention (/sitemap.xml), but linked sitemaps don't need to follow any specific convention. It's common to have sitemaps that link to sub-sitemaps like this. For example, sitemaps have a limit of 50k URLs/50Mb and so if you exceed either you would have to have paginated sitemaps with a sitemapindex pointing to each. There's no specification for how to split these up and https://www.sitemaps.org/protocol.html#index doesn't touch on this either way.

I mostly just didn't want to get into a dedicated view/URL just yet. We could make a separate view and URL for the list of version URLs if there is a strong reason though. We just can't mix a list of sitemaps with a list of URLs in a single sitemap.

Are these new sitemaps auto-generated by Read the Docs if the user doesn't output them as we currently do?

Not at the moment. Currently, users can output these sitemap files but there are some hacks required to be able to use them instead of our sitemap with a single link to the active version base URL.

We could generate these files, but I think it's also fine to require the user to use their tooling to generate these. These per-version and per-subproject-version sitemaps will 404 if they aren't output by the project.

@humitos
Copy link
Member

humitos commented Jun 23, 2025

The top-level sitemap request is convention (/sitemap.xml), but linked sitemaps don't need to follow any specific convention. It's common to have sitemaps that link to sub-sitemaps like this. For example, sitemaps have a limit of 50k URLs/50Mb and so if you exceed either you would have to have paginated sitemaps with a sitemapindex pointing to each. There's no specification for how to split these up and sitemaps.org/protocol.html#index doesn't touch on this either way.

Yeah, I was asking about this in particular because of the ?version attribute that returns a completely different content. I'm not sure if that won't have an impact in how crawlers parse the content. I think these URLs are fixed names and I'm not sure we can change them. I mean, /sitemap.xml is valid and for indexes we should probably use /sitemap_index.xml. Otherwise, how crawlers will find these non-standards URLs? 1

We could generate these files, but I think it's also fine to require the user to use their tooling to generate these.

I think it's fine to require the user to generate these files for now. We can auto-generate them later if we want to.

These per-version and per-subproject-version sitemaps will 404 if they aren't output by the project.

However, I don't think we should link to them if they are dead links. I suppose this will have negative impact in SEO.

Footnotes

  1. ... it seems it's fine as long as we declare them in robots.txt

@agjohnson
Copy link
Contributor Author

agjohnson commented Jun 23, 2025

I mean, /sitemap.xml is valid and for indexes we should probably use /sitemap_index.xml. Otherwise, how crawlers will find these non-standards URLs?

Because sitemap.xml?versions is always included from sitemap.xml. The concept is that the standard sitemap.xml file becomes a pointer to all of the individual sitemap files. Engines will crawl this first and then crawl each sitemap in the sitemap index file, like sitemap.xml?versions. This is all standard behavior, especially because of the limitations in sitemaps that require sitemap indexes.

However, I don't think we should link to them if they are dead links. I suppose this will have negative impact in SEO.

This could be an improvement for later, but I wouldn't block this incremental change on an end goal change like that.

Our current sitemap doesn't provide projects with good SEO already, so any improvement to sitemap handling would be better for SEO. We can only guess if returning a 404 on sitemaps is even potentially a negative to SEO in comparison.

it seems it's fine as long as we declare them in robots.txt

Sitemaps can be linked from robots.txt, but I started with this as a sitemap feature because it's more common for users to override robots.txt for various reasons. Overriding robots.txt will break any features we add to our default, so seemed like a place to isolate features.

I don't feel robots.txt is our best option, but it's a direction we could go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Link to subprojects from sitemap.xml
2 participants