Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify robots.txt generation #620

Closed
adam-sroka opened this issue Jul 30, 2023 · 3 comments
Closed

Clarify robots.txt generation #620

adam-sroka opened this issue Jul 30, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@adam-sroka
Copy link
Contributor

Issue description

Hi and thanks a lot for all the work on Congo 👏 !

I've been a Congo user for quite some time, and one thing that kept bugging me for a while is that the robots.txt of my website kept having Disallow: /, thus rendering my website pretty useless for search engine indexing, despite me not explicitly disallowing this anywhere.

I've naturally had a read through the docs and issues, but apart from finding the robots and enableRobotsTXT params, nothing seemed relevant, and only looking into the code I've found the culprit:

  • {{- if hugo.IsProduction | or (eq .Site.Params.env "production") }}
    Allow: /
    {{- else }}
    Disallow: /
    {{- end }}

The Allow all vs. Disallow all option is set based on hugo.IsProduction and .Site.Params.env == "production", which allowed me to find the bug in my case. My issue was that when running hugo server locally, this runs in development environment, so outputs Disallow: /, but after reading about it in this article, I found out that running hugo --minify should run in production by default, which was indeed the case and the in the SSG output on my local machine, my robots.txt had indeed Allow: /. The real bug was, that for some reason, in the CI pipeline (woodpecker-ci) which deploys my site to Codeberg pages, hugo --minify runs in development by default, so I had to change it to hugo run --minify --environment production to finally fix it.

tl;dr

I'd like to ask two things:

  1. What is the reason for Allowing all robots in production and not in development? Is this a beneficial behaviour that is to be kept?
  2. If yes, could we at least document this a bit better, so that this is more obvious and other people won't run into issues like this?

I'm happy to help with changing this/improving the docs on it :).

Thanks a lot 💙 !

Theme version

v2.6.1

Hugo version

hugo v0.110.0

Which browser rendering engines are you seeing the problem on?

Chromium (Google Chrome, Microsoft Edge, Brave, Vivaldi, Opera, etc.)

URL to sample repository or website

https://adam.sr

Hugo output or build error messages

No response

@adam-sroka adam-sroka added the bug Something isn't working label Jul 30, 2023
@jpanther
Copy link
Owner

Thanks for the report. It's been so long since I added this to the project that I can't remember the reasoning behind doing it this way. I think it was just the logic that if you're working in development you wouldn't want that project being scraped by search engines. I would imagine that most of the time that would never happen anyway so perhaps it is a pointless "feature" that could be removed.

@adam-sroka
Copy link
Contributor Author

Hi and thanks for the reply :)!

In my opinion, behaviour which is environment dependent in a way that is not strictly requested by the user isn't optimal, even though it might just try to be setting sensible defaults. This can then introduce unwanted behaviour which can sometimes be hard to trace as in my case. I'd also presume most users who run Hugo in development mode do so locally and aren't concerned about their site being scraped by search engines.

What I'd imagine could be a better way of doing this is to have a single parameter somewhere that dictates what goes into the robots.txt file. We already have the parameters enableRobotsTXT (set to true, which creates robots.txt by default), and robots (unset). I haven't yet delved into how robots is handled, or if it works on a per-page bases, but I could imagine setting it to allow or allow_all or something like that by default, which would write Allow: / to robots.txt, could be a way to go.

I'm happy to have a go at a PR over the weekend that you can then review. Please let me know if you prefer any specific approach 😊.

Thank you!

@jpanther
Copy link
Owner

The two config parameters serve a different purpose. The enableRobotsTXT does as you say, where it specifies if the robots.txt file should be created. This is a Hugo specific flag and part of the standard config. The robots parameter in the front matter is part of the theme and it determines what information is placed into the robots meta tag for that particular content page. The use case here is that you might have specific requirements for a single page, like disabling caching, or something and so it allows it to be done on a granular level.

I think 99% of the time people are going to just allow all in the robots.txt and so that's probably the sensible default. If you understand what the file is, and how it works, you're going to go and tweak it yourself anyway and so trying to get smart about that with parameters is probably only going to confuse people and not be broad enough to meet individual requirements anyway.

I'm going to change the default template to allow all and remove the test for whether the site is in production or development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants