From 7cd778983d87356a4d2698470ba5c7b422744194 Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Thu, 18 Apr 2024 14:53:20 -0400 Subject: [PATCH 1/4] rename release note snippet with "8936" #8936 --- ...s-in-sitemap.md => 8936-more-than-50000-entries-in-sitemap.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/release-notes/{8983-more-than-50000-entries-in-sitemap.md => 8936-more-than-50000-entries-in-sitemap.md} (100%) diff --git a/doc/release-notes/8983-more-than-50000-entries-in-sitemap.md b/doc/release-notes/8936-more-than-50000-entries-in-sitemap.md similarity index 100% rename from doc/release-notes/8983-more-than-50000-entries-in-sitemap.md rename to doc/release-notes/8936-more-than-50000-entries-in-sitemap.md From ceb8c0f89009a6d6da147ae9553f37f4f47be081 Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Thu, 18 Apr 2024 15:01:12 -0400 Subject: [PATCH 2/4] simplify release note, add upgrade section #8936 --- .../8936-more-than-50000-entries-in-sitemap.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/doc/release-notes/8936-more-than-50000-entries-in-sitemap.md b/doc/release-notes/8936-more-than-50000-entries-in-sitemap.md index 6fcf3180283..7b367e328c1 100644 --- a/doc/release-notes/8936-more-than-50000-entries-in-sitemap.md +++ b/doc/release-notes/8936-more-than-50000-entries-in-sitemap.md @@ -1,7 +1,11 @@ -The sitemap file generation can handle more than 50,000 entries if needed with the [sitemapgen4j](https://github.com/gdcc/sitemapgen4j) library, maintained by the Global Dataverse Community Consortium. +Dataverse can now handle more than 50,000 items when generating sitemap files, splitting the content across multiple files to comply with the Sitemap protocol. -In this case, the Dataverse Admin API `api/admin/sitemap` create a sitemap index file, called `sitemap_index.xml`, in place of the single `sitemap.xml` file. This created file reference multiples simple sitemap file, named ``sitemap1.xml``, ``sitemap2.xml``, etc. This referenced files will be as many files as necessary to contain the URLs of dataverses and datasets presents your installation, while respecting the limit of 50,000 URLs per file. See the [config section of the Installation Guide](https://guides.dataverse.org/en/latest/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines) for details. +For details see https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines #8936 and #10321. -A HTML preview can be found [here](https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines). +## Upgrade instructions -For more information, see [#8936](https://github.com/IQSS/dataverse/issues/8936). +If your installation has more than 50,000 entries, you should re-submit your sitemap URL to Google or other search engines. The file in the URL will change from ``sitemap.xml`` to ``sitemap_index.xml``. + +As explained at https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines this is the command for regenerating your sitemap: + +`curl -X POST http://localhost:8080/api/admin/sitemap` From b228fe7ee5572fca6ada2c646b2f81a696f62170 Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Thu, 18 Apr 2024 15:01:36 -0400 Subject: [PATCH 3/4] rewrite sitemap docs (50,000 items now supported) #8936 --- .../source/installation/config.rst | 38 +++++++++++-------- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst index bd185e2d008..e4ff65f059e 100644 --- a/doc/sphinx-guides/source/installation/config.rst +++ b/doc/sphinx-guides/source/installation/config.rst @@ -2052,39 +2052,47 @@ If you are not fronting Payara with Apache you'll need to prevent Payara from se Creating a Sitemap and Submitting it to Search Engines ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Sitemap file -############ +Creating a Sitemap +################## -Search engines have an easier time indexing content when you provide them a sitemap. The Dataverse Software sitemap includes URLs to all published Dataverse collections and all published datasets that are not harvested or deaccessioned. +Search engines have an easier time indexing content when you provide them a sitemap. Dataverse can generate a sitemap that includes URLs to all published collections and all published datasets that are not harvested or deaccessioned. Create or update your sitemap by adding the following curl command to cron to run nightly or as you see fit: ``curl -X POST http://localhost:8080/api/admin/sitemap`` -This will create or update a file in the following location unless you have customized your installation directory for Payara: +On a Dataverse installation with many datasets, the creation or updating of the sitemap can take a while. You can check Payara's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between. + +For compliance with the `Sitemap protocol `_, the generated sitemap will be a single file with 50,000 items or fewer or it will be split into multiple files. + +Single Sitemap File +################### + +If you have 50,000 items or fewer, a single sitemap will be generated in the following location (unless you have customized your installation directory for Payara): ``/usr/local/payara6/glassfish/domains/domain1/docroot/sitemap/sitemap.xml`` -On Dataverse installation with many datasets, the creation or updating of the sitemap can take a while. You can check Payara's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between. +Once the sitemap has been generated in the location above, it will be served at ``/sitemap.xml`` like this: https://demo.dataverse.org/sitemap.xml -https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Project Demo site and yours should be similar. +Multiple Sitemap Files (Sitemap Index File) +########################################### -Once the sitemap has been generated and placed in the domain docroot directory, it will become available to the outside callers at /sitemap/sitemap.xml; it will also be accessible at /sitemap.xml (via a *pretty-faces* rewrite rule). Some search engines will be able to find it at this default location. Some, **including Google**, need to be **specifically instructed** to retrieve it. +According to the `Sitemaps.org protocol `_, a sitemap file must have no more than 50,000 URLs and must be no larger than 50MiB. In this case, the protocol instructs you to create a sitemap index file called ``sitemap_index.xml`` (instead of ``sitemap.xml``), which references multiple sitemap files. In this case, the created files containing the URLs will be named ``sitemap1.xml``, ``sitemap2.xml``, etc. The referenced files are also generated in the same place as other sitemap files and there will be as many files as necessary to contain the URLs of collections and datasets present in your installation, while respecting the limit of 50,000 URLs per file. Dataverse will automatically detect whether you need to create a single ``sitemap.xml`` file or several files and generate them for you. However, when submitting your sitemap file to Google or other search engines as described below, you must be careful to use the correct file name corresponding to your situation. -One way to submit your sitemap URL to Google is by using their "Search Console" (https://search.google.com/search-console). In order to use the console, you will need to authenticate yourself as the owner of your Dataverse site. Various authentication methods are provided; but if you are already using Google Analytics, the easiest way is to use that account. Make sure you are logged in on Google with the account that has the edit permission on your Google Analytics property; go to the search console and enter the root URL of your Dataverse installation, then choose Google Analytics as the authentication method. Once logged in, click on "Sitemaps" in the menu on the left. (todo: add a screenshot?) Consult `Google's "submit a sitemap" instructions`_ for more information; and/or similar instructions for other search engines. +If you have over 50,000 items, a sitemap index file will be generated in the following location (unless you have customized your installation directory for Payara): -.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668 +``/usr/local/payara6/glassfish/domains/domain1/docroot/sitemap/sitemap_index.xml`` -Sitemap index file -################## +Once the sitemap has been generated in the location above, it will be served at ``/sitemap_index.xml`` like this: https://demo.dataverse.org/sitemap_index.xml -According to `Sitemaps.org protocol `_, a sitemap file must have no more than 50,000 URLs and must be no larger than 50MiB. In this case, the protocol instructs you to create a sitemap index file called ``sitemap_index.xml`` (instead of ``sitemap.xml``), which references multiples sitemap files. In this case, the created files containing the URLs will be named ``sitemap1.xml``, ``sitemap2.xml``, etc. This referenced files are also generated in the same place as other sitemap files and there will be as many files as necessary to contain the URLs of dataverses and datasets presents your installation, while respecting the limit of 50,000 URLs per file. Dataverse will automatically detect whether you need to create a single ``sitemap.xml`` file, or several files. However, you must be careful to use the correct file name corresponding on your situation. +Submitting Your Sitemap to Search Engines +######################################### -If there are more than 50,000 dataverses and datasets, the sitemap file created or updated will default to the location: +Some search engines will be able to find your sitemap file at ``/sitemap.xml`` or ``sitemap_index.xml``, but others, **including Google**, need to be **specifically instructed** to retrieve it. -``/usr/local/payara6/glassfish/domains/domain1/docroot/sitemap/sitemap_index.xml`` +One way to submit your sitemap URL to Google is by using their "Search Console" (https://search.google.com/search-console). In order to use the console, you will need to authenticate yourself as the owner of your Dataverse site. Various authentication methods are provided; but if you are already using Google Analytics, the easiest way is to use that account. Make sure you are logged in on Google with the account that has the edit permission on your Google Analytics property; go to the Search Console and enter the root URL of your Dataverse installation, then choose Google Analytics as the authentication method. Once logged in, click on "Sitemaps" in the menu on the left. Consult `Google's "submit a sitemap" instructions`_ for more information. -Moreover, it can also be accessed at ``/sitemap/sitemap_index.xml`` or ``/sitemap_index.xml``. In case of "Google Search Console" is used to submit the sitemap file, one of the previous URLs have to be used with the ``sitemap_index.xml`` file name. +.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668 Putting Your Dataverse Installation on the Map at dataverse.org +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From 7c6d101c8bcbdf34fda45514491cdd7cc010441f Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Wed, 24 Apr 2024 11:25:26 -0400 Subject: [PATCH 4/4] various sitemap doc tweaks #8936 --- doc/sphinx-guides/source/installation/config.rst | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst index e4ff65f059e..73d9afb0141 100644 --- a/doc/sphinx-guides/source/installation/config.rst +++ b/doc/sphinx-guides/source/installation/config.rst @@ -2077,7 +2077,7 @@ Once the sitemap has been generated in the location above, it will be served at Multiple Sitemap Files (Sitemap Index File) ########################################### -According to the `Sitemaps.org protocol `_, a sitemap file must have no more than 50,000 URLs and must be no larger than 50MiB. In this case, the protocol instructs you to create a sitemap index file called ``sitemap_index.xml`` (instead of ``sitemap.xml``), which references multiple sitemap files. In this case, the created files containing the URLs will be named ``sitemap1.xml``, ``sitemap2.xml``, etc. The referenced files are also generated in the same place as other sitemap files and there will be as many files as necessary to contain the URLs of collections and datasets present in your installation, while respecting the limit of 50,000 URLs per file. Dataverse will automatically detect whether you need to create a single ``sitemap.xml`` file or several files and generate them for you. However, when submitting your sitemap file to Google or other search engines as described below, you must be careful to use the correct file name corresponding to your situation. +According to the `Sitemaps.org protocol `_, a sitemap file must have no more than 50,000 URLs and must be no larger than 50MiB. In this case, the protocol instructs you to create a sitemap index file called ``sitemap_index.xml`` (instead of ``sitemap.xml``), which references multiple sitemap files named ``sitemap1.xml``, ``sitemap2.xml``, etc. These referenced files are also generated in the same place as other sitemap files (``domain1/docroot/sitemap``) and there will be as many files as necessary to contain the URLs of collections and datasets present in your installation, while respecting the limit of 50,000 URLs per file. If you have over 50,000 items, a sitemap index file will be generated in the following location (unless you have customized your installation directory for Payara): @@ -2085,10 +2085,14 @@ If you have over 50,000 items, a sitemap index file will be generated in the fol Once the sitemap has been generated in the location above, it will be served at ``/sitemap_index.xml`` like this: https://demo.dataverse.org/sitemap_index.xml +Note that the sitemap is also available at (for example) https://demo.dataverse.org/sitemap/sitemap_index.xml and in that ``sitemap`` directory you will find the files it references such as ``sitemap1.xml``, ``sitemap2.xml``, etc. + Submitting Your Sitemap to Search Engines ######################################### -Some search engines will be able to find your sitemap file at ``/sitemap.xml`` or ``sitemap_index.xml``, but others, **including Google**, need to be **specifically instructed** to retrieve it. +Some search engines will be able to find your sitemap file at ``/sitemap.xml`` or ``/sitemap_index.xml``, but others, **including Google**, need to be **specifically instructed** to retrieve it. + +As described above, Dataverse will automatically detect whether you need to create a single sitemap file or several files and generate them for you. However, when submitting your sitemap file to Google or other search engines, you must be careful to supply the correct file name (``sitemap.xml`` or ``sitemap_index.xml``) depending on your situation. One way to submit your sitemap URL to Google is by using their "Search Console" (https://search.google.com/search-console). In order to use the console, you will need to authenticate yourself as the owner of your Dataverse site. Various authentication methods are provided; but if you are already using Google Analytics, the easiest way is to use that account. Make sure you are logged in on Google with the account that has the edit permission on your Google Analytics property; go to the Search Console and enter the root URL of your Dataverse installation, then choose Google Analytics as the authentication method. Once logged in, click on "Sitemaps" in the menu on the left. Consult `Google's "submit a sitemap" instructions`_ for more information.