phpBB

Development Discussion Board

phpBB's testing ground of bleeding edge code
Advanced search

Sitemap, based on 3.0.6-RC3 feed.php

Discussion of general topics related to the new version and its place in the world. Don't discuss new features, report bugs, ask for support, et cetera. Don't use this to spam for other boards or attack those boards!
Forum rules
Discussion of general topics related to the new release and its place in the world. Don't discuss new features, report bugs, ask for support, et cetera. Don't use this to spam for other boards or attack those boards!

Sitemap, based on 3.0.6-RC3 feed.php

Postby Paul.J.Murphy » Thu Oct 29, 2009 5:49 pm

I had a thought - how difficult would it be to adapt the new feed.php from 3.0.6-RC3 to generate Google/Yahoo/M$ Sitemaps in a fairly scalable manner? The answer is: not that difficult, tbh ;-)

It wants to be a separate script from feed.php - the similarities are there at a high level, but the protocols are sufficiently different that there's not really anything to be gained if one script did both, but a lot of debugging headaches. So, here's my thoughts for a complete /sitemap.php, complete with relatively untested ACP component. I'm fairly confident that the basic generator works well and should scale up to fairly large sites. It splits the output over many individual sitemaps, complete with a sitemap index, so that only 1 URL needs to be handed to the crawler. For small sites, it's possible to generate just a single topics sitemap.

Some tuning of the _limit paramters will be needed on large sites to find the optimal level which stays below the 50,000 URLs and 10MB limits. This is not really predictable, as the number of URLs per forum/topic varies with the pagination configured on the board. The script generates the full series of &start=n URLs. The aim is to reliably produce a full dump of all the canonical forum & topic URLs on the board, with high accuract on the <lastmod> tags to allow the bots to rapidly extract the new content while having a fully comprehensive map of the board.

One constant to add in includes/constants.php:
Code: Select all
define('FORUM_OPTION_SITEMAP_EXCLUDE', 10);


A reasonable set of defaults:
Code: Select all
COPY phpbb_config (config_name, config_value, is_dynamic) FROM stdin;
sitemap_cache_time   3600   0
sitemap_enable   1   0
sitemap_feeds   0   0
sitemap_forum   1   0
sitemap_limit   10000   0
sitemap_overall_forums   1   0
sitemap_overall_forums_limit   10000   0
sitemap_overall_topics   0   0
sitemap_overall_topics_limit   10000   0
sitemap_sort_days   0   0
\.


Once it's dropped in place (ACP can be skipped, if desired - minimum is /sitemap.php, the constant, and the SQL), you've got the following URLs:
  • /sitemap.php - Sitemap index - give this to search engines for full auto mode
  • /sitemap.php?mode=index - alias for /sitemap.php
  • /sitemap.php?mode=forums - List of /index.php and all the /viewforum.php?f=n URLs
  • /sitemap.php?mode=topics - List of all the /viewtopic.php?t=n URLs
  • /sitemap.php?f=n - List of all the /viewtopic.php?f=n&t=m URLs

One cache file is generated per URL, so there's a maximum of 3+n cache files, where n is the number of forums.

I'm posting this here, rather than the mods development forum, as I think it's the sort of thing which deserves to live alongside /feed.php in the main distribution.
Attachments
sitemap.tar.gz
(9.13 KiB) Downloaded 255 times
Paul.J.Murphy
Registered User
 
Posts: 9
Joined: Mon Jun 16, 2008 4:49 am
Location: Edinburgh, Scotland, UK

Return to [3.0/Olympus] Discussion

Who is online

Users browsing this forum: No registered users and 16 guests