link rel="canonical"

Discussion of general topics related to the new version and its place in the world. Don't discuss new features, report bugs, ask for support, et cetera. Don't use this to spam for other boards or attack those boards!
Forum rules
Discussion of general topics related to the new release and its place in the world. Don't discuss new features, report bugs, ask for support, et cetera. Don't use this to spam for other boards or attack those boards!
Post Reply
Paul.J.Murphy
Registered User
Posts: 9
Joined: Mon Jun 16, 2008 4:49 am
Location: Edinburgh, Scotland, UK
Contact:

link rel="canonical"

Post by Paul.J.Murphy »

A suggestion for a simple enhancement to improve crawler/bot friendliness, based on:
http://googlewebmastercentral.blogspot. ... nical.html

Add just after the <title>...</title> in styles/prosilver/template/overall_header.html:

Code: Select all

<!-- IF U_CANONICAL --><link rel="canonical" href="{U_CANONICAL}" /><!-- ENDIF -->
Add to assign_vars() in index.php:

Code: Select all

'U_CANONICAL' => "{$phpbb_root_path}index.$phpEx",
Add to assign_vars() in viewforum.php:

Code: Select all

'U_CANONICAL' => "{$phpbb_root_path}viewforum.$phpEx?f=$forum_id" . (($start) ? "&start=$start" : ''),
Add to assign_vars() in viewtopic.php:

Code: Select all

'U_CANONICAL' => "{$phpbb_root_path}viewtopic.$phpEx?" . (($topic_data['topic_type'] == POST_GLOBAL) ? '' : "f=$forum_id&") . "t=$topic_id" . (($start) ? "&start=$start" : ''),
I'd also suggest changing the construction of U_VIEW_FORUM and U_VIEW_TOPIC to similarly only include non-zero start parameters, i.e.:

Code: Select all

(($start) ? "&start=$start" : ''))

This should help to eliminate duplicates in search results, improve bot efficiency, and generally have no downsides that I can see. It should have zero impact on board users.

Edit 29 Oct: Updated viewtopic URL to exclude forum_id for globals.
Last edited by Paul.J.Murphy on Thu Oct 29, 2009 10:55 pm, edited 2 times in total.

Roberdin
Registered User
Posts: 1546
Joined: Wed Apr 09, 2003 8:44 pm
Location: London, United Kingdom

Re: link rel="canonical"

Post by Roberdin »

Bots should only see one URI to each page already, right?
Rob

Paul.J.Murphy
Registered User
Posts: 9
Joined: Mon Jun 16, 2008 4:49 am
Location: Edinburgh, Scotland, UK
Contact:

Re: link rel="canonical"

Post by Paul.J.Murphy »

Roberdin wrote:Bots should only see one URI to each page already, right?
That's true up to a point, but there's still a few things such as viewforum.php?f=1 vs. viewforum.php?f=1&start=0 and a few other, similar cases like that which Googlebot seems to be able to find. There's also the case of external links, where a person has taken a non-bot URL from within phpBB and linked to it somewhere.

So, as it stands today (3.0.6-RC3), a correctly recognised bot crawling the site in the absence of external information will see relatively few duplicates (still quite a few predictable cases), but one crawling with knowledge of the web as a whole can potentially find an infinite number of URLs.

Paul.J.Murphy
Registered User
Posts: 9
Joined: Mon Jun 16, 2008 4:49 am
Location: Edinburgh, Scotland, UK
Contact:

Re: link rel="canonical"

Post by Paul.J.Murphy »

I've been mulling this over a little more, and even more convinced that there's essentially no downside and it's a Good_Idea™. :geek:

Looking at this has me wondering if it would be better to actually completely eliminate the forum_id param (f) for viewtopic.php. It does seem rather redundant, although I could possibly be missing some killer feature that it enables (and isn't so amazing that everything seems basically ok when you call viewtopic.php with just a combination of t and p). Edit: Having thought about this point a bit more, I withdraw it — the forum ID does have some use in terms of logfile/traffic analysis, targetting adverts based on the URL, etc.

I'm also in two minds whether the canonical URLs should be presented fully qualified (http://…) or relative (./…). I think either is valid under the relevant RFCs and from what Google have posted about it.

It would, of course, be better to simply eliminate duplicates entirely, to the extent of issuing 301 redirects for any non-canonical URLs which are not sort results, but that would be much harder than an interim quick enhancement of adding canonical links. As long a there are situations where the sid can end up in the URL, the duplicate space remains infinite, so all that can really be done is to minimise it and mitigate the remaining cases with canonical links, I guess.

It also seems that, as per usual, there's another school of thought on the "correct" way to present the canonical URL - the Content-Location HTTP response header. Google, Yahoo, and MS seem to be backing rel="canonical" for now, so it's clearly the more important one at present, but it probably couldn't hurt to generate both.

Just a bunch of thinking out loud on it, tbh. I'd welcome any counter viewpoints for why this is either not needed, not feasible, or causes some other issue.


Here's a nice example of duplicates making it into Google:
http://www.google.com/search?q=site:are ... K&filter=0

I deliberately chose an older topic to ensure that Googlebot has had plenty of time to digest it properly. I see 4 viewtopic URLs and 2 viewforum URLs there, where there ideally would be only one of each.

This topic has already been picked up twice (after only 4 days), once with "start=0", and once without:
http://www.google.com/search?q=site:are ... K&filter=0

Ellimist
Registered User
Posts: 3
Joined: Sun Jun 29, 2008 9:39 am

Re: link rel="canonical"

Post by Ellimist »

I have created a mod that does exactly this : Canonical URL for phpBB

Looking at your code, it seems that my mod is almost identical. Coincidentally, the variable name is same, though it was changed to be in accordance with phpBB nomenclature.

whaturmuva
Registered User
Posts: 9
Joined: Fri Feb 06, 2009 12:39 pm

Re: link rel="canonical"

Post by whaturmuva »

I'm sorry for the horrific necrobump. But I had a question regarding this code.

I currently have google indexing my site in two ways. Using my host url: mysite.hosturl.com, and my url: mysite.com

I was wondering if the {$phpbb_root_path} part of this code, could be set as mysite.com. So that when the bots are indexing it as mysite.hosturl.com, the meta tag would encourage it to come in on mysite.com instead of using the hosts domain.

I've been looking for something to do this for awhile (as due to server limitations I can't do a proper 301 redirect)... and this might just be perfect.

So for instance, would it be ok to instead of using

Code: Select all

'U_CANONICAL' => "{$phpbb_root_path}index.$phpEx",
could I instead put the code in as

Code: Select all

'U_CANONICAL' => "http://mysite.com/index.$phpEx",
?

Thanks for the help :)

User avatar
spello
Registered User
Posts: 26
Joined: Fri Aug 31, 2012 12:13 pm
Contact:

Re: link rel="canonical"

Post by spello »

You should also add NEXT and PREV tags ({PREVIOUS_PAGE} and {NEXT_PAGE} already exist ;))

template/overall_header.html

Code: Select all

<!-- IF U_CANONICAL --><link rel="canonical" href="{U_CANONICAL}" /><!-- ENDIF -->

<!-- IF PREVIOUS_PAGE --><link rel="prev" href="{PREVIOUS_PAGE}" /><!-- ENDIF -->
<!-- IF NEXT_PAGE --><link rel="next" href="{NEXT_PAGE}" /><!-- ENDIF -->
Moreover, you can add rel= to anchors in "bodies" (viewtopic_body, etc):

Code: Select all

<!-- IF PREVIOUS_PAGE --><a href="{PREVIOUS_PAGE}" rel="prev" class="left-box {S_CONTENT_FLOW_BEGIN}">{L_PREVIOUS}</a><!-- ENDIF -->
<!-- IF NEXT_PAGE --><a href="{NEXT_PAGE}" rel="next" class="right-box {S_CONTENT_FLOW_END}">{L_NEXT}</a><!-- ENDIF -->

Post Reply