getting there from here

Discuss features as they are added to the new version. Give us your feedback. Don't post bug reports, feature requests, support questions or suggestions here.
Forum rules
Discuss features as they are added to the new version. Give us your feedback. Don't post bug reports, feature requests, support questions or suggestions here. Feature requests are closed.
Post Reply
alreadyinuse
Registered User
Posts: 6
Joined: Sat May 03, 2008 10:12 am

getting there from here

Post by alreadyinuse »

I tried to search this forum for the string "all forums" but the search function declined my request on the grounds that these words are too common. That seems like a limitation that maybe shouldn't be, since I can't imagine going through life being unable to search databases for strings which are comprised of common words. Basically, this is a catch-22; The more people add value to a database by discussing a given topic, the more the value of the database is diminished because the more the database will refuse to return hits on that topic.

User avatar
Eelke
Registered User
Posts: 606
Joined: Thu Dec 20, 2001 8:00 am
Location: Bussum, NL
Contact:

Re: getting there from here

Post by Eelke »

The common word filter is designed to filter out words that do not add any value, such as "the", "it", etc. If you find the filter is filtering out terms that it should not, one could argue your threshold value is too low. One could also argue that the threshold of common words should not be implemented as an absolute number, but as a percentage of all words on the board (actually, I'm not sure how it is implemented, exactly).

With that said, the filtering is not ideal, because even if a word is common and by itself would not hold any information, a combination of otherwise common terms might have information contained in it after all (like you demonstrated). This is a technical limitation and I don't see anything that can be easily done about it, short of asking Google to rewrite the search routines in phpBB :) One way would be to not filter at all, but than your search indexes would soon come to a grinding halt with all the "the"'s, "it"'s and "and"'s on your board. Not a good idea.

code reader
Registered User
Posts: 653
Joined: Wed Sep 21, 2005 3:01 pm

Re: getting there from here

Post by code reader »

with mysql fulltext, the threshold is 50% of the records (in this case, record=post), so, if a word appears in less than 50% of the posts, it *will* be indexed. i guess that a word that appears in more than 50% of the posts is not really useful when searching.
there are 2 other limiting factors with mysql fulltext:
  • it only indexes words at least (configurable) N characters long, where N is by default 4 (i.e. words of 3 or less characters are not indexed). IMO, this is a bad choice for most boards. luckily, this value is configurable. unfortunately, it needs to be configured for the whole machine, i.e., not per table/database. in a shared hosting environment this is very bad indeed, but if you control your server you are good.
    (i may be wrong about the last part, but i couldn't figure how to set this value per database or per table. if anyone knows better, i'll be happy to learn)
    an amusing side effect of this logic is that when you first install your board, and you have less than three posts, every word appears in 50% of the posts, and the search doesn't find anything. as soon as you add the third post, you can find words which appear in one (but not two) of them.
  • in addition, mysql supports a "stopwords" table that allows one to manually tweak the ignore list. the one that comes with mysql by default (eg, for mysql 6 see : http://dev.mysql.com/doc/refman/6.0/en/ ... words.html ) is, imo, useless and silly, and contains words like "beforehand" and "howbeit" -- as you can see, extremely common words. again, luckily, this table can be overridden, either by an empty list or by a list of your choosing.
when i operated a BBS, i used mysql fulltext (there is a phpbb 2.0 MOD), with ft_min_word_len set to 2, and empty stopwords list.
hope it helps.

Post Reply