Search Architecture

General discussion of development ideas and the approaches taken in the 3.x branch of phpBB. The current feature release of phpBB 3 is 3.3/Proteus.
Forum rules
Please do not post support questions regarding installing, updating, or upgrading phpBB 3.3.x. If you need support for phpBB 3.3.x please visit the 3.3.x Support Forum on phpbb.com.

If you have questions regarding writing extensions please post in Extension Writers Discussion to receive proper guidance from our staff and community.
User avatar
Dog Cow
Registered User
Posts: 271
Joined: Wed May 25, 2005 2:14 pm

Search Architecture

Post by Dog Cow »

We need exact phrase searching with storing word positions and such. And relevance too. All built in to the phpBB Native search index so it will work on any DBMS.

More and more people expect the search on a site to function like Yahoo or Google and get annoyed when it doesn't.

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Berlin, Germany
Contact:

Re: Search Architecture

Post by naderman »

We can't quite reach the same quality of results as yahoo or google. Using something like Zend Search Lucene as the default search engine might be a viable option. We would have to benchmark what works better.

User avatar
Dog Cow
Registered User
Posts: 271
Joined: Wed May 25, 2005 2:14 pm

Re: Search Architecture

Post by Dog Cow »

naderman wrote:We can't quite reach the same quality of results as yahoo or google
Yes we can! Don't limit yourself just because Yahoo or Google won't tell us how they work!

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Berlin, Germany
Contact:

Re: Search Architecture

Post by naderman »

They work on massive amounts of data that allow them to learn language and what people want/think. A forum just doesn't quite contain that much information, that's really the only limitation I'm talking about ;-)

User avatar
Dog Cow
Registered User
Posts: 271
Joined: Wed May 25, 2005 2:14 pm

Re: Search Architecture

Post by Dog Cow »

Exact phrase search is good enough for me. I'm trying to figure out to build an index for 7 million documents. I'm not having much luck on comp.databases.mysql, but did find internals-index-format.txt in the Sphinx source doc/ directory. Problem is not too many people want (or are able) to tell how it's done.

But I'll figure it out in the end. :P

code reader
Registered User
Posts: 653
Joined: Wed Sep 21, 2005 3:01 pm

Re: Search Architecture

Post by code reader »

some thoughts about search:

currently phpBB supports a pluggable search engine.
however, there is a separation between the search front-end (aka UI) and the search backend, which really hurts the plugin "paradigm".
for instance, if a certain backend wants to support additional features, it can't, at least not without also becoming a MOD.
OTOH, if a search backend wants a waiver from a supporting some capability which is published by the UI it can't, at least not without it being a bona fides bug.

if a certain backend wants to use a different syntax for the search input string, it can't (again, without becoming a MOD. the reason is that the syntax is explained to the user outside the scope of the plugin)
and, i believe sorting is handled outside of the plugin, so even if the plugin supplies the results in a certain order (say, relevance), this gets wiped by the envelope (i may be wrong here: long time since i visited this code, and i don't have the time now to check).

one way to go about it is to create a "feature-list" of supported search features, and allow the backend to declare which features are supported.
this way, a single, very feature-rich UI can serve them all, while suppressing display of any feature not supported by the currently running backend.
this will not solve the "please allow me a private syntax" part, but would still be a significant improvement over today's state.
another option is to allow every backend to define its own UI: so there will be a default UI that would work for any backend that does not wish to bring in its own UI, but would still *allow* the backend to define its UI.
the last approach can be taken to the extreme and not even bother to create this default UI, so *each and every* backend will be required to roll its own.

long long time ago i created a mysql-fulltext thingy for phpbb2, and i found that if i do not want to support the things which were difficult (e.g. allow searches on "message body only" or "body+title", which i found mostly useless but was not trivial with mysql fulltext), and at the same time wanted to support features like phrase search which were not supported in the vanilla package, i had to change the UI.
btw: not only because of capabilities, but also because i needed to change the syntax of the search string, so i needed to change the explanations.

if it was up to me, i would completely eliminate the "native phpbb" search altogether. i would build a thin layer that links to google using the site:yoursite searchphrase syntax, and *maybe* use google's API to present the results internally.
(this approach has a major disadvantage that it can't index forums with restricted visibility, and it has the advantage that it builds on top of all of google's intelligence).

at the same time i would make sure that the door is wide open for add-on writers to supply all kinds of search.


peace.

jwxie
Registered User
Posts: 57
Joined: Mon Jan 23, 2006 3:38 am

Re: Search Architecture

Post by jwxie »

Why don't we first include tag into posting? Doesn't this make the search easier and faster to sort out common keywords?
Yahoo / Google learns them by sorting some way like that - not exactly of course,
Most importantly, once we have tag for phpbb4 as a default, we must let users to edit tags for the OP!!! Why? Simple. OP doesn't always has the best tag. I go to Stackflow a lot and people edit my tags when they don't think they fit and it does help a lot.

If I want to answer python beginner questions, all i need to do is just search python beginner and any post with those two tags will be sorted out.

Phil
Registered User
Posts: 185
Joined: Sun Mar 11, 2007 3:20 am
Contact:

Re: Search Architecture

Post by Phil »

Tagging relies that (1) the OP will take the time to tag their post (2) that the tags will be accurate -- both of these are uncertain enough that IMHO they should not be relied on for searching (or anything else honestly).
My phpbb.com account
Note that any of my opinions expressed in RFC topics are my own and not necessarily representative of the opinion of the phpBB Team.

jwxie
Registered User
Posts: 57
Joined: Mon Jan 23, 2006 3:38 am

Re: Search Architecture

Post by jwxie »

iWisdom wrote:Tagging relies that (1) the OP will take the time to tag their post (2) that the tags will be accurate -- both of these are uncertain enough that IMHO they should not be relied on for searching (or anything else honestly).
Yes you are right on those issues. However, (1) tagging doesn't take that long, unless someone thinks 10 seconds means a lot to them, and (2) as I suggest other users can help to edit the tag. I always bring Stackflow is because it's a great website and its tagging system is really cool.

I am not trying to sell SF to phpBB, but I am just giving an example (in discussion, we always need some examples).

Why is tagging not accurate? Is search ever accurate? I have spent two months trying to find a room near my school. Guess what? Only a month ago I found this Google group that has the information I want. It took me a month to sort it out from Google. A good search based on two things: (1) keywords and (2) experience

If we have tags, we can minimize the time to go through all the data (remember a lot of people search the entire board, and of course our algorithm is very powerful). Tags can help us sort out common one. I know where your concern is - those useless and bad tags. But if we consider that tags do help with search and if we also allow users to help edit tags, things can work out better.

Phil
Registered User
Posts: 185
Joined: Sun Mar 11, 2007 3:20 am
Contact:

Re: Search Architecture

Post by Phil »

The accuracy of search is based on content. The accuracy of tagging is based on arbitrary information entered by a user. Every seen YouTube's tags?
My phpbb.com account
Note that any of my opinions expressed in RFC topics are my own and not necessarily representative of the opinion of the phpBB Team.

Post Reply