phpBB

Development Discussion Board

phpBB's testing ground of bleeding edge code
Advanced search

[RFC] Upgrading phpBB Search to State of The Art

Publish your own request for comments or patches for the next version of phpBB. Discuss the contributions and proposals of others. Upcoming releases are 3.1/Ascraeus and 3.2/Arsia.

[RFC] Upgrading phpBB Search to State of The Art

Postby ameertawfik » Thu Apr 26, 2012 1:03 pm

Motive:

Forums’ current tools to find information are rather poor. Despite the active academic works on post and thread search, forums’ searching tools rely on the back end database full text search engine; hence, the structure and nature of forums are ignored completely. In this RFC, I wish we could implement the state-of-the-art methods of two search tasks: post search and thread search. The necessity of this project can be projected from the number of duplicated contents in forums and forums emphasis on searching before asking.

I submitted this feature as a project idea for Google summer of code 2012, but it was not selected. Nevertheless, naderman suggested to post the idea as RFC, and here it is.

The Google summer of code idea of search backend refactoring proposed by phpBB is complementary to this request. This request deals with implementing the algorithms, while the GSOC aims to improve code extension and efficiency.

My aim is not only to implement this function but also to implement other searching tasks such as expert finding[7] and thread recommendation[8]. However, for time being, we should focus on thread and post search tasks.

For thread search, I am considering the following methods : [1]'s pseudo cluster selection model, [2] 's weighted product method and [3]'s Title+initial post+reply post representation.

In post search, we can use the best performing method found in [4].

Note that much of the work above needs some academic background on Information retrieval and search engines. However, most of the stuff needed to implement the state of the art are available within any forums database.


Please feel free to ask and discuss; and pardon my academic writing style.


References

1.Elsas, J.L., Ancestry.com Online Forum Test Collection, in Technical Report CMU-LTI-017. 2011, Lan-guage Technologies Institute, School of Computer Science, Carnegie Mellon University.
2.Seo, J., W. Bruce Croft, and D. Smith, Online community search using conversational structures. Information Retrieval, 2011: p. 1-25.
3. Bhatia, S. and P. Mitra. Adopting Inference Networks for Online Thread Retrieval. in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. 2010. Atlanta, Georgia, USA, .
4.Duan, H. and C. Zhai, Exploiting Thread Structure to Improve Smoothing of Language Models for Forum Post Retrieval, in To appear in the 33rd European Conference on Information Retrieval (ECIR 2011). 2011: Dublin, Ireland.
5.Wang, H., et al. Learning Online Discussion Structures by Conditional Random Fields. in The 34th Annual International ACM SIGIR Conference (SIGIR'2011). 2011.
6.Xi, W., J. Lind, and E. Brill, Learning effective ranking functions for newsgroup search, in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 2004, ACM: Sheffield, United Kingdom. p. 394-401.
7. Seo, J. and W.B. Croft, Thread-based Expert Finding, in SIGIR’09 SSM Workshop. 2009, ACM: Boston, Massachusetts.
8.Zhao, J., et al., Learning a user-thread alignment manifold for thread recommendation in online forum, in Proceedings of the 19th ACM international conference on Information and knowledge management. 2010, ACM: Toronto, ON, Canada. p. 559-568.
Last edited by imkingdavid on Thu Apr 26, 2012 7:02 pm, edited 1 time in total.
Reason: Added [RFC] prefix to title
ameertawfik
Registered User
 
Posts: 4
Joined: Tue Aug 31, 2010 10:41 am

Re: [RFC] Upgrading phpBB Search to State of The Art

Postby naderman » Fri Apr 27, 2012 2:44 pm

Well this sounds certainly interesting. When are you going to start working on this? Are you going to wait for Dhruv to work on the refactoring of search backends first? I can't think of anything I would discuss up front. However demos, and proposals for user interface (both search mask and search results) would probably help get a better idea of what exactly your goal is.
www.naderman.de
Move your forum to Forumatic - we'll take care of maintenance & spam
User avatar
naderman
Development Team Leader
Development Team Leader
 
Posts: 1649
Joined: Sun Jan 11, 2004 2:11 am
Location: Karlsruhe, Germany

Re: [RFC] Upgrading phpBB Search to State of The Art

Postby ameertawfik » Fri Apr 27, 2012 3:58 pm

naderman wrote:When are you going to start working on this? Are you going to wait for Dhruv to work on the refactoring of search back ends first?

Yes. I think refactoring at this stage is more important than implementing state of the art. Dhruv's planned work is essential.
As long as the search module is extendable and focused, implementing good thread and post ranking functions is trivial.
We know the algorithms; what remains is the coding part.

I can't think of anything I would discuss up front. However demos, and proposals for user interface (both search mask and search results) would probably help get a better idea of what exactly your goal is.


I think the search interface will be the same. Nothing much to change. However, the issue is indexing.
If i get it right, currently, the search module indexes two fields: post text and topic title.
However, many of the studies mentioned above built indexes for threads and replies as well.
For topic(thread) search, I think implementing Elsas's Pseudo Cluster Selection[1] is very appealing. It does not require modification of the current indexing process. It ranks posts, and then it aggregates their scores to obtain threads scores. Additionally, it provides a good result as well.

Another interesting thing is which one to implement first. Post or topic search.
We might create a poll to see which one is more important to users.
ameertawfik
Registered User
 
Posts: 4
Joined: Tue Aug 31, 2010 10:41 am

Re: [RFC] Upgrading phpBB Search to State of The Art

Postby drathbun » Mon Apr 30, 2012 2:40 pm

ameertawfik wrote:Another interesting thing is which one to implement first. Post or topic search.
We might create a poll to see which one is more important to users.

How do you define post or topic search?

In the standard phpBB search screen the option is to return by posts or topics, and the default is posts. I very much prefer to return results as topics to get a higher level overview of the search match possibilities, and then I drill into the topic. If I want to search further, then I use the "search in this topic" option instead. Is that what you're talking about?
Sometimes you're the windshield, sometimes you're the bug.
User avatar
drathbun
Registered User
 
Posts: 72
Joined: Wed Feb 15, 2006 6:40 pm
Location: Texas

Re: [RFC] Upgrading phpBB Search to State of The Art

Postby ameertawfik » Mon Apr 30, 2012 6:04 pm

drathbun wrote:How do you define post or topic search?

In the standard phpBB search screen the option is to return by posts or topics, and the default is posts. I very much prefer to return results as topics to get a higher level overview of the search match possibilities, and then I drill into the topic. If I want to search further, then I use the "search in this topic" option instead. Is that what you're talking about?


Thanks drathbun, you made a good point,which leads us to an interesting observation about how post and topic search are used.

Here is the thing:
In the academic works cited in the first post, the majority focus on returning a list of topics to a user in response to his query.
One reason for that is what you said. Another reason is that ,sometime, a question or a query is best addressed by several posts from the same topic. For instance, multiple solutions for the same problem.

The thing is that, once the context of the search task changed, the definition of a relevant output changed a well.
In the standard screen, you are searching a very large list of posts or topics, hence the searching should address that.
An algorithm needs to address the diversity of contents on these lists.
In the other hand, searching a post in a given topic is much easier. You have a few posts to search, and these posts, to some extend, are about the same topic. That is why i said your point raised an interesting observation about post and topic search.

Therefore, the answer to your question is no. What I meant by post search is finding relevant posts regardless of their parent topics.
ameertawfik
Registered User
 
Posts: 4
Joined: Tue Aug 31, 2010 10:41 am

Re: [RFC] Upgrading phpBB Search to State of The Art

Postby drathbun » Mon Apr 30, 2012 8:21 pm

What I would like to see in a search system is inclusion of some additional factors that are human-driven rather than algorithms. I have tried some of these out with various degrees of success, mostly because of lack of the required human participation. To put my comments into context: I run a large board that offers self-service support for a large enterprise software vendor. By "self-service" I mean that the community as a whole helps each other, with very little input from the vendor. Basically it's a support board. :) In that context, I would place a premium on "solved" topics, and with those "solved" topics an additional premium would be placed on the most helpful posts. In order to do that, we need to know if a topic is sovled, open, closed, or some other status that is relevant. And we also need to know if a particular post is helpful in solving the topic, and perhaps even note the "most helpful" post within the topic if such a thing can be identified.

That being said, I would require a moderator or topic originator to be able to mark a topic as solved, or to say which posts are helpful. I would not want to try to determine this via some algorithm.

Then we get into the human factor of people that try to game the system, marking their own posts as most helpful, in an effort to gain stature and recognition within the community, but that's an entirely different discussion altogether.

As I re-read your initial post...
ameertawfik wrote:Despite the active academic works on post and thread search, forums’ searching tools rely on the back end database full text search engine; hence, the structure and nature of forums are ignored completely.

I think one could argue that this isn't completely true, yes? For example, my support board has forums based largely on individual products. When I fill out my search form, I have the option to limit my search activity to a particular forum, or category of forums, or even a parent and related sub-forums. That does allow the user to provide some input to the search algorithm as to the structure of the board, and where to search. Granted it's an "opt-in" feature, but did you have something different in mind?

I guess I should really read some of the reference links you posted before asking too many more questions... ;)
Sometimes you're the windshield, sometimes you're the bug.
User avatar
drathbun
Registered User
 
Posts: 72
Joined: Wed Feb 15, 2006 6:40 pm
Location: Texas

Re: [RFC] Upgrading phpBB Search to State of The Art

Postby ameertawfik » Mon Nov 19, 2012 4:37 pm

First of all, I am not sure whether I should continue this discussion here or in a new thread.
Nevertheless, I will go with keeping the discussion here. In addition, I just managed to finish my thesis writing. So, please accept my apologies for replying very late.

drathbun wrote:What I would like to see in a search system is inclusion of some additional factors that are human-driven rather than algorithms. I have tried some of these out with various degrees of success, mostly because of lack of the required human participation. To put my comments into context: I run a large board that offers self-service support for a large enterprise software vendor. By "self-service" I mean that the community as a whole helps each other, with very little input from the vendor. Basically it's a support board. :) In that context, I would place a premium on "solved" topics, and with those "solved" topics an additional premium would be placed on the most helpful posts. In order to do that, we need to know if a topic is sovled, open, closed, or some other status that is relevant. And we also need to know if a particular post is helpful in solving the topic, and perhaps even note the "most helpful" post within the topic if such a thing can be identified.

First, I want to highlight that when I said thread or post search, I actual aim at search solutions that do not take the forum type into context. As you said, your forum is rather a support community. However, I think phpBB is used by other types of forums. Nevertheless, such flexibility is a filtering function rather than a ranking function.
My primary focus is that how to capture the relevance between a post or a thread and the user query.

That being said, I would require a moderator or topic originator to be able to mark a topic as solved, or to say which posts are helpful. I would not want to try to determine this via some algorithm.
Then we get into the human factor of people that try to game the system, marking their own posts as most helpful, in an effort to gain stature and recognition within the community, but that's an entirely different discussion altogether.


That is exactly what is happening in the Question and Answers forums such as Yahoo answer. Yet, there is a solution for it that has been provided. I think a priority list is needed as well.

As I re-read your initial post...
ameertawfik wrote:Despite the active academic works on post and thread search, forums’ searching tools rely on the back end database full text search engine; hence, the structure and nature of forums are ignored completely.

I think one could argue that this isn't completely true, yes? For example, my support board has forums based largely on individual products. When I fill out my search form, I have the option to limit my search activity to a particular forum, or category of forums, or even a parent and related sub-forums. That does allow the user to provide some input to the search algorithm as to the structure of the board, and where to search. Granted it's an "opt-in" feature, but did you have something different in mind?

I guess I should really read some of the reference links you posted before asking too many more questions... ;)


Well, I think this is also related to filtering search results. In fact, I believe a facet search interface for phpBB will be an awesome feature.
ameertawfik
Registered User
 
Posts: 4
Joined: Tue Aug 31, 2010 10:41 am


Return to [3.x] RFCs

Who is online

Users browsing this forum: No registered users and 9 guests