Future support for Sphinx Search
Forum rules
Please do not post support questions regarding installing, updating, or upgrading phpBB 3.3.x. If you need support for phpBB 3.3.x please visit the 3.3.x Support Forum on phpbb.com.
If you have questions regarding writing extensions please post in Extension Writers Discussion to receive proper guidance from our staff and community.
Please do not post support questions regarding installing, updating, or upgrading phpBB 3.3.x. If you need support for phpBB 3.3.x please visit the 3.3.x Support Forum on phpbb.com.
If you have questions regarding writing extensions please post in Extension Writers Discussion to receive proper guidance from our staff and community.
- 3Di
- Registered User
- Posts: 951
- Joined: Tue Nov 01, 2005 9:50 pm
- Location: Milano 🇮🇹 Frankfurt 🇩🇪
- Contact:
Re: Future support for Sphinx Search
I created that PR based on trusting his report.
Free support for our extensions also provided here: phpBB Studio
Looking for a specific feature or alternative option? We will rock you!
Please PM me only to request paid works. Thx. Want to compensate me for my interest? Donate
My development's activity º PhpStorm's proud user º Extensions, Scripts, MOD porting, Update/Upgrades
Looking for a specific feature or alternative option? We will rock you!
Please PM me only to request paid works. Thx. Want to compensate me for my interest? Donate
My development's activity º PhpStorm's proud user º Extensions, Scripts, MOD porting, Update/Upgrades
Re: Future support for Sphinx Search
I was unaware of this bug, but I was able to reproduce it on search results over 20,000. I hadn't noticed before because I hadn't bothered to go to the last search results page for large search queries. I used the amended code proposed by 3Di and it fixed it. According to search-query.log, there is no real performance hit, as it just re-performs the search with the new offset if you try to go beyond the first 20,000 results.Hunchman801 wrote: ↑Mon Dec 23, 2019 2:34 pm Sounds great, I can't wait to test your implementation. Please let us know when you've released the code.
On the subject on Sphinx, there's another bug (can't view the last results of a query when there's more than 20,000 of them) that has a very simple fix, hope this gets included too one day.
Good work!
I will focus my PR on the many other issues I identified.
Re: Future support for Sphinx Search
I am almost done with my changes to fix and enhance Sphinx Search report.
I have just been taking care of a few little minor details such as:
- how search should handle search queries containing ampersand '&'. For example, how should words such as "D&D" or "P&L" be indexed and queried? Based on reviewing search logs on my board and the equivalent keywords in the database, a search for "D&D" should return "D&D" or "D & D". Similarly, a search for "dungeons & dragons" should return both "dungeons & dragons" and "dungeons and dragons".
- apostrophes. Words with contractions and possessive apostrophes are now stripped of the apostrophe at both the indexing and querying stage so that a search for the word with or without apostrophe yields identical results. "women's league" and "womens league" would have the exact same search results, but with morphology enabled (word stemming or lemmatization), these search terms would also find "woman league" and "women league".
- @ character. I've now configured Sphinx so that if, for example, the Twitter handle @username, was mentioned in a post. This is now indexed as both "@username" and "username". If you searched for "username", this would find both "username" and "@username". If you search for "@username", this will find only "@username".
- I've configured Sphinx to handle dollar signs ($) and hashes (#) in the same way as the @ character. This way, the word "$100" will be find if searching for either "$100" or "100". But a search for "$100" will only find "$100" and not "100". I could
- thousands separators. Commas are stripped from the search index so that 1,000,000 is indexed as 1000000. A search for either 1000000 or 1,000,000 will therefore find all instances of the number, whether or not the relevant post used thousands separators or not. Since thousands separators can vary by country (eg in Germany, a dot is used as a thousands separator), I have left the search keyword transformation optional and commented it out.
I don't think these kinds of issues have been thought about in much detail previously, but I'm seeking to address them. Any thoughts on this behaviour or any other issues are welcome.
Turning back to the OP and the deprecation of the SphinxAPI in favour of SphinxQL, I offer the following comments:
I do not think the current Sphinx phpBB plugin should be rewritten in favour of SphinxQL. While benchmarking of more recent versions of Sphinx (3.2 onwards) which solely use SphinxQL shows increased performance, this looks to be mostly confined to use of RT indexes, while base search performance in the old version of Sphinx 2.2 is just as good if not better when using regular indexing.
SphinxSearch is not actively supporting the 2.2 branch. However the latest full release which supported SphinxAPI (v.2.2.11) is very stable and so highly featured that I do not there will be any issues for a while to come. Even the SphinxSearch developers have stated that v.2.2.11 is unlikely to break.
The only issue is that the PHP SphinxAPI might face compatibility issues as PHP continues to evolve and SphinxAPI is not longer actively supported. That can be solved by the phpBB team actively updating the sphinxapi.php file that ships with phpBB. This is what has been done to date anyway - phpBB is still shipping with the 2012 version of sphinxapi.php which has been modified and updated by phpBB as a separate branch from Sphinx.
Alternatively, we could easily move the whole Sphinx backend to Manticore. Manticore is an open source fork from Sphinx that appears to be more actively updated, better featured and, importantly, continues to support and develop the PHP API. If we move phpBB from Sphinx to Manticore, as at the least introduce Manticore support as a 5th search backend option, then this will ensure we can continue to use the PHP API without requiring much additional work.
A nice summary of Manticore vs Sphinx can be seen here: https://manticoresearch.com/manticore-vs-sphinx/
There is an open issue in the Tracker on migrating from SphinxAPI to SphinxQL: https://tracker.phpbb.com/browse/PHPBB3-14398
I think my comments above address this issue. That is, we either actively maintain sphinxapi.php or enable support for Manticore, which now appears to be the better maintained search engine.
I have just been taking care of a few little minor details such as:
- how search should handle search queries containing ampersand '&'. For example, how should words such as "D&D" or "P&L" be indexed and queried? Based on reviewing search logs on my board and the equivalent keywords in the database, a search for "D&D" should return "D&D" or "D & D". Similarly, a search for "dungeons & dragons" should return both "dungeons & dragons" and "dungeons and dragons".
- apostrophes. Words with contractions and possessive apostrophes are now stripped of the apostrophe at both the indexing and querying stage so that a search for the word with or without apostrophe yields identical results. "women's league" and "womens league" would have the exact same search results, but with morphology enabled (word stemming or lemmatization), these search terms would also find "woman league" and "women league".
- @ character. I've now configured Sphinx so that if, for example, the Twitter handle @username, was mentioned in a post. This is now indexed as both "@username" and "username". If you searched for "username", this would find both "username" and "@username". If you search for "@username", this will find only "@username".
- I've configured Sphinx to handle dollar signs ($) and hashes (#) in the same way as the @ character. This way, the word "$100" will be find if searching for either "$100" or "100". But a search for "$100" will only find "$100" and not "100". I could
- thousands separators. Commas are stripped from the search index so that 1,000,000 is indexed as 1000000. A search for either 1000000 or 1,000,000 will therefore find all instances of the number, whether or not the relevant post used thousands separators or not. Since thousands separators can vary by country (eg in Germany, a dot is used as a thousands separator), I have left the search keyword transformation optional and commented it out.
I don't think these kinds of issues have been thought about in much detail previously, but I'm seeking to address them. Any thoughts on this behaviour or any other issues are welcome.
Turning back to the OP and the deprecation of the SphinxAPI in favour of SphinxQL, I offer the following comments:
I do not think the current Sphinx phpBB plugin should be rewritten in favour of SphinxQL. While benchmarking of more recent versions of Sphinx (3.2 onwards) which solely use SphinxQL shows increased performance, this looks to be mostly confined to use of RT indexes, while base search performance in the old version of Sphinx 2.2 is just as good if not better when using regular indexing.
SphinxSearch is not actively supporting the 2.2 branch. However the latest full release which supported SphinxAPI (v.2.2.11) is very stable and so highly featured that I do not there will be any issues for a while to come. Even the SphinxSearch developers have stated that v.2.2.11 is unlikely to break.
The only issue is that the PHP SphinxAPI might face compatibility issues as PHP continues to evolve and SphinxAPI is not longer actively supported. That can be solved by the phpBB team actively updating the sphinxapi.php file that ships with phpBB. This is what has been done to date anyway - phpBB is still shipping with the 2012 version of sphinxapi.php which has been modified and updated by phpBB as a separate branch from Sphinx.
Alternatively, we could easily move the whole Sphinx backend to Manticore. Manticore is an open source fork from Sphinx that appears to be more actively updated, better featured and, importantly, continues to support and develop the PHP API. If we move phpBB from Sphinx to Manticore, as at the least introduce Manticore support as a 5th search backend option, then this will ensure we can continue to use the PHP API without requiring much additional work.
A nice summary of Manticore vs Sphinx can be seen here: https://manticoresearch.com/manticore-vs-sphinx/
There is an open issue in the Tracker on migrating from SphinxAPI to SphinxQL: https://tracker.phpbb.com/browse/PHPBB3-14398
I think my comments above address this issue. That is, we either actively maintain sphinxapi.php or enable support for Manticore, which now appears to be the better maintained search engine.
Re: Future support for Sphinx Search
Another conundrum is how to handle abbreviations/acronyms/initialisms.
For example, if the user searches for "USA", you would expect to return results for "USA", "U.S.A" and "U.S.A." Similarly, a search for "U.S.A" should find "USA" as well as "U.S.A". However, by default "U.S.A." will be indexed as "U S A". How then should a search for "Marvel agents of S.H.I.E.L.D." be handled?
The easy solution is to strip fullstops in both indexing and search queries so that USA / U.S.A. and S.H.I.E.L.D / SHIELD are equivalent.
However, what if you need to search for things like filenames? On the phpBB community forum for example, if a user searches "composer.json", you would expect to return the literal result back. However, a search for just "composer" or "composer json" should also find posts with composer.json. If the dots are just stripped, then a search for "composer" will not find "composer.json". You would need to search for "composer*" to find it.
One potential solution is to use the expand_keywords configuration option. This internally expands keywords queries as follows:
running -> ( running | *running* | =running )
However this would have a global effect on all search queries and probably result in an unacceptable number of irrelevant search results (since a wildcard is effectively added to EVERY search term).
A possible compromise I have been able to find for this is to add the fullstop to the blend_chars variable in sphinx.conf. This results in composer.json in post text being indexed as both "composer.json" and "composer json" and U.S.A. becoming "U.S.A." and "U S A".
Then, search keywords containing full stops are transformed into alternative search queries so that "composer.json" becomes (composer.json|"composer json"|composerjson), U.S.A. becomes (U.S.A.|"U S A"|USA) etc.
Then you would need to create an exceptions.txt file to map specific abbreviations you want to be treated as identical. Sphinx allows you to map these specific examples in the index, eg U S A => USA. This would ensure that both a search for U.S.A. finds USA and a search for USA finds U.S.A. You would think that someone had created a list of common acronyms/initalisms for this purpose, but my Google searching has not found anything useful. Unfortunately, I would need to create something bespoke, which would be extremely time-consuming.
A similar problem exists for underscores and hyphens (although I have dealt with hyphens satisfactorily).
Creating a one-size-fits-all approach is not easy when you descend into this level of detail! I'm almost tempted to leave this all well alone.
For example, if the user searches for "USA", you would expect to return results for "USA", "U.S.A" and "U.S.A." Similarly, a search for "U.S.A" should find "USA" as well as "U.S.A". However, by default "U.S.A." will be indexed as "U S A". How then should a search for "Marvel agents of S.H.I.E.L.D." be handled?
The easy solution is to strip fullstops in both indexing and search queries so that USA / U.S.A. and S.H.I.E.L.D / SHIELD are equivalent.
However, what if you need to search for things like filenames? On the phpBB community forum for example, if a user searches "composer.json", you would expect to return the literal result back. However, a search for just "composer" or "composer json" should also find posts with composer.json. If the dots are just stripped, then a search for "composer" will not find "composer.json". You would need to search for "composer*" to find it.
One potential solution is to use the expand_keywords configuration option. This internally expands keywords queries as follows:
running -> ( running | *running* | =running )
However this would have a global effect on all search queries and probably result in an unacceptable number of irrelevant search results (since a wildcard is effectively added to EVERY search term).
A possible compromise I have been able to find for this is to add the fullstop to the blend_chars variable in sphinx.conf. This results in composer.json in post text being indexed as both "composer.json" and "composer json" and U.S.A. becoming "U.S.A." and "U S A".
Then, search keywords containing full stops are transformed into alternative search queries so that "composer.json" becomes (composer.json|"composer json"|composerjson), U.S.A. becomes (U.S.A.|"U S A"|USA) etc.
Then you would need to create an exceptions.txt file to map specific abbreviations you want to be treated as identical. Sphinx allows you to map these specific examples in the index, eg U S A => USA. This would ensure that both a search for U.S.A. finds USA and a search for USA finds U.S.A. You would think that someone had created a list of common acronyms/initalisms for this purpose, but my Google searching has not found anything useful. Unfortunately, I would need to create something bespoke, which would be extremely time-consuming.
A similar problem exists for underscores and hyphens (although I have dealt with hyphens satisfactorily).
Creating a one-size-fits-all approach is not easy when you descend into this level of detail! I'm almost tempted to leave this all well alone.
Re: Future support for Sphinx Search
After further extensive testing today, I am of the view that it is best to deal with customisation of search queries for abbreviations (and to a certain degree, hyphens) by way of a custom exceptions.txt file.
Using the USA example, I created the following rules in an exceptions.txt:
Every variation in the forum database was indexed to a single token "usa". Then, searching for any one of those variations all produced the same result. This is probably the best outcome and requires no string manipulation of the search keywords in phpBB code
The other upside is that in order for this to work, it does not the dot to part of the search index character set table. This means the dot will ordinarily be ignored and treated as whitespace, allowing the filename search examples described to work. Sphinx will simply ignore the fullstop - unless it is one of the variations specified in exceptions.txt.
Similar rules might be created for specific ampersand terms, eg D & D => d&d, P & O => p&o etc and then simply & => and.
Subject to input from others, I propose to deal with these issues as outlined above, then perhaps add some instructions in the development wiki.
Using the USA example, I created the following rules in an exceptions.txt:
Code: Select all
U.S.A. => usa
u.s.a. => usa
U.S.A => usa
u.s.a => usa
USA => usa
U.S. => usa
u.s. => usa
U.S => usa
u.s => usa
US => usa
The other upside is that in order for this to work, it does not the dot to part of the search index character set table. This means the dot will ordinarily be ignored and treated as whitespace, allowing the filename search examples described to work. Sphinx will simply ignore the fullstop - unless it is one of the variations specified in exceptions.txt.
Similar rules might be created for specific ampersand terms, eg D & D => d&d, P & O => p&o etc and then simply & => and.
Subject to input from others, I propose to deal with these issues as outlined above, then perhaps add some instructions in the development wiki.
Re: Future support for Sphinx Search
I have submitted a pull request for proposed changes here: https://github.com/phpbb/phpbb/pull/5815
As part of this I have submitted sample wordform and exception rule documents to guide board administrators on configuring custom search term mapping for their own boards. Included in this is a complete list of British vs American spelling variations I compiled so that a search using American spelling will find the British spelling equivalent and vice versa.
As part of this I have submitted sample wordform and exception rule documents to guide board administrators on configuring custom search term mapping for their own boards. Included in this is a complete list of British vs American spelling variations I compiled so that a search using American spelling will find the British spelling equivalent and vice versa.