PHPBB3-13803 / PHPBB3-13891 - Reparsing text

Discuss requests for comments/changes posted in the Issue Tracker for the development of phpBB. Current releases are 3.2/Rhea and 3.3/Proteus.
Post Reply
User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

PHPBB3-13803 / PHPBB3-13891 - Reparsing text

Post by JoshyPHP »

https://tracker.phpbb.com/browse/PHPBB3-13803 - https://github.com/phpbb/phpbb/pull/3579 (merged)
https://tracker.phpbb.com/browse/PHPBB3-13891 - https://github.com/phpbb/phpbb/pull/3662

With the s9e\TextFormatter merged, I've been thinking about the right way to reparse rich text for future versions of phpBB. This is a quick dump of my observations. No PR and I don't think I can take over the task for the moment.

First, reparsing should be done through a service. It should be possible to schedule it via cron but it should be implemented as a service.

Last I checked there were at least 9 different places where rich text is used: posts, private messages, user signatures, poll titles, poll options, forum descriptions, forum rules, group descriptions and admin's contact info. Column names vary, storage format varies. So instead of a service, it should be a set of services. Those services should be tagged to make it possible to trigger a reparsing of everything that's rich text. Using a tag makes it easier for extensions to have their content reparsed.

The Support Toolkit can reparse posts in 3.0, does it also work on 3.1?

There's a potentially important trap lying in wait. phpBB's message_parser uses the global $user to determine which features are enabled. Some extensions such as ABBC3 police BBCode usage according to the current user. The whole system assumes that the current user is the author of the post or text being processed. That means that if an admin triggers a reparsing, it will be made according to their credential and not the original author's.

I have written a working example of a reparser service in this feature branch: https://github.com/phpbb/phpbb/compare/ ... r?expand=1 (only the post_text service is included)
The unit of work is called a record, a set of values. Each service is responsible for retrieving and updating records in the database. For post texts, it's a set of columns from the same row. For text stored in the config table, it's a set of values stored in multiple rows. Forum descriptions and forum rules are two records stored in the same row. Pseudo-code for reparse_range():

Code: Select all

reparse_range($min_id, $max_id)
	get_records($min_id, $max_id)
	For each record $record
		reparse_record($record)
			If reparsed text differs: save_record($record)
Class overview
phpbb\textreparser\reparser_interface
public function reparse_range($min_id, $max_id)
public function get_max_id()

phpbb\textreparser\base implements reparser_interface
public function reparse_range($min_id, $max_id)
abstract public function get_max_id()
abstract protected function get_records($min_id, $max_id)
abstract protected function save_record(array $record)
protected function reparse_record(array $record)

phpbb\textreparser\posts extends base
public function get_max_id()
protected function get_records($min_id, $max_id)
protected function save_record(array $record)


Talking about cron, here's how a global reparsing should work:
  1. Get all tagged services.
  2. Retrieve the highest ID for each type of content. Store in a config value, e.g.

    Code: Select all

    post_text: 123456
    user_sig: 4321
  3. Every execution should reparse a block of N records of one content type (e.g. post_text WHERE post_id BETWEEN 123406 AND 123456), starting from the highest ID and working backwards.
    • Working backwards means that content posted after the reparsing has started will not be reparsed needlessly.
    • It also means that the most recent content gets reparsed first. The most recent content is also the most likely to be read by users first.
  4. There could be one single cron job that reparses everything by changing the type of content being reparsed at each execution. That would guarantee that a global reparsing would not trigger a dozen cron jobs rewriting ten different tables at the same time with the database performance going down the drain.
Last edited by JoshyPHP on Sun Apr 12, 2015 10:13 pm, edited 5 times in total.

User avatar
Elsensee
Former Team Member
Posts: 42
Joined: Sun Mar 16, 2014 1:08 pm
Location: Hamburg, Germany
Contact:

Re: [RFC] Reparsing text

Post by Elsensee »

JoshyPHP wrote:The Support Toolkit can reparse posts in 3.0, does it also work on 3.1?
As far as I know, the STK needs a rewrite to work on 3.1.
The script which is responsible for reparsing posts, however, does only need some fixes to work but I would not prefer having this as a "standalone script".
JoshyPHP wrote:Talking about cron, here's how a global reparsing should work:
  1. Get all tagged services.
  2. Retrieve the highest ID for each type of content. Store in a config value, e.g.

    Code: Select all

    post_text: 123456
    user_sig: 4321
  3. Every execution should reparse a block of N records of one content type (e.g. post_text WHERE post_id BETWEEN 123406 AND 123456), starting from the highest ID and working backwards.
    • Working backwards means that content posted after the reparsing has started will not be reparsed needlessly.
    • It also means that the most recent content gets reparsed first. The most recent content is also the most likely to be read by users first.
  4. There could be one single cron job that reparses everything by changing the type of content being reparsed at each execution. That would guarantee that a global reparsing would not trigger a dozen cron jobs rewriting ten different tables at the same time with the database performance going down the drain.
There won't be two simultaneously cron jobs running... ever. cron.php is acquiring a lock on.. some file. ;)
So having multiple services for every content type would be fine for me.

I'm just wondering on how much time it will take - that global reparsing.
On small boards it may just take some hours I guess, but what about the big boards like phpBB.com? Will it take months there? Or even years? :shock:
Will there be any way for the administrators of a board to check the current status of the global reparsing? Or controlling it? (Like stopping it, pausing it, setting the rate/size of blocks which are parsed per job)

Nicofuma
3.2 Release Manager
3.2 Release Manager
Posts: 299
Joined: Sun Apr 13, 2014 1:40 am
Location: Paris

Re: [RFC] Reparsing text

Post by Nicofuma »

please create and link to the ticket... as it was decided earlier now the entry point for every ticket is the tracker (the notion of RFC is gone)
Member of the phpBB Development-Team
No Support via PM

User avatar
RMcGirr83
Registered User
Posts: 360
Joined: Fri Mar 09, 2007 1:51 am
Contact:

Re: [RFC] Reparsing text

Post by RMcGirr83 »

Nicofuma wrote:please create and link to the ticket... as it was decided earlier now the entry point for every ticket is the tracker (the notion of RFC is gone)
Then this forum should be locked to new posts IMHO as it states very clearly [3.X]RFCs Discussion which I'm not entirely sure what that even means anymore if the "notion of RFC is gone". IF RFC is gone what are we supposed to be discussing?
Do not hire Christian Bullock he won't finish the job and will keep your money

User avatar
DarkBeing
Registered User
Posts: 83
Joined: Sun Jul 19, 2009 2:32 pm
Location: Currently Estonia
Contact:

Re: [RFC] Reparsing text

Post by DarkBeing »

I have to agree with RMcGirr83. Why discuss things here if the only way they get noted is to create a ticket? Why not move the whole discussion to the tracker then and delete this forum. The area51 forum seems to get more and more obsolete. At least that is the impression I get.

Nicofuma
3.2 Release Manager
3.2 Release Manager
Posts: 299
Joined: Sun Apr 13, 2014 1:40 am
Location: Paris

Re: Discussion about reparsing text

Post by Nicofuma »

It's out of the scope of this topic but I will answer and then please start a new one if you want to talk about it.

As it was proposed and decided here, the idea is to get ride of the RFCs (I fixed the wording of the forum) : every change proposal is posted on JIRA and the default state is accepted. Also, JIRA is the best place to discuss about it if the proposal isn't very huge or if it doesn't lead to a big debate. Otherwise the forum is the place to go. Why? Because a huge discussion is unreadable on JIRA, nothing else.
Member of the phpBB Development-Team
No Support via PM

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Reparsing text

Post by JoshyPHP »

Elsensee wrote:There won't be two simultaneously cron jobs running... ever.
Oh that's cool. Although on the other hand I guess it also means the cron task has to be short in duration and more frequently executed.
Elsensee wrote:I'm just wondering on how much time it will take - that global reparsing.
It strongly depends on the database server. The PHP execution is quite cheap (especially when working on batches) but reparsing all posts means rewriting the entire posts table with as many UPDATE queries as there are posts. I don't know how good shared servers are nowadays but if you do a thousand posts per hour, it'll take 6 weeks per million posts.
Nicofuma wrote:please create and link to the ticket... as it was decided earlier now the entry point for every ticket is the tracker (the notion of RFC is gone)
I don't know how closely I'll follow the discussion, that's what I didn't create a ticket. I don't want to create a ticket that I won't properly follow up. Feel free to move this topic to wherever such discussion may take place.

User avatar
Pony99CA
Registered User
Posts: 986
Joined: Sun Feb 08, 2009 2:35 am
Location: Hollister, CA
Contact:

Re: Discussion about reparsing text

Post by Pony99CA »

Nicofuma wrote:It's out of the scope of this topic but I will answer and then please start a new one if you want to talk about it.
There already is such a topic -- the one that you started.
Nicofuma wrote:As it was proposed and decided here, the idea is to get ride of the RFCs (I fixed the wording of the forum) : every change proposal is posted on JIRA and the default state is accepted.
When was it "proposed and decided"? The only mention that I saw was the linked topic and the only reply was mine disagreeing with part of the proposal.

Sorry to discuss that here, but if you had answered my objection there, I would have to. ;)

Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: PHPBB3-13803 - Implement a generic and scalable way to reparse rich text

Post by JoshyPHP »

I've published a WIP PR at https://github.com/phpbb/phpbb/pull/3579.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: PHPBB3-13803 - Implement a generic and scalable way to reparse rich text

Post by JoshyPHP »

For future reference: the signatures of generate_text_for_storage() and message_parser::parse() don't support the same feature set. Extra parameters could be added to generate_text_for_storage() to support toggling $allow_img_bbcode, $allow_flash_bbcode, $allow_quote_bbcode and $allow_url_bbcode but those values aren't stored in the database. That means it will require tracking each call to message_parser::parse() to determine how those values were computed at the call site. It will probably involve checking the text's author's permissions.

Code: Select all

generate_text_for_storage(&$text, &$uid, &$bitfield, &$flags, $allow_bbcode = false, $allow_urls = false, $allow_smilies = false)
parse($allow_bbcode, $allow_magic_url, $allow_smilies, $allow_img_bbcode = true, $allow_flash_bbcode = true, $allow_quote_bbcode = true, $allow_url_bbcode = true, $update_this_message = true, $mode = 'post')
 

Post Reply