[RFC] Integrate s9e\TextFormatter

These requests for comments/change have lead to an implemented feature that has been successfully merged into the 3.2/Rhea branch. Everything listed in this forum will be available in phpBB 3.2.
Post Reply
User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

You get the same features as current phpBB with just ext/dom and ext/filter. If you have ext/intl you can use IDNs, otherwise it's like current phpBB: no IDNs but everything else stays the same.

The optional extensions json, xsl and zlib unlock features that are not used in this RFC. The other optional extensions are only used to make some code a fraction of a percent faster.

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Berlin, Germany
Contact:

Re: [RFC] Integrate s9e\TextFormatter

Post by naderman »

Ok great, thanks. So we should ensure ext/dom & ext/filter are sufficiently wide spread, but I imagine they are.

User avatar
EXreaction
Registered User
Posts: 1555
Joined: Sat Sep 10, 2005 2:15 am

Re: [RFC] Integrate s9e\TextFormatter

Post by EXreaction »

They are enabled by default after PHP 5.2, so I don't think there is any issue there.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

Hey. This is the one-year anniversary (to the day!) of this thread so I thought I'd celebrate by bumping it. I've kept the branch up-to-date with develop, it's still fresh. That is all.

User avatar
Meis2M
Registered User
Posts: 448
Joined: Fri Apr 23, 2010 10:18 am
Contact:

Re: [RFC] Integrate s9e\TextFormatter

Post by Meis2M »

any hope to see in phpbb core?

aleha
Registered User
Posts: 143
Joined: Tue Mar 26, 2013 2:19 am

Re: [RFC] Integrate s9e\TextFormatter

Post by aleha »

Yes, in 3.2.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

phpBB 3.1 has been released a couple of weeks ago so it feels like now is the best time to restart the discussion about this RFC.

For the past 14 months I have diligently kept my fork in sync with the develop branch. What it needs now is for a phpBB developer to take point and outline a path towards merging it. @naderman: I don't know how you guys usually operate, so I'm waiting for you to tell me what you need from me to accept and merge this PR. Perhaps make a checklist of questions left open and decisions that lack a consensus. Anything that will help us progress and move forward.
naderman wrote:Is Nutella really as good as people say?
Why yes it is, thanks for asking. (sorry, this board doesn't have mentions yet so instead I have to abuse notifications)

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

In these comments on GitHub, naderman and nickvergessen discuss the pros and cons of storing parsed texts separate from the original text. naderman's concerns are with preserving the original text, having the possibility of using multiple text formatters and easily mass-invalidating old posts. nickvergessen is concerned with the extra (double) storage space it would require. naderman asks about storing the metadata separately from the text to reduce the storage requirements.

My opinion is that it's globally easier and more beneficial to keep the single column approach. Here's a point-by-point explanation.
  • Preserving the original text: from day one, the format I used for s9e\TextFormatter was designed to preserve the original text and make it possible to retrieve it in any programming language without the use of the library. With the exception of control codes, every byte of the original text is preserved. Most of the control codes are removed, CR (\r), LF (\n) and CR+LF (\r\n) combos are normalized to LF, HT (\t) as well as everything else is preserved. One way to ensure that there is no loss of content would be to immediately test that the text can be reversed when a post is submitted and if the result doesn't match, disable any formatting and treat it as an error.
  • Multiple text formatters: this possibility exists in current PR, but only as a global setting. s9e\TextFormatter is implemented as a formatter implementing a set of interfaces providing a set of services. Realistically, I don't foresee any other formatter being created.
  • Mass-invalidating posts: I believe it would be more efficient to use a flag to signal that a post must be reparsed. The example given was UPDATE posts SET post_formatter_meta = ''. Depending on the DBMS, that query would generate a lot of writes and even more as the new metadata is written. In some cases, the entire row may get rewritten twice. With a single flag, the value may be updated in place. If the new parsed text is the same as the old parsed text, the flag only needs to be reset.
  • Extra storage: it would be possible to extract the metadata from the XML format used by s9e\TextFormatter and store it separately. It wouldn't require twice the storage, my estimate is about 6 bytes per "rich text" feature used plus a base cost of 4 bytes, compared to storing the XML alone and excluding the extra column metadata required by the DBMS. For example, for a text with one link and one smiley it would cost less than 10 bytes more than storing the XML as-is. But more than the cost of storage, I'm concerned with the added complexity and CPU cost to unpack/reinject the metadata for each post.
There's an additional issue that hasn't been mentionned: migration. One big benefit of having a single column is that the schema does not have to change. Adding a column means that the entire table has to be rewritten, at least on MySQL.

About migration. I designed this PR to be free any form of migration. You can take any test board based on develop, switch your local branch to mine and everything should keep working as expected. New posts use the new formatter, old posts use the old routine. The posts table is the biggest and perhaps the busiest of the database. Big boards could incur a lot of downtime if all the posts had to be reparsed between versions. The strategy I'd recommend is to keep both the old and new for the entirety of the 3.2 line. At the same time, provide a way to incrementally reparse old posts as a cron job. By the time 3.3 comes out, all of the posts should be converted. Then in 3.3, ensure that all the posts are current, do remove the old routines and remove the bbcode_uid and bbcode_bitfield columns with minimal downtime.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

JoshyPHP wrote:For the past 14 months I have diligently kept my fork in sync with the develop branch. What it needs now is for a phpBB developer to take point and outline a path towards merging it. @naderman: I don't know how you guys usually operate, so I'm waiting for you to tell me what you need from me to accept and merge this PR. Perhaps make a checklist of questions left open and decisions that lack a consensus. Anything that will help us progress and move forward.
Situation unchanged here. I'm still on standby. As far as I'm concerned, the branch is feature-complete and ready to be merged. Probably.

Nicofuma
3.2 Release Manager
3.2 Release Manager
Posts: 299
Joined: Sun Apr 13, 2014 1:40 am
Location: Paris

Re: [RFC] Integrate s9e\TextFormatter

Post by Nicofuma »

I would like to finish that as soon as possible, but I also see that there is still an open discussion with no agreement nor consensus about the storage, and I think that this point is crucial and need to be decided.

So to sum-up: (currently means in the current state of the PR)
  • s9e\TextFormatter use an XML representation of the post
  • Currently the XML representation is stored in the post_text column
  • The XML representation add some tags and formalize the text (new line, spaces and tabs) and doesn't remove any part of the text
Issue: Naderman would store the original text in the post_text column and the metadata (i.e. the XML representation) in a new column
Pro:
  • we can allow using different formatters depending on an option on the post
  • no bug in any formatter can ever result in the source of the post being altered
  • reformatting all posts is as easy as UPDATE posts SET post_formatter_meta = ''; as the meta data should be regenerated on viewing posts if necessary.
Cons:
  • Increase the size of the database (even if the metadata doesn't contain any text we need to be able to inject them in the right place)
  • Extra time when rendering the posts
I would like us to take a decision about that as soon as possible. And to help us I would like to have some benchmarks (If you need it, we can give you a sample from phpbb.com):
  • What is the cost (in terms of storage) of storing the metadata separately for 1000 posts
  • What is the cost (in terms of loading time) of rendering 1000 posts (so what is the rendering time for 1000 posts stored with their metadata and what is the rendering time for 1000 posts stored with the metadata separately)
  • The same questions with 1000 very rich posts (so with a lot of BBCode and Smileys)
JoshyPHP wrote:poke
naderman wrote:poke
nickvergessen wrote:poke
Member of the phpBB Development-Team
No Support via PM

Post Reply