[RFC] Integrate s9e\TextFormatter

These requests for comments/change have lead to an implemented feature that has been successfully merged into the 3.2/Rhea branch. Everything listed in this forum will be available in phpBB 3.2.
Post Reply
User avatar
brunoais
Registered User
Posts: 964
Joined: Fri Dec 18, 2009 3:55 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by brunoais »

A small typo fix here just to make it clear:
Nicofuma wrote: But I also see that there is still an open discussion with no agreement nor consensus about the storage, and I think that this point is crucial and need to be decided. So to sum-up: (currently means in the current state of the PR)
  • s9e\TextFormatter use an XML representation of the post
  • Currently the XML representation is stored in the post_text column
  • The XML representation add some tags and formalize the text (new line, spaces and tabs) and doesn't remove any part of the text
If it didn't change since I last saw it, it is not XML, it is XSL. They are different although they share syntax.

I also leave 1 important question:
@JoshyPHP How long are you able to provide support for it? As far as I know, no one understood your code except for you. If something breaks, or is broken, it is for good until you arrive to fix it. There is the possible event of someone actually trying in their time to fix it.
If you really think you can give the support it might need, then I will support this addition to phpBB

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

Nicofuma wrote:The XML representation add some tags and formalize the text (new line, spaces and tabs) and doesn't remove any part of the text
For precision's sake, it doesn't remove text but it does remove control codes that are not whitespace. That's bytes 0x00..0x08, 0x0B..0x0C and 0x0E..0x1F. It doesn't touch spaces or tabs but it normalizes EOL to \n. That's actually very close to what phpBB does by default.

About storing parsed text in a separate column, in my opinion the biggest problem is not storage size, it's complexity. The issue is twofold: API and DB structure.

API. In phpBB's current codebase, when you want to display a text you send the post_text string through generate_text_for_display() and you receive HTML in return. If that were to change, each call site would need to be changed (manually?) to add the metadata as a sixth argument and then you still have to track the source of the data (usually a SQL request) and add the metadata column to the query. I see at least 38 calls to generate_text_for_display() in the current codebase. On top of that there are about 10 calls to generate_text_for_storage() that would need to be modified as well as the corresponding SQL queries.

DB structure. I grep'ed through the phpBB's sources and I see at least 9 different kinds of rich text: posts, private messages, user signatures, poll titles, poll options, forum descriptions, forum rules, group descriptions and admin's contact info. That's nearly as many tables that need to be changed, and columns to be added.

In addition, a number of tests and test fixtures would need to be updated. Such an extensive change would need to take place in a separate PR. I absolutely see the value of storing a copy of the original text, but the development cost is big enough to outweigh it.
Nicofuma wrote:And to help us I would like to have some benchmarks (If you need it, we can give you a sample from phpbb.com):
I don't have any real-life forums so yes, if you want me to run benchmarks on real-world data I'll need you to provide it.

I don't think that benchmarking the performance of storing metadata separately will tell you much though. I would expect it to cost less than a millisecond per page.
Last edited by JoshyPHP on Sat Feb 14, 2015 2:50 pm, edited 4 times in total.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

brunoais wrote:How long are you able to provide support for it?
My plan is to support the library for as long as it's in use. As far as commitment goes, I've maintained this branch that literally no one uses since August 2013.

If the library was unmaintained and somehow unusable, the backup plan would be to unparse every post and replace them with whatever the library gets replaced with. The services implement those interfaces.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

I have posted a list of differences between 3.1 and this branch in the first post for easier access. This list is not exhaustive, it doesn't include bugfixes or features that are stictly better. (e.g. nesting custom BBCodes)

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Berlin, Germany
Contact:

Re: [RFC] Integrate s9e\TextFormatter

Post by naderman »

JoshyPHP wrote:Such an extensive change would need to take place in a separate PR. I absolutely see the value of storing a copy of the original text, but the development cost is big enough to outweigh it.
I think you have a point here. This doesn't make it worse than the situation we already have, so we can simply consider that secondary RFC/change if we feel it's still necessary, and we can go ahead with this as-is.

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Berlin, Germany
Contact:

Re: [RFC] Integrate s9e\TextFormatter

Post by naderman »

Please note I did not read or remember the entirety of all previous posts, but went off of your current summary, so if there's already explanations I apologize and the post should be updated to include the reasoning or a link to it.

I think this is looking pretty good overall now and I would like to see this merged soon, so we can upgrade area51 and start testing this ASAP.
JoshyPHP wrote:
  • The markup inside quote's author is not parsed, e.g. quote="[b]author[/b]"
We need a solution that allows us to continue to use at least [url] there, so we can link to external sources, potentially we could add a new bbcode with a url parameter for this and then add a migration that transforms these cases, but this is very common and we need a solution for it.
JoshyPHP wrote:
  • Censored words are marked at posting time. Changing the words list does not affect old posts. The censor can still be turned off at rendering time. Censoring does not apply to BBCode attributes. That means you can't prevent links to a given site by censoring its domain name.
Censoring words must trigger a reparse on view then, as it needs to be possible to alter censors after the fact. In general I would prefer censors were applied on the resulting text nodes to be displayed, rather than on input stored in the database. Censoring domains or URLs is a very common practice so we need a solution, either a new feature to blacklist URLs or a modification of how the censors work.
JoshyPHP wrote:
  • No server-side syntax highlighting in code blocks.
Why not, where's the problem in doing this on view?
JoshyPHP wrote:
  • Default limits are set on the number of occurence of individual BBCodes in text (1000 of each), global number of BBCodes in text (10000) and nesting level (10.) Same thing for smilies, the default is 1000 smilies per post. Past this limit, the markup is ignored.
Why?
JoshyPHP wrote:
  • The text between list and the first * is ignored.
How much work would it be to automatically turn this into an item, or at least display it above or below the list?

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

naderman wrote:We need a solution that allows us to continue to use at least [url] there, so we can link to external sources, potentially we could add a new bbcode with a url parameter for this
I can change the default BBCode to allow something like [quοte="author" url="..."]. The url attribute would be optional.
naderman wrote:I would prefer censors were applied on the resulting text nodes to be displayed
I have an helper class that works with the library that could be used to censor text in HTML, inside of text nodes. There's a performance-simplicity trade-off comparable with phpBB's censor_text().
naderman wrote:Censoring domains or URLs is a very common practice so we need a solution, either a new feature to blacklist URLs or a modification of how the censors work.
From my point of view, the easiest way is to blacklist URLs using the censor list. The library can already blacklist/whitelist URLs by domain. I can extend it to blacklist URLs using any part of the URL.
naderman wrote:
JoshyPHP wrote:
  • No server-side syntax highlighting in code blocks.
Why not, where's the problem in doing this on view?
It's simply not a supported feature. I could hack the renderer to replace the content of [cοde=php] blocks on view. Obviously it would have a runtime cost. Personally I'd recommend using a JavaScript highlighter instead.
naderman wrote:Why?
I don't know which part you're referring to. If you don't want to limit the amount of markup processed, I can set every limit to PHP_INT_MAX.
naderman wrote:
JoshyPHP wrote:The text between list and the first * is ignored.
How much work would it be to automatically turn this into an item, or at least display it above or below the list?
To do it properly, a suprisingly large amount. The last time I looked into it I had to shelve the idea as the implementation details got too complex. Now to do it in a very limited way, it's only a matter of minutes. Once the text is parsed it's easy to select the text nodes between list and *. It would be hardcoded for those BBCodes specifically. Alternatively, I can also configure the library to not ignore that text.

Reasonably, you have 3 options:
  1. Leave it as-is. Text or markup in places where text or markup is not valid HTML (in this case, outside of list items) is ignored.
  2. Allow plain text outside of list items and in places where text is not supposed to be. No dev cost. Results in invalid HTML where used.
  3. Hardcode a way to wrap the text between list items in a list item. The text would remain plain text, e.g. BBCodes would not be parsed. Takes an hour with tests included, a bit hackish.
Personally I favor #1 but you may want to go with #2. I'd advise against #3 because it has a bad cost/benefit ratio.

User avatar
brunoais
Registered User
Posts: 964
Joined: Fri Dec 18, 2009 3:55 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by brunoais »

naderman wrote:
JoshyPHP wrote:
  • The markup inside quote's author is not parsed, e.g. quote="[b]author[/b]"
We need a solution that allows us to continue to use at least [url] there, so we can link to external sources, potentially we could add a new bbcode with a url parameter for this and then add a migration that transforms these cases, but this is very common and we need a solution for it.
JoshyPHP wrote:I can change the default BBCode to allow something like [quοte="author" url="..."]. The url attribute would be optional.
Please read:
https://area51.phpbb.com/phpBB/viewtopi ... 08&t=42481
See if you can apply it that way. I believe it would be the best way to do it. Translate the contents inside the quote such that it presents as discussed in that topic. Feel free to participate and discuss with me.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

It's possible to create the BBCode that MichaelC posted. The template would be different because it has to handle cases where there's not author or post specified, but it can be implemented.

That reminds me that in the default quote BBCode in this PR, the attribute for the quote's author is called author. When a post contains quote="..." it is interpreted as quote author="...". If the attribute needs to be changed to user it should be done early otherwise posts would need to be reparsed. I think that author is semantically correct though, because a quote's author is not always a user.

User avatar
brunoais
Registered User
Posts: 964
Joined: Fri Dec 18, 2009 3:55 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by brunoais »

"author" or "user", that's semantics and I'm perfectly OK with it.
It is more on using both as named parameters instead of having just a default and a named parameter. There's more but, for me, that's the most explicit and what I think that should be taken into account.

Post Reply