[RFC] Integrate s9e\TextFormatter

These requests for comments/change have lead to an implemented feature that has been successfully merged into the 3.2/Rhea branch. Everything listed in this forum will be available in phpBB 3.2.
Post Reply

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

I'm looking into the regexp used for smilies. The legacy regexp looks like this:

Code: Select all

#(?<=^|[\n .])(?:smilies go there)(?![^<>]*>)#u
Obviously, the second assertion (negative lookahead) is there to exclude smilies found inside of an HTML tag. That doesn't apply to s9e\TextFormatter so it can be ignored.

The first assertion (lookbehind) ensures that each smiley is either at the start of the text or preceded by either one of: a new line, a space, or a literal dot. Is this exactly the intended behaviour or is it there for other technical reasons? By default, the s9e\TextFormatter's Emoticons plugin replaces smilies wherever they are in plain text but it can be configured to include assertions before/after the smilies.

User avatar
EXreaction
Registered User
Posts: 1555
Joined: Sat Sep 10, 2005 2:15 am

Re: [RFC] Integrate s9e\TextFormatter

Post by EXreaction »

That is the intended behavior.

Custom smilies can be added to be anything, e.g. "o)" which shouldn't match "(to)"

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

Sure. The literal dot looks fishy though, I can't figure out what use case it's supposed to cover. FWIW I've tracked it down to a commit from 9 years ago but that didn't help.

I have added the negative lookbehind assertion (?<![^\n .\]]) which covers those same characters plus the right square bracket because of PHPBB3-7187 and PHPBB3-7275.
Last edited by JoshyPHP on Mon Sep 23, 2013 12:50 am, edited 1 time in total.

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by imkingdavid »

I'd guess the literal dot is there to let you put a smilie directly after a sentence.:)
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

I haven't talked about the word censor in this thread yet and it's a bit overdue. There is a number of differences between the way the Censor plugin works in s9e\TextFormatter, and how the current phpBB censors words. I'm going to describe those differences and outline some possible changes.

First I'll preface it by saying that in phpBB, the censor is used a lot. On this very page, censor_text() has probably been called 20 times. At the time of this writing, in current PR the Censor plugin only applies to messages/post bodies. Everywhere else is still handled by the old censor_text(), which I have left untouched.

The Censor plugin accepts jokers in censored words. * becomes the PCRE expression [\pL\pN]*, ? becomes . and a space becomes \s*. The rules for word boundaries are about the same as phpBB's. Words are case-insensitive. The regexp uses the Unicode mode unconditionally. As mentionned on the first page, the viewcensors setting is respected, as are the allow_nocensors setting and u_chgcensors permission.

The Censor plugin, like any other plugin, is run at parsing time and uses tags to mark the text it replaces. That last part is just a technical detail, but it means that censoring is treated like any other markup/feature. For instance, it will not apply to text that is consumed by another tag, meaning it will not apply to an attribute value such as an URL. If you censor "example" with "banana", http://example.org will display http://banana.org but it will still link to example.org because the replacement only applies to the text, not the attribute. Note that you can ban hosts using the URL features. Censor's tags obey rules like any other tags. The practical implication is that Censor only applies where tags are allowed. It won't, for example, apply inside of a code block.

The Censor plugin is executed at parsing time. That means the size of the list of censored words does not affect the performance in viewtopic (and admins who use the censor tend to have a pretty big list because of users evading it) but that also means that the list of censored words is "locked in" when the message is posted. Changes in the list of censored words are not reflected in old messages unless they are edited/reparsed. I don't know how much of a concern it is for phpBB. Censor lists are almost never updated so most of the time it makes no difference, but on the other hand it sucks if you add a word to the list, go back to a topic and see the word still uncensored. There's a bit of "WTF" factor in this case.

The Censor plugin comes with an helper class. It has 3 public methods: censorText(), censorHtml() and reparse(). The last one can be used on a parsed text to update it to a new list of censored words. It's meant to produce the same result as unparsing/reparsing the text but it's much faster. As a point of reference, properly parsing this post on my computer (using that list) takes ~3ms while reparsing it with the Censor helper only takes ~0.7ms. The other two methods are similar to phpBB's own censor_text(). They take either plain text or HTML and perform word replacements. censorHtml() only performs replacements in text nodes, preserving HTML elements. They could be used to replace censor_text(). They're generally faster than censor_text() and scale better with the number of censored words because censor_text() uses one preg_replace() with an array of regexps (one per word) whereas the Censor plugin uses one preg_match_all() then reparses each hit individually. The first approach scales better with the number of hits, the second approach scales better with the number of words. Since the vast majority of posts do not use censored words, the Censor helper is generally more performant than censor_text(). With that said, I'm not sure replacing censor_text() is worth the trouble.

User avatar
brunoais
Registered User
Posts: 964
Joined: Fri Dec 18, 2009 3:55 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by brunoais »

I think we can work with this using a "config" variable (or equivalent) that is global that contains the last time when censored words were written and a value in the posts that tells when they were last censored. With a simple "if" and those two things it should be enough to make sure that the post appears as up-to-date according to censoring.
Also, if the posts are censored at "compile" time how does it show the post uncensored for the moderators if they want to see it uncensored?

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

It is possible to implement something like that, where posts are censored dynamically if they're older than the word list, but I'm not sure the better performance justifies the increased complexity. generate_text_for_display() is where that logic should be but it doesn't know when the message was parsed so either its signature would need to be changed to allow passing the message's timestamp, or the timestamp should be included in the XML representation. generate_text_for_display() already has 5 arguments and I think phpBB should go towards less complexity when possible so in my opinion the choice is really between using the censor completely live when the message is displayed or when the message is posted.
brunoais wrote:Also, if the posts are censored at "compile" time how does it show the post uncensored for the moderators if they want to see it uncensored?
The template displays the original word if $S_VIEWCENSORS is falsy and the censored replacement otherwise. Parsing is never destructive, the XML representation of the parsed text contains both the original content and data from plugins. A censored word looks like this: <CENSOR with="replacement">bad word</CENSOR>.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

There's been some activity in the PR discussion tab. GitHub's great for discussing line changes, but pretty bad for discussing design. Here are some points that deserve more attention.

The first is unsafe BBCodes. In phpBB it's possible create unsafe BBCodes if the admin chooses to bypass the red warning. I haven't touched the ACP (I've added a line to invalidate the cache but that's all) so that part remains the same: click red button :arrow: unsafe BBCode gets stored in the database. The difference here is that BBCodes' safety is tested everytime the parser is configured, unsafe BBCodes cause a s9e\TextFormatter\Configurator\Exceptions\UnsafeTemplateException to be thrown and there's currently no provision to prevent that from happening. At some point in the future, the ACP should be modified to check BBCodes' safety using s9e\TextFormatter's template checker to properly deal with the issue BUT doing so may also means losing the ability to render old posts. For the time being, it's best to leave it alone and revisit the issue later on.

Second point: phpbb\path_helper. Since I wrote the code in this PR a number of changes have been made to phpBB (which, btw, is the reason I want to modify as few files as possible: to minimize conflicts maintenance.) One of them saw the creation of a phpbb\path_helper object. I should be using in the code that deals with smilies but I can't do that until I can properly mock it in tests. I could copy/paste the code from phpbb_path_helper_web_root_path_test but that's something that's going to be needed in more than a couple of tests. Wouldn't it make sense to create a test helper that mocks path_helper globally and/or in a container instead?

Finally, there's a particularity that I need to document. To keep it simple I won't explain the technical details unless somebody asks for them and give you the short version: internally, smilies use a tag named "E" and censored words use a tag named "CENSOR". The practical implication is that you can't straight-up create a BBCode like [E] or [CENSOR]. If you want a BBCode named E you must either create it with a different tag name, e.g. [E $tagName=foo]{TEXT}[/E] or use a different tag name for smilies such as XYZFOOBAR. Just know that in the latter case, the longer name costs an additional 16 more bytes of storage per smiley per post. IMO, it's a situation rare enough to simply exist as a KB article.

User avatar
brunoais
Registered User
Posts: 964
Joined: Fri Dec 18, 2009 3:55 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by brunoais »

About the [E] tag, does it also affect if the admin wants to use an [e] tag?
Why not use an "invalid" tag value for example:
[_]
and
[__]
?
Would TextFormatter be unable to coop with that?

Post Reply