[RFC] Integrate s9e\TextFormatter

These requests for comments/change have lead to an implemented feature that has been successfully merged into the 3.2/Rhea branch. Everything listed in this forum will be available in phpBB 3.2.
Post Reply
User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

[RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

Ticket: PHPBB3-11768
PR: #3461
--------------------

ImageBranch @ GitHubPull requestOld pull request

Synopsis

s9e\TextFormatter, a text formatting library that supports BBCodes and most other features found in forums, implemented as plugins. I'm its author. This RFC is about replacing the current BBCode/smilies/autolinking/censor routines (hereinafter referred to as legacy) with s9e\TextFormatter. I originally wanted to focus on providing support from the s9e\TextFormatter side of things while leaving the phpBB-centric code to somebody more proficient in it but I've recently come to the conlusion I should work on integration directly. This is why I'm currently implementing this RFC.

The goal is to close as many BBCodes-related bugs as possible and get to an equivalent level of functionality, backed with decent testing and within a reasonable time frame. As far as schedule goes, my goal is to be done before September 2013. I intentionally leave out other improvements (other than the more robust BBCode parser) for future RFCs because I don't want this RFC to die of inactivity or lack of consensus.

This RFC does not replace the legacy routines, it complements them. Old messages are still handled by the old code, but new messages are parsed and displayed by the new text formatter. Editing an old message will transparently convert it to the new format.

Altered functions/methods
  • acp_bbcodes::main()
  • acp_icons::main()
  • acp_words::main()
  • decode_message()
  • generate_text_for_display()
  • generate_text_for_storage()
  • strip_bbcode()
  • message_parser::parse()
Differences from 3.1
  • The markup inside quote's author is not parsed, e.g. [quote="[b]author[/b]"] -- One exception: up to 1 url tag is supported
  • Up to 1 blank line around block-level elements (quotes, lists, etc...) is ignored. In 3.1, if there's a blank line after a quote it is rendered as two <br>.
  • Censored words are marked at posting time. Changing the words list does not affect old posts. The censor can still be turned off at rendering time. Censoring does not apply to BBCode attributes. That means you can't prevent links to a given site by censoring its domain name.
  • No server-side syntax highlighting in code blocks.
  • Default limits are set on the number of occurence of individual BBCodes in text (1000 of each), global number of BBCodes in text (10000) and nesting level (10.) Same thing for smilies, the default is 1000 smilies per post. Past this limit, the markup is ignored.
  • The text between list and the first * is ignored.
  • In the following text, the [quote] BBCode is ignored:

    Code: Select all

    [b]...[quote]...[/quote]...[/b]
    This can be changed to be automatically interpreted as either of the following:

    Code: Select all

    [b]...[/b][quote]...[/quote]...[/b]
    [b]...[/b][quote][b]...[/b][/quote][b]...[/b]
  • There can't be BBCodes named E or CENSOR because those names are used for smilies and censored words respectively. This can be changed to something else, including namespaced names such as e:e or foo:bar. See the discussion.
Bugs ( fixed, pending) [ regression tests ]

PHPBB3-3981 - URL BBCode does not support IDN domains (only if idn_to_ascii() is available :!:)
PHPBB3-7187 - Quote smilies error
PHPBB3-7275 - Custom bbodes trim('${1}')
PHPBB3-8419 - custom tag eats up space character
PHPBB3-8420 - emoticon removes space before itself when using preview (see decode_message_test.php)
PHPBB3-9377 - Custom BB Code Nesting
PHPBB3-10002 - (incomplete) BBCode usage of [quоte] and [lіst] forces closing [/lіst] and [/quоte]s, ultimately breaking HTML/Design
PHPBB3-10122 - [list=] should support "none", along with CSS2 types
PHPBB3-10425 - URLs contains non ANSII characters are rejected and not recognize as URLs
PHPBB3-10587 - URL Bug in phpBB 3
PHPBB3-10922 - Allow parameters for [email] BBCode content instead of addresses only (Edit: not currently fixed, must have misread the description)
PHPBB3-10989 - Bug in BBCode
PHPBB3-11153 - Custom BBCode token {EMAIL} subpattern are captured - token can never be used as parameter
PHPBB3-12195 - Double-slash URLs not supported
PHPBB3-13555 - Poll options preview rendered incorrectly on <br /> collision

TODO

Legacy messages are preserved and displayed with the legacy routines
Legacy messages can be edited (but then they're parsed with s9e\TextFormatter)
New messages are parsed with s9e\TextFormatter
New messages are properly displayed in viewtopic
Editing BBCodes, smilies and censored words in the ACP regenerates the new parser
Default BBCodes are supported: B, CODE, COLOR, EMAIL, FLASH, I, IMG, LIST, *, QUOTE, SIZE, U, URL
Per-style BBCode templates
Custom BBCodes are supported, provided they are safe to use (e.g. no XSS)
[attachment] BBCode
Posting limits
  • max_*_font_size
  • max_*_img_height / max_*_img_width applied to [img]
  • max_*_img_height / max_*_img_width applied to [flash]
  • max_*_smilies
  • max_*_urls
{LOCAL_URL} doesn't prepend the board's URL
Tabs in code blocks are preserved, legacy routines replaces them. This might be the occasion to change this behaviour, preserve tabs and use a JavaScript syntax highlighter
Handles viewcensors, viewflash, viewimg and viewsmilies
Takes care of toggling BBCodes, magic URLs, smilies, IMG BBCode, FLASH BBCode, QUOTE BBCode, and URLs in parse_message::parse()
Cron job that removes old renderers from the cache Nope, piggyback on the "Purge all cache" button in the ACP instead.
Error message when using an unauthorized BBCode in poll options
strip_bbcode()
Polls
Look into how text is cleaned up for the search engine
Copy the tests from tests/bbcode/parser_test.php and instead of testing the parsed text, test the rendered test. Mostly done
That path_in_domain() thing
:?: Many other things I forgot to list and that you should remind me
Last edited by JoshyPHP on Mon Mar 02, 2015 11:08 pm, edited 47 times in total.

User avatar
DavidIQ
Customisations Team Leader
Customisations Team Leader
Posts: 1904
Joined: Thu Mar 02, 2006 4:29 pm
Location: Earth
Contact:

Re: [RFC] Integrate s9e\TextFormatter

Post by DavidIQ »

This seems very similar to the work discussed and done here:
viewtopic.php?f=108&t=33021

Will this effectively replace whatever was attempted there?
Image

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

You could consider it a superset in a way. It would supersede it.

Also, I omitted it but while this RFC is limited to just fixing bugs and having a more robust parser, the long term benefit is that you get new features for almost free. Extended custom BBCodes syntax, better XSS detection, extended templating, an HTML whitelist without the fear of XSS, a better way to embed media and others. All of that is code that already exists and that's already under continuous testing. (in my sig)
Last edited by JoshyPHP on Tue Jan 14, 2014 9:19 pm, edited 1 time in total.

User avatar
EXreaction
Registered User
Posts: 1555
Joined: Sat Sep 10, 2005 2:15 am

Re: [RFC] Integrate s9e\TextFormatter

Post by EXreaction »

Great to see this is being worked on. I haven't seen a pull request for the work so far. Would you mind creating a pull request (labeled WIP if it's not yet complete) against develop? It would be nice to watch the progress on this. :)

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

I've created a WIP pull request. It might be noisy because I try to commit regularly and I often push so that Travis can keep an eye on it.

I'm currently looking into how phpBB's tests work, especially wrt database. Also, I'm not sure where which tests should go so for the time being I put them where it makes sense to me and I'll set aside some time to organize them properly.

User avatar
EXreaction
Registered User
Posts: 1555
Joined: Sat Sep 10, 2005 2:15 am

Re: [RFC] Integrate s9e\TextFormatter

Post by EXreaction »

I've looked over it briefly and left some comments. The location of the tests looks fine as the PR is now.

User avatar
MattF
Extension Customisations
Extension Customisations
Posts: 675
Joined: Mon Mar 08, 2010 9:18 am

Re: [RFC] Integrate s9e\TextFormatter

Post by MattF »

This seems very exciting!
Has an irascible disposition.

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

Alright, I have a few questions for which I couldn't find a definite answer. It's about tests:
  • I'm dumping compiled templates to cache/ -- can I hardcode the .php extension or do I have to use %core.php_ext%/$phpEx ?
  • the normal code saves files to cache/ -- where should those files be saved during testing?
    • same as production
    • in the normal cache/ dir but with a different extension
    • sys_get_temp_dir()
    • use vfsStream instead
    • something else?
  • what's the proper type hint for $cache? Or rather, is there a proper type hint for the cache service that also work with the mock cache?
Also, I noticed that most files use dirname(__FILE__) instead of __DIR__. I assume this is legacy code predating PHP 5.3 and I use __DIR__ in my code.

Finally, status update: I'm currently working on refactoring the new code as a proper service. I don't intend to refactor the legacy code as a service, but I'm trying to account for that possibility so that it can be done later on.

User avatar
EXreaction
Registered User
Posts: 1555
Joined: Sat Sep 10, 2005 2:15 am

Re: [RFC] Integrate s9e\TextFormatter

Post by EXreaction »

If the library dumps the files itself and you only specify the directory, the extension is not important.

For testing, you can either store them in the cache directory or in tests/tmp/.

The mock cache is a driver, not the cache service. You can either setup the service with that as the driver or just pass the driver if that's all you need and use the driver interface or the service to typehint.

The old BBCode system can be removed from the system at a later time in a different PR. I wouldn't refactor any of that code. Most of the phpBB code that uses the BBCode parser should be using the generate_text_for_ functions (brunoais is working on this). I would modify those functions to parse with your system instead and not modify any areas that parse BBCode directly as those may be transitioned to the generate_text_for functions before your PR is merged (it would just cause conflicts).

User avatar
JoshyPHP
Registered User
Posts: 381
Joined: Fri Jul 08, 2011 9:43 pm

Re: [RFC] Integrate s9e\TextFormatter

Post by JoshyPHP »

Alright, I've changed the cache dir to point to tests/tmp/ and I'm typehinting the cache driver interface, as that's what other services seem to do. I don't intend to touch the legacy code, so no worries there.

I've just pushed a big update that implements the new text formatting as a service. Or rather, as services. When formatting text, the three big operations are parsing, unparsing, and rendering. In the legacy code, they're implemented as the global functions generate_text_for_(storage|edit|display). Here, they are implemented as 3 services: text_formatter.parser, text_formatter.unparser, text_formatter.renderer. Those 3 services correspond to 3 abstract classes detailed below, implemented by 3 concrete classes that act as adapters for the s9e\TextFormatter library. I used abstract classes rather than pure interfaces because I wanted to keep as much of the logic (e.g. what the default options are based on the user/config/auth) out of the adapters as possible.

text_formatter.parser
  • parse($text)
  • disable_bbcode($name) / enable_bbcode($name)
  • disable_bbcodes() / enable_bbcodes()
  • disable_censor() / enable_censor()
  • disable_magic_url() / enable_magic_url()
  • disable_smilies() / enable_smilies()
  • get_errors()
  • set_var($name, $value) / set_vars(array $vars)*
text_formatter.renderer
  • render($text)
  • configure_smilies_path($phpbb_root_path, phpbb_config $config)*
  • configure_user(phpbb_user $user, phpbb_config $config, phpbb_auth $auth)*
  • set_smilies_path($path)
  • set_viewcensors($value)
  • set_viewflash($value)
  • set_viewimg($value)
  • set_viewsmilies($value)
text_formatter.unparser
  • unparse($text)
I'm trying to keep any code specific to s9e\TextFormatter in the adapters, although there's a couple of operations that I haven't standardized: cache invalidation (e.g. whenever BBCodes change) and detecting whether a message should be displayed by the legacy code or the new one. (for the time being, I've hardcoded a simple regexp)

I've also created a new test case helper, to create those services in a couple of lines.

Code: Select all

$phpbb_container = new phpbb_mock_container_builder;
// ...here you can optionally add other services such as user or auth...
$this->get_test_case_helpers()->set_s9e_services($phpbb_container);
$phpbb_container->get('text_formatter.parser')->parse($text) 
It may not look like much, but I can understand why there were almost no existing tests covering text formatting, because setting up a working parser is actually quite tedious. Now that I can do that more easily I'll be able to move forward with my TODO list without spending as much time figuring out tests.

My next item is testing whether the posting limits (e.g. image dimensions, number of URLs and so on... there's a dozen of them) are enforced. And while I'm doing that, I'd like to discuss per-style templates. First off, does anybody even implement custom BBCode templates? The BBCodes admin panel doesn't offer a way to create multiple templates, so I guess that only the base BBCodes are styled with custom templates? I don't know much about phpBB styles, so I'd need more info about that.
Last edited by JoshyPHP on Wed Aug 14, 2013 2:36 am, edited 1 time in total.

Post Reply