Consider Patchwork\Utf8 for unicode handling

Note: We are moving the topics of this forum and it will be deleted at some point

Publish your own request for comments/change or patches for the next version of phpBB. Discuss the contributions and proposals of others. Upcoming releases are 3.2/Rhea and 3.3.
Post Reply
Okin7
Registered User
Posts: 7
Joined: Thu Oct 09, 2003 10:57 am
Location: Paris, France
Contact:

Consider Patchwork\Utf8 for unicode handling

Post by Okin7 »

The code is here: https://github.com/nicolas-grekas/Patchwork-UTF8

And the quick feature list:
- utf-8 handling is extended to grapheme clusters,
- any PHP function that needs utf-8 (grapheme cluster) awareness is carefully replicated in my lib with the very same signature (this helps a lot on documentation : "see official php doc" + utf-8)
- the implementation relies on pcre, mbstring, iconv and intl extentions for performance,
- pure PHP implementations of these last 3 extensions are included as a fallback when the natives ones aren't available,
- includes Unicode Normalization functions
- the code is unit tested (not 100% yet, but that's ongoing)

I started to develop this code in 2007, taking some lesson from phpbb, then make it evolve where it is currently. So this request is a way for me to give back something to phpbb :)

Oleg
Posts: 1150
Joined: Tue Feb 23, 2010 2:38 am
Contact:

Re: Consider Patchwork\Utf8 for unicode handling

Post by Oleg »

How is your implementation better than what phpbb currently has?

Also, the license compatibility should be carefully considered (lgpl3).

Okin7
Registered User
Posts: 7
Joined: Thu Oct 09, 2003 10:57 am
Location: Paris, France
Contact:

Re: Consider Patchwork\Utf8 for unicode handling

Post by Okin7 »

For the licence, I'm thinking about changing it to APLv2+GPLv2, so if LGPLv3 is a pb, then it might not be anymore very soon.

For the technical merit, I don't think phpbb currently handles grapheme clusters, althought that seems important for unicode string handling. My lib does this is a portable way, relying on the intl extension when available for performance.

My implementation handles quite the same problematics as the functions in files includes/utf/utf_normalizer.php, includes/utf/utf_tools.php, includes/utf/data/recode_*.php
but is takes a different approach : instead of creating new functions/interface for something that already exists in a native extention (iconv, mbstring, intl), it reimplements these extensions in PHP, so that application code can directly use the extension's interface to work at full speed, a still be portable when not. This also helps a lot on documentation for the use of these functions : instead of forcing the phpbb dev to learn something new, one should just read the official php doc.

Last but not least, I wonder if Patchwork\Utf8 could become a shared code for many open source projects. That's at least what I hope for with this rfc for phpbb... I think the subject is quite universal, and that it require tedious work to be implemented and maintened on the long run.

Okin7
Registered User
Posts: 7
Joined: Thu Oct 09, 2003 10:57 am
Location: Paris, France
Contact:

Re: Consider Patchwork\Utf8 for unicode handling

Post by Okin7 »

I should add that on top of the portability given by the PHP implementation, we are of course free to create any abstraction. So if wanted, the current utf8_* functions in phpbb can be kept, but implemented without requiring the current "if mbstring available... else..." that reduce testability of your code. This is also what I do with my Patchwork\Utf8 class, that replicates in utf-8 grapheme units the quasi-complete set of native string function : nobody has to use it to benefit from the lib's portability feature. Just unicode support is required for PCRE.

Oleg
Posts: 1150
Joined: Tue Feb 23, 2010 2:38 am
Contact:

Re: Consider Patchwork\Utf8 for unicode handling

Post by Oleg »

Okin7 wrote: For the technical merit, I don't think phpbb currently handles grapheme clusters, althought that seems important for unicode string handling. My lib does this is a portable way, relying on the intl extension when available for performance.
Could you explain the practical consequences of this to someone not familiar with the details of unicode?

Okin7
Registered User
Posts: 7
Joined: Thu Oct 09, 2003 10:57 am
Location: Paris, France
Contact:

Re: Consider Patchwork\Utf8 for unicode handling

Post by Okin7 »

The subject is detailled and introduced here : http://unicode.org/reports/tr29/

In unicode, a g̈ can be encoded as 2 code point (two UTF-8 chars) : U+0067 LATIN SMALL LETTER G then U+0308 COMBINING DIAERESIS
With mb_string, if you do a mb_substr("ag̈u", 1, 2, 'UTF-8'), you'll get a "g̈", but every one would expect getting "g̈u".

Okin7
Registered User
Posts: 7
Joined: Thu Oct 09, 2003 10:57 am
Location: Paris, France
Contact:

Re: Consider Patchwork\Utf8 for unicode handling

Post by Okin7 »

Patchwork\Utf8 has reached version 1.1.5, that is, better packaging, especially since the Laravel4 framework made it one of its dependencies.

Post Reply