The code is here: https://github.com/nicolas-grekas/Patchwork-UTF8
And the quick feature list:
- utf-8 handling is extended to grapheme clusters,
- any PHP function that needs utf-8 (grapheme cluster) awareness is carefully replicated in my lib with the very same signature (this helps a lot on documentation : "see official php doc" + utf-8)
- the implementation relies on pcre, mbstring, iconv and intl extentions for performance,
- pure PHP implementations of these last 3 extensions are included as a fallback when the natives ones aren't available,
- includes Unicode Normalization functions
- the code is unit tested (not 100% yet, but that's ongoing)
I started to develop this code in 2007, taking some lesson from phpbb, then make it evolve where it is currently. So this request is a way for me to give back something to phpbb
Consider Patchwork\Utf8 for unicode handling
Re: Consider Patchwork\Utf8 for unicode handling
How is your implementation better than what phpbb currently has?
Also, the license compatibility should be carefully considered (lgpl3).
Also, the license compatibility should be carefully considered (lgpl3).
Re: Consider Patchwork\Utf8 for unicode handling
For the licence, I'm thinking about changing it to APLv2+GPLv2, so if LGPLv3 is a pb, then it might not be anymore very soon.
For the technical merit, I don't think phpbb currently handles grapheme clusters, althought that seems important for unicode string handling. My lib does this is a portable way, relying on the intl extension when available for performance.
My implementation handles quite the same problematics as the functions in files includes/utf/utf_normalizer.php, includes/utf/utf_tools.php, includes/utf/data/recode_*.php
but is takes a different approach : instead of creating new functions/interface for something that already exists in a native extention (iconv, mbstring, intl), it reimplements these extensions in PHP, so that application code can directly use the extension's interface to work at full speed, a still be portable when not. This also helps a lot on documentation for the use of these functions : instead of forcing the phpbb dev to learn something new, one should just read the official php doc.
Last but not least, I wonder if Patchwork\Utf8 could become a shared code for many open source projects. That's at least what I hope for with this rfc for phpbb... I think the subject is quite universal, and that it require tedious work to be implemented and maintened on the long run.
For the technical merit, I don't think phpbb currently handles grapheme clusters, althought that seems important for unicode string handling. My lib does this is a portable way, relying on the intl extension when available for performance.
My implementation handles quite the same problematics as the functions in files includes/utf/utf_normalizer.php, includes/utf/utf_tools.php, includes/utf/data/recode_*.php
but is takes a different approach : instead of creating new functions/interface for something that already exists in a native extention (iconv, mbstring, intl), it reimplements these extensions in PHP, so that application code can directly use the extension's interface to work at full speed, a still be portable when not. This also helps a lot on documentation for the use of these functions : instead of forcing the phpbb dev to learn something new, one should just read the official php doc.
Last but not least, I wonder if Patchwork\Utf8 could become a shared code for many open source projects. That's at least what I hope for with this rfc for phpbb... I think the subject is quite universal, and that it require tedious work to be implemented and maintened on the long run.
Re: Consider Patchwork\Utf8 for unicode handling
I should add that on top of the portability given by the PHP implementation, we are of course free to create any abstraction. So if wanted, the current utf8_* functions in phpbb can be kept, but implemented without requiring the current "if mbstring available... else..." that reduce testability of your code. This is also what I do with my Patchwork\Utf8 class, that replicates in utf-8 grapheme units the quasi-complete set of native string function : nobody has to use it to benefit from the lib's portability feature. Just unicode support is required for PCRE.
Re: Consider Patchwork\Utf8 for unicode handling
Could you explain the practical consequences of this to someone not familiar with the details of unicode?Okin7 wrote: For the technical merit, I don't think phpbb currently handles grapheme clusters, althought that seems important for unicode string handling. My lib does this is a portable way, relying on the intl extension when available for performance.
Re: Consider Patchwork\Utf8 for unicode handling
The subject is detailled and introduced here : http://unicode.org/reports/tr29/
In unicode, a g̈ can be encoded as 2 code point (two UTF-8 chars) : U+0067 LATIN SMALL LETTER G then U+0308 COMBINING DIAERESIS
With mb_string, if you do a mb_substr("ag̈u", 1, 2, 'UTF-8'), you'll get a "g̈", but every one would expect getting "g̈u".
In unicode, a g̈ can be encoded as 2 code point (two UTF-8 chars) : U+0067 LATIN SMALL LETTER G then U+0308 COMBINING DIAERESIS
With mb_string, if you do a mb_substr("ag̈u", 1, 2, 'UTF-8'), you'll get a "g̈", but every one would expect getting "g̈u".
Re: Consider Patchwork\Utf8 for unicode handling
Patchwork\Utf8 has reached version 1.1.5, that is, better packaging, especially since the Laravel4 framework made it one of its dependencies.