Automatic UTF-8 normalization

General discussion of development ideas and the approaches taken in the 3.x branch of phpBB. The next feature release of phpBB 3 will be 3.3/Proteus.
Forum rules
Please do not post support questions regarding installing, updating, or upgrading phpBB 3.2.x. If you need support for phpBB 3.2.x please visit the 3.2.x Support Forum on phpbb.com.

If you have questions regarding writing extensions please post in Extension Writers Discussion to receive proper guidance from our staff and community.
igorw
Registered User
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Automatic UTF-8 normalization

Post by igorw »

This could be made into an RFC at some point, but I'll leave it as discussion for now.

Currently (in 3.0) it is required to call utf8_normalize_nfc() on each unicode request string. At the same time we are supplying request_var() with the information, whether it is unicode or not. In code that means:

Code: Select all

$input = utf8_normalize_nfc(request_var('input', '', true));
Not only is it a lot to type, it is redundant. You are supplying the same information twice. It would be easier to just do:

Code: Select all

$input = request_var('input', '', true);
And moving the utf8_normalize_nfc() call into request_var(), occurring when the $multibyte argument is set to true.

Would there be any reason for this not to be feasible? One possible issue would be future bug fix merging, but considering request_var is being totally rewritten anyway it may not make much difference.

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Karlsruhe, Germany
Contact:

Re: Automatic UTF-8 normalization

Post by naderman »

The idea of 3.1 is to keep things backward compatible. The new request class I'm working on has a 100% compatible request_var function. How do you propose we make this backwards compatible? I guess we could leave request_var as is, but the method on the class does it automatically?

igorw
Registered User
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Post by igorw »

It's a bit tricky, actually. The most logical place to put it is type_cast_helper::set_var. This also has the advantage that only user input is normalized. Currently the $default argument of request_var also goes through normalization (it should already be normalized).

You could let request_var make both request::variable and set_var aware that the data will be normalized manually. But since normalizing a string should (afaik) not do any harm, you might as well just add it into set_var and let the old code normalize it twice. That would be a bit slower but less obtrusive. But my "no harm" statement would need to be proven first though -- I'm no unicode guru.

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Karlsruhe, Germany
Contact:

Re: Automatic UTF-8 normalization

Post by naderman »

Indeed, you are correct we could just apply it twice, as per http://unicode.org/reports/tr15/#Design_Goals

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: Automatic UTF-8 normalization

Post by imkingdavid »

This would definitely be a nice change so that I don't have to keep normalizing all the user input. That function takes a long time to type, especially using it multiple times. :lol:

Anyway, I agree with eviL3 on the placement, and it should be fine to apply it twice.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

igorw
Registered User
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Post by igorw »

On a slightly related note, PHP 5.3.0 has native support for unicode normalization. See Normalizer::Normalize(). It's most likely more efficient than the current PHP implementation. Considering there are already fallback wrappers for existing UTF-8 functions, normalization could be treated the same way.

igorw
Registered User
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Post by igorw »

I've added some changes, based on Nils' request-class branch. Both testing and feedback would be great.

http://github.com/evil3/phpbb3/compare/ ... hancements

User avatar
A_Jelly_Doughnut
Registered User
Posts: 1780
Joined: Wed Jun 04, 2003 4:23 pm

Re: Automatic UTF-8 normalization

Post by A_Jelly_Doughnut »

A lot of that code looks quite familiar (from Nils and "old_trunk" :))

I'd probably rather see the normalizer/unicode changes in a separate branch from the request class, which I think we want to water down (probably get rid of the code that disables the superglobals)

It would probably be good in the native version of utf8_normalize_nfc() to check Normalizer::isNormalized() before re-normalizing. Because there's no way that we're going to remove every call to utf8_normalize_nfc() that around the code for 3.1, and I imagine that checking isNormalized() is faster than normalizing again. (This could be done inside Normalizer::Normalize(), I've not checked)
A_Jelly_Doughnut

igorw
Registered User
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Post by igorw »

Only the two latest commits are by me, the rest is Nils'. ;)

User avatar
A_Jelly_Doughnut
Registered User
Posts: 1780
Joined: Wed Jun 04, 2003 4:23 pm

Re: Automatic UTF-8 normalization

Post by A_Jelly_Doughnut »

Ah, I see. You based off of the /naderman/phpbb3/feature/request-class/ rather than one of the phpbb3/phpbb3/ branches.
A_Jelly_Doughnut

Post Reply