phpBB

Development Discussion Board

phpBB's testing ground of bleeding edge code
Advanced search

Automatic UTF-8 normalization

General discussion of development ideas and the approaches taken in the 3.x branch of phpBB. The next feature release of phpBB 3 will be 3.1/Ascreaus followed by 3.2/Arsia.

Automatic UTF-8 normalization

Postby igorw » Sat Apr 03, 2010 8:55 pm

This could be made into an RFC at some point, but I'll leave it as discussion for now.

Currently (in 3.0) it is required to call utf8_normalize_nfc() on each unicode request string. At the same time we are supplying request_var() with the information, whether it is unicode or not. In code that means:
Code: Select all
$input = utf8_normalize_nfc(request_var('input', '', true));

Not only is it a lot to type, it is redundant. You are supplying the same information twice. It would be easier to just do:
Code: Select all
$input = request_var('input', '', true);

And moving the utf8_normalize_nfc() call into request_var(), occurring when the $multibyte argument is set to true.

Would there be any reason for this not to be feasible? One possible issue would be future bug fix merging, but considering request_var is being totally rewritten anyway it may not make much difference.
User avatar
igorw
Registered User
 
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Postby naderman » Sun Apr 04, 2010 12:45 pm

The idea of 3.1 is to keep things backward compatible. The new request class I'm working on has a 100% compatible request_var function. How do you propose we make this backwards compatible? I guess we could leave request_var as is, but the method on the class does it automatically?
www.naderman.de
Move your forum to Forumatic - we'll take care of maintenance & spam
User avatar
naderman
Development Team Leader
Development Team Leader
 
Posts: 1649
Joined: Sun Jan 11, 2004 2:11 am
Location: Karlsruhe, Germany

Re: Automatic UTF-8 normalization

Postby igorw » Sun Apr 04, 2010 2:05 pm

It's a bit tricky, actually. The most logical place to put it is type_cast_helper::set_var. This also has the advantage that only user input is normalized. Currently the $default argument of request_var also goes through normalization (it should already be normalized).

You could let request_var make both request::variable and set_var aware that the data will be normalized manually. But since normalizing a string should (afaik) not do any harm, you might as well just add it into set_var and let the old code normalize it twice. That would be a bit slower but less obtrusive. But my "no harm" statement would need to be proven first though -- I'm no unicode guru.
User avatar
igorw
Registered User
 
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Postby naderman » Sun Apr 04, 2010 3:15 pm

Indeed, you are correct we could just apply it twice, as per http://unicode.org/reports/tr15/#Design_Goals
www.naderman.de
Move your forum to Forumatic - we'll take care of maintenance & spam
User avatar
naderman
Development Team Leader
Development Team Leader
 
Posts: 1649
Joined: Sun Jan 11, 2004 2:11 am
Location: Karlsruhe, Germany

Re: Automatic UTF-8 normalization

Postby imkingdavid » Sun Apr 04, 2010 5:12 pm

This would definitely be a nice change so that I don't have to keep normalizing all the user input. That function takes a long time to type, especially using it multiple times. :lol:

Anyway, I agree with eviL3 on the placement, and it should be fine to apply it twice.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.
User avatar
imkingdavid
Development Team
Development Team
 
Posts: 900
Joined: Thu Jul 30, 2009 12:06 pm

Re: Automatic UTF-8 normalization

Postby igorw » Sat Apr 24, 2010 1:18 pm

On a slightly related note, PHP 5.3.0 has native support for unicode normalization. See Normalizer::Normalize(). It's most likely more efficient than the current PHP implementation. Considering there are already fallback wrappers for existing UTF-8 functions, normalization could be treated the same way.
User avatar
igorw
Registered User
 
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Postby igorw » Sat Apr 24, 2010 1:47 pm

I've added some changes, based on Nils' request-class branch. Both testing and feedback would be great.

http://github.com/evil3/phpbb3/compare/ ... hancements
User avatar
igorw
Registered User
 
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Postby A_Jelly_Doughnut » Sat Apr 24, 2010 3:54 pm

A lot of that code looks quite familiar (from Nils and "old_trunk" :))

I'd probably rather see the normalizer/unicode changes in a separate branch from the request class, which I think we want to water down (probably get rid of the code that disables the superglobals)

It would probably be good in the native version of utf8_normalize_nfc() to check Normalizer::isNormalized() before re-normalizing. Because there's no way that we're going to remove every call to utf8_normalize_nfc() that around the code for 3.1, and I imagine that checking isNormalized() is faster than normalizing again. (This could be done inside Normalizer::Normalize(), I've not checked)
A_Jelly_Doughnut
User avatar
A_Jelly_Doughnut
MOD Team
MOD Team
 
Posts: 1751
Joined: Wed Jun 04, 2003 4:23 pm

Re: Automatic UTF-8 normalization

Postby igorw » Sat Apr 24, 2010 4:03 pm

Only the two latest commits are by me, the rest is Nils'. ;)
User avatar
igorw
Registered User
 
Posts: 500
Joined: Thu Jan 04, 2007 11:47 pm

Re: Automatic UTF-8 normalization

Postby A_Jelly_Doughnut » Sat Apr 24, 2010 4:26 pm

Ah, I see. You based off of the /naderman/phpbb3/feature/request-class/ rather than one of the phpbb3/phpbb3/ branches.
A_Jelly_Doughnut
User avatar
A_Jelly_Doughnut
MOD Team
MOD Team
 
Posts: 1751
Joined: Wed Jun 04, 2003 4:23 pm

Next

Return to [3.x] Discussion

Who is online

Users browsing this forum: Jacob and 12 guests