Language support in 2.2

Discussion of general topics related to the new version and its place in the world. Don't discuss new features, report bugs, ask for support, et cetera. Don't use this to spam for other boards or attack those boards!
Forum rules
Discussion of general topics related to the new release and its place in the world. Don't discuss new features, report bugs, ask for support, et cetera. Don't use this to spam for other boards or attack those boards!
E.Z
Registered User
Posts: 9
Joined: Fri Sep 05, 2003 10:40 pm

Language support in 2.2

Post by E.Z »

[edited by dhn]Thanks to BBcode problems the post is hard to read, E.Z did make a HTML version you can read here: http//zvi.org/dev/phpbb_lang1.html [/edited]



Hi,

This is my first post here, so first of all I'd like to thank all phpBB developers and supporters for a great work. I especially like the new pricing scheme you've announced D

I use phpBB 2.0.4 on web site the serves an Israeli hiking group that I also manage. It's a small message board by any standard, but it serves as a crucial information center for the group.

I set up the board to work Hebrew, and it works nicely, but the task wasn't as smooth and easy as it could. Partly because of missing features in the code, partly because of inadequate support files such as templates and translations.

My intention isn't to criticize - on the contrary I wish to describe a few problems and how they can fixed in 2.2, for the benefit of all Right-to-Left (RTL) languages (ie. not just Hebrew but also Arabic and maybe others). For some of the issues I may be able to contribute knowledge and actual work. This is a long (but IMHO interesting) post, so take a deep breath or just skip on to something else...

[*]Multiple languages (with different directionality) in the same board.

By this I mean not user interface in different languages, but forums and topics that use different languages. Note the only way to support multiple languages in a web page is using UTF-8 (there's only one charset declaration for the entire HTMl page).

What I wish is to have a language attribute per the entire board, per each forum and maybe even per each topic. Forums default the board's language, and topics default to the forum's language. The default UI language (eg. for guests and new users) is of course the baord's language, but can be changed by users at will.

When a user enters a forum, the forum is displayed using the forum's language, including proper layout (LTR or RTL). The UI layout and language don't change, only the layout of the table that contains the forum's data.

When a user posts a new topic, the default language of the topic is the forum's language, but the user may change the language (if permitted by configuration setting). When a user replies to an exisiting topic, the default language is the topic's language, but again if permitted for the forum the user may change the language.

When a topic is displayed, the layout (LTR or RTL) for the topic is determined by the language of the first message. The contents of messages in other lanuages is displyaed with the correct directionality for each languages, but withing the layout determined by the first message.

I implemented something that more or less works along these lines, using a smart but not very elegant trick. I add a special character combination to a forum's name which determines whether the forum is LTR or RTL. Same for topic titles. Then I've hooked a piece of code that checks for this signature and modified the rendering of the contents.

You may wish to look at this on my site http//hug-elad.org/forum - note that most of the forums are in Hebrew, which will probably look like gibberish to you. There are two (almost empty) English forums at the bottom. Look at a message in a Hebrew forum and then at a message in an English forum, at see how the layout is switched. Also note the forum names in the jumpbox contain ~R~ or ~L~ that signify either an RTL or an LTR forum (it works at the topic level too).
What's needed to implement this[list:]
[*:]A language setting field in the board's main config record.[/*m][/*:m:]
[*:]A language field in the forum record (default = 0 = main board lang).[/*m][/*:m:]
[*:]A language field in the topic header record (default = 0 = forum lang).[/*m][/*:m:]
[*:]A language field in the message header record (default = 0 = topic lang).[/*m][/*:m:]
[*:]Of course add language configuration / selection at the proper screens.[/*m][/*:m:]
[*:]In new message forum - change of language should reload the form using the new language and layout.[/*m][/*:m:][/list:u:]

Improve templates

The current templates do not fully accoutn for directionality. Even though directionality itself is included, right/lft alignment is in many cases hard coded, making the output mixed up. Not much to say except[list:]

[*:]Add language layout directives (ie. DIR=) to all the relevant templates. Note that not just to set the page layout (that more or less works today), but also at the level of fourm / topic tables.[/*m][/*:m:][/list:u:]

Take into account UTF-8 string lengths!

I converted the Hebrew language to UTF-8, to make phpBB compatible to the rest of my site, and also to be ready in case I need to support other languages (eg. Arabic, Russian).

It works, but there are problems with string length limits. Most language encodings use 1 byte per character, so it's easy to match the input size limit to the column size in the database.

However UTF-8 uses a variable number of bytes per character, usually 1-2 but with Chinese/Japanese/Korean even up to 4 (or is it 6) bytes per character.

Currently phpBB code and DB schemas don't take this into account. In one case I typed a very long topic title, which was within the character limit, but in UTF-8 it exceeded the byte limit. This led to a corruption of the topic and I had some hard time fixing it with phpMyAdmin.

What's needed to correcct this[list:]
[*:]Increase sizes of string columns in the DB schema. Some databases do not recognize the distinction between chars and bytes, so there's no escape from just making the columns a bit wider - I suggest by 40-60%.[/*m][/*:m:]
[*:]In forms, if the current encoding is UTF-8, limit the size of input strings to half of the size declared in the DB. While it's true that there's no way to predict exactly how many bytes will be used, a ratio of 2 bytes per char is will prevent many ocerruns.[/*m][/*:m:]
[*:]It may be wise to derive the bytes per char ration from the board main language setting mentioned in the previous point. For English and most Latin languages it's 11, Hebrew and Arabic are 21 or 31, Chinese/Japanese will probably be 41 I guess, etc.[/*m][/*:m:]
[*:]Make sure there are no buffer overruns. I'm not sure if the corruption I experienced is due to a DB bug or a PHP code bug, but the entire supply chain should be checked - for security reasons as well.[/*m][/*:m:][/list:u:]

Enforce English / LTR layout in some places

The previous points were all about flexibility of using different languages and layouts, but in some places (mostly admin stuff), non-English languages and non-LTR layout may cause serious problems. Why?

Most boards are installed on hosted web sites, where the board admin has no control at all about the underlying locale settings, filesystem character support, etc. So admins must be very careful not to use for example file and directory names that contain exotic characters.

It's perfectly possible to type English text (eg. file name etc.) into an input file even when the page is in Hebrew. However, due to the BiDi algorithm at work on the browser side, the text doesn't appear as it should. This is crucial for paths and file names, but other stuff may get confused as well. For example

In LTR (normal) layout forum/mydir/
In RTL layout will show /forum/mydir

In LTR (normal) layout 1+2=3
In RTL layout will show 3=1+2

Note how the slash in the end is shown as if it's in the beginning of the path string, while it's actually still in the end. Even for me, an experienced and knowledgeable user, this is confusing and causing mis-typing.

By the way, as it is today the admin CP is language aware but almost completely layout unaware - which causes even more serious confusion. For example radio selections are inverted (looks like Yes is selected while actually it's No, etc.). Imagine what this can do for settings such as \\\\"Board Active\\\\"... (

Possible approaches[list:]
[*:]Alternative #1 Lock certain pages to English, or at least to LTR layout, regardless of the user's and/or the board's default language. This user is usually the admin or a moderator, and they should be able to handle some English.[/*m][/*:m:]
[*:]Alternative #2 (recommended) Lock language and layout only for specific input fields, which are deemed as prone to BiDi typing mistakes, such as path and file names, email, password, etc. In this case, the admin CP templates also should also be ensured to include layout support.[/*m][/*:m:][/list:u:]

That't it, at least for now. Thank you for the time and attention to read all this.

I hope there will be a fruitfull discussion following this message, and I expect that whoever is in charge will instruct how to submit those enhancements request to the developers.

As I said I'm willing to help - mostly I can help with translation, fixing templates, testing, etc.

Regards,

E.Z
E.Z
Registered User
Posts: 9
Joined: Fri Sep 05, 2003 10:40 pm

Re: Language support in 2.2

Post by E.Z »

Apparently 2.2 message posting isn't yet fully working, so all the formatting codes in my message above aren't working. This makes it a bit hard to read such a long message - I apologize but all my attempts to try to fix didn't help much.

Eyal.
User avatar
psoTFX
Registered User
Posts: 1984
Joined: Tue Jul 03, 2001 8:50 pm
Contact:

Re: Language support in 2.2

Post by psoTFX »

I can't comment on most of this right now ... However, I can comment on UTF-8 ... I should start by noting that I did spend and have spent a fair amount of time examining the implications of using unicode and whether we could move to utilising it throughout phpBB ... my conclusion is, at this stage we cannot. An overview of why not is given below:

1) Database support

At this time MySQL 3.x/4.0.x has no native support for Unicode (ucs or utf8) codepages, MySQL 4.1.x does include support as do various other DB's if the appropriate encoding is specified. This imposes limits on field size, simply \\"uping the field lengths\\" is not a solution IMO. It also poses problems when sorting on string data, e.g. author names, etc.

2) PHP Support

PHP 4 like PHP 3 is built around iso-8859-1, and not multibyte charsets such as unicode, big5, etc. This poses problems when it comes to string handling, e.g. strlen will return \\"half\\" the value it should for a big5 string than an iso-8859-1 string, it does a good job of screwing up things like substr, strpos, etc.. It also poses problems for regular expressions. Depending on appropriate support the u modifier is supposed to enable unicode compatability ... however this requires a version of PHP in excess of our minimum requirements on the Windows platform (I'm also unsure whether it actually works).

The mbstring module is available but is not universal by any margin. And while I (and others) have had some success it's not the most reliable PHP module going.

Therefore I do not plan on implementing a full UTF-8/Unicode backend for phpBB 2.2. When PHP catches up and offers native support for unicode without having to fudge solutions left, right and centre I will, without much doubt switch over.
E.Z
Registered User
Posts: 9
Joined: Fri Sep 05, 2003 10:40 pm

Re: Language support in 2.2

Post by E.Z »

Paul,

First of all thanks for the very quick response.

I understand what you're saying about incomplete Unicode support in PHP and in MySQL. I thought PHP's Unicode and mbstring support is working, and I read that MySQL 4.1+ supports Unicode, but I have not tested either of them seriously. I did some small stuff with PHP and it seemed to work, but I'll take a look at it again.

However, phpBB does work now with UTF-8, so things aren't as bad as they might look. The only issues I've encountered are corruption when a UTF-8 string exceeded the byte limit, and non-working searches/sorts with MySQL (should work fine with MySQL 4.1+, PostgreSQL, MSSQL, Firebord, etc).

For the first issue I've suggested a workable solution - not perfect but adequate: apply a bytes-per-char ratio to calculate input field limit, and trim strings before sending to DB. This should work for any DB regardless whether it supports Unicode or not. It would be nice also to increase some column sizes, but this isn't a must.

For the second issue there's no solution on phpBB side, it entirely depends on the DB to support Unicode. However users of MySQL 4.1+ and most other databases should have this working as well.

I'm not deeply familiar with phpBB code, but I'm not sure it needs to be switched to mbstring. It works as it is. Code changes will need to handle new functionality (eg. calc input field size based on byte-per-char ratio), but this doesn't require any Unicode string handling.

All that said, UTF-8 is only a part of multi-language support. Even with simpler charsets the other issues I detailed in my first message are still valid.

With ISO-8859-8-I (Hebrew) for example, it's perfectly possible to display full English content and run a dual-language board. Similarly Arabic charsets include all English letters, and so on.

So even if there are no special provision for UTF-8 in the code, fixing the other things (and using a DB with Unicode) will enable full multi-language boards.

BTW - as my first message formatting got screwed up, I've made it available as HTML on http://zvi.org/dev/phpbb_lang1.html" target="_blank

Eyal.
User avatar
psoTFX
Registered User
Posts: 1984
Joined: Tue Jul 03, 2001 8:50 pm
Contact:

Re: Language support in 2.2

Post by psoTFX »

erm, I do have some idea of what I'm talking about ya know ;) We will not fudge a solution to character lengths or string functions. When PHP is built around unicode rather than iso-8859-1 I will very happily switch to it internally (rather than purely superficial as you're suggesting). Given phpBB is in more than 40 languages I'm aware of the implications of what is required.
Roberdin
Registered User
Posts: 1546
Joined: Wed Apr 09, 2003 8:44 pm
Location: London, United Kingdom

Re: Language support in 2.2

Post by Roberdin »

[quote=\"E.Z\"][bc659b]phpBB does work [uc659b]now[/uc659b] with UTF-8[/bc659b][/quote]

Whether it works with the latest build isn't important. phpBB is designed to be compatible for as many possible php installations possible. My server's one barely gets past the current phpBB 2.2 minimum version requirements.
E.Z
Registered User
Posts: 9
Joined: Fri Sep 05, 2003 10:40 pm

Re: Language support in 2.2

Post by E.Z »

Paul,

I wasn't in any way implying that you don't know what you're talking about. If that was in any way implied from my response then I apologize.

I meant that despite the limitations of PHP/MySQL with regard to Unicode, and despite the fact that phpBB wasn't written to support Unicode, phpBB does work quite well with UTF-8. If anything, this only testifies in favor of phpBB developers.

Leaving UTF-8 aside, there are still the other issues that need considering, in order to make phpBB fully multi-language capable, namely language settings per board/forum/message, and full RTL aware templates.

If the development team decides to accept my suggestions \\"as is\\" or even modified, then I am willing to help. For the first part (per-object language settings) I can mostly help with specifications and testing. For the second part (templates), I can test the templates and make the necessary correction, as well as create a new Hebrew translation.

If those issues can be worked out, phpBB will have all the infratructure it needs and it will be very useful with all the single-byte charsets. Those who require UTF-8 can implement themselves as I do now (maybe even I'll create a mod for that).

So please, let me know when and how to post those issues to the feature request database, and also who should I contact to offer my help.

Thanks,

Eyal.

PS - Thanks also for editing the original post and adding the link.
E.Z
Registered User
Posts: 9
Joined: Fri Sep 05, 2003 10:40 pm

Re: Language support in 2.2

Post by E.Z »

Roberdin,

When I say that phpBB works now with UTF-8, I refer to version 2.0.4 that I use, not the latest development version. So if your server is good enough for 2.0.x then you can use UTF-8, albeit with some minor limitations that can be worked around.

However if your DB (for example MySQL versions below 4.1) doesn't support Unicode, then you also lose the ability to search and the ability to sort by any string. But this has nothing to do with phpBB version or with PHP at all, it's a DB limitation.

Regards,

Eyal.
Roberdin
Registered User
Posts: 1546
Joined: Wed Apr 09, 2003 8:44 pm
Location: London, United Kingdom

Re: Language support in 2.2

Post by Roberdin »

Well, that's kind of useless.Suppose i need the Search function? :roll:
Rob
E.Z
Registered User
Posts: 9
Joined: Fri Sep 05, 2003 10:40 pm

Re: Language support in 2.2

Post by E.Z »

If you need a working Search then either don't use UTF-8 or use a DB that supports Unicode, such as MySQL 4.1 and later, PostgreSQL, Firebird, etc.

The problem is not with phpBB or PHP, but with MySQL version 3.X and 4.0.X that don't support Unicode.

Also, if you search for characters that are in the standard ASCII range (0-127), even if you use UTF-8 with a datbase that does NOT support UTF-8, searches in English will still work.

You can test it on my board (http://hug-elad.org/forum), which uses UTF-8 and most of the messages and the UI are in Hebrew. However there are some messages that contain English words - try searching for \"india\" without the quotes. You can verify that I use UTF-8 by looking at the HTML source.

EZ.
Post Reply