read my post regarding to UTF-8 and it will clear those issues.birdfoot wrote: Hi guys,
Sorry if I sound totally dumb. I don't really know too well about characters encoding. However, wouldn't text stored as &#nnnn; form work?
I'm just curious. I've used vBulletin before and had it's charset on ISO-8859-1. I noticed whenever I create a post using Chinese or Japanese characters, those characters get stored into the DB in &#nnnn; form as compared to how phpBB stores them. I also do not have any language packs installed.
When it comes to display, they work perfectly. They are displayed as what they have been inputted.
The only thing that gave me problems was searching. In vBulletin and by default, searching for those Chinese/Japanese text is not possible. However, there is a way to overcome this by enabling Fulltext search in vBulletin. Some alterations also need to be made to the DB, such as adding indices as well as changing the table types for posts and threads into MyISAM. This solution was actually provided within vBulletin itself.
There are limitations though, like searched text not getting highlighted in the results and sometimes can get wrong results. For the wrong searches, it is due to the searches working in an "OR" manner. i.e. if any of the characters were found (for e.g. ２ <- inputted as &#nnnn; form) then that post will be returned as a result. If you put quotes around the search string then you can get the correct results for what you really wanna look for. Also, I noticed that standard stuff like numbers get changed to &#nnnn; form too (like what I listed in the example above). Of cos, another issue will be requiring more space in DB.
Some other info concerning my text input methods are:
1. I'm using Windows XP Pro with English (US) as the the native OS language setting
2. I use Window XP accompanied IME to enter those text.
The search problem relay in the use of preg functions that don't split the words correctly. they can be replaced to solve this issue.
The storing format you've mentioned happen whenver you enter a charecter that is not inclueded in the ISO you are using. In utf-8 all the charecters can be presented this is why some of them take 2 byte space.