Nicofuma wrote:The XML representation add some tags and formalize the text (new line, spaces and tabs) and doesn't remove any part of the text
For precision's sake, it doesn't remove text but it does remove control codes that are not whitespace. That's bytes 0x00..0x08, 0x0B..0x0C and 0x0E..0x1F. It doesn't touch spaces or tabs but it normalizes EOL to \n. That's actually very close to what phpBB does by default.
About storing parsed text in a separate column, in my opinion the biggest problem is not storage size, it's complexity. The issue is twofold: API and DB structure.
API. In phpBB's current codebase, when you want to display a text you send the post_text string through
generate_text_for_display()
and you receive HTML in return. If that were to change, each call site would need to be changed (manually?) to add the metadata as a sixth argument and then you still have to track the source of the data (usually a SQL request) and add the metadata column to the query. I see at least 38 calls to
generate_text_for_display()
in the current codebase. On top of that there are about 10 calls to
generate_text_for_storage()
that would need to be modified as well as the corresponding SQL queries.
DB structure. I grep'ed through the phpBB's sources and I see at least 9 different kinds of rich text: posts, private messages, user signatures, poll titles, poll options, forum descriptions, forum rules, group descriptions and admin's contact info. That's nearly as many tables that need to be changed, and columns to be added.
In addition, a number of tests and test fixtures would need to be updated. Such an extensive change would need to take place in a separate PR. I absolutely see the value of storing a copy of the original text, but the development cost is big enough to outweigh it.
Nicofuma wrote:And to help us I would like to have some benchmarks (If you need it, we can give you a sample from phpbb.com):
I don't have any real-life forums so yes, if you want me to run benchmarks on real-world data I'll need you to provide it.
I don't think that benchmarking the performance of storing metadata separately will tell you much though. I would expect it to cost less than a millisecond per page.