phpBB

Development Discussion Board

phpBB's testing ground of bleeding edge code
Advanced search

[RFC] Post Revisions

Publish your own request for comments or patches for the next version of phpBB. Discuss the contributions and proposals of others. Upcoming releases are 3.1/Ascraeus and 3.2/Arsia.

Re: [RFC] Post Revisions

Postby t_backoff » Tue May 08, 2012 8:24 pm

David,

I quickly glanced over your initial changes and noticed this:

Code: Select all
define('POST_REVISIONS_TABLE',        $table_previx . 'post_revisions'); 

That should be $table_prefix
User avatar
t_backoff
Moderator Team
Moderator Team
 
Posts: 113
Joined: Sat Jun 12, 2010 3:25 am

Re: [RFC] Post Revisions

Postby imkingdavid » Mon May 14, 2012 3:32 pm

t_backoff wrote:David,

I quickly glanced over your initial changes and noticed this:

Code: Select all
define('POST_REVISIONS_TABLE',        $table_previx . 'post_revisions');

That should be $table_prefix

Good catch. Fixed.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.
User avatar
imkingdavid
Development Team
Development Team
 
Posts: 901
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Postby imkingdavid » Fri May 25, 2012 5:34 pm

For reference:
Code: Select all
<imkingdavid> hmm... the diff engine packaged with phpbb only shows changes per line, not per character. so if I were to write: "Post" and then change it to "Posting", it would show a line removed "Post" and a line added "Posting" instead of 3 characters added "ing"
<imkingdavid> is that preferrable or should I figure out how to do a per-character diff?
<bantu> imkingdavid: AFAIK diff is by definition line-based
<imkingdavid> bantu: perhaps, but the diff engine I referenced in my gsoc proposal, http://www.raymondhill.net/finediff/viewdiff-ex.php has options for per-character, per-word, per-sentence, and per paragraph/line
<bantu> ah
<imkingdavid> I think that a diff for code works best as per-line, but a diff for user-written content (i.e. prose) is better either per-word or per-character
<bantu> yes, that might be the case
<imkingdavid> I don't want to package two diff engines with phpBB, though, so if we were to use a per word or per character diff,  I would have to write it myself
<imkingdavid> which is okay if we want to do that
<-- stodan (~daniel@unaffiliated/stodan) has quit (Quit: Over & Out)
<bantu> imkingdavid: Have you thought about conflict resolution yet?
<imkingdavid> bantu: not yet. i'd need an example conflict to test
<imkingdavid> although...
<imkingdavid> it only ever compare two post revisions, so I'm not sure how there could be a conflict
<bantu> exactly
<bantu> so we just won't support branching off
<bantu> say two people editing at the same time
<bantu> this is fine I guess
<imkingdavid> revisions are based on posts, not other revisions. so if two people edit at the same time, two revisions will be made on the post, and the most recently finished one will be used
<imkingdavid> if the other one is decided to be better, they can swap them
<bantu> imkingdavid: they will usually want to merge them
<bantu> and the last editor will already get a warning about it
<imkingdavid> right
<bantu> I'm not sure about the overhead of using a character or word based approach for storage.
<imkingdavid> bantu: all revisions are stored in full text, nto as diffs
<imkingdavid> not*
<bantu> I'm pretty sure git does line-based and e.g. GitHub just highlights words again
<imkingdavid> right
<bantu> imkingdavid: so this is just about display
<imkingdavid> basically
<imkingdavid> when I change the word "Post" to "Posting" I only want to see "ing" in green, not "Post" in red and "Posting" in green
<imkingdavid> although, if we go word-based, it would be the latter
<imkingdavid> but the whole line wouldn't show as a change, just that word
<imkingdavid> which would be okay too
<bantu> line based is certainly bad for display
<bantu> I agree with that
<imkingdavid> alright, I'll work on either word or character based diff. which would you prefer?
<bantu> Especially since a post can be a single line that is very long.
<imkingdavid> right
<bantu> I don't think there would be a problem with packaging another diff engine.
<bantu> The current diff engine is however also used for doing actual merges.
<bantu> So it is kind of important that things actually work there.
<imkingdavid> right
<imkingdavid> i can include a new diff engine separately, since that would be easier than adding to the current engine
<imkingdavid> here's the code for that diff engine I referenced: https://github.com/gorhill/PHP-FineDiff
<-- rxu (~Miranda@phpbb/developer/rxu) has quit (Read error: Connection reset by peer)
<imkingdavid> i'll just change the class name to fit with naming guidelines and then make sure the code conforms to coding guidlines
<imkingdavid> or is that not required for 3rd party libs?
<bantu> it really is the other way around
<bantu> third party libraries should be used as is if possible
<imkingdavid> okay
<bantu> since it does support word-based and character-based diffing, it should not matter right now
<bantu> feel free to pick one
<imkingdavid> alright


TL;DR: Will be using FineDiff (MIT Licensed) for post revision diff, because the currently-packaged line-based diff is not visually appealing for prose diffs. Instead, will use either character- or word-based (defaults to character) diff.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.
User avatar
imkingdavid
Development Team
Development Team
 
Posts: 901
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Postby bantu » Sat May 26, 2012 12:03 am

imkingdavid wrote:TL;DR: Will be using FineDiff (MIT Licensed) for post revision diff, because the currently-packaged line-based diff is not visually appealing for prose diffs. Instead, will use either character- or word-based (defaults to character) diff.

It turns out we cannot use FineDiff as it does not support UTF8 properly. Also, after closer review, it does not look very mature to me.

[01:45:24] <naderman> imkingdavid: please entirely ignore the diff engine used for updates, and just use a different one
[01:45:28] <naderman> oh he's gone
[01:45:33] <naderman> bantu: well you tell him that
[01:46:01] <bantu> naderman: that is what I told him
[01:46:20] <naderman> well not quite that clearly, but ok :P
[01:46:42] <naderman> and try and make sure whichever one you pick is proper PHP5 code
[01:49:00] <naderman> * Does it work with UTF-8?
[01:49:00] <naderman> As of now, the code assume single-byte characters. To use UTF-8 text, you can
[01:49:00] <naderman> always convert the encoding using mb_convert_encoding():
[01:49:07] <naderman> so the library he was looking at is not an option
[01:52:49] <bantu> naderman: interesting
[01:53:41] <bantu> I would also prefer if the library would be "more mature".
[01:53:48] <naderman> yeah
[01:53:59] <naderman> is he planning on doing snapshot or delta storage?
[01:55:48] <bantu> naderman: I think he doesn't want to do delta storage
[01:55:54] <naderman> k
[01:56:29] <bantu> this will cost quite a bit of storage then
[01:56:41] <naderman> yeah
[01:56:52] <naderman> but implementing a delta store on top of SQL sucks too
[01:56:54] <bantu> ideally, the new bbcode engine would be finished first
[01:57:00] <naderman> I wonder what mediawiki does
[01:57:03] <naderman> true
[01:57:19] <naderman> well if it's just about displaying diffs, I guess it doesn't matter that much
[01:57:23] <naderman> what matters is all the other stuff
[01:57:38] <naderman> I would actually recommend to him that he entirely ignores displaying diffs until the very end
[01:57:43] <naderman> and just displays them side by side for now
[01:58:04] <naderman> all the functionality around revisions is way more important
[01:58:20] <bantu> yeah, I agree
User avatar
bantu
3.0 Release Manager
3.0 Release Manager
 
Posts: 437
Joined: Thu Sep 07, 2006 11:22 am
Location: Karlsruhe, Germany

Re: [RFC] Post Revisions

Postby imkingdavid » Tue May 29, 2012 1:36 pm

Alright, I will keep an eye out for a good diff engine that can handle utf8, although I did look at quite a few over the weekend and there are really not that many diff engines out there. I may end up having to make something myself. Just to be sure, what exactly is needed to make the diff engine utf8-compatible?

Next question, I have thought a bit more over the weekend about storing as delta and calculating revisions at runtime, but I feel like that would be way too complicated. The differences would then have to be rewound and recalculated if any revision is deleted, and if a revision is made to an earlier revision, we would need to decide if the differences would be calculated from the most recent revision or the one that was currently in use when the post was edited. And there are probably even more issues that I am not thinking of off the top of my head.

So, storing the posts in full is probably the best way to do this. I have addressed the issues with this approach taking up storage space. But I do have a question: As it is right now on my branch, every post has at least 1 revision (i.e. the original content == the first revision) because otherwise it is impossible to allow changing between revisions without having to keep switching data around and potentially losing data if something goes wrong. However, that also means that there is data duplication: the *_posts table will always have the latest version of the post, because the edit functionality not only edits the post as is the original behavior, but also inserts a new revision containing the same information (note that I have modified viewtopic.php to always serve data from the revisions table instead of the posts table).

My question, therefore, is: because the revisions table has all of the post information that is able to be versioned, would it be okay to remove those fields from the posts table? Otherwise, each post will take up twice the space it currently does.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.
User avatar
imkingdavid
Development Team
Development Team
 
Posts: 901
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Postby naderman » Tue May 29, 2012 1:46 pm

No, the posts table should continue to have all data available. The revisions table will contain a lot more data, making indexes etc. less efficient. We should keep the entire post data necessary to show a post local in a row in the posts table.

Instead you should only create a revision of the previous state when editing, rather than creating a revision of the current state. That way you only ever need an entry in the revision table if there has been at least one change.
www.naderman.de
Move your forum to Forumatic - we'll take care of maintenance & spam
User avatar
naderman
Development Team Leader
Development Team Leader
 
Posts: 1649
Joined: Sun Jan 11, 2004 2:11 am
Location: Karlsruhe, Germany

Re: [RFC] Post Revisions

Postby drathbun » Tue May 29, 2012 4:19 pm

Random scattered thoughts...

I would prefer a character diff rather than word, for reasons already listed (post versus posting example given).

I don't see a problem with storing revisions as text and calculating the differences between subsequent post versions (as stated, a post with no edits has zero rows in the revision table). That seems to be more efficient than storing a single version of the text and storing revision instructions to derive the new text. It is far more likely to want the final text display (which would be on viewtopic or search results) and the post revision request is going to be a very infrequent request.

Indexing on the post revisions table does not have to be substantial, only an index on post_id and revision key. Even if the table gets large, I can't see a use case where you would look at post revisions for more than one post at a time.
Sometimes you're the windshield, sometimes you're the bug.
User avatar
drathbun
Registered User
 
Posts: 72
Joined: Wed Feb 15, 2006 6:40 pm
Location: Texas

Re: [RFC] Post Revisions

Postby imkingdavid » Tue May 29, 2012 4:51 pm

drathbun wrote:Random scattered thoughts...

I would prefer a character diff rather than word, for reasons already listed (post versus posting example given).

I would as well, except that it often displays very oddly. So there would have to be some added logic to get it to display correctly. For instance, take the following two strings, which I fabricated entirely for this example, even though there are probably examples that better illustrate the issue:
I am a posterity person

I ask a pottery question

Here is the resulting character diff:
I amsk a postterity pquerstion

Instead of the actual changes I made (word diff):
I amask a posteritypottery personquestion


As I said, not the best example, but I think it illustrates how often times coincidental similarities between strings can make the diff appear more confusing than a word-based diff.

That being said I think word based and character based are both better than line based:
I am a posterity person
I ask a pottery question
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.
User avatar
imkingdavid
Development Team
Development Team
 
Posts: 901
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Postby drathbun » Tue May 29, 2012 7:10 pm

How about word-based for everything but changes at the end of a word then? :P

Your example is an interesting extreme. For the most part, I would expect character based diff to better show spelling error corrections (pottery versus posterity), where a word-based diff will show how the thought process of the poster progressed as he or she fine-tuned the post.

I used the diff highlight functions from Wordpress when I played around with this for my board. They use array functions to split posts into words and compare the arrays, so character-based diff is not really possible. And I haven't done a lot of performance testing because as I said above, this particular feature should not be called very often. As a total percentage of page views during any given day, I would expect this to be less than 1%.
Sometimes you're the windshield, sometimes you're the bug.
User avatar
drathbun
Registered User
 
Posts: 72
Joined: Wed Feb 15, 2006 6:40 pm
Location: Texas

Re: [RFC] Post Revisions

Postby MichaelC » Tue May 29, 2012 9:48 pm

I personally prefer a line based diff but showing what has changed between the two lines.
I am a posterity person
I ask a pottery person
Unknown Bliss
psoTFX wrote:I went with Olympus because as I said to the teams ... "It's been one hell of a hill to climb"

No unsolicited PMs please except for quotes.
User avatar
MichaelC
Website Team
Website Team
 
Posts: 797
Joined: Thu Jan 28, 2010 6:29 pm

Previous Next

Return to [3.x] RFCs

Who is online

Users browsing this forum: Arty, Exabot [Bot], naim, paulus and 10 guests