[RFC] Post Revisions

Note: We are moving the topics of this forum and it will be deleted at some point

Publish your own request for comments/change or patches for the next version of phpBB. Discuss the contributions and proposals of others. Upcoming releases are 3.2/Rhea and 3.3.
Post Reply
User avatar
tbackoff
Registered User
Posts: 180
Joined: Sat Jun 12, 2010 3:25 am

Re: [RFC] Post Revisions

Post by tbackoff »

David,

I quickly glanced over your initial changes and noticed this:

Code: Select all

define('POST_REVISIONS_TABLE',        $table_previx . 'post_revisions'); 
That should be $table_prefix

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Post by imkingdavid »

t_backoff wrote:David,

I quickly glanced over your initial changes and noticed this:

Code: Select all

define('POST_REVISIONS_TABLE',        $table_previx . 'post_revisions');
That should be $table_prefix
Good catch. Fixed.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Post by imkingdavid »

For reference:

Code: Select all

<imkingdavid> hmm... the diff engine packaged with phpbb only shows changes per line, not per character. so if I were to write: "Post" and then change it to "Posting", it would show a line removed "Post" and a line added "Posting" instead of 3 characters added "ing"
<imkingdavid> is that preferrable or should I figure out how to do a per-character diff?
<bantu> imkingdavid: AFAIK diff is by definition line-based
<imkingdavid> bantu: perhaps, but the diff engine I referenced in my gsoc proposal, http://www.raymondhill.net/finediff/viewdiff-ex.php has options for per-character, per-word, per-sentence, and per paragraph/line
<bantu> ah
<imkingdavid> I think that a diff for code works best as per-line, but a diff for user-written content (i.e. prose) is better either per-word or per-character
<bantu> yes, that might be the case
<imkingdavid> I don't want to package two diff engines with phpBB, though, so if we were to use a per word or per character diff,  I would have to write it myself
<imkingdavid> which is okay if we want to do that
<-- stodan (~daniel@unaffiliated/stodan) has quit (Quit: Over & Out)
<bantu> imkingdavid: Have you thought about conflict resolution yet?
<imkingdavid> bantu: not yet. i'd need an example conflict to test
<imkingdavid> although...
<imkingdavid> it only ever compare two post revisions, so I'm not sure how there could be a conflict
<bantu> exactly
<bantu> so we just won't support branching off
<bantu> say two people editing at the same time
<bantu> this is fine I guess
<imkingdavid> revisions are based on posts, not other revisions. so if two people edit at the same time, two revisions will be made on the post, and the most recently finished one will be used
<imkingdavid> if the other one is decided to be better, they can swap them
<bantu> imkingdavid: they will usually want to merge them
<bantu> and the last editor will already get a warning about it
<imkingdavid> right
<bantu> I'm not sure about the overhead of using a character or word based approach for storage.
<imkingdavid> bantu: all revisions are stored in full text, nto as diffs
<imkingdavid> not*
<bantu> I'm pretty sure git does line-based and e.g. GitHub just highlights words again
<imkingdavid> right
<bantu> imkingdavid: so this is just about display
<imkingdavid> basically
<imkingdavid> when I change the word "Post" to "Posting" I only want to see "ing" in green, not "Post" in red and "Posting" in green
<imkingdavid> although, if we go word-based, it would be the latter
<imkingdavid> but the whole line wouldn't show as a change, just that word
<imkingdavid> which would be okay too
<bantu> line based is certainly bad for display
<bantu> I agree with that
<imkingdavid> alright, I'll work on either word or character based diff. which would you prefer?
<bantu> Especially since a post can be a single line that is very long.
<imkingdavid> right
<bantu> I don't think there would be a problem with packaging another diff engine.
<bantu> The current diff engine is however also used for doing actual merges.
<bantu> So it is kind of important that things actually work there.
<imkingdavid> right
<imkingdavid> i can include a new diff engine separately, since that would be easier than adding to the current engine
<imkingdavid> here's the code for that diff engine I referenced: https://github.com/gorhill/PHP-FineDiff
<-- rxu (~Miranda@phpbb/developer/rxu) has quit (Read error: Connection reset by peer)
<imkingdavid> i'll just change the class name to fit with naming guidelines and then make sure the code conforms to coding guidlines
<imkingdavid> or is that not required for 3rd party libs?
<bantu> it really is the other way around
<bantu> third party libraries should be used as is if possible
<imkingdavid> okay
<bantu> since it does support word-based and character-based diffing, it should not matter right now
<bantu> feel free to pick one
<imkingdavid> alright
TL;DR: Will be using FineDiff (MIT Licensed) for post revision diff, because the currently-packaged line-based diff is not visually appealing for prose diffs. Instead, will use either character- or word-based (defaults to character) diff.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

User avatar
bantu
3.0 Release Manager
3.0 Release Manager
Posts: 557
Joined: Thu Sep 07, 2006 11:22 am
Location: Karlsruhe, Germany
Contact:

Re: [RFC] Post Revisions

Post by bantu »

imkingdavid wrote:TL;DR: Will be using FineDiff (MIT Licensed) for post revision diff, because the currently-packaged line-based diff is not visually appealing for prose diffs. Instead, will use either character- or word-based (defaults to character) diff.
It turns out we cannot use FineDiff as it does not support UTF8 properly. Also, after closer review, it does not look very mature to me.
[01:45:24] <naderman> imkingdavid: please entirely ignore the diff engine used for updates, and just use a different one
[01:45:28] <naderman> oh he's gone
[01:45:33] <naderman> bantu: well you tell him that
[01:46:01] <bantu> naderman: that is what I told him
[01:46:20] <naderman> well not quite that clearly, but ok :P
[01:46:42] <naderman> and try and make sure whichever one you pick is proper PHP5 code
[01:49:00] <naderman> * Does it work with UTF-8?
[01:49:00] <naderman> As of now, the code assume single-byte characters. To use UTF-8 text, you can
[01:49:00] <naderman> always convert the encoding using mb_convert_encoding():
[01:49:07] <naderman> so the library he was looking at is not an option
[01:52:49] <bantu> naderman: interesting
[01:53:41] <bantu> I would also prefer if the library would be "more mature".
[01:53:48] <naderman> yeah
[01:53:59] <naderman> is he planning on doing snapshot or delta storage?
[01:55:48] <bantu> naderman: I think he doesn't want to do delta storage
[01:55:54] <naderman> k
[01:56:29] <bantu> this will cost quite a bit of storage then
[01:56:41] <naderman> yeah
[01:56:52] <naderman> but implementing a delta store on top of SQL sucks too
[01:56:54] <bantu> ideally, the new bbcode engine would be finished first
[01:57:00] <naderman> I wonder what mediawiki does
[01:57:03] <naderman> true
[01:57:19] <naderman> well if it's just about displaying diffs, I guess it doesn't matter that much
[01:57:23] <naderman> what matters is all the other stuff
[01:57:38] <naderman> I would actually recommend to him that he entirely ignores displaying diffs until the very end
[01:57:43] <naderman> and just displays them side by side for now
[01:58:04] <naderman> all the functionality around revisions is way more important
[01:58:20] <bantu> yeah, I agree

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Post by imkingdavid »

Alright, I will keep an eye out for a good diff engine that can handle utf8, although I did look at quite a few over the weekend and there are really not that many diff engines out there. I may end up having to make something myself. Just to be sure, what exactly is needed to make the diff engine utf8-compatible?

Next question, I have thought a bit more over the weekend about storing as delta and calculating revisions at runtime, but I feel like that would be way too complicated. The differences would then have to be rewound and recalculated if any revision is deleted, and if a revision is made to an earlier revision, we would need to decide if the differences would be calculated from the most recent revision or the one that was currently in use when the post was edited. And there are probably even more issues that I am not thinking of off the top of my head.

So, storing the posts in full is probably the best way to do this. I have addressed the issues with this approach taking up storage space. But I do have a question: As it is right now on my branch, every post has at least 1 revision (i.e. the original content == the first revision) because otherwise it is impossible to allow changing between revisions without having to keep switching data around and potentially losing data if something goes wrong. However, that also means that there is data duplication: the *_posts table will always have the latest version of the post, because the edit functionality not only edits the post as is the original behavior, but also inserts a new revision containing the same information (note that I have modified viewtopic.php to always serve data from the revisions table instead of the posts table).

My question, therefore, is: because the revisions table has all of the post information that is able to be versioned, would it be okay to remove those fields from the posts table? Otherwise, each post will take up twice the space it currently does.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

User avatar
naderman
Consultant
Posts: 1727
Joined: Sun Jan 11, 2004 2:11 am
Location: Berlin, Germany
Contact:

Re: [RFC] Post Revisions

Post by naderman »

No, the posts table should continue to have all data available. The revisions table will contain a lot more data, making indexes etc. less efficient. We should keep the entire post data necessary to show a post local in a row in the posts table.

Instead you should only create a revision of the previous state when editing, rather than creating a revision of the current state. That way you only ever need an entry in the revision table if there has been at least one change.

drathbun
Registered User
Posts: 72
Joined: Wed Feb 15, 2006 6:40 pm
Location: Texas
Contact:

Re: [RFC] Post Revisions

Post by drathbun »

Random scattered thoughts...

I would prefer a character diff rather than word, for reasons already listed (post versus posting example given).

I don't see a problem with storing revisions as text and calculating the differences between subsequent post versions (as stated, a post with no edits has zero rows in the revision table). That seems to be more efficient than storing a single version of the text and storing revision instructions to derive the new text. It is far more likely to want the final text display (which would be on viewtopic or search results) and the post revision request is going to be a very infrequent request.

Indexing on the post revisions table does not have to be substantial, only an index on post_id and revision key. Even if the table gets large, I can't see a use case where you would look at post revisions for more than one post at a time.
Sometimes you're the windshield, sometimes you're the bug.

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] Post Revisions

Post by imkingdavid »

drathbun wrote:Random scattered thoughts...

I would prefer a character diff rather than word, for reasons already listed (post versus posting example given).
I would as well, except that it often displays very oddly. So there would have to be some added logic to get it to display correctly. For instance, take the following two strings, which I fabricated entirely for this example, even though there are probably examples that better illustrate the issue:
I am a posterity person
I ask a pottery question
Here is the resulting character diff:
I amsk a postterity pquerstion
Instead of the actual changes I made (word diff):
I amask a posteritypottery personquestion
As I said, not the best example, but I think it illustrates how often times coincidental similarities between strings can make the diff appear more confusing than a word-based diff.

That being said I think word based and character based are both better than line based:
I am a posterity person
I ask a pottery question
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

drathbun
Registered User
Posts: 72
Joined: Wed Feb 15, 2006 6:40 pm
Location: Texas
Contact:

Re: [RFC] Post Revisions

Post by drathbun »

How about word-based for everything but changes at the end of a word then? :P

Your example is an interesting extreme. For the most part, I would expect character based diff to better show spelling error corrections (pottery versus posterity), where a word-based diff will show how the thought process of the poster progressed as he or she fine-tuned the post.

I used the diff highlight functions from Wordpress when I played around with this for my board. They use array functions to split posts into words and compare the arrays, so character-based diff is not really possible. And I haven't done a lot of performance testing because as I said above, this particular feature should not be called very often. As a total percentage of page views during any given day, I would expect this to be less than 1%.
Sometimes you're the windshield, sometimes you're the bug.

User avatar
MichaelC
Development Team
Development Team
Posts: 889
Joined: Thu Jan 28, 2010 6:29 pm

Re: [RFC] Post Revisions

Post by MichaelC »

I personally prefer a line based diff but showing what has changed between the two lines.
I am a posterity person
I ask a pottery person
Formerly known as Unknown Bliss
psoTFX wrote: I went with Olympus because as I said to the teams ... "It's been one hell of a hill to climb"
No unsolicited PMs please except for quotes.

Post Reply