View Full Version : Elimination of duplicate posts ( continuation from 13153)
MegaZ
06-01-2003, 01:28 PM
Delta,
1. if topics are identical to other ones...megaz should be able to take care of it...by installing new dublicate remover modules of this phpbb.
First of all, there is no duplicate remover module. Even if I wrote that myself, it would be practically impossible, because we're talking about comparing subjects and content of ALL forum messages before a post takes place. This is neither efficient nor logical. Think about the database server - it would probably end up eating up all the system resources, resulting in a DoS.
2. Cursing, well it's already taken care of..simply curse words needs to be entered and forum itself easily replaces them with chars or eliminates them.
Hmm...are we talking about entering bad words in 3-4 different languages? What can you define as "bad". If I see that the word "****" is filtered (and it is, as you can see - "u" is replaced by a "*"), I don't have to type it in exactly the same way. I can write "f_uck" or "f uck", thus ignoring all of the filtering.
Plus, the more bad words there are, the more loaded the server gets. Think about all string parsing that has to occur before a post is shown. Again, another load on the server!
Delta
06-01-2003, 05:39 PM
Delta,
1. if topics are identical to other ones...megaz should be able to take care of it...by installing new dublicate remover modules of this phpbb.
First of all, there is no duplicate remover module. Even if I wrote that myself, it would be practically impossible, because we're talking about comparing subjects and content of ALL forum messages before a post takes place. This is neither efficient nor logical. Think about the database server - it would probably end up eating up all the system resources, resulting in a DoS.
2. Cursing, well it's already taken care of..simply curse words needs to be entered and forum itself easily replaces them with chars or eliminates them.
Hmm...are we talking about entering bad words in 3-4 different languages? What can you define as "bad". If I see that the word "f*ck" is filtered (and it is, as you can see - "u" is replaced by a "*"), I don't have to type it in exactly the same way. I can write "f_uck" or "f uck", thus ignoring all of the filtering.
Plus, the more bad words there are, the more loaded the server gets. Think about all string parsing that has to occur before a post is shown. Again, another load on the server!
well, Mr.Megaz... if you are familiar with php..i bet you are, you could've write that module with no problem....you could make it,
so it compares messages for the last 24 hours ( sorting by date, and compare..i believe it already does it in sql part) ...comparing couple kbs within the box once a day, won't take any recources.
Or i saw similar modules written for a different forums...all you have to , simply port it ..which i noticed you do pretty good job :)
when it comes cursing...i saw irc bots doing pretty good job on filtering .... where it simple ignores special characters except the space char.. also, it has a special case where it combanies pairs of words next to each other and checks for that ...where
"f uck" cases gets eliminated too.
or logical tokenizer funcs can be applied...if you had a chance to use them.
Nothing is impossible pal.
MegaZ
06-02-2003, 10:05 PM
well, Mr.Megaz... if you are familiar with php..i bet you are, you could've write that module with no problem....you could make it,
so it compares messages for the last 24 hours ( sorting by date, and compare..i believe it already does it in sql part) ...comparing couple kbs within the box once a day, won't take any recources.
Or i saw similar modules written for a different forums...all you have to , simply port it ..which i noticed you do pretty good job
I don't know if we are misunderstanding each other, but by "diplicate posts", I mean topics that are posted by the same user, with the same subject and message. Considering that I'll have to insert another SELECT statement before the information gets written to the database, it would add a considerable load to the server. Now, considering that I'll have to select username, subject and message for the last 24 hours, and compare each of the line to the entered data.
Here is what happens:
[code:1]SELECT A.poster_id AS poster_id, A.post_id AS post_id, A.post_username AS username, FROM_UNIXTIME(A.post_time) AS date, B.post_subject AS subject, B.post_text AS text FROM phpbb_posts AS A JOIN phpbb_posts_text AS B WHERE A.post_id=B.post_id AND FROM_UNIXTIME(A.post_time) > DATE_SUB(NOW(), INTERVAL 24 hour);
191 rows in set (2.45 sec)[/code:1]
2.45 seconds just for posting a message?
And that's just the database call. Now think of all messages being compared one by one with the entered username/subject/message. When the board gets busy, PHP will die because of memory limits, causing the apache process to crash! Of course, I could limit the select statement to let's say 200 last messages, but I really don't see the point.
Thanks, but no thanks...
when it comes cursing...i saw irc bots doing pretty good job on filtering .... where it simple ignores special characters except the space char.. also, it has a special case where it combanies pairs of words next to each other and checks for that ...where
"f uck" cases gets eliminated too.
or logical tokenizer funcs can be applied...if you had a chance to use them.
Nothing is impossible pal.
IRC bots parse strings line by line and on the fly. We're talking about parsing humongous strings everytime someone browses the forum. In order to solve problems with strings like "f uck", I'll have to remove all spaces and other characters and leave out A-z. This could be easily done with regexp, but again - at the cost of efficiency and server load. That means, every message will be converted to a very long string and I'll have to look for occurences of bad words in those strings. Then, I need to put back the characters I removed to retain the format. Oops...problematic huh? How do I know where to place a character like space? So, the whole thing gets messy. Thiswouldnotbecool :-) At least this is the only solution that comes to my mind. If you have better ideas, let me know.
P.S. I'm not saying that it's impossible - it's just inefficient.
Delta
06-03-2003, 11:58 AM
Mega, here is a simple way to do:
1. mysql procedure query
select last 500 posts
write to a text file
// or you could use php funcs to read and write from db.
2. from php/C open the text file
remove dublicates. /* there are many fairly fast algorithms. */
close.
3. read back to db.
since, all comparasion is happening within local mashine...so, your apache/ mysql should not have any problem, except sql read/write ops which should not take too much resources.
~cheers.
P.S. Kiyovtora - say what you want, forum age > 6.
MegaZ
06-03-2003, 02:57 PM
Delta, if quering 191 posts took 2.45 seconds, imagine what's going to happen with selecting last 500 posts! What's the point of writing to a file? I don't get it...
Are you proposing to do it once, or every time someone tries to post a message?
Aleteya полностью с Вами согласен!
Delta_wokeup
06-03-2003, 04:13 PM
Delta, if quering 191 posts took 2.45 seconds, imagine what's going to happen with selecting last 500 posts! What's the point of writing to a file? I don't get it...
Are you proposing to do it once, or every time someone tries to post a message?
Aleteya полностью с Вами согласен!
dude, lets forget all above...
solution: SELECT UNIQUE from Field
just look up the keyword "unique".
deltagoingbacktosleep
06-03-2003, 04:14 PM
Megaz, i forgot to say...that is what database for..to perform operation like that...it's designed for that.
Delta
06-03-2003, 04:44 PM
Delta, if quering 191 posts took 2.45 seconds, imagine what's going to happen with selecting last 500 posts! What's the point of writing to a file? I don't get it...
Are you proposing to do it once, or every time someone tries to post a message?
Aleteya полностью с Вами согласен!
dude, lets forget all above...
solution: SELECT UNIQUE from Field
just look up the keyword "unique".
forgot to say something again...sleepy me: distinct would do the same thing but faster
something like "select distinct * from mytable ..." it uses the new unique functionality.
The best way would be whenever it creates an entry or adds data if CREATE comes with DISTINCT , it would prevent from duplicates everytime someone tries to add duplicates and gives an error in case it happens. Since everything is indexed in db...it's gotta perfom that operation fairly fast, since that is what it's designed for....Magic :P
You can also set up php so it catchs SQL errors and it will promt errors to a screen.
~cheers.
MegaZ
06-03-2003, 08:04 PM
Nopes, "select distinct" would be no good as well.
What's the use of it? It will only select distinct posts, which is of no use in this particular situation.
The only way to prevent duplicates in forum, is to use the following:
1) Insert a SELECT statement before an INSERT occurs.
2) Compare the results of the SELECT with the post values.
3) If a db field matches the post values, print a "duplicate message" error.
4) Else insert the post values into the database.
The major problem comes with #1. Whether I give a SELECT or SELECT DISTINCT doesn't matter - the database call to retrieve the last 200 or whatever messages is expensive.
Having a unique field in the database is obviously not an option - you wouldn't be able to have just "ok" twice as a reply - ever. The same thing applies to the subject. For example, there were 4 messages with a subject "Uzbekistan", all posted on different dates. If I make the subject row "unique", nobody will ever be able to create a topic with subject "Uzbekistan".
The best way would be whenever it creates an entry or adds data if CREATE comes with DISTINCT , it would prevent from duplicates everytime someone tries to add duplicates and gives an error in case it happens.
CREATE is for creating tables, not inserting values. CREATE never happens on phpBB, because the tables are already there - it only inserts, modifies and deletes values. AFAIK, DISTINCT can only be used with a SELECT in SQL.
CrazyDT
06-03-2003, 11:58 PM
MegaZ, What you also could do is:
1) Create a table at least with 2 columns that stores hash of posts with a user id.
2) Each time someone is trying to insert a new post, take a hash value of their post, and run against the hash table. Since hash value contains less info, and might not guarantee exact match you can further verify against the original post message. if it matches then you'll know what to do.
This method would probably be most optimal
Delta
06-04-2003, 12:55 AM
People!! sql is designed for doing operations like that....
as you were saying before...i think we are misunderstaing each other:
it's not true that you can use unique distinct feature in SELECT only.
I meant, when you CREATE you can add an option UNIQUE which will take care of specific fields you put there. and db should be indexed.
something like: CREATE UNIQUE INDEX firstname_lastname ON people (firstname,lastname);
or can be done as a field.
CREATE TABLE tablename (
rest of columns,
UNIQUE [index] (indexcolumn)
.......
);
CREATE is for creating tables, not inserting values. CREATE never happens on phpBB, because the tables are already there - it only inserts, modifies and deletes values. AFAIK, DISTINCT can only be used with a SELECT in SQL.
I know what Create does. phpbb is poorly designed, since it never nativly checks for duplicates.
CrazyDT
06-04-2003, 02:16 AM
Delta, This will add huge overhead if you use database to seed out duplicate messages because message body can be very lengthy, so each time you post it will take a long time to go through each row, which is currently 106420, with an average length of up to 2000 bytes. This is a lot! Cost/benefit > 1 Not worth of considering this approach.
MegaZ
06-04-2003, 03:42 AM
CrazyDT, I like the idea with hashing posts & user_id...but is there a chance two posts get the same hash? I'm sure it's very unlikely, but you never know ;-)
Actually, I have an idea. Based on your proposal, here is the way I am thinking about doing it.
1) Create a table with four fields: user_id (int5), subject(int32), message(int32), post_date(datetime). None of the fields should be unique. Index post_date (since I'll be quering last 24 hours).
2) Before a post gets written into the database, perform the following:
a. DELETE everything from the table where post_date is more than 24 hours.
b. SELECT everything from the table where post_date is less than 24 hours.
c. If the user_id, hashed subject & message of the post are matched against what we have in the database, output a "duplicate message" error.
d. Else, INSERT user_id, hashed subject, hashed message & post_date into the table.
3) INSERT the post data into the database.
This way, the table size will be pretty small and the additional load won't be as bad as doing a regular query.
Any other thoughts?
Delta,
something like: CREATE UNIQUE INDEX firstname_lastname ON people (firstname,lastname);
or can be done as a field.
Creating another index is not an option. Since the database is already quiet large, it would result in usage of additional memory (hard disk space). Indexes are fast for querying but slow for inserts and updates. We don't want it any slower...
I think CrazyDT's idea is much better in terms of server load and performance.
CrazyDT
06-04-2003, 07:46 PM
I've decided to open up an additional thread in a separate topic from
http://www.forum.uz/viewtopic.php?t=13153 because it's growing rapidly, and it's propagating across the forum since it’s a global message.
Couple of additional comments to your plan.
CrazyDT, I like the idea with hashing posts & user_id...but is there a chance two posts get the same hash? I'm sure it's very unlikely, but you never know
To be 100% sure you can perform additional check against the original post.
Also, I'd like to make sure that we're trying to eliminate topic not a post/reply to a topic because sometimes and often you have someone replying simply with a single word or expression which is often like ':D' or 'cool' or alike.
noodles
06-04-2003, 08:05 PM
What are you going to do with the posts that got the same meaning but written differently? Again, you'll have to delete them by your own... :P
MegaZ
06-05-2003, 02:24 PM
CrazyDT, there is one potential problem with duplicate post removal.
If a user finds out that an exact message will not be posted, he/she can add a simple "." or some other character like space. This will screw up the hash value and the post will be allowed to be written.
Max has noticed the problem as well.
If we're talking about filtering out new threads only, then there is no point in doing the duplicate posts thing :-) Most of the time I notice similar/duplicate posts and get rid of those myself (which happens very rarely).
So, the question is - is there a point in preventing duplicate posts at all? The accuracy of duplicate posts filter is going to be very low, considering the fact that people will find out whether they submitted a duplicate post. So, if I were the user and I got a message saying "duplicate post", I would simply change my username or add an additional character to my post, thus making it "not duplicate" technically. Floods will keep on coming, no matter how we fight against those.
One way to fight against flood is to increase the number of seconds a user is supposed to spend before he/she can post another message. This time is already 30 seconds, but I could increase that to something like 2 minutes. But most participants won't like the idea, because they will not be able to reply to a message quickly.
Another way is to force everyone to register. But that will definitely decrease the number of forum participants. According to the small survey I have conducted previously, for some reason people think that when they register, their presence is no longer anonymous (although it's not true). And I'll have to take the "hit" again about censorship and etc.
MegaZ
06-05-2003, 04:48 PM
CrazyDT, I moved the posts from the post "13153" here. The messages were offtopic anyway.
CrazyDT
06-06-2003, 12:30 AM
MegaZ, you've got good points there. This feature isn't useful enough because it's not perfect enough. Instead spending time on this enhancement you might as well be spending your time implementing a new cool feature.
MegaZ
06-06-2003, 01:42 AM
CrazyDT, exactly :-) I'm working on a new version of music.arbuz.com :-)
noodles
06-06-2003, 10:15 AM
MegaZ,
One way to fight against flood is to increase the number of seconds a user is supposed to spend before he/she can post another message. This time is already 30 seconds, but I could increase that to something like 2 minutes. But most participants won't like the idea, because they will not be able to reply to a message quickly.
good point! :P I won't like it.
Another way is to force everyone to register. But that will definitely decrease the number of forum participants. According to the small survey I have conducted previously, for some reason people think that when they register, their presence is no longer anonymous (although it's not true).
I guess with the essential fields which must be filled when registering barely possible to know exactly about the user. So, I'd like to know the opinion of other users on tis issue. You may create some poll. :P
have a luck ;)
Dunyokezganqalandar
12-02-2003, 02:57 PM
voobshe manimcha forum.uzga o'zbekistondan kam kirishsa kerak, manku hamma o'rtoqlarimga va dugonalarimga aytishga harakat qilyapman
baribir kam
noodles
12-05-2003, 12:56 PM
it's been almost 6 months for my last post to this topic and during this time no reply so ppl, u don't give a damn about ur so told "privacy" =) so let's make this forum only for registered users then =))
spoon
12-05-2003, 02:42 PM
MegaZ, One way to fight against flood is to increase the number of seconds a user is supposed to spend before he/she can post another message. This time is already 30 seconds, but I could increase that to something like 2 minutes. But most participants won't like the idea, because they will not be able to reply to a message quickly.
There is a flaw in flood protection system. After you've post something, in case if you want to edit it afterwards, it won't let you.
So, you either have to copy the post and delete it, or have to click on another topic, then come back and do your editing.
Dunyokezganqalandar
12-05-2003, 04:46 PM
Megaz!
taklif bor kimningdir postini o'chirib tashlashdan oldin ogohlantiruvchi e- mail yuboringlar?
o'shanda o'sha odam kirib edit qilib to'g'rilashi mumkin
i want see responds o my posts
vBulletin® v3.7.0, Copyright ©2000-2008, Jelsoft Enterprises Ltd.