[RFC] stop distributing worthless CAPTCHAS in 3.1

Note: We are moving the topics of this forum and it will be deleted at some point

Publish your own request for comments/change or patches for the next version of phpBB. Discuss the contributions and proposals of others. Upcoming releases are 3.2/Rhea and 3.3.
Post Reply
User avatar
callumacrae
Former Team Member
Posts: 1046
Joined: Tue Apr 27, 2010 9:37 am
Location: England
Contact:

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by callumacrae »

DavidIQ wrote:And as was mentioned there and I'll repeat here, assigning default Q&A answers from a set of questions and answers is going to be very easily breakable, more so than the current CAPTCHA that comes on by default. No matter how random you select the question/answer pair the list would still be available to SPAM bot developers at which point it would be a matter of adding them to their bots. At least with an image CAPTCHA the bots have to actually do some work to solve. In your proposal we're just giving them the answers.

I'll let the dev team decide if this should be merged to that other topic or not.
They can be generated though, right?
Made by developers, for developers!
My blog

User avatar
DavidIQ
Customisations Team Leader
Customisations Team Leader
Posts: 1904
Joined: Thu Mar 02, 2006 4:29 pm
Location: Earth
Contact:

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by DavidIQ »

The questions and answers? Maybe...but how do you propose these be auto-generated without providing a list of questions and answers?
Image

User avatar
Pony99CA
Registered User
Posts: 986
Joined: Sun Feb 08, 2009 2:35 am
Location: Hollister, CA
Contact:

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by Pony99CA »

DavidIQ wrote:
Pony99CA wrote:I agree with Tabitha and Steve. However, I'm not sure why this topic exists -- I suggested removing useless CAPTCHAs in the topic that Tabitha linked to (and Steve had posted in, too ;)).

I even proposed a scheme to generate default questions and answers so that no extra user step was required at installation.
And as was mentioned there and I'll repeat here, assigning default Q&A answers from a set of questions and answers is going to be very easily breakable, more so than the current CAPTCHA that comes on by default. No matter how random you select the question/answer pair the list would still be available to SPAM bot developers at which point it would be a matter of adding them to their bots. At least with an image CAPTCHA the bots have to actually do some work to solve. In your proposal we're just giving them the answers.
You've misunderstood my proposal, I think; Callum was on the right track.

There is no "list of questions" or "question/answer pairs" in my proposal; there is a list of question templates from which a template is selected at random and then a new question is generated at random using that template.

For example, here's a template:

Code: Select all

Type characters #<x>, #<y> and #<z> from the following string:  <s>
<x>, <y> and <z> would be randomly generated, then a string of at least max(<x>, <y>, <z>) characters would be generated, and the answer would be created by picking those characters from the string.

Also, as noted, there would be other default templates, such as:

Code: Select all

In string <s>, type the <ul>-cased characters.  (<ul> is "upper" or "lower")
Type the <an> characters in the following string: <s>  (<an> is "alphabetic" or "numeric")
In string <s>, type the <oe>-numbered characters.  (<oe> is "odd" or "even")
Please tell us how a bot would know the answers to these questions without actually reading the question similarly to how a human would? These are exactly the types of questions admins are being encouraged to use in the support area of phpBB.com, so if a bot could correctly parse these, admins would be getting a lot more spam than they are.

Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.

User avatar
Dog Cow
Registered User
Posts: 271
Joined: Wed May 25, 2005 2:14 pm

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by Dog Cow »

I think that we should ask the administrators to provide one or more knowledge-based questions.

All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.

Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by imkingdavid »

Dog Cow wrote:I think that we should ask the administrators to provide one or more knowledge-based questions.

All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.

Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?
It is also a simple thing to search on Google. If a bot did a google search and crawled the results, found the most common word in the results (except for things like "the" and "and", etc.), it would not take much to come up with Constantinople. If not in the first try, it wouldn't likely take too long.

What I'm worried is that once we have some templates set, a bot could just as easily parse the templates to extract the information it needs to find, the whole string from which to find it, and eventually the correct answer.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

User avatar
Pony99CA
Registered User
Posts: 986
Joined: Sun Feb 08, 2009 2:35 am
Location: Hollister, CA
Contact:

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by Pony99CA »

imkingdavid wrote:
Dog Cow wrote:I think that we should ask the administrators to provide one or more knowledge-based questions.

All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.

Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?
It is also a simple thing to search on Google. If a bot did a google search and crawled the results, found the most common word in the results (except for things like "the" and "and", etc.), it would not take much to come up with Constantinople. If not in the first try, it wouldn't likely take too long.
It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.

See my attempt at debunking the bots using Google claim. That used a question (picked at random, by the way) which didn't happen to contain the answer in the first two Google results. I think a bot could more easily parse the questions that we're telling people to use than parse Google's results.

Maybe a bot could attempt every word or phrase on the page, but wouldn't they get locked out trying that?
imkingdavid wrote:What I'm worried is that once we have some templates set, a bot could just as easily parse the templates to extract the information it needs to find, the whole string from which to find it, and eventually the correct answer.
Sure, they might be able to do that -- but then they could do that now, because those are the types of questions being recommended by the support staff.

Plus, I still think that it would be more difficult than the already broken CAPTCHAs phpBB currently ships by default. They only make it harder for humans without slowing bots down much, so we should get rid of them.

And remember, if somebody kept a default question in my proposal, they'd get a nag screen warning them to change the question.

Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by imkingdavid »

Pony99CA wrote:It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.
To be honest, I have no knowledge of how bots actually work, but if I were going to write one, that is what I would do: copy the question into <insert search engine name here>, get the first page of results, do a word frequency count for any signifant words (filter out any unrelated words [a, the, and, it, etc.]), and try the most common, the next most common, etc. Then as soon as I find a working answer, I save that in some database with the question. Next time I encounter that question, I try that answer first; if it does not work, go back through the process.

Maybe that is more sophisticated than bots can currently handle, maybe it isn't. I don't know, but I would assume it's not. If I were to write a PHP bot, I could use https://github.com/fabpot/Goutte to scrape a web page, then just do an explode on whitespace to separate each word, and then run use a foreach or while loop to count each occurance of each word (excepting the extraneous conjunctions and such), and go from there.

That being said, I could just as easily write a script that understands "Find the #X, #Y, and #Z characters in A". Once the script understands that string format, it can extract X, Y, and Z positions, A string, and do a simple calculation like so:

Code: Select all

// Question: Find the 1st, 3rd, and last characters in the following string: A dog is not a cat
// Assume that the script understands how to parse that question and extract the information
// (shouldn't be too hard if you know what you're looking for)
// We end up with the following
$string = 'A dog is not a cat';
// indexing starts at 0, so first character is index 0
$positions = array(0, 2, strlen($string));
$answer = '';
// Append each of the characters at the specified positions to the answer string
foreach ($positions as $pos) {
    $answer .= $string[$pos];
}
// here you go!
return $answer;
// return is: Adt 
Pony99CA wrote:Maybe a bot could attempt every word or phrase on the page, but wouldn't they get locked out trying that?
As I said, it would only try the most commonly encountered words. And we lock out for password attempts, not captcha attempts, iirc.
Pony99CA wrote:Sure, they might be able to do that -- but then they could do that now, because those are the types of questions being recommended by the support staff.
Who says they can't? All we're saying is that they currently don't. Or at least, not that we've seen. Otherwise, we wouldn't be recommending said questions.
Pony99CA wrote:Plus, I still think that it would be more difficult than the already broken CAPTCHAs phpBB currently ships by default. They only make it harder for humans without slowing bots down much, so we should get rid of them.
We agree on this point, at least. :D
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

User avatar
Pony99CA
Registered User
Posts: 986
Joined: Sun Feb 08, 2009 2:35 am
Location: Hollister, CA
Contact:

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by Pony99CA »

imkingdavid wrote:That being said, I could just as easily write a script that understands "Find the #X, #Y, and #Z characters in A". Once the script understands that string format, it can extract X, Y, and Z positions, A string, and do a simple calculation[....]
Sure, once you know what the template is and have found it on the page, you can do that. That's why I've suggested multiple templates.

And let's not forget that I proposed the automatic question for one reason -- because some people didn't want to add another step to phpBB installation requiring the user to select a question and answer. My proposal was meant to avoid that while still getting a reasonably good question. The user would get a nag screen if they hadn't changed the default question.

Although it would be interesting to see what would happen if the Q&A CAPTCHA allowed "Generated question" as an option and every registration attempt got a randomly generated question. (That would probably encourage bot authors to break the system faster, but I'd be curious about it.)
imkingdavid wrote:
Pony99CA wrote:Maybe a bot could attempt every word or phrase on the page, but wouldn't they get locked out trying that?
As I said, it would only try the most commonly encountered words. And we lock out for password attempts, not captcha attempts, iirc.
Nope, we have this on the Registration settings page:
Registration attempts:
Number of attempts users can make at solving the anti-spambot task before being locked out of that session.
I think that would prevent that explode and count scenario pretty well. And what happens when the answer is a phrase, not just one word (like "Detroit Tigers" in my example)?
imkingdavid wrote:
Pony99CA wrote:Sure, they might be able to do that -- but then they could do that now, because those are the types of questions being recommended by the support staff.
Who says they can't? All we're saying is that they currently don't. Or at least, not that we've seen. Otherwise, we wouldn't be recommending said questions.
That's my point -- they aren't doing it now. Sure, maybe they can (now or in the future), but the proposal should work -- at least for a while. :D (And, again, the user would be nagged encouraged to pick a new question.)

If bots ever get true artificial intelligence (enough to pass the Turing test), all CAPTCHAs will become pretty much useless anyway. :shock:
imkingdavid wrote:
Pony99CA wrote:Plus, I still think that it would be more difficult than the already broken CAPTCHAs phpBB currently ships by default. They only make it harder for humans without slowing bots down much, so we should get rid of them.
We agree on this point, at least. :D
So, as Jean Luc Picard said, "Make it so!" I don't care if installation requires the user to select a question and answer or if we use a generated question, but do one of them. It will cut down on spam registrations -- and support requests complaining about too much spam. ;)

Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.

User avatar
Dog Cow
Registered User
Posts: 271
Joined: Wed May 25, 2005 2:14 pm

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by Dog Cow »

I think that careful wording of questions can make automated lookup of the answer more difficult. For example, notice that my "question" was actually a statement then followed by a very short question that referred back to the statement without repeating any of the key terms.

It should be possible to provide even less context in the original statement, a bare minimum, and still have the human understand what is being asked of him or her.

Istanbul was originally named something else. What was this name?
imkingdavid wrote:
Pony99CA wrote:It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.
To be honest, I have no knowledge of how bots actually work, but if I were going to write one, that is what I would do: copy the question into <insert search engine name here>, get the first page of results, do a word frequency count for any signifant words (filter out any unrelated words [a, the, and, it, etc.]), and try the most common, the next most common, etc.
That's the programmer in you over-engineering a solution to a simple problem. :geek:

Here's how I'd solve it:

Pick a popular site that is using Q&A. Spend a few minutes refreshing the registration page to copy as many questions as I feel like. Compose answers for these questions. Write a bot that uses a built-in dictionary lookup to recognize each question and provide the correct answer. Have it skip questions that it doesn't recognize.

Repeat for each target site.

User avatar
imkingdavid
Registered User
Posts: 1050
Joined: Thu Jul 30, 2009 12:06 pm

Re: [RFC] stop distributing worthless CAPTCHAS in 3.1

Post by imkingdavid »

Dog Cow wrote:
imkingdavid wrote:
Pony99CA wrote:It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.
To be honest, I have no knowledge of how bots actually work, but if I were going to write one, that is what I would do: copy the question into <insert search engine name here>, get the first page of results, do a word frequency count for any signifant words (filter out any unrelated words [a, the, and, it, etc.]), and try the most common, the next most common, etc.
That's the programmer in you over-engineering a solution to a simple problem. :geek:

Here's how I'd solve it:

Pick a popular site that is using Q&A. Spend a few minutes refreshing the registration page to copy as many questions as I feel like. Compose answers for these questions. Write a bot that uses a built-in dictionary lookup to recognize each question and provide the correct answer. Have it skip questions that it doesn't recognize.

Repeat for each target site.
:D

I guess I just prefer something more dynamic that would require less work from me. I thought mine to be a fairly simple solution, but I suppose yours is a little simpler code-wise, because instead of the script having to learn the questions/answers itself, you are feeding it the information it needs.
I do custom MODs. PM for a quote!
View My: MODs | Portfolio
Please do NOT contact for support via PM or email.
Remember, the enemy's gate is down.

Post Reply