They can be generated though, right?DavidIQ wrote:And as was mentioned there and I'll repeat here, assigning default Q&A answers from a set of questions and answers is going to be very easily breakable, more so than the current CAPTCHA that comes on by default. No matter how random you select the question/answer pair the list would still be available to SPAM bot developers at which point it would be a matter of adding them to their bots. At least with an image CAPTCHA the bots have to actually do some work to solve. In your proposal we're just giving them the answers.
I'll let the dev team decide if this should be merged to that other topic or not.
[RFC] stop distributing worthless CAPTCHAS in 3.1
- callumacrae
- Former Team Member
- Posts: 1046
- Joined: Tue Apr 27, 2010 9:37 am
- Location: England
- Contact:
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
- DavidIQ
- Customisations Team Leader
- Posts: 1905
- Joined: Thu Mar 02, 2006 4:29 pm
- Location: Earth
- Contact:
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
The questions and answers? Maybe...but how do you propose these be auto-generated without providing a list of questions and answers?
- Pony99CA
- Registered User
- Posts: 986
- Joined: Sun Feb 08, 2009 2:35 am
- Location: Hollister, CA
- Contact:
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
You've misunderstood my proposal, I think; Callum was on the right track.DavidIQ wrote:And as was mentioned there and I'll repeat here, assigning default Q&A answers from a set of questions and answers is going to be very easily breakable, more so than the current CAPTCHA that comes on by default. No matter how random you select the question/answer pair the list would still be available to SPAM bot developers at which point it would be a matter of adding them to their bots. At least with an image CAPTCHA the bots have to actually do some work to solve. In your proposal we're just giving them the answers.Pony99CA wrote:I agree with Tabitha and Steve. However, I'm not sure why this topic exists -- I suggested removing useless CAPTCHAs in the topic that Tabitha linked to (and Steve had posted in, too).
I even proposed a scheme to generate default questions and answers so that no extra user step was required at installation.
There is no "list of questions" or "question/answer pairs" in my proposal; there is a list of question templates from which a template is selected at random and then a new question is generated at random using that template.
For example, here's a template:
Code: Select all
Type characters #<x>, #<y> and #<z> from the following string: <s>
Also, as noted, there would be other default templates, such as:
Code: Select all
In string <s>, type the <ul>-cased characters. (<ul> is "upper" or "lower")
Type the <an> characters in the following string: <s> (<an> is "alphabetic" or "numeric")
In string <s>, type the <oe>-numbered characters. (<oe> is "odd" or "even")
Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
I think that we should ask the administrators to provide one or more knowledge-based questions.
All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.
Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?
All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.
Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?
- imkingdavid
- Registered User
- Posts: 1050
- Joined: Thu Jul 30, 2009 12:06 pm
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
It is also a simple thing to search on Google. If a bot did a google search and crawled the results, found the most common word in the results (except for things like "the" and "and", etc.), it would not take much to come up with Constantinople. If not in the first try, it wouldn't likely take too long.Dog Cow wrote:I think that we should ask the administrators to provide one or more knowledge-based questions.
All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.
Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?
What I'm worried is that once we have some templates set, a bot could just as easily parse the templates to extract the information it needs to find, the whole string from which to find it, and eventually the correct answer.
- Pony99CA
- Registered User
- Posts: 986
- Joined: Sun Feb 08, 2009 2:35 am
- Location: Hollister, CA
- Contact:
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.imkingdavid wrote:It is also a simple thing to search on Google. If a bot did a google search and crawled the results, found the most common word in the results (except for things like "the" and "and", etc.), it would not take much to come up with Constantinople. If not in the first try, it wouldn't likely take too long.Dog Cow wrote:I think that we should ask the administrators to provide one or more knowledge-based questions.
All of these string- and mathematics- based, automatically generated questions, are no good in my opinion, because I think that they would require more brain-power from the user than would a knowledge-based question (asking a fact). If the forums asks a fact, then that is simple memory recall for most humans.
Modern-day Istanbul was renamed after the Turks conquered it. What was its previous name?
See my attempt at debunking the bots using Google claim. That used a question (picked at random, by the way) which didn't happen to contain the answer in the first two Google results. I think a bot could more easily parse the questions that we're telling people to use than parse Google's results.
Maybe a bot could attempt every word or phrase on the page, but wouldn't they get locked out trying that?
Sure, they might be able to do that -- but then they could do that now, because those are the types of questions being recommended by the support staff.imkingdavid wrote:What I'm worried is that once we have some templates set, a bot could just as easily parse the templates to extract the information it needs to find, the whole string from which to find it, and eventually the correct answer.
Plus, I still think that it would be more difficult than the already broken CAPTCHAs phpBB currently ships by default. They only make it harder for humans without slowing bots down much, so we should get rid of them.
And remember, if somebody kept a default question in my proposal, they'd get a nag screen warning them to change the question.
Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
- imkingdavid
- Registered User
- Posts: 1050
- Joined: Thu Jul 30, 2009 12:06 pm
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
To be honest, I have no knowledge of how bots actually work, but if I were going to write one, that is what I would do: copy the question into <insert search engine name here>, get the first page of results, do a word frequency count for any signifant words (filter out any unrelated words [a, the, and, it, etc.]), and try the most common, the next most common, etc. Then as soon as I find a working answer, I save that in some database with the question. Next time I encounter that question, I try that answer first; if it does not work, go back through the process.Pony99CA wrote:It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.
Maybe that is more sophisticated than bots can currently handle, maybe it isn't. I don't know, but I would assume it's not. If I were to write a PHP bot, I could use https://github.com/fabpot/Goutte to scrape a web page, then just do an explode on whitespace to separate each word, and then run use a foreach or while loop to count each occurance of each word (excepting the extraneous conjunctions and such), and go from there.
That being said, I could just as easily write a script that understands "Find the #X, #Y, and #Z characters in A". Once the script understands that string format, it can extract X, Y, and Z positions, A string, and do a simple calculation like so:
Code: Select all
// Question: Find the 1st, 3rd, and last characters in the following string: A dog is not a cat
// Assume that the script understands how to parse that question and extract the information
// (shouldn't be too hard if you know what you're looking for)
// We end up with the following
$string = 'A dog is not a cat';
// indexing starts at 0, so first character is index 0
$positions = array(0, 2, strlen($string));
$answer = '';
// Append each of the characters at the specified positions to the answer string
foreach ($positions as $pos) {
$answer .= $string[$pos];
}
// here you go!
return $answer;
// return is: Adt
As I said, it would only try the most commonly encountered words. And we lock out for password attempts, not captcha attempts, iirc.Pony99CA wrote:Maybe a bot could attempt every word or phrase on the page, but wouldn't they get locked out trying that?
Who says they can't? All we're saying is that they currently don't. Or at least, not that we've seen. Otherwise, we wouldn't be recommending said questions.Pony99CA wrote:Sure, they might be able to do that -- but then they could do that now, because those are the types of questions being recommended by the support staff.
We agree on this point, at least.Pony99CA wrote:Plus, I still think that it would be more difficult than the already broken CAPTCHAs phpBB currently ships by default. They only make it harder for humans without slowing bots down much, so we should get rid of them.
- Pony99CA
- Registered User
- Posts: 986
- Joined: Sun Feb 08, 2009 2:35 am
- Location: Hollister, CA
- Contact:
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
Sure, once you know what the template is and have found it on the page, you can do that. That's why I've suggested multiple templates.imkingdavid wrote:That being said, I could just as easily write a script that understands "Find the #X, #Y, and #Z characters in A". Once the script understands that string format, it can extract X, Y, and Z positions, A string, and do a simple calculation[....]
And let's not forget that I proposed the automatic question for one reason -- because some people didn't want to add another step to phpBB installation requiring the user to select a question and answer. My proposal was meant to avoid that while still getting a reasonably good question. The user would get a nag screen if they hadn't changed the default question.
Although it would be interesting to see what would happen if the Q&A CAPTCHA allowed "Generated question" as an option and every registration attempt got a randomly generated question. (That would probably encourage bot authors to break the system faster, but I'd be curious about it.)
Nope, we have this on the Registration settings page:imkingdavid wrote:As I said, it would only try the most commonly encountered words. And we lock out for password attempts, not captcha attempts, iirc.Pony99CA wrote:Maybe a bot could attempt every word or phrase on the page, but wouldn't they get locked out trying that?
I think that would prevent that explode and count scenario pretty well. And what happens when the answer is a phrase, not just one word (like "Detroit Tigers" in my example)?Registration attempts:
Number of attempts users can make at solving the anti-spambot task before being locked out of that session.
That's my point -- they aren't doing it now. Sure, maybe they can (now or in the future), but the proposal should work -- at least for a while.imkingdavid wrote:Who says they can't? All we're saying is that they currently don't. Or at least, not that we've seen. Otherwise, we wouldn't be recommending said questions.Pony99CA wrote:Sure, they might be able to do that -- but then they could do that now, because those are the types of questions being recommended by the support staff.
If bots ever get true artificial intelligence (enough to pass the Turing test), all CAPTCHAs will become pretty much useless anyway.
So, as Jean Luc Picard said, "Make it so!" I don't care if installation requires the user to select a question and answer or if we use a generated question, but do one of them. It will cut down on spam registrations -- and support requests complaining about too much spam.imkingdavid wrote:We agree on this point, at least.Pony99CA wrote:Plus, I still think that it would be more difficult than the already broken CAPTCHAs phpBB currently ships by default. They only make it harder for humans without slowing bots down much, so we should get rid of them.
Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
I think that careful wording of questions can make automated lookup of the answer more difficult. For example, notice that my "question" was actually a statement then followed by a very short question that referred back to the statement without repeating any of the key terms.
It should be possible to provide even less context in the original statement, a bare minimum, and still have the human understand what is being asked of him or her.
Istanbul was originally named something else. What was this name?
Here's how I'd solve it:
Pick a popular site that is using Q&A. Spend a few minutes refreshing the registration page to copy as many questions as I feel like. Compose answers for these questions. Write a bot that uses a built-in dictionary lookup to recognize each question and provide the correct answer. Have it skip questions that it doesn't recognize.
Repeat for each target site.
It should be possible to provide even less context in the original statement, a bare minimum, and still have the human understand what is being asked of him or her.
Istanbul was originally named something else. What was this name?
That's the programmer in you over-engineering a solution to a simple problem.imkingdavid wrote:To be honest, I have no knowledge of how bots actually work, but if I were going to write one, that is what I would do: copy the question into <insert search engine name here>, get the first page of results, do a word frequency count for any signifant words (filter out any unrelated words [a, the, and, it, etc.]), and try the most common, the next most common, etc.Pony99CA wrote:It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.
Here's how I'd solve it:
Pick a popular site that is using Q&A. Spend a few minutes refreshing the registration page to copy as many questions as I feel like. Compose answers for these questions. Write a bot that uses a built-in dictionary lookup to recognize each question and provide the correct answer. Have it skip questions that it doesn't recognize.
Repeat for each target site.
- imkingdavid
- Registered User
- Posts: 1050
- Joined: Thu Jul 30, 2009 12:06 pm
Re: [RFC] stop distributing worthless CAPTCHAS in 3.1
Dog Cow wrote:That's the programmer in you over-engineering a solution to a simple problem.imkingdavid wrote:To be honest, I have no knowledge of how bots actually work, but if I were going to write one, that is what I would do: copy the question into <insert search engine name here>, get the first page of results, do a word frequency count for any signifant words (filter out any unrelated words [a, the, and, it, etc.]), and try the most common, the next most common, etc.Pony99CA wrote:It's simple for humans to do -- not necessarily for a bot. In the example given, the first answer (not including the Google "including results for" links) said "Fall of Constantinople - Wikipedia, the free encyclopedia". How would a bot know whether the answer was "fall" or "Constantinople" (or even "Wikipedia", "free" or "encyclopedia" for that matter)? While they might be able to weed out those last three, they'd still have to pick from two. And if the bot looked at more than the first result, there's also "Byzantium", "Hagia Sophia" and so on in the first page.
Here's how I'd solve it:
Pick a popular site that is using Q&A. Spend a few minutes refreshing the registration page to copy as many questions as I feel like. Compose answers for these questions. Write a bot that uses a built-in dictionary lookup to recognize each question and provide the correct answer. Have it skip questions that it doesn't recognize.
Repeat for each target site.
I guess I just prefer something more dynamic that would require less work from me. I thought mine to be a fairly simple solution, but I suppose yours is a little simpler code-wise, because instead of the script having to learn the questions/answers itself, you are feeding it the information it needs.