CAPTCHA
From Wikipedia, the free encyclopedia
A CAPTCHA (an initialism for "Completely Automated Public Turing test to tell Computers and Humans Apart", trademarked by Carnegie Mellon University) is a type of challenge-response test used in computing to determine whether or not the user is human. The term was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper (all of Carnegie Mellon University), and John Langford (of IBM). A common type of CAPTCHA requires that the user type the letters of a distorted image, sometimes with the addition of an obscured sequence of letters or digits that appears on the screen.
Because the test is administered by a computer, in contrast to the standard Turing test that is administered by a human, a CAPTCHA is sometimes described as a reverse Turing test (although this term is ambiguous because it could also mean a Turing test in which the participants are both attempting to prove they are the computer).
Contents |
[edit] Origin
Since the early days of the Internet, users have wanted to make text illegible to computers. The first such people were hackers, posting about sensitive topics to online forums they thought were being automatically monitored for keywords. To circumvent such filters, they would replace a word with look-alike characters. HELLO could become h3ll0
or |-|3|_|_()
or )-(3££0
, as well as numerous other variants, such that a filter could not possibly detect all of them. This later became known as leetspeak.[citation needed]
The first discussion of automated tests which distinguish humans from computers for the purpose of controlling access to web services appears in a 1996 manuscript of Moni Naor from the Weizmann Institute of Science, entitled "Verification of a human in the loop, or Identification via the Turing Test". Primitive CAPTCHAs seem to have been later developed in 1997 at AltaVista by Andrei Broder and his colleagues in order to prevent bots from adding URLs to their search engine. Looking for a way to make their images resistant to OCR attack, the team looked at the manual to their Brother scanner, which had recommendations for improving OCR's results (similar typefaces, plain backgrounds, etc.). The team created puzzles by attempting to simulate what the manual claimed would cause bad OCR. In 2000, von Ahn and Blum developed and publicized the notion of a CAPTCHA, which included any program that can distinguish humans from computers. They invented multiple examples of CAPTCHAs, including the first CAPTCHAs to be widely used (at Yahoo!).
[edit] Applications
CAPTCHAs are used to prevent automated software from performing actions which degrade the quality of service of a given system, whether due to abuse or resource expenditure. Although CAPTCHAs are most often deployed as a response to encroachment by commercial interests, the notion that they exist to stop only spammers is mistaken. CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam, such as the webmail services of Gmail, Hotmail, and Yahoo. CAPTCHAs have also found active use in stopping automated posting to blogs or forums, whether as a result of commercial promotion, or harassment and vandalism. CAPTCHAs also serve an important function in rate limiting, as automated usage of a service might be desirable until such usage is done in excess, and to the detriment of human users. In such a case, a CAPTCHA can enforce automated usage policies as set by the administrator when certain usage metrics exceed a given threshold.
[edit] Characteristics
CAPTCHAs are by definition fully automated, requiring little human maintenance or intervention in administering the test. This has obvious benefits in cost and reliability.
In practice, the algorithm used to create the CAPTCHA does not need to be made public, though it may be covered by a patent. Although publication can help demonstrate that breaking it requires the solution to a difficult problem in the field of artificial intelligence, deliberate withholding of the algorithm can increase the integrity of a limited set of systems (see security through obscurity). The most important factor in deciding whether an algorithm should be made open or restricted is the size of the system. Although an algorithm which survives scrutiny by security experts may be assumed to be more conceptually secure than an unevaluated algorithm, an unevaluated algorithm specific to a very limited set of systems is always of less interest to those engaging in automated abuse. Breaking a CAPTCHA generally requires some effort specific to that specific CAPTCHA implementation, and an abuser may decide that the benefit granted by automated bypass is negated by the effort required to engage in abuse of that system in the first place.
[edit] Accessibility
- See also: Web accessibility
CAPTCHAs based on reading text — or other visual-perception tasks — prevent blind or visually impaired users from accessing the protected resource.[1] These CAPTCHAs also create barriers for the much larger number of people with learning disabilities involving text recognition (the visual perceptual problems often casually referred to as "dyslexia"). Users do not need to have severe vision or learning disabilities to fail at least one attempt at the average CAPTCHA: For example, it's difficult for most people to distinguish a distorted capital O from a distorted lowercase o, but some CAPTCHAs require this. These CAPTCHAs assume that it is possible to define a human being as someone who can identify distorted text 100% accurately on the first attempt (if the user mistypes even one character of the text, most CAPTCHA schemes will present an entirely different CAPTCHA image for the next attempt).
Other visual implementations do not require users to enter text, instead asking the user to pick images with common themes from a random selection.[2]
For non-sighted users (for example blind users, or the color blind on a color-using test), visual CAPTCHAs present serious problems. Because CAPTCHAs are designed to be unreadable by machines, common assistive technology tools such as screen readers cannot interpret them. Some implementations of CAPTCHAs permit users to opt for an audio CAPTCHA.[3]
While providing an audio CAPTCHA allows access to most blind users, it still hinders many users, including: users with slow Internet connections and/or poor audio hardware, non-native speakers of the language used in the audio file, users with auditory learning disabilities, and people who are both visually and hearing impaired. According to sense.org.uk, about 4% of people over 60 in the UK have both vision and hearing impairments. There are about 23,000 people in the UK who have serious vision and hearing impairments. According to The National Technical Assistance Consortium for Children and Young Adults Who Are Deaf-Blind (NTAC), there were 9,516 deafblind children in the USA in 2004.[4] Gallaudet University quotes a 1993 estimate of 35,000 fully deafblind adults in the USA.[5] Deafblind population estimates depend heavily on the degree of impairment used in the definition. An open question is what fraction of people cited as impaired use websites that would restrict them.
The use of CAPTCHAs can prevent a small number of disabled individuals from using significant subsets of such common Web-based services as PayPal, GMail, Orkut, Yahoo!, and many forum and weblog systems, as they exist by default. While most noteworthy services will generally make alternative accommodations for disabled users, offering users with disabilities the option of bypassing a CAPTCHA by contacting the website's owners does not provide truly equitable access, since it means that users with disabilities must wait (sometimes many days) to conduct their business on the website in question, while all other users get instant access at any time. Since sites may use CAPTCHAs as part of the initial registration process, or even at every login, this challenge can completely block access. In certain jurisdictions, site owners could become target of litigation if they are using CAPTCHAs that discriminate against certain people with disabilities. For example, a CAPTCHA may make a site incompatible with Section 508 in the United States.
Even for perfectly sighted individuals, new generations of graphical CAPTCHAs, designed to overcome sophisticated recognition software, can be very hard or impossible to read, and may include characters not familiar to foreign users reading a translated version of the page.
Any hard artificial intelligence problem, such as speech recognition or common-sense questions, could be used as the basis for a CAPTCHA.
A method of improving the CAPTCHA to ease the work with it was proposed by ProtectWebForm and was called "Smart CAPTCHA".[6] Developers advise combining the CAPTCHA with JavaScript support. Because it is too hard for most spam robots to parse and execute scripts, using a simple script which fills the CAPTCHA fields and hides the image and the field from human eyes was proposed.
The "Smart" CAPTCHAs are an example of security through obscurity: they only protect the site because bot authors have not encountered the specific type of JavaScript in question; thus, a single widespread implementation of a specific "Smart" CAPTCHA would be ineffective.
One alternative method involves displaying to the user a simple mathematical equation and requiring the user to enter the solution as verification. Although these are much easier to defeat using software, they are suitable for scenarios where graphical imagery is not appropriate, and they provide a much higher level of accessibility for visually impaired users than the image-based CAPTCHAs. However, these may be difficult for users with a cognitive disorder. Like "Smart" CAPTCHAs, mathematical puzzles provide security through obscurity and would be ineffective if a specific implementation were widespread.
Other kinds of challenges, such as those that require understanding the meaning of some text (e.g. a logic puzzle, trivia question, or instructions on how to create a password) can also be used as a CAPTCHA. Again, there is little research into their resistance against countermeasures, as current graphical CAPTCHAs can provide an equal level of protection, when properly implemented.
[edit] Circumvention
There are a few approaches to defeating CAPTCHAs: using cheap human labor to recognize them, exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA, and finally improving character recognition software.
[edit] Human solvers
CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One approach involves relaying the puzzles to a sweatshop of human operators. According to one estimate, the operators could easily solve hundreds of them each hour. If the humans are dedicated employees who receive minimum wage this is not likely to be viable,[7] but services like the Amazon Mechanical Turk have had success using micropayments to attract human problem-solvers for other tasks. Another variation of this technique involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site (such as an adult site) owned by the attacker. With enough traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site.[8]
[edit] Insecure implementation
Howard Yeend has identified two implementation issues with poorly designed CAPTCHA systems:[9]
- Some CAPTCHA protection systems can be bypassed without using OCR simply by re-using the session ID of a known CAPTCHA image.
- CAPTCHAs residing on shared servers also present a problem; a security issue on another virtual host may leave the CAPTCHA issuer's site vulnerable.
Sometimes, if part of the software generating the CAPTCHA is client-side (the validation is done on a server but the text that the user is required to identify is rendered on the client side), then users can modify the client to display the unrendered text. Some CAPTCHA systems use md5 hashes stored client-side; these can often be "cracked."[10]
[edit] Computer character recognition
Although visual CAPTCHAs were originally designed to defeat standard OCR software designed for document scanning, a number of research projects have proven that it is possible to defeat many CAPTCHAs with programs that are specifically tuned for a particular type of CAPTCHA. For CAPTCHAs with distorted letters, the approach typically consists of the following steps:
- Extraction of the image from the web page.
- Removal of background clutter, for example with color filters and detection of thin lines.
- Segmentation, i.e. splitting the image into segments containing a single letter.
- Identifying the letter for each segment.
Most visual CAPTCHAs present the image to the web browser as a single HTML img
element. Some implementations split the image into multiple parts or encode parts of the image into the HTML code itself,[11] forcing an automatic process to render the page and perform OCR on the page rather than just the image of interest.
Removal of clutter is typically very easy to do automatically. In 2005, it was also shown that neural network algorithms have a lower error rate than humans in glyph identification.[12] The only part where humans still outperform computers is segmentation. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.
Neural networks have been used with great success to defeat CAPTCHAs as they are generally indifferent to both affine and non-linear transformations. As they learn by example rather than through explicit coding, with appropriate tools very limited technical knowledge is required to defeat more complex CAPTCHAs.
Some CAPTCHA-defeating projects:
- Mori et al. published a paper in IEEE CVPR'03 detailing a method for defeating one of the most popular CAPTCHAs, EZ-Gimpy, which was tested as being 92% accurate in defeating it.[13] The same method was also shown to defeat the more complex and less-widely deployed Gimpy program 33% of the time. However, the existence of implementations of their algorithm in actual use is indeterminate at this time.
- PWNtcha has made significant progress in defeating commonly used CAPTCHAs, which has contributed to a general migration towards more sophisticated CAPTCHAs.[14]
- A number of Microsoft Research papers describe how computer programs and humans cope with varying degrees of distortion.[15]
[edit] Image-recognition CAPTCHAs vs. character-recognition CAPTCHAs
With the 'demonstration' (through research publications) that character recognition CAPTCHAs are vulnerable to computer vision based attacks, some researchers have proposed alternatives to character recognition, in the form of image recognition CAPTCHAs which require users to identify simple objects in the images presented. The argument is that object recognition is typically considered a more challenging problem than character recognition, due to the limited domain of characters and digits in the alphabets of most natural languages.
An open source image recognition CAPTCHA system is available to users of the popular phpBB2 forum software in the form of an addon called KittenAuth[16] which in its default form requires the user to select a stated type of animal from an array of thumbnail images but which can be customized for example to present images of interest to the forum's userbase.
Some proposed image recognition CAPTCHAs include:
- Chew et al. published their work in the 7th International Information Security Conference, ISC'04, proposing three different versions of image recognition CAPTCHAs, and validating the proposal with user studies. It is suggested that one of the versions, the anomaly CAPTCHA, is best with 100% of human users being able to pass an anomaly CAPTCHA with at least 90% probability in 42 seconds.[17]
- Datta et al. published their paper in the ACM Multimedia '05 Conference, named IMAGINATION (IMAge Generation for INternet AuthenticaTION), proposing a systematic way to image recognition CAPTCHAs. Images are distorted in such a way that state-of-the-art image recognition approaches[18] (which are potential attack technologies) fail to recognize them.[19]
[edit] References
- ^ The W3C paper Inaccessibility of CAPTCHA outlined some of the accessibility problems with CAPTCHAs.
- ^ HumanAuth supports ADA and Section 508 requirements without forcing users to read distorted CAPTCHA text. Retrieved on October 23, 2006.
- ^ The article Proposal for an accessible Captcha describes how audio and visual test can be combined to increase accessibility in a Captcha.
- ^ http://www.tr.wou.edu/ntac/index.cfm?path=publications/publications_census.html
- ^ http://library.gallaudet.edu/dr/faq-stats-deaf-blind.html
- ^ http://www.protectwebform.com/smartcaptcha
- ^ Hire People To Solve CAPTCHA Challenges. Petmail Design (2005-07-21). Retrieved on August 22, 2006.
- ^ Doctorow, Cory (2004-01-27). Solving and creating captchas with free porn. Boing Boing. Retrieved on August 22, 2006.
- ^ Breaking CAPTCHAs Without Using OCR. Howard Yeend (pureMango.co.uk) (2005). Retrieved on August 22, 2006.
- ^ Online services allow MD5 hashes to be cracked. Retrieved on December 2, 2006.
- ^ Stop Spam Bots with the HTML Encoded Captcha (HEC)
- ^ Kumar Chellapilla, Kevin Larson, Patrice Simard, Mary Czerwinski (2005). "Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs)" (PDF). Microsoft Research. Retrieved on 2006-08-02.
- ^ http://www.cs.berkeley.edu/~mori/gimpy/mori_gimpy.pdf
- ^ http://sam.zoy.org/pwntcha/
- ^ http://research.microsoft.com/~kumarc/
- ^ http://www.thepcspy.com/articles/security/the_cutest_humantest_kittenauth Thepcspy.com
- ^ http://www.cs.berkeley.edu/~tygar/papers/Image_Recognition_CAPTCHAs/imagecaptcha.pdf
- ^ http://en.wikipedia.org/wiki/CBIR
- ^ http://infolab.stanford.edu/~wangz/project/imsearch/IMAGINATION/ACM05/
[edit] External links
- Cryptographp: Free Captcha script for PHP, (now in English)
- The Captcha Project
- Inaccessibility of CAPTCHA: Alternatives to Visual Turing Tests on the Web, a W3C Working Group Note.
- Captcha History from PARC.
- Google Tech Talk on Human Computation
- Steps to Implement Captcha Web Service in Oracle's eBusiness Framework using Java
- Advantages and Pitfalls of 3D/Isometric Captchas
[edit] Defeating CAPTCHAs
- Breaking a Visual CAPTCHA (Gimpy) By Greg Mori and Jitendra Malik
- OCR Research Team defeats weak CAPTCHAs.
- PWNtcha - CAPTCHA decoder
- Defeating a simple CAPTCHA with Open Source software
- CAPTCHA testing with Web browser automation
- OCR test to read easy texts in pics
- Decoding CAPTCHA using PHP | Hypertext Preprocessor, Theory & Source
- Will Solve Captcha for Money? - Article on Slashdot about using low-paid data entry workers to defeat CAPTCHAs in bulk.