Bypass Captcha using Python and Tesseract OCR engine

06:21:00 python , rechaptchapython 0 Comments

A CAPTCHA is a type of challenge-response test used in computing as an 
attempt to ensure that the response is generated by a person. The 
process usually involves one computer (a server) asking a user to 
complete a simple test which the computer is able to generate and 
grade.The term "CAPTCHA" was coined in 2000 by Luis von Ahn, Manuel 
Blum, Nicholas J. Hopper, and John Langford (all of Carnegie Mellon 
University). It is an acronym based on the word "capture" and standing 
for "Completely Automated Public Turing test to tell Computers and 
Humans Apart".

In this post I am going to tell you guys how to crack weak captcha s 
using python and Tesseract OCR engine.Few days back I was playing around
 with an web application.The application was using a captcha as an anti 
automation technique when taking users feedback.

First let me give you guys a brief idea about how the captcha was working in that web application.

Inspecting the captcha image I have found that the form loads the captcha image in this way:

<img src="http://www.site.com/captcha.php"> 

From this you can easily understand that the “captcha.php” file returns an image file.

If we try access the url http://www.site.com/captcha.php each and every time it generates an image with a new random digit.

To make this clearer to you, Let me give you an example 

Suppose after opening the feedback form you got few text fields and a 
captcha.Suppose at a certain time the captcha loaded with a number for 
ex. "4567".

So if you use that code "4567" the form will be submitted successfully.

Now the most interesting thing was if you copy the captcha image url 
(which is http://www.site.com/captcha.php in this case) and open the 
image in new tab of same browser ,the cpatcha will load with a different
 number as I have told you earlier. Suppose you have got "9090" this 
time. Now if you try to submit the feedback form with the number that’s 
was loaded earlier with the feedback form( which was "4567" )the 
application will not accept that form. If you enter “9090” then the 
application will accept that form.

For more clear idea I have created this simple Fig.

Now my strategy to bypass this anti automation techniques was

1)Download the image only from 

http://www.site.com/captcha.php 

2)Feed that image to OCR Engine
3)Craft an http POST request with all required parameter and the decoded captcha code, and POST it.

Now what is happening here??

When you are requesting the image file, the server will do steps 1 to 5 as shown in figure. 

Now when we are posting the http request, the server will match the 
received captcha code with the value that was temporarily stored. Now 
the code will definitely match and server will accept the form.

Now I have used this Python Script to automated this entire process.

from PIL import Image
import ImageEnhance
from pytesser import *
from urllib import urlretrieve
 
def get(link):
    urlretrieve(link,'temp.png')
 
get('http://www.site.com/captcha.php');
im = Image.open("temp.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("temp2.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")
 
imgx = Image.open('temp2.png')
imgx = imgx.convert("RGBA")
pix = imgx.load()
for y in xrange(imgx.size[1]):
    for x in xrange(imgx.size[0]):
        if pix[x, y] != (0, 0, 0, 255):
            pix[x, y] = (255, 255, 255, 255)
imgx.save("bw.gif", "GIF")
original = Image.open('bw.gif')
bg = original.resize((116, 56), Image.NEAREST)
ext = ".tif"
bg.save("input-NEAREST" + ext)
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

Here I am only posting code of OCR engine. If your are a python lover like me you can use "httplib" python module to do the rest part.This script is not idependent. pytesser python
 module is requred to run this script.PyTesser is an Optical Character 
Recognition module for Python. It takes as input an image or image file 
and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script.

You can get this package @ http://code.google.com/p/pytesser/

The script works in this way.

1)First the script will download the captcha image using python module "urlretrive"

After that It will try to clean backgroug noises.

2)When this is done the script will make the image beigger to better understading.

3)At last it will feed that processed image to OCR engine.

Here is another python script which is very useful while testing captchas.You can add these line to your script if the taget captcha image is too small.This python script can help you to change resolution of any image.

from PIL import Image
import ImageEnhance

im = Image.open("test.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("final_pic.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")

Thanks for reading.I hope It was helpful.Feel free to share and drop comments.

Bypass Captcha using Python and Tesseract OCR engine

06:21:00 python , rechaptchapython 0 Comments

0 comments :

Popular Posts

Labels

About us

Labels

Translate

Bypass Captcha using Python and Tesseract OCR engine 06:21:00 python , rechaptchapython 0 Comments

Share story

0 comments :

Popular Posts

Labels

Bypass Captcha using Python and Tesseract OCR engine

06:21:00 python , rechaptchapython 0 Comments