A self-adaptive algorithm to defeat text-based CAPTCHA

CAPTCHA (Completely Automated Public Turing test to Tell Computers and Humans Apart) is almost everywhere in data entry due to automated scripts like bots. Nowadays, text-based scheme is still applied most widely, which typically need the users to answer questions regarding recognition task. In particular, the segmentations of different types of CAPTCHAs are not always the same. As so far, there isn't any universal way to solve the segmentation problems. In this paper, we present a novel adaptive algorithm and based on that we create a system to defeat several CAPTCHAs at the same time. The CAPTCHA datasets we used are from the State Administration for Industry&Commerce of the People's Republic of China. There are totally 33 entrances of CAPTCHAs we need to solve. In this experiments, we assume that each of the entrance is known. Results are provided showing how our algorithms work well towards these CAPTCHAs.


I. INTRODUCTION
Captcha plays an increasingly important role in distinguishing between human beings and computer programs automatically.For example, Google improves its service by blocking access to automated spammers, eBay ameliorates its online marketplace by preventing bots from flooding the site with scams, while Facebook limits creation of fraudulent profiles used to cheat at games [1].Text-based CAPTCHAs are most widely used by combination of distorting characters and obfuscation techniques which can be recognized by people but may be hard for automated bots [2] [3] [4].
Basically, to defeat the text-based CAPTCHA, we need three steps: denoising, segmentation and recognition.We'd like to first introduce the preprocessing step.The purpose is to decrease the noise influence for ensuring the correct segmentation which can increase the recognition accuracy rate.Actually, there exist a lot of methods, including the combination of image processing and artificial intelligence algorithms, like median filter [5], neighborhood filter, wavelet threshold, universal denies, K-nearest neighbors algorithm, support vector machine and so on.However, how to choose those candidate ways is the key to perform the right denoising [1].
After preprocessing, we should use the segmentation to divide the image into characters for better recognition.Regarding the segmentation, it heavily depends on the feature  [6].For some special CAPTCHAs to be further discussed later, we present our adaptive length of characters to implement the segmentation.
The last step is recognition.There are three popular ways to recognize the CAPTCHA.Optical Character Recognition(OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machinecode text [7].It is widely used as a form of data entry from the printed records, whether passport document, invoices, bank statements, computerized receipts, business cards, mails, printouts of static-data, or any suitable documentation.The second solution for recognition is Template Matching [8].Before OCR, this is the most original artificial intelligence method to execute the recognition.For some simple cases, it is very convenient.At the mean time, for some rotated CAPTCHAs, it is somehow better than OCR.However, we need to mention on the other side that if the circumstances are too complex, including heavily rotation, overlapping and twisted, the third way of convolution neural network will be the potential solution.
The rest of this paper is organized as follows: Section II describes the background on the three steps in terms of denoising, segmentation and recognition.Section III proposes the scheme of adaptive segmentation.Section IV shows the experiment.The paper concludes with Section V.

II. BACKGROUND
In real world, different kinds of CAPTCHAs can be found in a lot of popular websites [9] [3] [10].Each of the ex- A low-cost attack on a Microsoft CAPTCHA was presented for solving the segmentation resistant task, which achieved a segmentation success rate of higher than 90% [11].Moreover, another projection-based segmentation algorithm for breaking MSN and YAHOO CAPTCHAs proves to be effective, which doubled the corrected segmentation rate of the traditional method [12].Besides, an algorithm using ellipse-shaped blobs detection for breaking Facebook CAPTCHA also presents to be useful [13].
As can be seen, Fig. 1 presents some examples of text-based CAPTCHAs which will be discussed later.Fig. 2 shows the basic flow chart to defeat the CAPTCHAs.In this section, we will demonstrate in details about the whole process.

A. Prepossessing Step
There are several potential techniques to denoise the images as we have introduced above.According to the difference of each type, we should correspondingly choose the appropriate method.
1) Thresholding: Intensity is the main difficulty for thresholding.It is not any relationships between the pixels.No one can guarantee that the pixels identified by the thresholding process are contiguous.We can easily include extraneous pixels that aren't part of the desired region, and we can just as easily miss isolated pixels within the region (especially near the boundaries of the region).These effects get worse as the noise becomes worse, simply because it's more likely that a pixels intensity doesn't represent the normal intensity in the region [14].When we use thresholding, we typically have to balance with the tradeoff, sometimes losing too much of the region or getting too many extraneous background pixels.(Shadows of objects in the image are also a real pain -not just where they fall across another object but where they mistakenly get included as part of a dark object on a light background.) Right here, we utilize the Automated Methods for Finding Thresholds: To set a global threshold or to adapt a local threshold to an area, we usually look at the histogram to see if we can find two or more distinct modes, one for the foreground and one for the background [15].
Recall that a histogram is a probability distribution: As we could easily identify in Fig 3, the samples of left column are made up of color background noise and binary characteristics.According to the advantage of global threshold's method, it is very efficient to acquire the denoised images.
2) Median filter: As we know, the median filter is a nonlinear digital filtering technique, often used to remove noise.Such noise reduction is a typical pre-processing step to improve the results of later processing (for example, edge detection on an image).Median filtering is very widely used in digital image processing because, under certain conditions, it preserves edges while removing noise.
The main idea of the median filter is to run through the signal entry by entry, replacing each entry with the median of neighboring entries.For example, for every window slides, y [1] = Median[2 2 80] = 2. [5] Shown below is one classic algorithm.

Algorithm 1 Median Filtering Algorithm
Input: Image X of size m × n, kernel radius τ Output: Image Y of the same size as X Initialize: kernal histogram H for i = 1 to m do for j = 1 to n do for k = -τ to τ do Remove X i+k,j−τ −1 from H Add X i+k,j+τ to H end for Y i,j ← median(H) end for end for 3) Affine transformation: An affine transformation, geometrically, affine map or an affinity is a function between affine spaces which preserves points, straight lines and planes.Also, after an affine transformation, the sets of parallel lines still remain parallel.An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line [16].An affine map is made up of two functions: a translation and a linear map.The ordinary vector algebra represents linear maps by matrix multiplication, and represents translations by vector addition.If the linear map is expressed as a multiplication by a matrix A and the translation as the addition of a vector b, an affine map f acting on a vector x can be represented as From Fig 4, we can easily understand the effect of affine transformation.In short, we need to retain the ratio among graphics while the angles cannot be guaranteed.

B. Segmentation
Segmentation is the most important part because it highly affects the performance of recognition.Usually, we use segmentation to divide the image into multiple parts for easier analysis [6].In short, we need to simplify the image, to get the segmentation by detecting their boundaries like lines or curves [17].Right here, our goal is to segment the CAPTCHA by characters for a better recognition in the next step.The difficulty is also obvious due to the rotation, noise, even twisted characters.We have to solve all of those problems one by one.

1) Histogram-based method:
Histogram-based method is referred to as computing the number of pixels in each row or column.The basic algorithm would be illustrated as y 0 , y 1 , ..... y n , where y i is the number of pixels in the image with gray-level i, and n is the maximum gray-level attained.We can easily imagine that if the histogram between each character is very far, it would be apparent to set the start and end point for each character [18].
2) K-means clustering: K-means is a widely used method in clustering analysis.We focus on partitioning n observation into K clusters to find out which centroid they belong to with the intended nearest mean.As we could see in Fig. 3, every character in the middle and right columns is the same color, which means that the value of its RGB are more closely related compared with other characters, let alone the noise.There are several generations of K-means methods, below is the basic pseudocode we would like to exploit.

Algorithm 2 K-Means Algorithm
1.Select K points as the initial centroids

2.Repeat F orm K clusters by assigning all points to the closet centroid
Recompute the centroid of each cluster 3.U ntil The centroid don't change Actually, segmentation is a very dependent method since the CAPTCHAs varies considerably.So far, there aren't any useful algorithms to segment towards the affixed, bended, even twisted characters.Besides, the sophisticated combination of histograms, K-means plus the affine transformation are our best weapons to defeat the CAPTCHAs.In the next section, we will focus on an algorithm which contains a lot of CAPTCHAs types by using adaptive length segmentation.

C. Recognition
After successive segmentation, the last step is to recognize the character, which means translate the image to text.Here we will introduce mainly two methods.
1) Optical character recognition(OCR): Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machineencoded text [19].It is widely used as a form of data entry from printed paper data records, whether passport , receipts, bank statements, or any suitable documentation.This method is one of the most commonly used for digitizing printed texts to be electronically further edited, searched and stored.In fact, there exist various of OCR softwares, one of the most renown open source software is named Tesseract by Google [7] [20].We'd like to use it as our recognition tool.If we could rigorously follow its format, the recognition result would be quite good.
2) Template Matching: Originally, template matching(TM) is a method for searching and finding the location of a template image in a large image [8].This method also can be used to find the candidate character if we preprocess the noise completely and have enough clean templates.The method simply slides the template image over the input image (as in 2D convolution) and compares the template with input image.Then we could get the covariance matrix between the predefined template and input image, classifying the highest covariance candidate as the label.

III. THE SCHEME OF ADAPTIVE SEGMENTATION
There are a lot of CAPTCHAs with different fonts as Fig. 5 shows.All of them come from the same entrance which means we have to deal with them simultaneously.Totally, there exist 8 font types in terms of traditional and simplified Chinese character, arabic numbers, symbol operator and character operator.The first thing for us to do is the right segmentation for each type.Then the rest we need to resolve is to identify which font type we just mentioned they belong to.
From Figure .5,we cannot simply use the histogram-based method to perform the segmentation due to rotation.Because the neighboring characters will affect close-by each others.So at first we have to use affine transformation to turn the rotated characters into vertical.Following is math representation: One attractive feature of this matrix representation is that we can use it to factor a complex transform into a set of simpler transform.Here we only utilize the rotation transformation, thus the following rotation matrix is essential: cos θ − sin θ sin θ cos θ , where θ is an angle of counterclockwise rotation around the origin.
After carefully observation, we find out that the angel is fixed, which means that every time when we load an image we first need to do the regular affine transformation to make it vertical.The rest thing is to segment it as the Fig. 6 exhibits.Another issue we need to cope with is the segmentation.As we have introduced before, there are about eight font types, we need to check each of them character by character.
We finally optimize the whole flows, following is the basic idea which we will explain in details.Once the image comes into our system, we should check its first character, if it is an arabic number, we go to the third character, to check whether it is a digit or Chinese character.If it is a digits, we go back to check the second character to see whether it is an algebraic symbol or Chinese character operator.Otherwise, we go ahead to find the fourth character.If the first character is not a digit, we would like to check the second character to check the font type.If the second one is an algebraic symbol, the third character must be the Chinese character, then we can ignore the fourth one.If the second is not an algebraic symbol, the simplest way is to directly see the fourth one whether it is equal to one particular ideograph.Fig. 7 shows the basic flow chart of our adaptive segmentation.
One more thing, there exists a very tricky segmentation issue.After we use the affine transformation, we can exploit histogram-based algorithm to extract each character one by one.Then the following issue comes out, some Chinese character's component and radical are separate thus it can be mistakenly classified.Thus we have to correct the wrong classification by fixed length.That is to say, the length of operators, characters and digits should always be the same, which means we must firstly detect each font's length for preparation.

IV. EXPERIMENT
This paper selects about 3000 samples from the 33 entrances.Due to the similarity, we should first combine the similar CAPTCHAs.Fig. 8 basically illustrates the process of defeating the CAPTCHAs.As we mentioned before, the entrances are known, thus we just need to focus on one type font.Nevertheless, regarding the last four rows, they belong to one entrance as we showed in Section III.We can observe that the outcome is very clear.Fig. 9 shows the accuracy rate of each CAPTCHAs, the best accuracy rate is as high as 100%, while the lowest rate is 33.3%.
Furthermore, by comparing the results in Fig. 9, we can see that some rotated CAPTCHAs usually play worse recognition rate than the others.The best rotated recognition rate is 71.6% in the third row.However, the accuracy rate of other rotated fonts like in the second row, fifth row and eighth row are 54%, 33.3% and 35% respectively.We can clearly see that with the increasing rotated angle, the accuracy rate is correspondingly decreasing.That is to say, the rotation hinders the accurate segmentation and recognition, which is our pain point as well.Regarding the rest types which could directly utilize OCR, we can positive estimate that their accuracy rate is optimistic.Actually, we cannot deal with the whole entrances because some of them are too complicated with heavy rotation, even overlapping.For further work, we would like to construct the convolution neural network for classification instead of the OCR or Template Matching.

V. CONCLUSIONS
In fact, CAPTCHA is designed to tell computer and human apart.Defeating the CAPTCHA is also helpful to find the weakness of CAPTCHA, which can further improve the safety.In this paper, as a contribution toward defeating with improving the systematic CAPTCHAs, we evaluated various methods in real world and identified the optimal algorithms.In ideal condition, OCR is perhaps the best solution.However, in real practice, not all the CAPTCHAs can be recognized well by OCR due to the rotation.In such cases, Template Matching is an alternative choice.While, we cannot deny the fact that the accurate rate is not such ideal especially compared with OCR.Thus, we proposed our algorithm to correct the rotated CAPTCHAs by affine transformation and then segment them through our adaptive system.Moreover, we could exploit OCR to achieve the highest accurate rate with a considerably fast approach.Finally, by our novel adaptive segmentation, the similar complex types of rotated CAPTCHAs can be efficiently solved with the state-of-the-art performance.

Fig. 7 .
Fig. 7.The detailed flow chart of adaptive segmentation