Source of the neural network for motion recognition with. Development of an image recognition system based on artificial neural networks. Neural networks in biometrics and recognition of persons

1

The neural network is a mathematical model and its embodiment in the form of a software or software and hardware implementation, which is based on modeling the activities of biological neural networks, which are neuron networks in the biological organism. Scientific interest in this structure arose because the study of its model allows you to obtain information about some system. That is, a similar model may have a practical implementation in a number of industries of modern science and technology. Capada addresses issues affecting the use of neural networks to build anti-imaging systems that are widely used in security systems. Details investigated issues related to the topic of image recognition algorithm and its application. Briefly provides information on the methodology for learning neural networks.

neural networks

training with neural networks

image recognition

paradigm of local perception

security systems

1. Yann Lecun, J.S. Denker, S. Solla, R.E. Howard and L. D. Jackel: Optimal Brain Damage, in Touretzky, David (EDS), Advances in Neural Information Processing Systems 2 (Nips * 89). - 2000. - 100 p.

2. Zhigalov K.Yu. Methods for photorealistic vectorization of laser location data for further use in GIS // News of higher educational institutions. Geodesy and aerial photography. - 2007. - № 6. - P. 285-287.

3. Ranzato Marc'Aurelio, Christopher Poultney, Sumit Chopra and Yann Lecun: Efficient Learning of Sparse Representations with An Energy-Based Model, In J. Platt et al. (EDS), Advances in Neural Information Processing Systems (NIPS 2006). - 2010. - 400 p.

4. Zhigalov K.Yu. Preparation of machinery for use in automated construction management systems. Natural and technical sciences. - M., 2014. - № 1 (69). - P. 285-287.

5. Y. Lecun and Y. Bengio: Convolutional Networks for Images, Speech, And Time-Series, In Arbib, M. A. (EDS) // The Handbook of Brain Theory and Neural Networks. - 2005. - 150 p.

6. Y. Lecun, L. Bottou, G. Orr and K. Muller: Efficient backprop, in orr, G. and Muller K. (EDS) // Neural Networks: Tricks of the Trade. - 2008. - 200 p.

To date, technological and research progress covers all new horizons, rapidly progressing. One of them is the modeling of the surrounding natural world with the help of mathematical algorithms. In this aspect, there are trivial, such as the simulation of sea oscillations, and extremely complex, non-trivial, multicomponent problems, such as modeling the functioning of the human brain. In the process of studying this issue, a separate concept was allocated - a neural network. The neural network is a mathematical model and its embodiment in the form of a software or software and hardware implementation, which is based on modeling the activities of biological neural networks, which are neuron networks in the biological organism. Scientific interest in this structure arose because the study of its model allows you to obtain information about some system. That is, a similar model may have a practical implementation in a number of industries of modern science and technology.

Brief history of neural network development

It is worth noting that initially the concept of "neural network" originates in the work of American mathematicians, neurolinguists and neuropsychologists W. Maccalok and U. Pitts (1943), where the authors mention it for the first time, give it a definition and produce the first attempt to build a model neural network. Already in 1949, D. Hebb offers the first learning algorithm. Then there was a number of research in the field of neural learning, and the first working prototypes appeared in about 1990-1991. last century. Nevertheless, the computing capacity of the equipment of that time was not enough for enough fast work neural networks. By 2010, the GPU capacity video cards increased greatly and appeared the concept of programming directly on video cards, which significantly (3-4 times) increased the performance of computers. In 2012, the neural networks first won the IrageNet championship than and was marked by their further rapid development and the emergence of the term Deep Learning.

In the modern world, neural networks have tremendous coverage, scientists consider research conducted in the field of studying behavioral features and states of neural networks, extremely promising. The list of areas in which the neural networks have found use is huge. This is recognition and classification of images, and forecasting, and the solution of approximation tasks, and some aspects of data compression, data analysis, and, of course, use in safety systems of a different nature.

The study of neural networks today is actively occurring in scientific communities of different countries. With such a consideration, it is presented as a particular case of a number of methods of recognition of images, discriminant analysis, as well as clustering methods.

It should also be noted that over the past year, financing was allocated for startups in the field of image recognition systems in more than the previous 5 years, which indicates a sufficiently large demand of this type of development in the final market.

Application of neural networks to recognize images

Consider standard tasks solved by neural networks in the Annex to the images:

● identification of objects;

● recognition of parts of objects (for example, faces, hands, legs, etc.);

● Semantic definition of object boundaries (allows you to leave only objects boundaries in the picture);

● semantic segmentation (allows you to divide the image to various separate objects);

● Selecting normal to surface (allows you to convert two-dimensional pictures to three-dimensional images);

● Allocation of objects of attention (allows you to determine what the person will pay attention to this image).

It is worth noting that the image recognition task is of a bright character, the solution of this task is a complex and extraordinary process. When recognizing recognition, a human face, a handwritten digit, as well as many other objects, which are characterized by a number of unique features, is significantly complicated by the identification process.

In this study, the algorithm for creating and learning to recognize manuscript symbols of the neural network will be considered. The image will be read by one of the inputs of the neural network, and one of the outputs will be involved in the output of the result.

At this stage, it is necessary to briefly stay on the classification of neural networks. To date, the main species are three:

● Cutting neural networks (CNN);

● recurrent networks (Deep Learning);

● Reinforcement training.

One of the most frequent examples of building a neural network is the classic neural network topology. Such a neural network can be represented as a complete-connected graph, which is characteristic of its feature is the direct distribution of information and the inverse distribution of error alarm. This technology does not have recursive properties. An illustrative neural network with classic topology can be depicted in Fig. one.

Fig. 1. Neural network with simple topology

Fig. 2. Neural network with 4 layers of hidden neurons

One of the obviously significant minuses of this network topology is redundancy. Due to redundancy when submitting data in the form, for example, a two-dimensional matrix on the input can be obtained one-dimensional vector. So, for the image of the handwritten Latin letter described using a 34x34 matrix, 1156 inputs will be required. This suggests that computing power spent on the implementation of software and hardware solutions of this algorithm will be too large.

The problem was solved by the American scientist Jan Le Kunom, who analyzed the works of the Nobel Prize laureates in the field of medicine T. Wtesel and D. Hubel. In the framework of the study conducted by them, the visual cerebral cerebral cerebral cerette performed as an object of study. Analysis of the results showed that a number of simple cells are present in the cortex, as well as a number of complex cells. Simple cells reacted to the image of straight lines obtained from visual receptors, and complex - for translational movement in one direction. As a result, the principle of constructing neural networks, called convolutional, was developed. The idea of \u200b\u200bthis principle was that, in order to implement the functioning of the neural network, the alternation of convolutional layers is used, which are made to denote C - Layers, subdiscree layers of S - Layers and complete-linking layers of F - Layers at the output of the neural network.

The basis of the construction of a network of this kind is three paradigms - this paradigm of the local perception, the paradigm of the shared scales and the subdiscretization paradigm.

The essence of the local perception paradigm is that the entire image matrix is \u200b\u200bapplied to each inlet neuron, but its part. The remaining parts are fed to other input neurons. In this case, you can observe the parallelization mechanism, with the help of a similar method, you can save the image topology from the layer to the layer, multidimensional processing it, that is, in the processing process, some sets of neural networks can be used.

The paradigm of shared scales suggests that a small set of scales can be used for a multitude of connections. These sets also have the name "kernel". For the final image processing result, it can be said that the separated weights have a positive effect on the properties of the neural network, when studying the behavior of which the ability of invariants in images and filter noise components, without producing their processing.

Based on the above, it can be concluded that when applying the image coagulation procedure on the basis of the kernel, the output image will appear, the elements of which will be the main characteristic of the degree of matching filter, that is, the sign of signs will be generated. This algorithm is shown in Fig. 3.

Fig. 3. Card generation algorithm

The subdiscretization paradigm is that there is a decrease in the input image by reducing the spatial dimension of its mathematical equivalent - the N-dimensional matrix. The need for subdiscretion is expressed in the invariance of the scale of the original image. When applying the alternation method, the layers appears the possibility of generating new signs of signs from the already existing, that is, the practical implementation of this method is that the ability to degenerate a multidimensional matrix in the vector, and then at all in the scalar value.

Implementation of the learning neural network

Existing networks are divided into 3 class of architectures in terms of learning:

● Training with the teacher (Percepton);

● training without a teacher (adaptive resonance network);

● Mixed learning (network of radial-basic functions).

One of the most important criteria for estimating the neural network operation in the event of an image recognition is the quality of image recognition. It is worth noting that for a quantitative assessment of the quality of recognition of the image using the functioning of the neural network, the solid-scale error algorithm is most often used:

(1)

In this dependence, EP - P-Aya Recognition Error for Neuron Pairs,

DP - the expected output result of the neural network (usually the network should strive to recognize 100%, but this in practice does not occur), and the design O (IP, W) 2 is a network output square, which depends on the PO input and dialing W. Weight coefficients. In this design, the culk kernels and the weight coefficients of all layers are included. The miscalculation of the error is to calculate the average arithmetic value for all pairs of neurons.

As a result of the analysis, regularity was derived that the nominal value of the weight, when the error value is minimal, can be calculated based on dependence (2):

(2)

From this dependence, it can be said that the task of calculating the optimal weight is the arithmetic difference between the derivative function of the first order of weight error, divided into the derivative function of the second order error.

The dependences make it possible to trivial error calculation, which is in the output layer. Calculation of an error in hidden layers of neurons can be implemented using the inverse error distribution method. The main idea of \u200b\u200bthe method is to disseminate information, in the form of an error alarm, from the output neurons to the input, that is, in the direction opposite to the distribution of signals on the neural network.

It should also be noted that the network training is made on specially prepared bases of images classified for a large number of classes, and occupies quite a long time.
To date, the largest base is ImageNet (www.image_net.org). It has free access for academic institutions.

Conclusion

As a result of the foregoing, it is worth noting that neural networks and algorithms implemented on the principle of their functioning may be used in the recognition systems of a dactyloscopic map for the internal affairs bodies. Often it is the program component of the software and hardware complex, aimed at recognizing such a unique complex image, as a drawing, which is identification data, solves the tasks assigned to it not fully. The program implemented on the basis of algorithms based on the neural network is much more efficient.

Summarizing, you can summarize the following:

● Neural networks can be used, both in the recognition of images and texts;

● This theory makes it possible to talk about creating a new promising class of models, namely models based on intellectual modeling;

● Neural networks are capable of learning, which indicates the possibility of optimizing the process from functioning. Such an opportunity is an extremely important option for the practical implementation of the algorithm;

● Assessment of the image recognition algorithm using a neural network study may have a quantitative value, respectively, there are mechanisms for adjusting parameters to the required value by calculating the desired weight coefficients.

To date, further study of neural networks is represented by a promising area of \u200b\u200bthe study, which will be successfully applied to even more sectors of science and technology, as well as human activity. The main emphasis in the development of modern recognition systems is now shifted to the area of \u200b\u200bsemantic segmentation of 3D images in geodesy, medicine, prototyping and other fields of human activity is quite complex algorithms and this is due to:

● with the lack of a sufficient number of databases of reference images;

● lack of a sufficient number of free experts for the initial system learning;

● Images are stored not in pixels, which requires additional resources from both the computer and the developers.

It should also be noted that today there are a large number of standard architectures for building neural networks, which significantly facilitates the problem of building a neural network from scratch and reduces it to the selection of a suitable specific task of the network structure.

Currently, the market has a sufficiently large number of innovative companies engaged in image recognition using neural network technology training technologies. It is known for certain that they have reached the accuracy of image recognition in the area of \u200b\u200b95% when using a database of 10,000 images. Nevertheless, all achievements refer to static images, with a video circuit in currently Everything is much more complicated.

Bibliographic reference

Markova S.V., Zhigalov K.Yu. Application of a neural network to create an image recognition system // Fundamental studies. - 2017. - № 8-1. - P. 60-64;
URL: http://fundamental-research.ru/ru/article/view?id\u003d41621 (date of handling: 03/24/2020). We bring to your attention the magazines publishing in the publishing house "Academy of Natural Science"

AlexNet - a convolutional neural network that has had a great influence on the development of machine learning, especially on computer vision algorithms. The network with a large margin won the image recognition contest ImageNet LSVRC-2012 in 2012 (with a number of errors 15.3% against 26.2% at the second place).

ALEXNET architecture is similar to the LENET network created by Yann Lecum. However, AlexNet has more filters on the layer and invested convolutional layers. The network includes convilios, maximum association, drop, denominate data, RELU activation functions and stochastic gradient descent.

Features alexnet

  1. As an activation function, RELU is used instead of Arctangent to add nonlinearity to the model. Due to this, with the same accuracy of the method, the speed becomes 6 times faster.
  2. The use of a dropower instead of regularization solves the problem of retraining. However, learning time doubles with a drop of 0.5.
  3. Overlapping associations to reduce the size of the network. Due to this, the level of errors of the first and fifth levels is reduced to 0.4% and 0.3%, respectively.

Datasets ImageNet.

ImageNet is a set of 15 million marked images with high resolution, divided into 22,000 categories. Images are assembled on the Internet and are labeled by hand using Amazon's Mechanical Turk crowdsourcing. Since 2010, since 2010, the Annual Contest ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) is held, which is part of the Pascal Visual Object Challenge. In the Challenge, part of the identity of the ImageNet from 1000 images in each of the 1000 categories is used. A total of 1.2 million images for learning, 50,000 images for checking and 150,000 - for testing. ImageNet consists of images with different resolution. Therefore, for the competition they are scaled to a fixed resolution of 256 × 256. If initially the image was rectangular, then it is cut to a square in the center of the image.

Architecture

Picture 1

The network architecture is shown in Figure 1. AlexNet contains eight layers with weight coefficients. The first five of them are convolution, and the remaining three are complete connected. The output is passed through the Softmax loss function, which forms the distribution of 1000 grade labels. The network maximizes multi-line logistic regression, which is equivalent to maximizing the medium in all learning cases of the logarithm of the probability of proper marking on the distribution of expectations. The cores of the second, fourth and fifth convolutional layers are connected only with the core cards in the previous layer, which are on the same graphics processor. The core of the third convolutional layer are associated with all the cards of the cores of the second layer. Neurons in complete-connected layers are associated with all the neurons of the previous layer.

Thus, AlexNet contains 5 convolutional layers and 3 complete-connected layers. RELU is applied after each convolutional and complete layer. Dropoot applied before the first and second complete layers. The network contains 62.3 million parameters and spends 1.1 billion calculations with direct pass. Coupling layers, which account for 6% of all parameters, produce 95% of calculations.

Training

AlexNet runs 90 epochs. Training takes 6 days at the same time on two NVIDIA graphics processors GeForce GTX. 580, which is the reason that the network is divided into two parts. A stochastic gradient descent is used at a rate of learning 0.01, pulse 0.9 and the collapse of the weight coefficients of 0.0005. The rate of learning is divided 10 after the saturation of accuracy and decreases 3 times during the training. Scheme of updating weight coefficients w. It has the form:

where i. - iteration number, v. - Pulse variable, and epsilon. - Training rate. During the entire training stage, the learning speed was chosen equal to all layers and was adjusted manually. The subsequent heuristics was to divide the rate of learning by 10, when the number of errors during verification ceased to decrease.

Examples of use and implementation

The results show that a large, deep convolutional neural network is able to achieve record results on very difficult datasets, using only training with the teacher. A year after the Publication of AlexNet, all participants in the ImageNET competition began to use convolutional neural networks to solve the classification task. AlexNet was the first implementation of convolutional neural networks and opened a new era of studies. Now implement AlexNet has become easier using deep learning libraries: Pytorch, TensorFlow, Keras.

Result

The network reaches the next level of errors of the first and fifth levels: 37.5% and 17.0%, respectively. Best performanceReached during the ILSVRC-2010 competition was 47.1% and 28.2% using an approach, which averaged the predictions obtained by six models with rarefied coding trained in various properties vector. Since then, results have been achieved: 45.7% and 25.7% when using an approach, which averaged the predictions of two classifiers trained in Fisher vectors. The results of the ILSVRC-2010 are shown in Table 1.


Left: Eight test images of ILSVRC-2010 and five shortcuts that are most likely in the opinion of the model. The correct label is written under each image, and the probability is shown by the Red Stripe, if it is in the upper five. Right: Five ILSVRC-2010 test images in the first column. The remaining columns show six training images.

An overview of neural network methods used when recognizing images. Neural methods are methods based on the use of various types of neural networks (NA). The main directions of application of various NAs to recognize images and images:

  • application to extract key characteristics or signs of specified images,
  • classification of the images themselves or already extracted from them characteristics (in the first case, the extraction of key characteristics is implicitly inside the network),
  • solution of optimization tasks.

Artificial NA architecture has some similarity with natural neural networks. NA, designed to solve various tasks, can differ significantly to functioning algorithms, but their main properties are as follows.

The NA consists of elements called formal neurons, which themselves are very simple and associated with other neurons. Each neuron converts a set of signals entering it to enter the output signal. It is connections between neurons encoded by weights play a key role. One of the advantages of the NA (as well as the lack of implementing them on a sequential architecture) is that all elements can function in parallel, thereby significantly increasing the efficiency of solving the problem, especially in image processing. In addition to the fact that the NA allows you to effectively solve many tasks, they provide powerful flexible and universal learning mechanisms, which is their main advantage over other methods (probabilistic methods, linear separators, decisive trees, etc.). Training eliminates the need to choose key features, their significance and relationship between signs. Nevertheless, the choice of the initial representation of the input data (vector in the N-dimensional space, frequency characteristics, weavlets, etc.), significantly affects the quality of the solution and is a separate topic. Nats have a good generalizing ability (better than those of decisive trees), i.e. It can successfully disseminate the experience obtained on the final learning set, on all set images.

We describe the use of NA to recognize images, noting the possibilities for the use of a person to recognize a person.

1. Multilayer neural networks

The architecture of the multilayer neural network (MNS) consists of a successively connected layer, where the neuron of each layer with its inputs is associated with all the neurons of the previous layer, and the outputs are next. NA with two decisive layers can with any accuracy to approximate any multidimensional function. A ns with one decisive layer is able to form linear separating surfaces, which strongly narrows the range of tasks with them of solved, in particular such a network will not be able to solve the problem of the "excluding or" type problem. A ns with a nonlinear activation function and two decisive layers allows you to form any convex areas in space of solutions, and with three decisive layers - areas of any complexity, including non-depth. At the same time, the MNF does not lose its generalizing ability. MNCs are trained using an error inverse algorithm, which is the method of gradient descent in the space of weights in order to minimize the total network error. At the same time, errors (more precisely the magnitude of the scale correction) extends in the opposite direction from the inputs to the outputs, through the weights connecting neurons.

The simplest use of a single-layer NA (called auto-massociative memory) is to train the network to restore the submitted images. Feeding a test image and computing the quality of the reconstructed image, you can estimate how much the network recognized the input image. The positive properties of this method are that the network can restore distorted and roaring images, but for more serious purposes it does not fit.

Fig. 1. Multilayer neural network for image classification. Neuron with maximum activity (here the first) indicates the belonging to the recognized class.

The MHC is also used to directly classify images - the input is either the image itself in any form, or a set of previously extracted key characteristics of the image, at the output of the neuron with maximum activity, indicates the belonging to the recognized class (Fig. 1). If this activity is below some threshold, it is believed that the submitted image does not apply to any of the known classes. The learning process establishes the compliance of the images submitted to the input with the belonging to a specific class. This is called learning with the teacher. In the application of a person to recognize a person, this approach is good for the tasks of monitoring access of a small group of persons. This approach provides a direct comparison of the network of themselves, but with an increase in the number of classes, the training time and network work increases exponentially. Therefore, for such tasks, as a search for a similar person in a large database, requires the extraction of a compact key characteristic, based on which one can search.

The classification approach using the frequency characteristics of the entire image is described in. A single-layer ns was used, based on multivalued neurons. 100% recognition on the MIT database was marked, but it was recognized among the images to which the network was trained.

The use of MNS to classify images of persons based on characteristics such as distances between some specific parts of the face (nose, mouth, eyes), described in. In this case, these distances were submitted to the NAT entrance. Also, hybrid methods were used - in the first NA, the results of the treatment with a hidden Markov model were submitted, and in the second, the result of the NA's work was submitted to the entrance of the Markov model. In the second case, no advantage was observed, which indicates that the result of the NA classification is sufficient.

The use of NA to classify images, when the network decomposition is received on the network input by the main component method.

In the classical MNC, interlayer neural connections are complete connections, and the image is represented as a one-dimensional vector, although it is two-dimensional. The architecture of the machine NA is aimed at overcoming these shortcomings. It uses local receptor fields (provide local two-dimensional connectivity of neurons), total weights (provide detection of some features anywhere in the image) and a hierarchical organization with spatial subsamples (Spatial Subsampling). Complete NA (SNA) provides partial resistance to changes, displacements, turns, distortions. The SNA architecture consists of many layers, each of which has several planes, and the neurons of the next layer are associated only with a small number of neurons of the previous layer from the surrounding area of \u200b\u200bthe local area (as in the visual crust of the person). Weight at each point of one plane are the same (sweeping layers). Behind the folding layer follows a layer that reduces its dimension by the local averaging. Then again the cooler layer, and so on. Thus, a hierarchical organization is achieved. Later layers extract more general characteristics, less depending on image distortions. Learn by the SNA standard method of reverse error. Comparison of MNS and SNS showed significant advantages of the latter both in speed and for reliability of classification. The useful feature of the SNA is that the characteristics formed at the outputs of the upper layers of hierarchy can be applicable to the classification according to the nearest neighbor's method (for example, calculating the Euclidean distance), and the SNA can successfully remove such characteristics for images that are absent in the learning set. For SNS, fast learning and work rate is characterized. SNA testing on an ORL database containing images of persons with minor changes in lighting, scale, spatial turns, positions and various emotions showed approximately 98% recognition accuracy, and for known persons, there were options for their images that are missing in the learning set. This result makes this architecture promising for further developments in the recognition of feature recognition.

MHC apply to the detection of defined type objects. In addition, any trained MNC to some extent can determine the identity of the images to "its" classes, it can be specially trained to reliably detecting certain classes. In this case, the output classes will be classes belonging and not belonging to the specified type of images. The neural network detector was used to detect the image of the face in the input image. The image was scanned with a window of 20x20 pixels, which was supplied to the input of the network, whether the plot of persons decisive belongs to the class of persons. Training was performed both using positive examples (various images of persons) and negative (non-person images). To increase the reliability of detection, a NA team trained with various initial weights was used, as a result of which the NA was mistaken in different ways, and the final decision was taken by the voting of the entire team.

Fig. 2. Main components (own persons) and decomposition of the image to the main components.

The NA is also used to extract key characteristics of the image, which are then used for subsequent classification. B shows a method for neuralizing the implementation of the main component analysis method. The essence of the method of analysis of the main component is to obtain the maximum decaned coefficients characterizing the input images. These coefficients are called the main components and are used for statistical image compression, in which a small number of coefficients is used to represent the entire image. NA with one hidden layer of neurons containing n neurons (which is much less than the dimension of the image), trained according to the reverse error method to restore the image submitted to the input at the output, forms the coefficients of the first n of the main components at the output, which are used for comparison. Typically used from 10 to 200 main components. With an increase in the component number, its representativeness is much reduced, and the use of components with large numbers does not make sense. When using nonlinear activation functions of neural elements, a nonlinear decomposition is possible to the main components. Nonlinearity allows you to more accurately reflect the variations of the input data. Using the analysis of the main components to the decomposition of images of individuals, we obtain the main components called with their own persons (Holons in operation), which is also inherent in the useful property - there are components that mainly reflect such essential characteristics of the person as gender, race, emotions. When restoring the components have a face similar to face, and the first reflect the most general shape Persons, the latter are various small differences between persons (Fig. 2). This method is well applicable to search for similar images of persons in large databases. It is also shown the possibility of further reduction in the dimension of the main component using the NA. Assessing the quality of the reconstruction of the input image, you can very accurately determine its belonging to the class of persons.

Introduction

The subject of this study is to develop an image recognition system based on the artificial neural networking machine. The image recognition task is very important, since the ability to automatically recognize a computer image brings many new opportunities in the development of science and technology, such as the development of searching systems for individuals and other objects in photographs, quality control of products without human participation, automatic transport management and set Others.

As for artificial neural networks, in recent years, this section of machine learning is becoming increasingly developed due to a significant increase in the computing power of existing computers and the widespread distribution of graphics cards for calculations, which allows you to train neural networks of much greater depth and complex structure the earlier, Which, in turn, show significantly better results compared to other algorithms for many tasks, in particular the image recognition task. This direction of development of neural networks was called Deep Learning ("Deep Training") and is one of the most successful and rapidly developing at present. For example, according to the results of the annual Iragenet-2014 recognition competition, the overwhelming majority of successful algorithms used deep convolutional networks.

Since the task of image recognition is very extensive and in most cases requires a separate approach for various types of images, then consider the image recognition task as a whole within a single study is almost impossible, therefore it was decided for an example to consider separately such a subtipation of image recognition as road recognition signs.

Thus, the main purpose of this study was to develop an image recognition system based on artificial neural networks for road signs. To achieve this goal, the following tasks were formulated:

Performing an analytical review of literature on the subject of artificial neural networks and their applications for the image recognition task

Development of an algorithm for recognizing road signs using the artificial neural networking apparatus

Development of a prototype system recognition system based on the developed algorithm. The result of this task must be a software package that allows the user to upload an image and get the prediction of the class of this image.

Experimental studies. It is necessary to conduct research and evaluate the accuracy of the work of the algorithm

During the study, all the tasks set were fulfilled. Specific results for each of them will be described in the main part of the work.

1. Review of literature

.1 Machine learning

Neural networks, which are discussed in detail in this work are one of the varieties of machine learning algorithms, or Machine Learning. Machine training is one of the subsections of artificial intelligence. The main property of the Machine Learning algorithms is their ability to learn during the work. For example, an algorithm for constructing a solutions tree, without any preliminary information about what is data and which there are patterns in them, and only some input set of objects and the meaning of some signs for each of them along with the class label, in the process of building a tree itself He reveals hidden patterns, that is, it is trained, and after training it is capable of predicting the class for new objects that he has not seen earlier.

Allocate two main types of machine learning: training with a teacher and training without a teacher. Training with the teacher suggests that the algorithm other than the source data itself is provided with some additional information about them, which it may further use for learning. The most popular tasks for learning with the teacher include the tasks of classification and regression. For example, the classification task can be formulated as follows: having some set of objects, each of which relates to one of several classes, it is necessary to determine which of these classes is a new object. The task of recognizing road signs, which was considered in this work, is a typical variety of classification tasks: There are several types of road signs - classes, and the task of the algorithm is to "recognize" the sign, that is, to attribute it to one of the existing classes.

Learning without a teacher is different from learning with a teacher by the fact that the algorithm is not provided with any additional information except for the set of source data. The most popular example of the task of learning without a teacher is the clustering task. The essence of the clustering task is as follows: A number of objects belonging to different classes (but what class belongs to which object is unknown, the number of classes itself may also be unknown), and the purpose of the algorithm is also known - to split this set of objects on the subset of "similar" objects, that is, belonging to one class.

Among all machine learning algorithms allocate several main families. If we are talking about the task of classification, the most popular such families include, for example:

· Rule-based classifiers - the main idea of \u200b\u200bsuch classifiers is to find the rules for attributing objects to one or another class in the form "IF - THEN". To search for such rules, some statistical metrics are commonly used, the construction of rules based on solutions tree is also common.

· Logistic regression is the main idea in the search for a linear plane, the most precisely separating space into two half-spaces so that the objects of different classes belong to different half-spaces. In this case, the equation of the target plane is searched as a linear combination of input parameters. To teach such a classifier, for example, the gradient descent method can be used.

· Bayesian classifier - as follows from the name, the classifier is based on the Bayes Theorem, which is recorded in the form

neural Network Machine Training

The idea of \u200b\u200bthe classifier in this case is to find a class with a maximum posteriori probability, provided that all parameters have the values \u200b\u200bthat they have for a classified instance. In general, this task implies a preliminary knowledge of a very large number of conditional probabilities and, accordingly, the huge size of the training sample and the high complexity of the calculations, so in practice, the type of Bayesovsky classifier is most often used, called the naive Bayesian classifier, in which all parameters are independent of each other From each other, respectively, the formula takes a much simpler form and to use it requires only a small amount of conditional probabilities.


Although this assumption is usually distant from reality, the naive Bayesian classifier often shows good results.

· Trees of solutions - in a simplified form, this algorithm It is to build a tree in which each node corresponds to a certain test that is performed above the object parameters, and the leaves are the final classes. There are many varieties of decisions and algorithms for their construction. For example, one of the most popular algorithms - C4.5.

· Neural networks - a model represented in the form of sets of elements (neurons) and connections between them, which in the general case can be directed or not directed and have some weights. During the operation of the neural network, a part of its neurons, called input, receives a signal (input data), which is distributed in some way, and at the output of the network (output neurons) you can see the result of the network operation, for example, probabilities of individual classes. Neural networks will be discussed in this work in more detail in the next section.

· Support vectors - The concept of the algorithm also lies, as in the case of logistic regression, in the search for a separating plane (or several planes), however, the method of searching for this plane is different in this case - a plane is searched for such that the distance from it to the nearest points - representatives of both Classes Maximum, for which the methods of quadratic optimization are commonly used.

· Lazy classifiers (Lazy Learners) - a special kind of classification algorithms, which, instead of pre-build a model and further make decisions on the objective of the object to one or another class on its basis, are based on the idea that similar objects most often have one and The same class. When such an algorithm is on the input object for classification, it is looking for objects viewed earlier to it similar to it and, using information about their classes, forms its prediction regarding the class of the target facility.

It can be seen that classification algorithms can have a variety of ideas and, of course, and, of course, for different types of tasks show different efficiency. So, for tasks with a small number of input features, systems may be useful, based on the rules, if it is possible to quickly and conveniently calculate some metrics - lazy classifiers, if we are talking about tasks with a very large number of parameters that It is difficult to identify or interpret, such as image recognition or speech, the most suitable method of classification becomes neural networks.

1.2 Neural Networks

Artificial neural networks are one of the well-known and used machine learning models. The idea of \u200b\u200bartificial neural networks is based on the imitation of the nervous system of animals and people.

The simplified model of the animal nervous system is presented in the form of a cell system, each of which has a body and branch of two types: dendrites and axons. At a certain moment, the cell receives signals from other cells through dendrites and, if these signals are sufficient, is excited and transmits this excitation to other cells with which it is associated, through axons. Thus, the signal (excitation) applies throughout the nervous system. The model of neural networks is arranged in a similar way. The neural network consists of neurons and referenced links between them, while each connection has some weight. In this case, part of neurons is input - they receive data from the external environment. Then, at each step, the neuron receives a signal from all input neurons, calculates the weighted sum of the signals, applies some function to it and transmits the result for each of its outputs. The network also has a number of output neurons that form the result of the network. So, for the task of classification, the output values \u200b\u200bof these neurons can mean the predicted probabilities of each of the classes for the input object. Accordingly, the training of the neural network is the selection of such weights for connections between neurons so that the output values \u200b\u200bfor all input data are as close as possible to valid.

Several main types of neural network architectures are distinguished:

· Direct distribution network (Feed-Forward Network) - implies that neurons and links between them form an acyclic graph, where signals are distributed only in one direction. It is these networks that are the most popular and widely studied, and their training represents the smallest difficulties.

· Recurrent Neural Networks - in such networks, in contrast to direct distribution networks, the signals can be transmitted in both directions, and can act on the same neuron several times during the processing of one input value. A private variety of recurrent neural networks is, for example, the Boltzmann machine. The main difficulty in working with such networks is their training, as creating an effective algorithm for this in the general case is a challenge and still has no universal solution.

· Self-organizing Kohonen cards - a neural network intended primarily for clustering and visualizing data.

In the history of the development of neural networks, 3 main lifting periods are distinguished. The first studies in the field of artificial neural networks belong to the 40s of the 20th century. In 1954, J. Makkolok and W. Pitts published the work "Logical calculation of ideas related to nervous activity", in which the basic principles of building artificial neural networks were set out. In 1949, the book D.Hebba "Organization of Conduct" was published, where the author considered the theoretical foundations of learning neural networks and for the first time formulated the concept of learning neural networks as a setup of weights between neurons. In 1954, V.Klark first attempted to implement the analogue of the Hebb network using a computer. In 1958, F. Rosenbampt proposed the Pershepton model, which was essentially a neural network with one hidden layer. The fundamental view of Rosenblatte Perseceptron is presented in Figure 1.

Figure 1. Rosenblatt Perseceptron

This model He studied using the error correction method, which was that weights remain unchanged until the output value of the percepton is correct, in the case of an error, the weight of the communication varies to 1 to the side, the reverse mark of the error occurred. This algorithm, as has been proven by Rosenblatte, always converges. Using such a model, it was possible to create a computer that recognizes some letters of the Latin alphabet, which, undoubtedly, was a great success at that time.

However, interest in neural networks has decreased significantly after the publication of M. Minsk and S. Pereceptron books (PERCEPTRONS) in 1969, where they described significant limitations that the perceptron model, in particular, the impossibility of representing the function excluding or, and Also pointed to too high requirements for the required computing power of computers for learning neural networks. Since these scientists had a very high authority in the scientific community, neural networks were recognized as long as possible technology. The situation has changed only after being established in 1974 the algorithm for the inverse dissemination of the error.

The backpropagation algorithm of the error (backpropagation algorithm) was proposed in 1974 at the same time and independently two scientists, P. Carosom and A. Galyushkin. This algorithm is based on the method of gradient descent. The basic idea of \u200b\u200bthe algorithm is to disseminate information about the error from the network exits to its inputs, that is, in the opposite direction relative to the standard campaign. In this case, the weights of the relationship are adjusted on the basis of the error information that has reached them. The basic requirement that imposes this algorithm is that the neuron activation feature should be differentiated, since the method of gradient descent, which is not surprising, is calculated on the basis of the gradient.

The inverse error distribution algorithm makes it easy to train a network having several hidden layers, which allows to bypass the restrictions of the perceptron that block the development of this industry earlier. From a mathematical point of view, this algorithm comes down to a sequential multiplication of matrices - which is quite well studied and optimized task. In addition, this algorithm is well parallelized, which makes it possible to significantly speed up the network learning time. All this jointly led to a new flourishing of neural networks and a variety of active studies in this direction.

Algorithm backpropagation, at the same time, has a number of problems. Thus, the use of a gradient descent involves the risk of convergence to a local minimum. Another important problem is the long time of learning the algorithm in the presence of a large number of layers, since an error in the process of return propagation has a property to decrease increasingly when approaching the network, respectively, training the initial network layers will occur extremely slow. Another disadvantage of neural networks in general is the complexity in interpreting the results of their work. The trained model of the neural network is some similarity of the black box, the object of which the object is supplied and the output is a forecast, but to determine which signs of the input object are taken into account and which of neurons is responsible for which is usually quite problematic. This makes neural networks are largely less attractive compared, for example, with the trees of solutions in which the trained model is in itself represents some quintessence of knowledge about the subject area under consideration and the researcher is easy to understand why this object It was attributed to a particular class.

These disadvantages, in combination with the fact that, although neural networks showed good results, these results were comparable to the results of other classifiers, for example, gaining popularity of machinery vectors, and the results of the latter were much easier to interpretation and learning required less time, led to the next decline in the development of neural networks.

This decline ended only in the 2000s of the 21st century, when the concept of Deep Learning, or deep learning appeared and began to spread. The revival of neural networks has contributed to the emergence of new architectures, such as convolutional networks, Restricted Bolzman Machines, stacking consignors, etc., which allowed to achieve significantly higher results in such machine learning, as image recognition and speech. An essential factor for their development was also the emergence and distribution of powerful video cards and their use for computing tasks. The video card, differing significantly by a large number of comparison cores with the processor, albeit less power each, is ideal for tasks of learning neural networks. This, combined with the significance of computers significantly increased recently, and the spread of computing clusters allowed to train substantially more complex and deep architectures of neural networks than before.

1.3 Deep.learning

One of the most important problems with which it is necessary to encounter when using machine learning algorithms is the problem of choosing the right features based on training. This problem becomes especially significant when considering such tasks as image recognition, speech recognition, processing of the natural language, and the like, that is, those where there is no obvious set of features that can be used for training. Typically, the choice of set of signs for learning is carried out by the researcher itself by some analytical work, and it is the selected set of features in many respects determines the success of the algorithm. So, for the task of recognizing images, the prevailing color in the image can be the prevalence of its change, the presence of clear boundaries on the image or something else. In more detail, the question of image recognition and the choice of correct signs will be considered in the appropriate chapter.

However, this approach has significant disadvantages. Firstly, this approach implies a significant amount of work to identify signs, and this work is carried out by a manually researcher and may require high time. Secondly, the identification of signs, on the basis of which it is possible to obtain a quality algorithm, in this case it becomes in many ways random, besides, it is thus unlikely that signs that can have an important impact on the internal structure of the image, but It is not obvious for a person. Thus, the idea of \u200b\u200bautomatic identification of signs is particularly attractive, which can further be used to work machine learning algorithms. And it is precisely such an opportunity to use the DEEP Learning approach.

From the point of view of machine learning theory, Deep Learning is a subset of the so-called Representation Learning. The main concept of Representation Learning is just the same automatic search for signs, on the basis of which some algorithm will continue to operate, for example, classification.

On the other hand, another important problem with which you have to face when using machine learning is the presence of variation factors that may have a significant impact on the appearance of the source data, but at the same time they do not have the relationship to their essence, which the researcher is trying to analyze . So, in the image recognition task, such factors may be an angle under which the object on the image will be turned to the observer, the time of day, lighting, etc. So, depending on the point of view and weather, the Red Machine may have a different shade and shape in the photo. Therefore, for such tasks, for example, identifying the item depicted in the photo, it looks reasonable to take into account not specific low-level facts, such as the color of a specific pixel, but the characteristics of a higher level of abstraction, for example, the presence of wheels. However, it is obvious that it is not possible to determine the wheels on the basis of the original image, and its task is nontrivial, and its solution can be directly quite complex. In addition, the presence of wheels is only one of the huge variety of possible signs, and the definition of them all and the compilation of algorithms for checking the image for the presence of them does not look very realistic. It is here that researchers can use all the advantages of the DEEP Learning approach. Deep Learning is based on the provision of a source object in the form of a hierarchical structure of features, so that every next level of features is based on the elements of the previous level. So, if we are talking about the images, the initial pixels of the images will act as the lowest level, the next level will be segments that can be distinguished among these pixels, then the angles and other geometric shapes in which the segments are folded. At the next level, their figures are formed already recognizable objects for humans, such as wheels, and finally, the last level of the hierarchy is responsible for specific items in the image, for example, a car.

For the implementation of the DEEP Learning approach, multilayer neural networks of various architectures are used in modern science. Neural networks are ideal for solving the task of identifying from data and construct a hierarchical set of features, since, in fact, the neural network is a set of neurons, each of which is activated only if the input data is satisfied with certain criteria - that is, it is some sign, with this Neuron activation rules - what determines this feature - they are trained automatically. At the same time, neural networks in their most common form represent a hierarchical structure, where each next layer of neurons uses the neurons of the previous layer as its entrance - or, in other words, signs of higher level are formed on the basis of signs of lower level.

The spread of this approach and, in connection with this, the next flourishing of neural networks, served three interrelated reasons:

· The emergence of new neural network architectures, sharpened to solve certain tasks (convolutional networks, Boltzmann machines, etc.)

· Development and availability of calculations using GPU and parallel computing in general

· The appearance and distribution of a layer-by-layer neural networking approach, in which each layer is trained separately using the standard backpropagation algorithm (usually on unimpressed data, i.e., essentially taking care of the auto-coder), which allows you to identify essential features at this level, and then all layers are combined In a single network and the network has already occurs with the use of data marked for solving a specific task (Fine-Tuning). This approach has two significant advantages. First, it is thus significantly increasing the efficiency of network training, since at each moment of time does not study a deep structure, and a network with one hidden layer is learned - as a result, problems disappear with a decrease in error values \u200b\u200bas the depth of the network increases and the corresponding decrease in learning speed increases. And secondly, this approach to network training allows undiscouched data in training, which are usually much more than marked - what makes the network training easier and accessible to researchers. The marked data in such an approach is required only at the very end to donoyetting the network to solve a specific classification task, and at the same time, since the overall structure of the signs describing the data has already been created during the previous training, it requires significantly less data to donkey network than for initial learning. In order to identify signs. In addition to reducing the required number of data marked, the use of such an approach allows you to train the network once using a large number of unbalanced data and then use the resulting feature structure to solve various classification tasks, a propagating network using various data sets - for much less time, which would be required in The case of complete network learning every time.

Consider a little more detailed the main architectures of neural networks, commonly used in the context of deep learning.

· Multilayer perceptron - is a conventional complete-link neural network with a large number of layers. The question of how the number of layers is considered to be quite large, does not have a unambiguous response, but usually networks with 5-7 layers are already considered "deep". This architecture of neural networks, although it does not have fundamental differences from networks that used earlier before the dissemination of deep learning, may be very effective in the event of a successful solution to the task of its learning, which was the main problem of working with such networks earlier. Currently, this problem is solved by using graphic cards to train a network, which allows you to accelerate training and, accordingly, to carry out a greater number of learning iterations, or layering networks mentioned earlier. So, in 2012, Ciresan with colleagues published the article "Deep Big Multilayer Perceptrons for Digit Recognition", in which they made an assumption that a multilayer perceptron with a large number of layers, in case of sufficient duration of training (which is achieved over a reasonable time using parallel computing on GPU ) and sufficient data for learning (which is achieved by applying various random transformations to the source set of data) can show effectiveness not worse than other, more complex models. Their model, which is a neural network with 5 hidden layers, when classifying numbers from Dataset Mnist, showed a percentage of an error 0.35, which is better than the previously published results of more complex models. Also, by combining several network trained in this way into a single model, they managed to reduce the error rate to 0.31%. Thus, despite the seeming simplicity, the multilayer Perseceptron is a completely successful representative of deep learning algorithms.

· Stacked Autoencoder - This model is closely related to multi-layer perceptrone and as a whole with the task of learning deep neural networks. It is using a stacking carcoder that the layered teaching of deep networks is being implemented. However, this model is used not only for the purpose of learning other models, and often has great practical importance in itself. To describe the essence of a stacking carcoder, consider first the concept of an ordinary consignor. The consumercoder is an algorithm for learning without a teacher, in which its own input values \u200b\u200bare actuated as the expected output values \u200b\u200bof the neural network. Schematically, the model of the carcoder is presented in Figure 2:

Figure 2. Classic Country Code

Obviously, the task of learning such a model has a trivial solution if the number of neurons in a hidden layer is equal to the number of input neurons - then the hidden layer is enough to simply broadcast its input values \u200b\u200bto the output. Therefore, when training, the auto-coders are introduced additional limitations, for example, the number of neurons in the hidden layer is set to significant smaller than in the input layer, or special regularization techniques are applied, aimed at ensuring a high degree of impassability of the hidden layer neurons. One of the most common applications of auto-coders in pure form is the task of obtaining a compressed representation of the source data. So, for example, a carcoder with 30 neurons in a hidden layer, trained on the MNIST dataset, allows you to restore the original image source images on the outdoite layer, which means that in fact, each of the source images can be sufficiently described only to 30 numbers. In this application, the consignors are often considered as an alternative to the main component method. The stacked autooderator is essentially a combination of several ordinary consignors, trained layer layer. At the same time, the output values \u200b\u200bof the trained neurons of the hidden layer of the first of the authenecoders act as input values \u200b\u200bfor the second of them, etc.

· Reindeeming networks are one of the most popular deep learning models that are primarily used to recognize images. The concept of convolutional networks is built on three basic ideas:

o Local sensitivity (Local receptive fields) - if we talk about the image recognition task, this means that the recognition of one or another element should first have the impact of its direct environment, while the pixels located in another part of the image, Most likely, this element is in no way connected and do not contain information that would help it correctly identify it.

o Separated Weights (Shared Weights) - the presence of divided scales in the model actually personifies the assumption that the same object can be found in any part of the image, and the same pattern is applied to search in all parts of the image)

o Sabsampling (subsample) - a concept that allows you to make a model more resistant to insignificant deviations from the desired pattern - including related to small deformations, a change in lighting, etc. The idea of \u200b\u200bSabesMamp is that when compared with the pattern, an accurate value for a given pixel or pixel region is taken into account, and its aggregation at some surroundings, for example, the average or maximum value.

From a mathematical point of view, the basis of convolutional neural networks is an operation of a matrix convolution, which consists in a single multiplication of the matrix, which is a small portion of the original image (for example, 7 * 7 pixels) with a matrix of the same size called a convolution kernel, and the subsequent summation of the values \u200b\u200bobtained . In this case, the core of reconciliation is essentially some template, and the number obtained as a result of summation characterizes the degree of similarity to this image area to this template. Accordingly, each layer of the convolutional network consists of a certain number of templates, and the task of learning network lies in the selection of the correct values \u200b\u200bin these templates - so that they reflect the most significant characteristics of the source images. In this case, each template is compared consistently with all parts of the image - it is precisely the expression of a scales separation idea. The layers of this type in the convolutional network are called convolution layers. In addition to the bundle layers, there are sabading layers, which replace small areas of the image by one number, thereby simultaneously reducing the sample size for the next layer and making the network more resistant to small changes in the data. In the last layers of the convolutional network, one or more complete layers are commonly used for performing directly classifying objects. In recent years, the use of convolutional networks has become in fact the standard when classifying images and allows you to achieve the best results in this area.

· Limited cars Boltzmann (Restricted Boltzmann Machines) - Another kind of deep learning models, in contrast to convolutional networks, used primarily for speech recognition task. The Boltzmann car in its classical understanding is a non-oriented graph, which reflects the dependencies between the nodes (neurons). In this case, part of neurons is visible, and the part is hidden. From the point of view of neural networks, the Boltzmann machine is essentially a recurrent neural network, from the point of view of statistics - a random Markov field. Important concepts for Boltzmann machines are the concepts of network energy and equilibrium states. The energy of the network depends on how much of the neuron-related neurons is simultaneously in the activated state, the task of learning such a network consists in its designer to the state of equilibrium, in which its energy is minimal. The main disadvantage of such networks are big problems with learning them in general. To solve this problem, J. Hinton with colleagues were offered a model of the Boltzmann Machines (Restricted Boltzmann Machines), which imposes a restriction on the network structure, representing it in the form of a bipathic graph, in one part of which only visible neurons are located, and in the other - only hidden Accordingly, communication is present only between visible and hidden neurons. This restriction made it possible to develop effective algorithms for training networks of this species, due to which significant progress was carried out in solving the tasks of speech recognition, where this model practically displaced the most previously popular Markov's hidden networks.

Now, having considered the basic concepts and principles of deep training, briefly consider the basic principles and evolution of developing image recognition and how the place in it is taken by Deep Learning.

1.4 Image recognition

There are many wording for the image recognition task, and it is definitely quite difficult to determine it. For example, you can consider image recognition as a task of searching and identifying on the source image of some logical objects.

Image recognition is usually a challenge for a computer algorithm. This is due, first of all, with high variations of images of individual objects. Thus, the task of finding a car in the image is simple for the human brain, which is capable of automatically identifying the facility for the vehicle for the car (wheels, a specific form) and, if necessary, "deliver" a picture in the imagination, presenting missing parts, and extremely complex for the computer, Since there is a huge number of varieties of cars of different brands and models having a large-scale form, in addition, the final form of an object on the image highly depends on the point of shooting, the angle under which it is removed and other parameters. Also, an important role is played by lighting, which affects the color of the resulting image, and can also be imperceptible or distorting individual parts.

Thus, the main difficulties in image recognition cause:

· Variability of objects inside the class

· Form variation, size, orientation, positions in the image

· Lighting variability

To combat these difficulties throughout the history of the development of image recognition, various methods were proposed, and at present there was already significant progress in this area.

The first studies in the field of image recognition were published in 1963 by L. Bererts in the article "Machine Perception of Three-Dimensional Solids", where the author made an attempt to abstract from possible changes In the form of the subject and concentrated on recognizing images of simple geometric shapes in conditions of various lighting and in the presence of turns. The computer program developed by him was capable of identifying the geometric objects of some simple forms on the image and form their three-dimensional model on the computer.

In 1987, Sh. Sulman and D.Huttenloher was published an article "Object Recongnition using Alignment" where they also made an attempt to recognize objects relatively simple forms, while the recognition process was organized in two stages: first search for the area in the image where the target object is located , and the definition of its possible sizes and orientation ("Alignment") using a small set of characteristic features, and then pixel comparison of a potential image of an object with expected.

However, the pixel image comparison has many significant drawbacks, such as its complexity, the need for a template for each of the objects of possible classes, and the fact that, in the case of a pixel comparison, only the search for a specific object can be performed, and not a whole class of objects. In some situations, this is applicable, however, in most cases, it is still necessary to search for a single specific object, but a variety of objects of any class.

One of the important directions in the future development of image recognition was the recognition of images based on the identification of contours. In many cases, it is precursors that contain most of the information about the image, and at the same time considering the image in the form of a set of contours allows it to be significantly simplified. To solve the problem of searching for contours in the image, the classic and most well-known approach is the Canny Detector (CANNY EDGE DETECTOR), the work of which is based on the search for the local maximum of the gradient.

Another important direction in the field of image analysis is the use of mathematical methods, such as frequency filtering and spectral analysis. These methods are used, for example, to compress images (JPEG compression) or improving its quality (Gauss filter). However, since these methods are not directly related to the recognition of images, they will not be considered in more detail here.

Another task that is often considered in connection with the task of image recognition is the task of segmentation. The main objective of segmentation is the allocation of individual objects on the image, each of which can then be separately studied and iscited. The task of segmentation is greatly simplified if the original image is binary - that is, consists of pixels of only two colors. In this case, the problem of segmentation is often solved with the use of methods of mathematical morphology. The essence of methods of mathematical morphology is to represent the image as a certain set of binary values \u200b\u200band applying to this multiple logical operations, among which is the transfer, extension (logical addition) and erosion (logical multiplication). With the use of these operations and their derivatives, such as closing and opening, it is possible, for example, to eliminate the noise on the image or select borders. If such methods are used in the problem of segmentation, then the most important task becomes just the task of eliminating noise and the formation in the image of more or less homogeneous sections, which are then easy to find with algorithms similar to the search for connected components in the graph - this will be the desired segments Images.

As for the RGB-image segmentation, one of the important sources of information about the segments of the image can be its texture. To determine the texture of the image, a gab filter is often used, which was created in attempts to reproduce the characteristics of the perception of textures with human vision. The operation of this filter is based on the function of the frequency conversion of the image.

Another important family of algorithms used to recognize images are algorithms based on the search for local features. Local features are some well-distinguishable areas of the image that allow you to relate the image with the model (the desired object) and determine whether this image of the model corresponds to and, if it matches, determine the parameters of the model (for example, the angle of inclination, the applied compression, etc.) . To qualitatively perform their functions, local features must be resistant to attic transformations, shifts, etc. The classic example of local features are the angles that are often present at the boundaries of various objects. The most popular algorithm for finding corners is the Harris detector.

Recently, methods of image recognition based on neural networks and deep learning are becoming increasingly popular. The main flourishing of these methods came after the recreation networks at the end of the 20th century (LECUN,), which show significantly better results in recognizing images on comparison with the rest of the methods. So, most of the leading (and not only) algorithms in the Annual Image Recognition Competition of ImageNET-2014 used in one form or another convolutional networks.

1.5 Traffic Signs Recognition

The recognition of road signs in general is one of the numerous image recognition tasks or, in some cases, video records. This task has a large practical value, since the recognition of road signs is used, for example, in the automation programs of the car. The task of recognizing road signs has a lot of variations - for example, the identification of the presence of road signs in the photo, selection in the image of the site, which is a road sign, determining which particular sign is depicted in the photo, knowingly the image of the road sign, etc. Three global tasks associated with the recognition of road signs are distinguished - their identification among the surrounding landscape, directly recognizing, or classification, and the so-called tracking - here it is implied the possibility of the "Follow" algorithm, that is, to keep a road sign in a video in the focus. Each of these subtasks in itself is a separate subject for research and is usually its circle of researchers and traditional approaches. In this paper, attention was focused on the task of classifying the road sign depicted in the photo, so we consider it in more detail.

This task is the task of classification for classes with an unbalanced frequency. This means that the likelihood of image accessories in various classes is different, since some classes are more common than others - for example, on Russian roads a sign of the speed limit "40" occurs significantly more often than the "end-to-end passage" sign. In addition, road signs form several class groups such that classes inside the same group are strongly similar to each other - for example, all signs of speed limits look very similar and differ only in numbers inside them, which, of course, significantly complicates the task of classification. On the other hand, road signs have a clear geometric shape and a small set of possible colors, which could significantly simplify the classification procedure - if it were not for the fact that real photographs of road signs can be removed from different angles and with different lighting. Thus, the task of the classification of road signs, although it can be considered as a typical task of recognizing images, but to achieve the best result requires a special approach.

Until a certain point in the study, the study on this topic was rather chaotic and are not related to each other, since each researcher set their own tasks and used its own set of data, so it was not possible to compare and summarize the existing results. Thus, in 2005, Bahlmann with colleagues in the framework of a comprehensive road sign recognition system supporting all 3 previously mentioned racks of road sign recognition, implemented signs recognition algorithm that operates with an accuracy of 94% for road signs related to 23 different classes. Training was produced at 40000 images, while the number of images corresponding to each class varied from 30 to 600. To detect road signs, the Adaboost algorithm and Waire Wavelets were used in this system, and for the classification of the found characters - an approach based on the EXPECTION MAXIMIZATION algorithm. The road sign recognition system with speed limitations developed by Moutarde in 2007 had accuracy to 90% and was trained on a set of 281 image. In this system, circles and square detectors were used to detect road signs on images (for European and American marks, respectively), in which each figure was released and classified using a neural network. In 2010, Ruta with colleagues developed a system for detecting and classifying 48 different types of road signs with an accuracy of classification of 85.3%. Their approach was based on the search for circles and polyhedra and allocate a small number of special regions in them, which allow you to distinguish this sign from all others. At the same time, a special conversion of image colors, called by the authors of Color Distance Transform and allowing to reduce the number of colors present in the image, and, accordingly, increase the capabilities as comparing images and reduce the size of the data being processed. Broggie with colleagues in 2007 offered a three-stage algorithm for detecting and classifying road signs, consisting of segmentation of colors, definitions of form and neural network, but in their publication there are no quantitative indicators of the results of their algorithm. Gao et al. In 2006, the system recognition recognition system based on the color analysis and forms of the alleged sign and showed 95% recognition accuracy among 98 copies of road signs.

The situation with the fragility of the study in the field of recognition of road signs has changed in 2011, when the IJCNN conference (International Joint Conference On Neural Networks) held a competition for recognizing road signs. For this competition, GTSRB data set (German Traffic Sign Recognition Benchmark) was developed, containing more than 50,000 images of road signs located on Germany roads and 43 different classes. Based on this set of data, a competition was carried out consisting of their two stages. According to the results of the second stage, an article was published "MAN VS. Computer: Benchmarking Machine Learning Algorithms for Traffic Sign Recognition ", which provides an overview of the results of the competition and the description of the approaches used by the most successful teams. Also, in the footsteps of this event, a number of articles were published by the authors of algorithms - participants of the competition, and this data set later became a basic benchmark for algorithms associated with the recognition of road signs, similar to all known MNIST to recognize handwritten numbers.

The most successful algorithms in this competition include the convolutional network committee (IDSIA command), a multishing convolutional network (Multi-Scale CNN, Sermanet command) and random forest (Random Forests, CAOR command. Consider each of these algorithms a little more detail.

The neural network committee proposed by the IDSIA command from the Italian Dalle Molle Institute for Artificial Intelligence Research, led by D. Ciresan, has achieved the accuracy of the 99.46% marks, which is higher than human accuracy (99.22%), the assessment that was carried out within same competition. This algorithm was subsequently described in more detail in the article "Multi-Column Deep Neural Network for Tra FFI C Sign Classi Fi Cation". The main idea of \u200b\u200bthe approach is that 4 different methods of normalization were applied to the source data: image adjustment (Histogram Equalization), Adaptive Chart Alignment and Contrast Normalization (CONSTRAST NORMALIZATION). Then, for each set of data obtained as a result of normalization, and the initial data set was built and trained in 5 convolutional networks with randomly initialized initial values \u200b\u200bof weights, each of the 8 layers, and various random transformations were used for the input network values, which allowed Increase the size and variability of the training sample. The resulting network prediction was formed by averaging the prediction of each of the convolutional networks. To train these networks, an implementation was used using calculations on the GPU.

The algorithm using a multishing convolution network was proposed by a team consisting of P.Sermanet and Y. Lecun from the University of New York. This algorithm was described in detail in the article "Traffic Sign Recognition with Milti-Scale Convolutional Networks". In this algorithm, all the source images were scaled to the size of 32 * 32 pixels and are converted to shades of gray, after which the normalization of contrast was applied to them. Also, the size of the original learning set has been increased by 5 times by applying small random transformations to source images. The resulting network was composed of two stages (STages), as shown in Figure 3, while in the final classification, output values \u200b\u200bwere used not only of the second stage, but also the first. This network showed an accuracy of 98.31%.

Figure 3. Multiple neural network

The third successful algorithm with the use of random forest was developed by the Caor team from Mines Paristech. A detailed description of their algorithm was published in the article "Real-Time Traffic Sign Recognition using Spatially Weighted Hog Trees". This algorithm is based on building a forest from 500 random solutions trees, each of which is trained in a randomly selected subset of the learning set, while the final output value of the classifier is the largest number of votes. This classifier, in contrast to the previously discussed, used non-source images in the form of a set of pixels, and provided by the organizers of the competition together with them HOG-presentation of images (histograms of a oriented gradient). The final result of the algorithm was 96.14% of correctly classified images, which shows that methods that are not related to neural networks and Deep Learning can also be used for the task of recognition of road signs and DEEP Learning, although their performance is still lagging behind the results of convolutional networks .

1.6 Analysis of existing libraries

To implement the algorithms for working with neural networks in the system being developed, it was decided to use one of the existing libraries. Therefore, an analysis of the existing solutions of software for implementing the Deep Learning algorithms was carried out, and in this analysis, a choice was carried out. Analysis existing solutions consisted of two phases: theoretical and practical.

During the theoretical phase, libraries such as Deeplearning4j, Theano, Pylearn2, Torch and Caffe were considered. Consider each of them in more detail.

· Deeplearning4j (www. Deeplearning4j.org) - open source library for implementing neural networks and deep learning algorithms written in Java. It is possible to use from Java, Scala and Closure languages, integration with Hadoop, Apache Spark, Akka and AWS are supported. The library is developing and supported by Skymind, which also provides commercial support for this library. Inside this library, a library is used to quickly work with N-dimensional arrays ND4J developing the same company. Deeplearning4j supports many types of networks, among them Multilayer Perseception, convolutional networks, Restricted Bolzmann Machines, Stacked Denoising Autoencoders, Deep Autoencoders, Recursive Autoencoders, Deep-Belief Networks, Recurrente networks and some others. An important feature of this library is its ability to work in a cluster. The library also supports network training using GPU.

· Theano (www.github.com/theano/theano) - a library in an open source Python language that allows you to effectively create, calculate and optimize mathematical expressions using multidimensional arrays. To represent multidimensional arrays and actions on them, the NUMPY library is used. This library is intended primarily for scientific research and was established by a group of scientists from the University of Montreal. Theano features are very wide, and work with neural networks is only one of its small parts. At the same time, this library is the most popular and most often mentioned when it comes to working with Deep Learning.

· Pylearn2 (www.github.com/lisa-lab/pylearn2) - open source Python-based library, built based on Theano, but providing a more convenient and simple interface for researchers providing a ready-made algorithm set and allowing simple configuration of networks in format Yaml files. Developed by a group of scientists from Lisa Laboratory of the University of Montreal.

· Torch (www.torch.ch) - a library for calculating and implementing machine learning algorithms implemented in C language, however, allowing researchers to work with it a much more convenient scripting Lua language. This library provides its own effective implementation of operations on matrices, multidimensional arrays, supports calculations on the GPU. Allows you to implement complete and convolutional networks. It has an open source code.

· CAFFE (www.caffe.berkeleyvision.org) - a library concentrated on the effective implementation of deep learning algorithms developed first of all Berkley Vision and Learning Center, however, as all previous ones, has an open source code. The library is implemented in C, however, also provides a convenient interface for Python and Matlab. Supports complete and convolutional networks, allows you to describe networks in the format in the form of a set of layers in Foomate.prottxt, supports calculations on the GPU. The advantages of the library also refers to the presence of a large number of preferential models and examples, which, in combination with other characteristics, makes the library the easiest to start work among the above.

For the aggregate criteria for further consideration, 3 libraries were selected: DeepLearning4j, Theano and Caffe. These 3 libraries were installed and tested in practice.

Among these libraries, DeepleArning4j turned out to be the most problematic in the installation, in addition, errors were found in the demonstration examples supplied with the library, which caused certain issues regarding the reliability of the library and extremely difficult to further study. Taking into account the lower performance of the Java language relative to the CAFFE, on which the library is implemented, it was decided to refuse this library.

The theano library also turned out to be quite difficult in the installation and configuration, however, for this library there are a large number of high-quality and well-structured documentation and examples of the working code, so the library is ultimately the work, including using a graphics card, managed to configure. However, as it turned out, the implementation of even an elementary neural network in this library requires writing a large amount of its own code, respectively, there are also large difficulties with a description and modification of the network structure. Therefore, despite the potentially much more wide opportunities for this comparison library with CAFFE, it was decided to stop at the last one as the most relevant tasks.

1.7 LibraryCaffe

The CAFFE library provides a simple and convenient interface for the researcher, allowing you to easily configure and train neural networks. To work with the library, you need to create a network description in Prototxt format (Protocol Buffer Definition File - a data description language created by Google), which is somewhat similar to JSON format, is well structured and understandable for humans. The network description is essentially an alternate description of each of its layers. As input data, the library can work with a database (LEVELDB or LMDB), in-Memory data, HDF5 files and images. It is also possible to use a special type of data for the development and test purposes, called DummyData.

The library supports the creation of the layers of the following types: innerProduct (complete layer), Splitting (converts data to send several output layers at once), flattening (converts data from a multidimensional matrix to vector), Reshape (allows you to change the dimension of data), ConcateNation Multiple input layers are one output), Slicing and some more others. Special types of layers are also supported for convolutional networks - CONVOLUTION (Cushion Layer), Pooling (SabsEmbling Layer) and Local Response Normalization (Local Data Normalization Layer). In addition, several types of loss functions used in network training are supported (Softmax, Euclidean, Hinge, Sigmoid Cross-Entropy, Infogain and Accuracy) and neurons activation functions (Rectified-Linear, Sigmoid, Hyperbolic Tangent, Absolute Value, Power and BNLL) - Which are also configured in the form of separate layers of the network.

Thus, the network is described declaratively in sufficient simple form. Examples of network configurations used in this study can be seen in Appendix 1. Also, to work library using standard scripts, you need to create a SOLVER.PROTOTXT file, which describes the network training configuration - the number of iterations for learning, Learning Rate, a platform for computing - CPU or GPU etc.

Model training can be implemented using embedded scripts (after reforming to the current task) or manually via writing code using the API provided in Python or Matlab. In this case, there are scripts that allow not only to complete the network training, but also, for example, to create a database based on the list of images provided - in this case, the images before adding to the database will be shown to fixed size and normalized. Scripts, which takes place, also encapsulate some auxiliary actions - for example, produce an assessment of the current accuracy of the model through a number of iterations and save the current state of the trained model into the snapshot file. The use of snapshot files allows you to continue learning the model instead of starting first if such a need arises, and also, after a certain number of iterations, change the configuration of the model - for example, add a new layer - and at the same time the weights already trained earlier layers will retain their meanings that Allows you to implement the pre-layered training mechanism described earlier.

In general, the library turned out to be quite convenient in operation and allowed to implement all the desired models, as well as obtain the accuracy values \u200b\u200bfor the classification data for these models.

2. Development of the prototype of the image recognition system

.1 image classification algorithm

During the study of theoretical material on the topic and practical experiments, the following set of ideas were formed to be embodied in the final algorithm:

· Using deep convolutional neural networks. Coupling networks stably show the best results in recognizing images, including road signs, so their use in the developed algorithm looks logical

· Use multi-layer perceptons. Despite the overall effectiveness of convolutional networks, there are types of images for which a multi-layer perceptron shows the best results, so it was decided to use this algorithm.

· Combining the results of several models using an additional classifier. Since it was decided to use at least two types of neural networks, a method is required to form some common result of the classification based on the results of each of them. To do this, it is planned to use an additional classifier that is not associated with neural networks, input values \u200b\u200bfor which the results of the classification of each of the networks, and the output - the final projected image class

· Applying additional transformations to input data. To increase the suitability of input images to recognition and, accordingly, improve the performance of the classifier, several types of transformations must be applied to the input data, and the results of each of them must be processed by a separate network trained to recognize precisely images with this type of conversion.

Based on all the above ideas, the following concept of image classifier was formed. The classifier is an ensemble of 6 neural networks operating independently: 2 multilayer perceptrons and 4 convolutional networks. In this case, the network of one type differ in each other by the type of conversion applied to the input data. The input data is subject to scaling so that the input of each network always contains the data of the same size, and these dimensions may vary for different networks. For aggregation of the results of all networks, an additional classic classifier is used, which was used 2 options: the J48 algorithm based on the solutions tree, and the KSTAR algorithm, which is a "lazy" classifier. Transformations that are used in the classifier:

· Binarization - the image is replaced by a new, consisting of pixels of only black and white colors. Adaptive threshold method is used to perform the binarization (Adaptive Thresholding). The essence of the method is that for each pixel, the image is calculated by the average value of some of its neighborhood of pixels (it is assumed that the image contains only shades of gray, for this, the source images were previously converted accordingly), and then on the basis of the calculated average value, the pixel must be determined Recognize black or white.

· Histogram equalization of the histogram (Histogram Equalization) - the essence of the method is to be used to the histogram of an image of some function, such that the values \u200b\u200bon the resultant diagram are distributed as uniform as possible. In this case, the target function is calculated based on the color intensity distribution function in the source image. An example of using a similar function to the histogram of the image is shown in Figure 4. This method can be used both for black and white and for color images - separately for each component of color. In this study, both options were used.

Figure 4, the results of the application of alignment chart to the image

· Strengthening contrast - it is that for each pixel the image is a local minimum and a maximum of some neighborhood and then this pixel Replaced with a local maximum if its initial value is closer to the maximum or a local minimum otherwise. Applied to black and white images.

Schematically, the general scheme of the obtained classifier is presented in Figure 5:

Figure 5, Final Classifier Scheme

To implement part of the model that is responsible for converting input data and neural networks, Python language and CAFFE library are used. We describe the structure of each of the networks more in detail.

Both multi-layer perceptrons contain 4 hidden layers, and in general, their configuration is described as follows:

· Inlet layer

· Layer 1, 1500 neurons

· Layer 2, 500 neurons

· Layer 3, 300 neurons

· Layer 4, 150 neurons

· Output layer

An example of a CAFFE configuration file describing this network can be seen in Appendix 1. As for convolutional networks, the well-known LENET network developed to classify images from Dataseta ImageNet was taken as the basis of their architecture. However, to meet the images in question that have significantly smaller size, the network has been modified. A brief description looks like this:

The scheme of this network is presented in Figure 6.

Figure 6, Cutting network scheme

Each of the neural networks leading to the model is trained separately. After learning neural networks, a special script in the Python language for each of the networks for each of the images of the learning set receives the result of the classification in the form of a probability list of each of the classes, selects the two most likely class and records the obtained values \u200b\u200balong with the real value of the image class to the file. The resulting file is then transmitted as a training set of a classifier (J48 and KSTAR) implemented in the WEKA library. Accordingly, further classification is performed using this library.

2.2 System architecture

Now, having considered the algorithm for recognizing road signs using neural networks and an additional classifier, we proceed directly to the description of the developed system that uses this algorithm.

The designed system is an application with a web interface that allows the user to download the image of the road sign and get for this sign the result of the classification using the described algorithm. This application consists of 4 modules: Web application, neural network module, classification module and administrator interface. Schematically, the interaction diagram of modules is shown in Figure 7.

Figure 7, Scheme of the classification system

The numbers in the diagram indicate the sequence of actions when the user is running with the system. The user loads the image. The user's request is processed by a web server and the loaded image is transmitted to the neural network module, where all the necessary transformations are performed above the image (scaling, changing the color scheme, etc.), after which each of the neural networks forms its prediction. Then the control logic of this module selects the two most likely predictions for each network and returns this data to the WEB server. The Web server transfers the received network prediction data to the classification module, where they are processed and the final response is made about the predicted class class, which is returned to the Web server and from there to the user. At the same time, the interaction between the user and the web server and the web server and the modules of neural networks and the classification is carried out by REST queries using the HTTP protocol. The image is transmitted in Multipart Form Data format, and the data on the results of the classifiers is in JSON format. This logic of the work makes separate modules sufficiently isolated from each other, which allows you to develop them independently, including using various programming languages, as well as if necessary, it is easy to change the logic of the operation of each module separately without affecting the logic of the other.

To implement the user interface, the HTML and Java Script languages \u200b\u200bwere used in this system, to implement the Web server and the classification module - the Java language, and to implement the neural network module - Python language. The appearance of the system user interface is presented in Figure 8.

Figure 8. User interface Systems

The use of this system assumes that the modules of neural networks and classification already contain trained models. At the same time, an administrator interface is provided to study the models, which is essentially a set of scripts in Python to teach neural networks and a Java console utility for learning the final classifier. It is assumed that these tools should not be used frequently or non-professional users, so no more advanced interface is required for them.

In general, the designed application successfully performs all tasks set before it, including the user allows the user to conveniently get the class prediction for the image chosen by him. Therefore, only the question remains practical results The work of the classifier used in this algorithm, it will be reviewed in Chapter 3.

3. Experimental research results

.1 source data

As input data in this study, the previously mentioned Dataset GTSRB (German Traffic Signs Recognition Benchmark) was used. This daseset consists of 51840 images belonging to 43 classes. In this case, the number of images belonging to different classes is different. The distribution of the number of images by classes is presented in Figure 9.

Figure 9. Dragging the number of images by classes

Input image sizes also vary. For the smallest of images, the width is 15 pixels, for the largest - 250 pixels. The total distribution of image sizes is shown in Figure 10.

Figure 10. Distribution of image sizes

The source images are presented in PPM format, that is, in the form of a file, where each pixel corresponds to three numbers - the intensity values \u200b\u200bof the red, green and blue component of the color.

3.2 Preliminary data processing

Before starting operation, the source data was appropriately prepared - converted from PPM format to JPEG format, with which the CAFFE library is able to operate, randomly divided into training and test set in a ratio of 80: 20%, as well as scaled. In the classification algorithm, images of two sizes are 45 * 45 (for learning a multilayer steppe on binarized data) and 60 * 60 (for learning other networks), therefore instances of these two sizes have been created for each image of the learning and test set. Also, each of the images were applied to the previous conversion (binarization, histogram normalization, contrast increase), and the images have already been saved in the LMDB database (Lightning Memory-Mapped Database), which is a quick and efficient "key value" " This storage method provides the fastest and convenient operation of the CAFFE library. Python Imaging Library (PIL) and SCIKIT-Image (PIL) and scikit-image were used to convert images. Examples of the images obtained after each of the conversions are shown in Figure 11. The images stored in the database were used later to directly teach neural networks.

Figure 11. Results of the application of conversion to the image

As for the training of neural networks, each of the networks was trained separately and the results of its work were estimated, and then the final classifier was built and trained. However, the simplest network was built before this, which is a perceptron with one hidden layer. Consideration of this network had two goals - learning how to work with the CAFFE library on simple example and the formation of some benchmark for a more subject assessment of the results of the rest of the networks compared to it. Therefore, in the next section, consider each of the network models and the results of its work in more detail.

3.3 Results of individual models

The models implemented during this study include:

· Neural network with one hidden layer

· Multilayer neural network based on baseline data

· Multilayer neural network built on the basis of binarized data

· Captive network based on source data

· Captive network built based on RGB data after chart alignment

· Refractory network built based on Greyscale data after chart alignment

· Captive network built based on Greyscale data after contrast enhancing

· Combined model consisting of a combination of two multilayer neural networks and 4 convolutional.

Consider each of them in more detail.

Neural is with one hidden layer, although it does not apply to Deep Learning models, it turns out to be very useful for implementation first, as a training material for working with the library, and secondly, as a certain basic algorithm for comparison with the work of other models . The undoubted advantages of this model include the ease of its construction and high speed of learning.

This model was built for the initial color images of size 45 * 45 pixels, while the hidden layer contained 500 neurons. The network training took about 30 minutes, and the resulting prediction accuracy turned out to be 59.7%.

The second constructed model is a multi-layered complete neural network. This model was built for binarized and color versions of smaller images and contained 4 hidden layers. The network configuration is described as follows:

· Inlet layer

· Layer 1, 1500 neurons

· Layer 2, 500 neurons

· Layer 3, 300 neurons

· Layer 4, 150 neurons

· Output layer

Schematically, the model of this network is shown in Figure 12.

Figure 12. Multilayer Persheppon Scheme

The total accuracy of the model obtained is 66.1% for binarized images and 81.5% for color. However, it justifies the construction of a model for binarized images, despite its smaller accuracy - there was a number of images for which it was the binarized model that was able to determine the right class. In addition, a model based on color images demanded significantly longer time on training - about 5 hours compared with 1.5 hours for a binarized version.

The remaining built models are somehow based on convolutional networks, since it is precisely such networks that showed the greatest efficiency in the tasks of the image recognition. The basis of the neural network architecture was taken by the well-known LENET network, designed to classify images from the Dataseta ImageNet. However, to meet the images in question that have significantly smaller size, the network has been modified. Brief description of the network architecture:

· 3 Cushion layers with kernel sizes 9, 3 and 3, respectively

· 3 layers of SabseMPling

· 3 total layers with dimensions 100, 100 and 43 neuron

This network was separately trained on the original images of the larger, images after aligning the histogram (color is saved), images after aligning the histogram of the black and white view and, finally, black and white images with enhanced contrast. Training results are presented in Table 1:

Table 1. Results of the recruitment network

It can be seen that the best results showed a network based on black and white images after the histogram alignment. This can be explained by the fact that in the process of aligning the diagram, the quality of images, for example, the differences between the image and the background and the total degree of brightness have improved, at the same time excessive information contained in color and not carrying a significant semantic load - a person can easily recognize the same The most signs in the black and white variant - but the loose image and the complicating classification - was eliminated.

Traine using the reverse error method every network on the learning set (the set is the same for all networks, but different conversions are applied to the images)

2. For each instance of the learning set, receive two most likely class in descending order of the likelihood from each network, save the resulting set (only 12 values) and the actual class label

Use the received data set - 12 attributes and class label - as a learning set for the final classifier

Assess the accuracy of the obtained model: For each instance of the test set, you get two most likely class in descending order of the likelihood from each network and the final prediction of the class based on this dataset

According to the results of the steps from this scheme, the total accuracy of the combined algorithm was calculated: 93% when using the J48 and 94.8% algorithm - when using Kstar. At the same time, the algorithm based on the decisions tree shows a little worse results, however, two important advantages are: first, the tree obtained as a result of the operation of the algorithm demonstrates the classification logic and allows you to better understand the real data structure (for example, which of the networks gives the most accurate Predictions for a specific type of signs and therefore its prediction uniquely determines the result), secondly - after building a model, this algorithm allows you to classify new entities very quickly, since it takes only one passage along the tree from top to bottom. As for the KSTAR algorithm, during its operation, the construction of the model does not actually occur, and the classification is based on the search for the most similar instances among the training sample. Therefore, this algorithm, although it classifies entities, but does not provide any for more information For them, and most importantly, the classification of each instance may require a significant amount of time, which can be unacceptable for tasks, where it is required to obtain a very quick result, for example, when recognizing road signs with automatic driving.

Table 2 presents a general comparison of the results of all the considered algorithms.

Table 2. Comparison of the results of the work of the algorithms

Figure 13 presents a network training schedule on an example convolutional network for greyscale data with a histogram alignment (along the axis of the number of iterations, along the axis of the accuracy).

Figure 13. Refrigerated network training schedule

To summarize the study, it is also useful to study the results of the classification and reveal what signs are the most simple for classification, and which, on the contrary, are recognized with difficulty. Consider for this output values \u200b\u200bof the J48 algorithm and the obtained conjugacy table (see Appendix 3). It can be seen that for part of the signs, the accuracy of the classification is 100% or very close to them - for example, these are signs of "STOP" (Class 14), "LDA way" (class 13), "Main Road" (Class 12), "End All restrictions "(class 32)," cross-cutting proceeds "(class 15) (Figure 12). Most of these signs have a characteristic form ("main road") or special graphic elements that have no analogues on other signs ("End of all restrictions").

Figure 12. Examples of easily recognizable road signs

Other signs are often mixed with each other, for example, such as the detour on the left and the detour on the right or different signs of speed limits (Figure 13).

Figure 13. Examples of frequently mixed characters

The pattern is striking that neural networks often mix symmetrical signs - especially this concerns convolutional networks that are looking for local signs in the image and does not analyze the image as a whole - for the classification of such images, multi-layer perceptrons are more suitable.

Summing up, it can be said that with the help of convolutional neural networks and a combined algorithm based on them, it was possible to obtain good results in the classification of road signs - the accuracy of the obtained classifier is almost 95%, which allows to obtain practical results, in addition, the proposed approach using An additional classifier to combine the results of neural networks has many opportunities for further improvement.

Conclusion

In this paper, the task of recognizing images using the artificial neural network was studied in detail. The most relevant approaches to the recognition of images were considered, including those using deep neural networks, as well as their own algorithm for image recognition on the example of the recognition task of road signs using deep networks. According to the results of work, we can say that all the tasks set at the beginning of the work were completed:

An analytical review of literature on the use of artificial neural networks to recognize images was carried out. According to the results of this review, it was found that approaches to recognizing images based on the use of deep convolutional networks were most effective and distributed.

An image recognition algorithm was developed using the example of a road sign recognition task using the neural network ensemble consisting of two multi-layer perceptrons and 4 deep convolutional networks, and using two types of additional classifier - J48 and KSTAR - to combine the results of individual networks and form the final prediction.

A prototype system was developed to recognize images on the example of road signs based on the algorithm from paragraph 3, which provides a web interface to the user to load the image and, using pre-trained models, classifies this image and displays the user the result of the classification

Developed in paragraph 3, the algorithm was trained using the GTSRB dataset, and the results of each of the networks included in it and the total accuracy of the algorithm for two types of an additional classifier were estimated separately. According to the results of the experiments, the largest accuracy of the recognition, equal to 94.8%, is achieved when using an ensemble of neural networks and a KSTAR classifier, and among individual networks, the best results - accuracy 89.1% - showed a convolutional network using a pre-conversion of the image in shades of gray and performing the image histogram alignment.

In general, this study confirmed that at present deep artificial neural networks, especially convolutional networks, are the most effective and promising approach to classify images, which is confirmed by the results of numerous studies and conducting image recognition competitions.

List of used literature

1. Al-Azawi M. A. N. Neural Network Based Automatic Traffic Signs Recognition // International Journal Of Digital Information and Wireless Communications (ijdiwC). - 2011. - T. 1. - No. 4. - P. 753-766.

2. Baldi P. Autoencoders, Unsupervised Learning, And Deep Architectures // ICML Unsupervised and Transfer Learning. - 2012. - T. 27. - P. 37-50.

Bahlmann C. et al. A System for Traffic Sign Detection, Tracking, And Recognition using Color, Shape, And Motion Information // Intelligent Vehicles Symposium, 2005. Proceedings. IEEE. - IEEE, 2005. - P. 255-260.

Bastien F. et al. Theano: New Features and Speed \u200b\u200bImprovements // ARXIV Preprint ARXIV: 1211.5590. - 2012.

Bengio Y., Goodfellow I., Courville a. Deep Learning. - Mit Press, Book in Preparation

Bergstra J. et al. Theano: A CPU and GPU Math Compiler in Python // Proc. 9th Python in Science Conf. - 2010. - C. 1-7.

Broggi A. et al. Real Time Road Signs Recognition // Intelligent Vehicles Symposium, 2007 IEEE. - IEEE, 2007. - P. 981-986.

CANNY J. A Computational Approach to Edge Detection // Pattern Analysis and Machine Intelligence, Ieee Transactions ON. - 1986. - №. 6. - P. 679-698.

Ciresan D., Meier U., Schmidhuber J. Multi-Column Deep Neural Networks for Image Classification // Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference ON. - IEEE, 2012. - P. 3642-3649.

Ciresan D. et al. A Committee of Neural Networks for Traffic Sign Classification // Neural Networks (IJCNN), The 2011 International Joint Conference ON. - IEEE, 2011. - P. 1918-1921.

11. Cireşan D. C. et al. Deep Big Multilayer Perceptrons for Digit Recognition // Neural Networks: Tricks of the Trade. - Springer Berlin Heidelberg, 2012. - P. 581-598.

Daugman J. G. Complete Discrete 2-d Gabor Transforms by Neural Networks for Image Analysis and Compression // Acoustics, Speech and Signal Processing, Ieee Transactions on. - 1988. - T. 36. - №. 7. - P. 1169-1179.

Gao X. W. et al. Recognition of Traffic Signs Based on their Color and Shape Features Extracted using Human Vision Models // Journal of Visual Communication and Image Reportation. - 2006. - T. 17. - №. 4. - P. 675-685.

GoodFellow I. J. et al. Pylearn2: A Machine Learning Research Library // ARXIV Preprint ARXIV: 1308.4214. - 2013.

Han J., Kamber M., Pei J. Data Mining: Concepts and Techniques. - Morgan Kaufmann, 2006.

Harris C., Stephens M. A Combined Corner and Edge Detector // Alvey Vision Conference. - 1988. - T. 15. - P. 50.

Houben S. et al. Detection of Traffic Signs In Real-World Images: The German Traffic Sign Detection Benchmark // Neural Networks (IJCNN), The 2013 International Joint Conference On. - IEEE, 2013. - C. 1-8.

Huang F. J., Lecun Y. Large-Scale Learning with Svm and Convolutional NetW for Generic Object Recognition // 2006 Ieee Computer Society Conference On Computer Vision and Pattern Recognition. - 2006.

HuttenLocher D. P., ULLMAN S. OBJECT RECOGNITION USING ALIGNMENT // PROC. ICCV. - 1987. - T. 87. - P. 102-111.

Jia, Yangqing. "Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding." H TTP: // CAFFE. BerkeleyVision. ORG (2013).

Krizhevsky A., Sutskever I., Hinton G. E. Imagenet Classification with Deep Convolutional Neural Networks // Advances In Neural Information Processing Systems. - 2012. - P. 1097-1105.

LaFuente-Arroyo S. et al. Traffic Sign Classification Invariant to Rotations using Support Vector Machines // Proceedings of Advable Concepts for Intelligent Vision Systems, Brussels, Belgium. - 2004.

Lecun Y., Bengio Y. Convolutional Networks for Images, Speech, and Time Series // The Handbook of Brain Theory and Neural Networks. - 1995. - T. 3361. - P. 310.

Lecun Y. et al. Learning Algorithms for Classification: A COMPARISON ON HANDWRITTEN DIGIT RECOGNITION // Neural Networks: The Statistic Mechanics Perspective. - 1995. - T. 261. - P. 276.

Masci J. et al. Stacked Convolutional Feature Extraction // Artificial Neural Networks and Machine Learning-Icann 2011. - Springer Berlin Heidelberg, 2011. - P. 52-59.

Matan O. et al. Handwritten Character Recognition Using Neural Network Architectures // Proceedings of the 4th USPS Advanced Technology Conference. - 1990. - P. 1003-1011.

McCulloch W. S., Pitts W. A Logical Calculus of the ideas Immanent in Nervous Activity // The Bulletin of Mathematical Biophysics. - 1943. - T. 5. - №. 4. - P. 115-133.

MINSKY M., Seymour P. PerceptRons. - 1969.

Mitchell T. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, 2005 // Manuscript Available AT # "897281.files / image021.gif"\u003e

As one of the tools for solving hard-formalized tasks has already been mentioned quite a lot. And here, on Habré, it was shown how these networks are used to recognize images, in relation to the problem of hacking caps. However, there are quite a lot of neural network types. And is it a good classical complete neural network (PNS) for the recognition task (classification) of images?

1. Task

So, we are going to solve the image recognition task. It can be recognition of individuals, objects, symbols, etc. I suggest to start considering the task of recognizing handwritten numbers. This task is good for a number of reasons:

    To recognize the handwriting symbol, it is rather difficult to compile a formalized (not intelligent) algorithm and it becomes clear, it is only worth looking at one and the narrow digit written by different people

    The task is quite relevant and related to OCR (Optical Character Recognition)

    There is a freely distributed database of handwritten symbols available for download and experiments.

    There are quite a few articles on this topic and can be very easy and convenient to compare various approaches

As input data is proposed to use the MNIST database. This base contains 60,000 training pairs (label - label) and 10,000 test (images without labels). Images are normalized in size and sectored. The size of each digit is not more than 20x20, but they are written in a square size 28x28. An example of the first 12 digits from the MNIST database set is shown in Figure:

Thus, the task is formulated as follows: create and teach to neuralize the recognition of handwritten characters, taking their images at the entrance and activating one of 10 outputs. Under the activation, we will understand the value of 1 at the output. The values \u200b\u200bof the remaining outputs should (ideally) be equal to -1. Why at the same time the scale is not used, I will explain later.

2. "Ordinary" neural networks.

Most people under the "ordinary" or "classic" neural networks understand the simultaneous neural networks of direct distribution with reverse error distribution:

As it should be from the name in such a network, each neuron is associated with each, the signal is only in the direction from the input layer to the weekend, there are no recursions. We will call such a network of abbreviated PNS.

First you need to decide how to apply for the input. The simplest and almost non-alternative solution for PNS is to express the two-dimensional image matrix in the form of a one-dimensional vector. Those. For the image of the handwritten digit size of 28x28, we will have 784 entrance, which is no longer enough. Then happens, for which neurosetheviks and their methods, many conservative scientists do not like - the choice of architecture. And do not like, since the choice of architecture is pure shamanism. Until now, there are no methods that allow you to unambiguously determine the structure and composition of the neural network based on the task description. In defense, I will say that for hard-formalized tasks, it is unlikely that such a method will ever be created. In addition, there are many different methods of network reduction (for example OBD), as well as different heuristics and empirical rules. One of these rules states that the number of neurons in the hidden layer should be at least an order of magnitude more of the number of inputs. If we take into account that in itself the transformation from the image to the class indicator is quite complex and substantially nonlinear, one layer cannot do here. Based on the foregoing roughly, we pretend that the number of neurons in the hidden layers we will have about 15000 (10,000 in the 2nd layer and 5000 in the third). In this configuration with two hidden layers, the number customizable and trained connections There will be 10 million between the inputs and the first hidden layer + 50 million between the first and second + 50 thousand between the second and weekend, if you consider that we have 10 outputs, each of which indicates the figure from 0 to 9. Total rough 60 000 000 connections. I did not in vain mentioned that they were customizable - this means that when learning, for each of them, it will be necessary to calculate the error gradient.

What can be done here, beauty artificial intelligence requires victims. But if you think about, it comes to mind that when we convert the image into a linear chain byte, we are irretrievably lost. And with each layer, this loss is only aggravated. So there is - we lose the image topology, i.e. The relationship between its individual parts. In addition, the task of recognition implies the skill of the neural network to be resistant to small shifts, turning and changing the image, i.e. It should extract certain invariants from the handwriting of a person or another. So what should be neural to be not very computationally complex and, at the same time, more invariant to different image distortions?

3. Coupling neural networks

The solution of this problem was found by the American scientist of French origin by the Lekuna, inspired by the works of Nobel laureates in the field of Medicine Torsten Nils Wiesel and David H. Hubel. These scientists explored the visual bark of the cat's brain and found that there are so-called simple cells that are particularly reacting to straight lines at different angles and complex cells that respond to the movement of lines in one direction. Jan Lekun offered to use the so-called convolutional neural networks.

6. Results

The program on MATLABCENTRAL is attached to the file already trained neural network, as well as the GUI to demonstrate the results of the work. Below are examples of recognition:



The link contains a table of comparing recognition methods based on MNIST. The first place behind convolutional neural networks with a result of 0.39% of recognition errors. Most of these erronely recognized images are not every person correctly recognizes. In addition, elastic distortions of input images were used, as well as preliminary training without a teacher. But about these methods somehow in another article.

Links.

  1. Yann Lecun, J. S. Denker, S. Solla, R. E. Howard and L. D. Jackel: Optimal Brain Damage, in Touretzky, David (EDS), Advances in Neural Information Processing Systems 2 (Nips * 89), Morgan Kaufman, Denver, CO, 1990
  2. Y. Lecun and Y. Bengio: Convolutional Networks for Images, Speech, and Time-Series, In Arbib, M. A. (EDS), The Handbook of Brain Theory and Neural Networks, Mit Press, 1995
  3. Y. Lecun, L. Bottou, G. Orr and K. Muller: Efficient backprop, in orr, G. and Muller K. (EDS), Neural Networks: Tricks of the Trade, Springer, 1998
  4. Ranzato MARC "Aurelio, Christopher Poultney, Sumit Chopra and Yann Lecun: Efficient Learning of Sparse Representations with An Energy-Based Model, In J. Platt et al. (EDS), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press , 2006.