Range training. Metrics in machine learning tasks Training algorithm and building error matrix

1

In recent years, many attention is paid to the reconstruction of images, respectively, quality assessment is an important task to compare various image recovery methods. In many cases, reconstruction methods lead to blurring textures and structures when restoring large areas with distorted pixel values. An objective quantitative assessment of the recovery results is currently absent, in connection with which an expert assessment is used in many approaches. This article discusses the new approach to the quality of the recovery of images based on machine learning with the use of a person's model of view, which lies in the fact that local images can be represented as descriptors as some parametric distributions. Next, the method of supporting vectors of regression allows you to predict the perceived quality of recovered images in accordance with the expert assessment. The paper demonstrates that the quality assessment obtained using the given approach correlates with a subjective quality assessment.

machine learning

visual quality

reconstruction

image processing

1. Gastaldo P. Machine Learning Solutions for Objective Visual Quality Assessment / 6th International Workshop On Video Processing and Quality Metrics for Consumer Electronics, VPQM. - Vol. 12. - 2012.

2. Bertalmio M., Bertozzi A., Sapiro G. Navier-Stokes, FL UID Dynamics, And Image and Video Inpainting / Hawaii: Proc. IEEE COMPUTER Vision and Pattern Recognition (CVPR). - 2001.- PP. 213-226.

3. Criminisi A., Perez P., Toyama K. Region Fi Ling and Object Removal by Exemplar-based Image Inpainting / Ieee Trans. Image Process. - 13 (9). - 2004. - PP. 28-34.

4. Vijay M., Cheung, S.S. Eye Tracking Based Perceptual Image Inpainting Quality Analysis / Image Processing (ICIP), 17th Ieee International Conference on Ieee. - 2010. - PP. 1109 - 1112.

5. Ardis P.A., Singhal A. Visual Salience Metrics for Image Inpainting / SPIE Electronic Imaging. International Society for Optics and Photonics. - 2009.

6. Cheung S.S., Zhao J., Venkatesh V. Efficient Object-Based Video Inpainting / Image Processing, 2006 IEEE International Conference ON. - 2006. - PP. 705-708.

7. Perettaigin G.I. Representation of images by Gaussian random fields / Automotry. - № 6. - 1984. - P. 42 - 48.

8. Frantc V.A., Voroni V.V., Marchuk V.I., Sherstobitov A.I., Agaian S., Egiazarian K. Machine Learning Approach for Objective Inpainting Quality Assessment / Proc. SPIE 9120, Mobile Multimedia / Image Processing, Security, and Applications. - Vol. 91200s. - 2014.

9. Paul A., Singhal A., and. Brown C. INPAINTING QUALITY ASSESSMENT / JOURNAL OF ELECTRONIC IMAGING. - Vol. 19. - 2010. - PP. 011002-011002.

An objective image quality metric is an important part of image processing systems. One of the important applications of objective metrics to assess image quality is the evaluation of the effectiveness of algorithms and image processing systems. Despite the large number of publications on this topic, the task of assessing the quality of reconstructed images is considered only in a few. At the same time, the task of restoring the lost areas of the image has increased considerable attention recently.

There are two approaches to the quality assessment of images: a quantitative assessment by using mathematical methods (rms error, LP-norm, measures that take into account the features of perception of the image with a visual system of a person) and a subjective assessment based on expert assessments.

Quality assessment obtained using existing approaches may differ significantly from the assessment obtained using experts. Most of the existing approaches for quality assessment use a reference image. But, unfortunately, in many cases the reference image is not available. Such tasks include the task of reconstruction of lost pixels. Thus, the task of developing a quantitative metric assessment of the quality of reconstructed images is relevant.

In the development of quantitative assessments of image quality, significant successes are achieved. However, the entered criteria are not fairly perfect. Most attempts to find acceptable image quality assessments refers to special cases. A certain assessment is proposed based on any physiological prerequisites, and more often convenient for analysis and computing, and then its properties are evaluated. Creating more advanced image quality assessments is associated with a deeper study of the properties of the human visual system.

The purpose of this work It is the development of a metric of an image quality assessment when processing reconstruction methods based on machine learning.

Mathematical model

The article uses designations similar to the notation adopted in the work. The entire image consists of two non-cycle areas: the reconstructed area, and the famous area. Figure 1 shows an example of the location of these areas.

Fig 1. Image model

It is known an image and an area Ω inside it. The task of reconstruction consists in modifying the values \u200b\u200bof the pixels of the image within the region Ω, so that the area does not stand out against the background of the surrounding image. The purpose of the reconstruction may consist of restoring damaged parts of the image (for example, scratches and cracks in old photos) or removing unwanted objects in the image. Shown in Figure 1 The region Ω is always determined by the user, i.e. The definition of the region Ω is not part of the reconstruction problem.

Algorithm for assessing image recovery

In general, for the successful construction of the image quality metric based on machine learning, a solution is required for three following tasks:

1. Defining the space of features that serve as a description of the input signals.

2. Select the display function from the space of signs into space of quality estimates.

3. Training of the system and checking its stability (check for retraining, etc.).

The structural diagram of the selected approach is shown in Figure 2 and contains the following steps:

1. Choosing an area of \u200b\u200binterest (using a focus card);

2. Calculation of low-level signs of the image;

3. Construction of the descriptor of the restored area on the basis of low-level features;

4. Solving the problem of regression in order to obtain a numerical quality assessment based on the obtained vector descriptor.

Fig. 2. block diagram of the algorithm

The paper shows that visual attention plays an important role in the audience perception of man. At any time, the human eye clearly sees only a small part of the scene, while at the same time a much more significant area of \u200b\u200bthe scene is perceived as "blurred". This "blurred information" is enough to assess the importance of various areas of the scene and attracting attention to important areas of the visual field. Most methods allow you to get an attention card - a two-dimensional image in which the values \u200b\u200bof each pixel are associated with the importance of the corresponding area.

Saliency Toolbox described in operation is used to receive attention cards. This toolkit uses the model of the human visual system. It is important to note that it makes no sense to compare the restored area on the source and restored image, since the total content can change significantly. To select areas of interest, it is proposed to use the following expression:

.

Here is a map of attention for the reconstructed image, and the value of the attention card corresponding to the pixel. In the expression above, the view density is calculated inside and outside the restored image area. The value is used as a threshold when deciding which parts of the image will be used in the assessment, and which are not. Only those areas for which are taken into account.

Spectral representations are used as low-level signs of local areas. The following is proposed to analyze the following Fourier bases, Walsh, Haar using the efficiency vector. To correctly calculate the components of the systemic criterion of efficiency in the presence of interference and distortion requires the use of statistical averaging.

In the synthesis of signal processing algorithms and systems, the criterion of a minimum of medium risk is most often used, allowing to consider interference statistics and signals. When implementing frequency transformations and assessment of computational costs, the choice of the spectral decomposition basis is essential. To optimize the selection of the signal decomposition basis, it is advisable to use the criterion of a minimum of medium risk. For this, it is necessary that the class of the signals and processes can be specified and their probabilistic characteristics were known.

For a given class of two-dimensional processes, a certain probability of each subclass is assumed, where the index is the subclass number with some common properties, and the number of the process of the process of the subclass. We will compare some totality of basic systems Decomposition to the generalized Fourier series by, - and the basic system in general, has the form: .

At the end of the members of the Fourier series, you can characterize the error:, where is the distance in some metric, the partial amount of the members of the Fourier series.

The hardware definition of the Fourier series coefficients or their calculation is associated with certain computational costs. We introduce the function of losses that takes into account both the losses associated with the error in the truncation of the Fourier series and the cost of instrumentation and computing resources:

.

The value of the conditional risk depends on both the subclass of signals and from the basis and is calculated by averaging the functions of implementing losses:

where - the probability density of the analyzed signals and interference; And the angular brackets mean the operation of statistical averaging.

The average risk is determined by averaging the conditional risk on the subclass of signals:

,

where is the probability of the signals' subclass.

In accordance with the criterion, the minimum of average risk from the bases is chosen by the one for which the average risk is minimized.

To assess the effectiveness of the systematic image processing criterion, test images are considered as textures obtained based on modeling Gaussian fields with specified correlation functions. The generation of homogeneous normal random fields, as well as stationary normal random processes, is most simply made by the forming filter method.

As an example, the article discusses the representation of random implementations with various correlation functions in the bases of trigonometric functions (Fourier), Walsh and Haar. We will analyze in selected bases for the created image models of 256 per 256 pixels. We also define the three types of probability distribution of subclasses: 1) uniform: ; 2) descending :;
3) increasing :. Select the cost function in the form: .

The average risk is determined by the averaging of the conditional risk on the subclass of signals using the received a priori probabilities of the subclasses of signals, the calculated values \u200b\u200bare presented in Table 1.

Table 1

Middle risk values

Types of probability distribution

The results of the calculations presented in the table show that for the received models of two-dimensional signals and the distributions of their probabilities, the Khaar basis has the smallest average risk, and the Fourier basis is the largest.

Based on the analysis of the analysis, select the Haar's basis to represent local areas of images. It should be noted that the size of the recovered area is different for various images. In this regard, on the basis of low-level signs, a high-level representation of a fixed size should be formed. As a high-level view, the "bag of words" approach is used. The procedure for constructing a descriptor (signature) of the recovered area consists of two steps. In the first step, a dictionary is built. This uses low-level features extracted from all images of the learning set of images. To build a dictionary, the extracted features are divided into 100 classes using the K-average clustering algorithm. Each element of the dictionary is a centroid for one of the classes found by the clustering procedure. Each word in the dictionary represents the Haar conversion in the image block size of 8x8. The obtained dictionary is used at the second stage when constructing histograms of frequencies for words from a dictionary as a vector of signs - descriptor of the recovered area (Fig. 3). The resulting set of descriptors is used to train the regression machine (Support Vector Regression). To obtain a histogram of the frequency of words from the dictionary, all visually noticeable areas are extracted (the visibility is determined using the attention of the attitude) of a particular image. Then the Haar's transformation is used to each of the extracted blocks and the classification is performed according to the resulting dictionary based on the Euclidean distance.

Each bin of the resulting histogram contains the number of low-level features of a particular class in this reduced area. After normalizing the histogram, the image signature is obtained by a high-level representation of the restored area.

Fig.3. Building a histogram

Evaluation of the effectiveness of the image recovery quality assessment algorithm

In order to assess the effectiveness of the developed metric, a set of test images was used. The set consists of 300 images. The following approaches are selected as recovery methods: a method based on the search for self-like domains, a method based on spectral transformations, a method based on the calculation of private derivatives. For each image, an expert assessment was obtained, with the involvement of 30 people. The results were divided into two non-cycle sets. The first was used for learning, and the second to verify the result.

Experts assessed quality on a scale in which 5 corresponds to "excellent", and 1 corresponds to "very bad". To estimate the effectiveness of the obtained metric, the correlation coefficient is used between the vectors obtained using an objective metric and the expert method of quality estimates. Analysis of the results in Table 2 shows that the proposed approach exceeds the known quality metrics on the selected test data set.

table 2

Correlation coefficient for various calculation methods Objective
Image quality metrics

Proposed approach

Conclusion

The article presents an objective metric assessment of image quality based on machine learning. Quantitative image quality measures are needed for designing and evaluating image playback systems. These measures will largely help to get rid of labor-intensive and inaccurate modern methods for assessing images through subjective examination. In addition, on the basis of quantitative measures, it is possible to develop methods for optimizing image processing systems. It has been demonstrated that the quality assessment obtained using the given approach correlates with a subjective quality assessment.

The work was supported by the Ministry of Education and Science of Russia in the framework of the FTP "Research and Development on Priority Directions for the Development of the Scientific and Technology Complex of Russia for 2014-2020" (Agreement No. 14.586.21.0013).

Reviewers:

Fedosov V.P., Dr. N., Professor, Head of the Department of Engineering and Technological Academy of the Southern Federal University, Rostov-on-Don;

Marchuk V.I., D.T., Professor, Head of the Department "Radioelectronic and Electrical Systems and Complexes" ISOP (branch of the DSTU), G.Shachta.

Bibliographic reference

Voronin V.V. Assessment of the quality of recovery of images based on machine learning // Modern problems of science and education. - 2014. - № 6.;
URL: http://science-education.ru/ru/Article/View?id\u003d16294 (date of handling: 02/01/2020). We bring to your attention the magazines publishing in the publishing house "Academy of Natural Science"

In the process of preparing the task for the Goto Summer Entry School, we found that in Russian practically there is practically no qualitative description of the main ranking metrics (the task concerned the private event of the ranking task is to build a recommendation algorithm). We in E-Contenta are actively using various ranking metrics, so they decided to fix it under province, writing this article.

The ranking task now occurs everywhere: sorting web pages according to a given search query, personalization of news feed, recommendations of video, goods, music ... in a word, hot. There is even a special direction in machine learning, which is engaged in the study of the algorithms for ranking capable self-study - learning to rank (Learning to Rank). To choose from all the variety of algorithms and approaches the best, it is necessary to be able to evaluate their quality quantitatively. On the most common ranking quality metrics and will be discussed.

Briefly about the ranking task

Ranking - set sorting task elements From the considerations of them relevance. Most often relevance is understood in relation to no one object. For example, in the information search task, the object is a request, elements - all kinds of documents (links to them), and the relevance - the document compliance with the request, in the task of the recommendations, the object is a user, elements - one or another recommended content (goods, video, music ), and the relevance is the likelihood that the user will use (buy / light / view) by this content.

Formally, consider N objects and m elements. Reuse of the element ranking algorithm for an object is a mapping that compares each element weight characterizing the degree of relevance to the object (the greater the weight, the relevant object). At the same time, the set of weights sets the permutation on the set of elements elements (we believe that the set of elements is ordered) based on their sorting of weight.

To estimate the quality of ranking, it is necessary to have some "standard", with which it would be possible to compare the results of the algorithm. Consider the reference function of the relevance, which characterizes the "real" relevance of the elements for this object (- the element is ideal, is completely irrelevant), as well as the corresponding permutation (descending).

There are two main ways to receive:
1. Based on historical data. For example, in the case of the content recommendations, you can take views of the user (likes, purchases) and assign the corresponding elements 1 () viewed scales, and everything else is 0.
2. Based on expert assessment. For example, in the search task, for each request, you can attract the command to the command, which manually appreciate the relevance of the documents request.

It should be noted that when only extreme values \u200b\u200btakes: 0 and 1, then the press is usually not considered and take into account only a plurality of relevant elements for which.

Purpose of ranking quality metric - determine how the algorithm obtained by the algorithm of the relevance and the corresponding permutation correspond to it correspond to true values \u200b\u200bof relevance. Consider the main metrics.

Mean Average Precision

Mean Average Precision AT K ( [Email Protected]) - one of the most frequently used ranking metrics. To sort out how it works, let's start with the "basics."

Note: "* Precision" Metrics are used in binary tasks, where only two values \u200b\u200btakes: 0 and 1.

Precision AT K.

Precision AT K ( [Email Protected]) - Accuracy on K elements - base ranking metric for one object. For example, our ranking algorithm issued a relevance assessment for each element. Having revealed among them the first elements with the greatest, it is possible to calculate the share of relevant. This is exactly what Precision AT K:

Note: Under the element is understood, which as a result of the permutation turned out to be at one position. So, the element with the largest, is the element with the second largest and so on.

Average Precision AT K

Precision AT K - Metric is simple for understanding and implementation, but has an important drawback - it does not take into account the order of elements in the "Top". So, if we guess only one out of ten elements, it is not important at what place it was: on the first, or on the last, - in any case. It is obvious that the first option is much better.

This deficiency levels the ranking metric average Precision AT K ( [Email Protected]) that is equal to the sum [Email Protected] By indexes K from 1 to k only for relevant elementsdivided by k:

So, if from three elements we relevant turned out to be only located in the last place, then if only the one was guess, then, and if we were gone, then.

Now I. [Email Protected] We are on our teeth.

Mean Average Precision AT K

Mean Average Precision AT K ( [Email Protected]) - One of the most commonly used ranking metrics. IN [Email Protected] and [Email Protected] Ranking quality is estimated for a separate object (user, search query). In the practice of objects, many: we are dealing with hundreds of thousands of users, millions of search queries, etc. Idea [Email Protected] is to calculate [Email Protected] For each object and averaged:

Note: This idea is quite logical, assuming that all users are equally needed and equally important. If this is not the case, then instead of simple averaging you can use weighted, dominated [Email Protected] Each object for its corresponding "importance" weight.

Normalized Discounted Cumulative Gain

Normalized Discounted Cumulative Gain (NDCG) - Another common ranking quality metric. As in the case of [Email Protected]Let's start with the basics.

Cumulative Gain AT K

Repeat one object and elements with the largest. Cumulative Gain AT K ( [Email Protected]) - Basic ranking metric, which uses a simple idea: the relevant elements in this top, the better:

This metric has obvious disadvantages: it is not normalized and does not take into account the position of relevant elements.

Note that in contrast to [Email Protected], [Email Protected] It can also be used in the case of non-biothered values \u200b\u200bof reference relevance.

Discounted Cumulative Gain AT K

Discounted Cumulative Gain AT K ( [Email Protected]) - Modification of Cumulative Gain AT K, taking into account the order of elements in the list by multiplying the relevance of the element on the weight equal to the reverse logarithm of the position number:

Note: If only values \u200b\u200bare 0 and 1, then, and the formula takes a simpler type:

The use of logarithm as a discount function can be explained by the following intuitive considerations: from the point of view of ranking position at the beginning of the list, they differ much stronger than the position at its end. So, in the case of a search engine between positions 1 and 11, a whole abyss (only in several cases from a hundred, the user comes in the farther first page of the search results), and there is no special difference between the positions 101 and 111 - there are few people reaches them. These subjective considerations are perfectly expressed with the help of logarithm:

Discounted Cumulative Gain solves the problem of accounting for the position of the relevant elements, but only aggravates the problem with the lack of normalization: if it varies within the limits, it already takes the values \u200b\u200bon not entirely understandable. The following metric is designed to solve this problem

Normalized Discounted Cumulative Gain AT K

How can you guess from the name, normalized Discounted Cumulative Gain AT K ( [Email Protected]) - nothing but normalized version [Email Protected]:

where is the maximum (i - ideal) value. Since we agreed that it takes values \u200b\u200bin, then.

Thus, inherit from taking into account the position of the elements in the list and, while taking values \u200b\u200bin the range from 0 to 1.

Note: By analogy with [Email Protected] can be calculated, averaged for all objects.

Mean Reciprocal Rank.

Mean Reciprocal Rank (MRR) - Another frequently used ranking quality metric. It is set as the following formula:

where - reciproosal Rank. For the object - very simple in its essence, equal the reverse order of the first correctly guess element.

Mean Reciprocal Rank varies in the range and takes into account the position of the elements. Unfortunately, he does it only for one element - the 1st is true predicted, not paying attention to all subsequent.

Metrics based on rank correlation

Separately, it is worth highlighting the ranking quality metrics based on one of the coefficients rank correlation. In statistics, the rank coefficient of correlation is the correlation coefficient that does not take into account the importance itself, but only their rank (order). Consider the two most common rank coefficients of correlation: the coefficients of spirit and kendella.

Kendella rank coefficient

The first one is the Kendella correlation coefficient, which is based on counting coordinated
(and inconsistent) steam at the permutations - couples of elements, which the permutations assigned the same (different) order:

Spearman's rank coefficient

The second is the rank coefficient of the correlation of the spirit - in fact, it is nothing more than the Correlation of Pearson, counted on the ranks. There is a fairly convenient formula, expressing it from ranks directly:

where is the Pearson correlation coefficient.

Metrics based on rank correlation have the disadvantage already known to us: they do not take into account the position of the elements (even worse than [Email Protected]because The correlation is considered in all elements, and not according to K elements with the greatest rank). Therefore, in practice is used extremely rare.

Metrics based on a cascade behavior model

Up to this point, we did not deepen how the user (hereinafter, we will look at the specific case of the object - the user) studies the elements offered to it. In fact, implicitly we had the assumption that viewing each element independent From the views of other elements - a kind of "naivety". In practice, items are often visible by the user alternately, and whether the user will review the next element depends on its satisfaction with the previous one. Consider an example: in response to the search query, the ranking algorithm suggested several documents to the user. If the documents on position 1 and 2 were extremely relevant, the likelihood that the user will review the document on position 3 small, because It will be quite satisfied with the first two.

Similar models of user behavior, where the study of the elements offered to it occurs consistently and the probability of viewing the element depends on the relevance of the previous ones is called cascade.

Expected Reciprocal Rank.

EXPECTED RECIPROCAL RANK (ERR) - an example of a ranking quality metric based on a cascade model. It is set as the following formula:

where the rank is understood in the order of descending order. The most interesting thing in this metric is probability. When calculating them, the assumptions of a cascade model are used:

where is the likelihood that the user will be satisfied with the object with the rank. These probabilities are considered based on values. Since in our case, we can consider a simple option:

which can be read like: true relevance of the element in position In conclusion, we give a few useful links.

Often in the practice of the system analytics, which make up FRD, there are informalizable things. An example may be type requirements:

  • The application must work quickly
  • The application must consume little traffic
  • The video material should be high quality.

Such requirements, being recorded in the FRD "as is", are a monstrous source of problems subsequently. Formalization of such requirements is the constant headache of the analytics. Usually, the analyst solves the task of two receptions: first the "equivalent" formal requirement is advanced, then in the process of communication (with the customer, an expert of the subject area, etc.), it is proved that such a formal requirement can replace the original requirement. Generally speaking, the requirement received by us is not functional; It describes not "that" should be able to do a system, but "how to do". At the same time, "how to do" should be formulated with a specific qualitative characteristic.

It was a preamble to the thesis that the system analyst should well own the mathematical apparatus and at the same time be able to explain the "mathematics" to the customer. Now consider an example.

About the task of classification

Suppose we write a FRD for the contextual advertising system similar to Amazon Omakase. One of the modules of our future system will be a contextual analyzer:

The analyzer takes on the entrance text of the web page and produces its contextual analysis. How he does it is not particularly interested in us; It is important that at the output we get a set of product categories (many of which are predetermined in advance). Further, based on these categories, we can show banners, commodity links (as amazon), etc. The analyzer for us is still a black box to which we can ask a question (as a text of the document) and get an answer.

The customer wants the analyzer "well defined the context". We need to formulate that this requirement means. To begin with, let's talk about the context as such, i.e. About the very set of categories that is returned by the analyzer. It is possible to determine this as a classification task when a document (web page) is compared a lot of classes from a predetermined number; In our case, classes are product categories. The classification task is quite often found in text processing (for example, spam filters).

Metrics Evaluation

Consider the metrics of the assessment applicable to the classification task. Suppose we know right Categories for some number of documents. We grure out the answers of our hypothetical analyzer as follows:

  • True positive ( true Positives) - Those categories that we expected to see and got at the exit
  • False positive ( false positives) - Categories that be at the output should not and the analyzer mistakenly returned to the output
  • Falsely negative ( false Negatives.) - the categories we expected to see, but the analyzer did not define them
  • True negative ( true Negatives.) - categories that should not be at the output and at the output of the analyzer they are also completely absent.

Let's call a test sample a lot of documents (web pages) for which we know the right answers. If you calculate each category number of hits (we consider hitting param Document - Category), we get a canonical answer distribution plate:

The left column of the table is the "right" combinations of documents and categories (the presence of which we expect at the exit), the right - incorrect. Top Table Row - Positive (Positive) Classifier Answers, Lower - Negative (In our case, the lack of category in response). If the number of all pairs document - Category equally N., it is not difficult to see that

In general, now you can record the requirement of the customer in the form (the number of incorrect answers is zero) and to stop on it. However, in practice such systems does not happen and the analyzer will, of course, work with errors relative to the test sample. Understand the percentage of errors will help us a metric of correctness (Accuracy):

In the numerator we see the diagonal of the matrix - the total number of correct answers, which is divided into the total number of issues. For example, an analyzer that gave 9 correct answers out of 10 possible, has an accuracy 90%.

Metric F 1.

Simple example of inapplicable accuracy-metric is the task of determining the shoe brand. Suppose we want to calculate the number of references to shoe brands in the text. Consider the task of the classification, the purpose of which will determine whether the specified essence of the shoe brand is (Timberland, Columbia, Ted Baker, Ralph Lauren, etc.). In other words, we break the entity in the text into two classes: a - a shoe brand, B - everything else.

Now consider the degenerate classifier, which simply returns class B (everything else) for any Entities. For this classifier, the number of true positive answers will be 0. Generally speaking, let's think about the topic, and how often do we have shoe brands when reading text on the Internet? It turns out that oddly enough that in the general case 99.9999% of the words of the text are not shoe brands. We construct an answer distribution matrix for sampling at 100.000:

I calculate it accuracy, which will be equal to 99990/100000 \u003d 99.99%! So, we easily built a classifier, which essentially does nothing, however, has a huge percentage of the correct answers. At the same time, it is absolutely clear that we did not solve the task of determining the shoe brand. The fact is that the correct entities in our text are strongly "diluted" in other words that do not have any values \u200b\u200bfor the classification. Given this example, it is quite clear to use other metrics. For example, value tN. It is clearly "trash" - it seems to be the correct answer, but the growing tN. As a result, the contribution is strongly "suppresses" tP. (which is important to us) in the ACCURACY formula.

We define the measure of accuracy (P, Precision) as:

As it is easy to notice, the measure of accuracy characterizes how many positive responses received from the classifier are correct. The more accuracy, the less the number of false hits.

Precision measures, however, does not give an idea of \u200b\u200bwhether all the correct answers returned the classifier. To do this, there is a so-called measure of completeness (R, Recall):

The measure of completeness characterizes the ability of the classifier to "guess" as much as possible of positive answers from the expected. Note that false-positive answers do not affect this metric.

Precision and Recall give a rather comprehensive characteristic of the classifier, and "from different angles". Usually, when building this kind of systems, it is necessary to balance between two of these metrics all the time. If you are trying to racall, making the classifier more "optimistic", it leads to a drop of Precision due to an increase in the number of false-positive responses. If you already twist your classifier, making it more "pessimistic", for example, it is strictly filtered by the results, then with the growth of Precision, it will cause a simultaneous drop of Recall due to the rejection of some kind of correct answers. Therefore, it is convenient for the characteristics of the classifier to use one value, the so-called metric F 1:

In fact, it is simply the average harmonic values \u200b\u200bof P and R. Metric F 1 reaches its maximum 1 (100%), if p \u003d r \u003d 100%.
(It is not difficult to estimate that for our degenerate classifier F 1 \u003d 0). The value F 1 is one of the most common metrics for this kind of systems. It is F 1 that we will use to formulate the threshold quality of our analyzer in FRD.

In calculating F 1, there are two main approaches for the classification task.

  • Total F 1.: Results in all classes are reduced to one single table, according to which the F 1 metric is calculated.
  • Middle F 1.: For each class, we form your CONTINGENCY MATRIX and its value F 1, then we take a simple arithmetic average for all classes.

Why do you need a second way? The fact is that the sample size for different classes can vary greatly. For some classes, we can have very few examples, and for some - a lot. As a result, the metrics of one "large" class, being negotiated into one common table, will "score" all the others. In a situation where we want to evaluate the quality of the system, the system is more or less uniform for all classes, the second option is suitable.

Educational and test sample

Above, we considered the classification on a single sample, for which all the answers are known. If you apply it to the contextual analyzer, which we are trying to describe, everything looks a little more difficult.

First of all, we must fix product categories. The situation when we guarantee some value of F 1, and the set of classes at the same time can unlimited expanding, almost dead. Therefore, it is additionally negotiated that the category set is fixed.

We calculate the value F 1 for a given sample, which is known in advance. This sample is usually called educational. However, we do not know how the classifier will behave on the data that we are unknown. For these purposes, the so-called test sample, sometimes called gOLDEN SET.. The difference between the training and test sample is purely speculative: After all, having some many examples, we can cut it on the learning and test sample as it pleases. But for self-learning systems, the formation of the proper training sample is very critical. Incorrectly selected examples can strongly affect the quality of the system.

The situation is typical when the classifier shows a good result on the training sample and completely failed - on the test sample. If our classification algorithm is based on machine learning (ie depends on the training sample), we can estimate its quality in a more complex "floating" scheme. To do this, all existing examples we have divide, say, on 10 parts. I carry the first part and use it to study the algorithm; The remaining 90% of examples use as a test sample and calculate the value F 1. Then I carry the second part and use as a training; We obtain another value of F 1, etc. As a result, we received 10 values \u200b\u200bof F 1, now we take their arithmetic average, which will become the final result. I repeat that this is a way (called also cross-Fold Validation) It makes sense only for algorithms based on machine learning.

Returning to writing a FRD, we notice that we have a situation much worse. We have a potentially unlimited set of input data (all web pages of the Internet) and there is no way to assess the context of the page except participation of man. Thus, our sample can only be formed manually, and it is highly dependent on the whims of the compiler (and the decision on whether the page takes to some category, takes a person). We can appreciate the measure of F 1 on the examples known to us, but can not find out F 1 for all Internet pages. Therefore, for potentially unlimited data sets (such as a lot of web pages, there are many more), sometimes using the "Unsupervised) method. To do this, randomly select a certain number of examples (pages) and the operator (person) is the correct set of categories (classes). Then we can experience the classifier at these selected examples. Next, considering that the examples we selected are typicalWe can approximately assess the accuracy of the algorithm (Precision). At the same time, we cannot estimate Recall (it is not known how many correct answers are outside the examples of the examples of the examples), therefore, we cannot calculate and F 1.

Thus, if we want to find out how the algorithm behaves on all possible input data, the best thing is that we can appreciate in this situation is the approximate value of Precision. If everyone agrees to use a predetermined fixed sample, then you can calculate the average value F 1 according to this sample.

Eventually?

And in the end, we will have to do the following:

  1. Fix the learning sample. The training sample will be built on the basis of the customer's presentations about the "correct" context.
  2. Fix a set of categories for our analyzer. Can we calculate F 1 on an undefined set of classes?
  3. Describe the requirement in the form: The analyzer must determine the context with an average value of F 1 at least 80%. (eg)
  4. Explain this to the customer.

As you can see, write a FRD to such a system is not easy (especially the last item), but maybe. As for the threshold value F 1, in such cases can be repelled from the values \u200b\u200bof F 1 for similar classification tasks.

Hi, Habr!

In the tasks of machine learning to assess the quality of models and comparisons of various algorithms, metrics are used, and their choice and analysis are an indispensable part of the datassatanist.

In this article we will look at some quality criteria in classification tasks, discuss what is important when choosing a metric and what can go wrong.

Metrics in classification tasks

To demonstrate useful functions sklearn. And the visual representation of the metrics, we will use our dataset on the outflow of the telecom operator's clients, with whom we met in the first course of the course.

Upload the necessary libraries and look at the data

Import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from Sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df \u003d pd.read_csv ("../../ Data / Telecom_churn.csv")

Df.head (5)

POSSIBLE DATA

We will make a mapping of binary columns # and codify the state dummy-coding (for simplicity, it is better not to do so for wooden models) D \u003d ("YES": 1, "NO": 0) DF ["INTERNATIONAL PLAN"] \u003d DF [" INTERNATIONAL PLAN "]. MAP (D) DF [" Voice Mail Plan "] \u003d DF [" Voice Mail Plan "]. Map (D) DF [" churn "] \u003d df [" churn "]. Astype (" int64 " ) Le \u003d Labelencoder () df ["state"] \u003d le.fit_transform (DF ["State"]) Ohe \u003d OneHotencoder (Sparse \u003d False) Encoded_state \u003d Ohe.Fit_Transform (DF ["State"]. Values.Reshape (- 1, 1)) TMP \u003d Pd.DataFrame (Encoded_state, columns \u003d ["state" + str (i) for i in range (encoded_state.shape)]) df \u003d pd.concat (, axis \u003d 1)

Accuracy, Precision and Recall

Before moving to the metrics themselves, it is necessary to introduce an important concept to describe these metrics in the terms of the classification errors - confusion Matrix. (error matrix).
Suppose that we have two classes and an algorithm that predicts the belonging of each object to one of the classes, then the classification error matrix will look like this:

True Positive (TP) FALSE POSITIVE (FP)
False Negative (Fn) True Negative (TN)

this is the answer algorithm at the facility, and

True class mark on this object.
Thus, classification errors are two types: False Negative (Fn) and False Positive (FP).

Training Algorithm and Building Matrix Error

X \u003d df.drop ("churn", axis \u003d 1) y \u003d df ["churn"] # divide the sample on the train and test, all metrics will be measured on the X_TRAIN, X_TEST, Y_TRAIN, Y_TEST \u003d TRAIN_TEST_SPLIT (x, y , stratify \u003d y, test_size \u003d 0.33, random_state \u003d 42) # teach your native logistics regression LR \u003d LOGISTICRESSION (RANDOM_STATE \u003d 42) LR.FIT (X_TRAIN, Y_TRAIN) # Use the Function Construction of the error matrix from the Sklearn Def Plot_confusion_matrix documentation (CM, CLASSES , Normalize \u003d false, title \u003d "(! Lang: Confusion Matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion Matrix.") plt.savefig("conf_matrix.png") plt.show()!}

Accuracy.

An intuitive, obvious and almost unused metric is accuracy - the proportion of the correct answers of the algorithm:

This metric is useless in tasks with unequal classes and it is easy to show on the example.

Suppose we want to evaluate the work of the spam filter mail. We have 100 non-spam letters, 90 of which our classifier defined correctly (true negative \u003d 90, false positive \u003d 10) and 10 spam letters, 5 of which the classifier also defined correctly (True Positive \u003d 5, False negative \u003d 5 ).
Then accuracy:

However, if we just predict all the letters as non-spam, we will get higher accuracy:

At the same time, our model does not have any predictive strength, since we initially wanted to identify letters with spam. To overcome this, it will help us with the transition with a common metric classes to certain quality quality indicators.

Precision, Recall and F-Mera

To assess the quality of the operation of the algorithm on each of the classes individually, we introduce the metrics Precision (accuracy) and Recall (fullness).

Precision can be interpreted as the proportion of objects called the classifier positive and at the same time indeed being positive, and Recall shows what the proportion of the positive class objects from all objects of the positive class found an algorithm.

It is Introduction that Precision does not allow us to record all objects in one class, since in this case we get the level of False Positive level. Recall demonstrates the ability of the algorithm to detect this class in general, and Precision is the ability to distinguish this class from other classes.

As we noted earlier, classification errors are two types: false positive and false negative. In statistics, the first type of errors is called the error of the i-th kind, and the second - error of the II-kind. In our task, to determine the outflow of subscribers, the first kind error will be the adoption of a loyal subscriber for the leaving, since our zero hypothesis is that none of the subscribers will reject, and we reject this hypothesis. Accordingly, the error of the second kind will be the "skipping" of the outgoing subscriber and the erroneous adoption of zero hypothesis.

Precision and Recall do not depend, in contrast to Accuracy, from the class ratio and therefore applicable in conditions of unbalanced samples.
Often in real practice there is a task to find the optimal (for the customer) balance between these two metrics. A classic example is the task of determining customer outflows.
Obviously we can't find all leaving the outflow of customers and only them. But by defining a strategy and resource to hold customers, we can find the necessary thresholds by Precision and Recall. For example, you can focus on withholding only high-yield customers or those who will have more likely, as we are limited to the call center resource.

Usually, when optimizing the hyperparameters of the algorithm (for example, in the case of a sort of grid Gridsearchcv) One metric is used, the improvement of which we expect to see on the test sample.
There are several different ways to combine Precision and Recall in an aggregated quality criterion. F-measure (generally

) - Average harmonic Precision and Recall:

in this case, determines the weight of accuracy in the metric and

this is the average harmonic (with a multiplier 2, so that in the case of precision \u003d 1 and recall \u003d 1 have

)
The F-measure reaches the maximum at full and accuracy equal to one and close to zero, if one of the arguments is close to zero.
Sklearn has a convenient function _metrics.classification report Returning Recall, Precision and F-measure for each of the classes, as well as the number of instances of each class.

Report \u003d Classification_Report (Y_TEST, LR.PREDICT (X_TEST), target_names \u003d ["non-churned", "churned"]) Print (Report)

class precision. recall f1-Score. support
Non-churn 0.88 0.97 0.93 941
Churn 0.60 0.25 0.35 159
aVG / Total 0.84 0.87 0.84 1100

Here it should be noted here that in the case of problems with unbalanced classes, which prevail in real practice, often have to resort to the techniques of artificial modification of the dataset to align the ratio of classes. There are many of them and we will not touch them, you can see some methods and choose suitable for your task.

AUC-ROC and AUC-PR

When converting a real response of the algorithm (as a rule, the probabilities of belonging to the class, separately, see SVM) to the binary label, we must select any threshold at which 0 becomes natural and close threshold, equal to 0.5, but it is not always It turns out optimal, for example, with the above-mentioned absence of the balance of classes.

One way to appreciate the model as a whole, without tosing to a specific threshold, is AUC-ROC (or ROC AUC) - Area ( A.rea. U.nDER C.uRVE) under the error curve ( R.eceiver. O.perating C.hARACTERISTIC CURVE). This curve is a line from (0.0) to (1,1) in the coordinates of True Positive Rate (TPR) and FALSE POSITIVE RATE (FPR):

TPR is already known to us, it is fullness, and FPR shows what proportion from the objects of a class of class algorithm predicted incorrectly. In the ideal case, when the classifier does not make mistakes (FPR \u003d 0, TPR \u003d 1), we get the area under a curve equal to one, otherwise, when the classifier accidentally issues the probability of classes, AUC-ROC will strive for 0.5, as the classifier will be give out the same amount of TP and FP.
Each point on the chart corresponds to the choice of some threshold. The area under the curve in this case shows the quality of the algorithm (more - better), in addition, the steepness of the curve itself is important - we want to maximize TPR, minimizing FPR, and therefore our curve ideally should strive for a point (0,1).

ROC curve draw code

SNS.SET (font_scale \u003d 1.5) SNS.SET_COLOR_CODES ("MUTED") PLT.FIGURE (FIGSIZE \u003d (10, 8)) FPR, TPR, THRESHOLDS \u003d ROC_CURVE (Y_TEST, LR.PREDICT_PROBA (X_TEST) [:, 1], POS_Label \u003d 1) LW \u003d 2 PLT.plot (FPR, TPR, LW \u003d LW, Label \u003d "ROC Curve") plt.plot () plt.xlim () plt.ylim () plt.xlabel ("False Positive Rate ") Plt.ylabel (" True Positive Rate ") PLT.Savefig (" Roc.png ") PLT.Show ()

AUC-ROC criterion is resistant to unbalanced classes (spoiler: alas, but not everything is so unambiguous) and can be interpreted as the likelihood that a randomly selected Positive object will be run by the classifier above (will have a higher probability of being positive) than randomly selected negative an object.

Consider the following task: we need to choose 100 relevant documents from 1 million documents. We migrated two algorithm:

  • Algorithm 1. Returns 100 documents, 90 of which are relevant. In this way,
  • Algorithm 2. Returns 2000 documents, 90 of which are relevant. In this way,

Most likely, we would choose the first algorithm, which gives very little False Positive against the background of your competitor. But the difference in False Positive Rate between these two algorithms extremely Mala - just 0.0019. This is a consequence of the fact that the AUC-ROC measures the share of False Positive relative to True Negative and in tasks where we are not as important as the second (greater) class, can give a completely adequate picture when comparing algorithms.

In order to correct the situation, back to completeness and accuracy:

  • Algorithm 1.
  • Algorithm 2.

There is already a significant difference between the two algorithms - 0.855 exactly!

Precision and Recall are also used to build a curve and, similarly to AUC-ROC, find the area under it.

Here it can be noted that on small datasets, the area under the PR curve may be too optimistic, because it is calculated by the method of the trapezoids, but usually enough data in such tasks. For details on the relationship between AUC-ROC and AUC-PR, you can contact here.

LOGISTIC LOSS.

A mansion is the logistics function of losses, defined as:

this is the answer of the algorithm on

OM object,

true class mark on

OM object, and

sample size.

In detail about the mathematical interpretation of the logistics function, the loss is already written in the framework of the Linear Model.
This metric infrequently acts in business requirements, but often in Kaggle tasks.
Intuitively, you can submit to the minimization of Logloss as the task of maximizing Accuracy by fine for incorrect predictions. However, it should be noted that Logloss is extremely fines for the confidence of the classifier in an incorrect answer.

Consider an example:

DEF LOGLOSS_Crutch (Y_TRUE, Y_PRED, EPS \u003d 1E-15): RETURN - (Y_TRUE * NP.LOG (Y_PRED) + (1 - Y_TRUE) * NP.LOG (1 - Y_PRED)) Print ("Logloss with uncertain classification% f "% LOGLOSS_CRUSCH (1, 0.5)) \u003e\u003e LOGLOSS with uncertain classification 0.693147 Print (" Logloss with a confident classification and a sure answer% f "% Logloss_Crutch (1, 0.9)) \u003e\u003e Logloss with a confident classification and a sure answer 0.105361 Print (" Logloss with a confident classification and incorrect response% f "% Logloss_Crutch (1, 0.1)) \u003e\u003e Logloss with a confident classification and an incorrect answer 2.302585

We note how dramatically rose LOGLOSS with an incorrect answer and a confident classification!
Consequently, an error on one object can give a significant deterioration in the overall error on the sample. Such objects are often emissions that you need to not forget to filter or view separately.
Everything becomes in place if you draw a LOGLOSS schedule:

It can be seen that, the closer to zero, the response of the algorithm during Ground Truth \u003d 1, the higher the value of the error and the curve is growing steeper.

Summing up:

  • In the case of a multiclass classification, you need to closely monitor the metrics of each of the classes and follow the logic of the solution tasks, not optimizing the metric
  • In the case of unequal classes, it is necessary to select the balance of classes for learning and a metric, which will correctly reflect the quality of the classification
  • The choice of metrics must be done with a focus on the subject area, pre-processing the data and, possibly segmented (as in the case of dividing on rich and poor customers)

useful links

  1. Evgeny Sokolova's course: Seminar on the choice of models (there is information on the metrics of the tasks of regression)
  2. AUC-ROC challenges from A.G. Dyaconova
  3. Additionally, other metrics can be read on Kaggle. To the description of each metric added link to competitions, where it was used
  4. Presentation of Bogdan Melnik Aka LD86 pro learning on unbalanced samples

Any Data Scientist works daily with large amounts of data. It is believed that about 60% - 70% of the time occupies the first stage of the workflow: cleaning, filtering and transformation of data into the format suitable for the use of machine learning algorithms. At the second stage, a predefining and direct training of models is performed. In today's article, we will concentrate in the second stage of the process and consider various techniques and recommendations, which are the result of my participation in more than 100 competitions in machine learning. Despite the fact that the described concepts are quite common, they will be useful in solving many specific tasks.

All examples of the code are written on Python!

Data

Before applying machine learning algorithms, the data must be converted to a tabular view. This process, presented in the figure below, is the most complex and time consuming.

Data in Database - data in the database. Data Munging - Data Filtering. Useful Data - Useful data. Data Conversion - Data Conversion. Tabular Data - tabular data. DATA - independent variables (signs). Labels - dependent variables (target variables).

After the data is converted to a tabular view, they can be used to train models. Table view is the most common presentation of data in the field of machine learning and intelligent data analysis. Table rows are separate objects (observations). Table columns contain independent variables (signs), denoted X., and dependent (targeted) variables denoted y.. Depending on the class class, target variables can be represented as one and several columns.

Types of target variables

Target variables determine the class of tasks and can be represented by one of the following options:

  • One column with binary values: the task of a two-class classification (binary classification), each object belongs only to one class.
  • One column with valid values: the problem of regression, one value is predicted.
  • Several columns with binary values: Multi-class classification task (Multi-Class Classification), each object belongs only to one class.
  • Several columns with valid values: the task of regression, several quantities are predicted.
  • Several columns with binary values: a classification task with intersecting classes (Multi-Label Classification), one object can belong to several classes.

Metrics

When solving any task of machine learning, it is necessary to have the opportunity to assess the result, that is, a metric is necessary. For example, for the task of a two-class classification, the area under the ROC curve is usually used as a metric (ROC AUC). In the case of a multiclass classification, a category cross entropy is usually applied (Categorical Cross-Entropy). In the case of regression, the average squares of deviations (Mean Squared Error, MSE).

We will not consider metrics in detail, because they can be quite diverse and selected for a specific task.

Libraries

First of all, it is necessary to establish the basic libraries necessary to perform calculations, such as nUMPY. and schipy. Then you can start installing the most popular libraries for data analysis and machine learning:

  • Research and data transformation: pandas. (http://pandas.pydata.org/).
  • A wide range of various machine learning algorithms: scikit-Learn. (http://scikit-learn.org/stable/).
  • Best Realization of Gradient Busting (Gradient Boosting): xgboost. (https://github.com/dmlc/xgboost).
  • Neural networks: keras. (http://keras.io/).
  • Visualization: mATPLOTLIB (http://matplotlib.org/).
  • Implementation Progress Indicator: tQDM. (https://pypi.python.org/pypi/tqdm).

I should say that I do not use Anaconda. (https://www.continuum.io/downloads). Anaconda. Combines the majority of popular libraries and significantly simplifies the installation process, but I need more freedom. You decide. 🙂

Framework for machine learning

In 2015, I presented a framework concept for automatic machine learning. The system is still under development, but the release will soon be placed. The framework of the framework, which will serve as the basis for further presentation, is presented in Figure below.

Figure from Publication: Takur A., \u200b\u200bKron-Grimberg A. AutoCompete: Framework for Machine Training Competitions. (A. Thakur and A. Krohn-Grimberghe, Autocompete: A Framework for Machine Learning Competitions.)

At the input framework receives data previously converted to a tabular view. Pink lines show the direction for the simplest case.

In the first step, the task class is determined. This can be done by analyzing the target variable. The task may be a classification or regression, the classification may be two-class or multiclassic, classes can intersect or not intersect. After the task class is defined, we divide the original set of data into two parts and get the training set (Validation SET), as shown in the figure below.

In the event that we are dealing with a classification, the data separation must be done in such a way that the ratio of the amounts of objects relating to different classes in the resulting sets corresponded to this ratio for the source dataset (Stratified Splitting). It can be easily done using class Stratifiedkfold. Libraries scikit-Learn..

For the task of regression, the usual separation is suitable with the class Kfold.which is also available in the library scikit.learn..

In addition, more complex data separation methods exist for the regression problem, providing the same distribution of the target variable in the obtained sets. These approaches remain on independent consideration by the reader.

In the example of the code above the size of the validation set ( eval_Size) It is 10% of the source dataset. This value should be selected, focusing on the amount of source data.

After separating the data, all operations applied to the learning set must be maintained and then apply to the validation set. Validation set in no case cannot be combined with the training set. If you do, we will receive very good estimates, while our models will be useless due to a strong retraining.

In the next step, we define the types of signs. The most common three types are: numeric, category and text. Let's consider a set of data from the popular task of the Passenger "Titanic" (https://www.kaggle.com/c/titanic/data).

In this data set column survival Contains the target variable. In the previous step, we have already separated the target variable from the signs. Signs pCLASS., sex. and embarked. are categorical. Signs age., sIBSP., parch. And the like are numeric. Sign name. is textual. However, I don't think the name of the passenger will be useful when predicting, survived this passenger or not.

Numeric signs do not need to be converted. Such signs are in its original form are ready for normalization and training of models.

Do not forget that before use OneHotencoder., it is necessary to convert a category in numbers using Labelencoder..

Since the data from the Titanic Competition does not contain a good example of a textual feature, let us formulate a general rule conversion of text signs. We can combine all text signs into one, and then apply the appropriate algorithms that allow you to convert text to a numerical representation.

Text signs can be combined as follows:

Then we can perform a conversion using class Countvectorizer. or Tfidfovectorizer. Libraries scikit-Learn..

Usually, Tfidfovectorizer. Provides the best result than Countvectorizer.. In practice, I found out that the following parameter values Tfidfovectorizer. are optimal in most cases:

If you apply a vectorizer only to a learning set, do not forget to save it on the disk to subsequently apply to the validation set.

In the next step, the signs obtained as a result of the transformations described above are transmitted to the steps (Stacker). This framework node combines all transformed signs into one matrix. Please note, in our case, we are talking about the feature stacker (Feature Stacker), which should not be confused with model stacker (Model Stacker), representing other popular technology.

Combining features can be performed using the function hstack Libraries nUMPY. (in case of unresolved (dese) signs) or using the function hstack From the module sparse. Libraries schipy (In the case of rarefied (sparse) signs).

In the event that other pre-preparation steps are performed, for example, a decrease in dimension or selection of features (will be considered below), the combination of the obtained features can be effectively accomplished by class FeatureUnion. Libraries scikit.learn..

After all signs are combined into one matrix, we can start learning models. Since signs are not normalized, at this stage only ensemble algorithms based on decisions trees should be applied:

  • RandomForestClassifier
  • RandomforestRegressor.
  • ExtRatreesclassifier
  • ExtRatreesRegressor
  • Xgbclassifier
  • Xgbregressor.

To apply linear models, it is necessary to normalize the signs using classes Normalizer. or Standardscaler. Libraries scikit-Learn..

These normalization methods provide a good result only in case of unresolved (DENSE) features. To apply Standardscaler. to rarefied (sparse) featured, as a parameter must be specified with_mean \u003d false.

If the steps described above provided a "good" model for us, you can switch to the hyper-parameters. If the model does not satisfy us, we can continue working with signs. In particular, as additional steps, we can apply various methods for reducing dimension.

In order not to complicate, we will not consider a linear discriminant analysis (Linear Discriminant Analysis, LDA) and a quadratic discriminant analysis (Quadratic Discriminant Analysis, QDA). In general, the main component method is used to reduce the dimensionality of the data (PRINCIPAL COMPONENT ANALYSIS, PCA). When working with images, start from 10 to 15 components and increase this value until the result improves. When working with other types of data, you can start with 50-60 components.

In case of text data, after converting the text to a rarefied matrix, you can apply a singular decomposition (Singular Value Decomption, SVD). Singular decomposition Truncatedsvd. Available in the library scikit-Learn..

The number of a component of singular decomposition, which, as a rule, provides a good result in the case of signs obtained as a result of transformation using Countvectorizer. or Tfidfovectorizer., It is 120 - 200. A greater amount of component allows slightly improving the result of the cost of significant computational costs.

After performing the steps described, do not forget to normalize the signs to be able to apply linear models. Next, we can either use the prepared features for learning models, or perform feature selection (Feature Selection).

There are various methods for selecting signs. One of the popular methods is a greedy feature selection algorithm (Greedy Feature Selection). The greedy algorithm has the following scheme described. Step 1: We teach and evaluate the model on each of the initial signs; We select one character that provided the best evaluation. Step 2: We teach and evaluate the model on steam pairs consisting of a better sign selected in the previous step, and each of the remaining signs; We select the best sign from the remaining. We repeat the same steps until you select the desired number of features, or until any other criterion is performed. An embodiment of this algorithm, where the area of \u200b\u200bthe ROC curve is used as a metric, is available as follows: https://github.com/abhishekkrthakur/greedyfeatureselection. It should be noted that this implementation is imperfect and requires certain modifications to a specific task.

Another faster method of selection of signs is the selection using one of the machine learning algorithms that assess the importance of signs. For example, you can use logistic regression (Logistic Regression) or later selected features can be used to teach other algorithms.

By performing the selection of signs with random forest, remember that the number of trees should be small, in addition, do not perform a serious setting of hyperparameters, otherwise retraining.

The selection of signs can also be performed using gradient busting algorithms. It is recommended to use the library xgboost. Instead of appropriate implementation from scikit-Learn.because implementation xgboost. Much faster and scalable.

Algorithms RandomForestClassifier, RandomforestRegressor. and xgboost. Also allow the selection of features in case of rarefied data.

Another popular technique for the selection of non-negative features is the selection based on the Chi-Squared criterion. The implementation of this method is also available in the library. scikit-Learn..

In the code above we use the class SelectkBest together with the criterion of chi-square ( chi.2 ) to select 20 of the best signs. The number of signs of signs, in fact, is a hyper-parameter that needs to be optimized to improve the result of the model.

Do not forget to save all the converters that you used to the learning set. They will need to appreciate the models on the validation set.

The next step is the choice of machine learning algorithm and adjusting hyperparameters.

In the general case, choosing an algorithm for machine learning, it is necessary to consider the following options:

  • Classification:
    • Random forest (Random Forest).
    • Logistic regression (LOGISTIC REGRESSION).
    • Naive Bayes (Naive Bayes).
    • Method K Nearest Neighbors (K-Nearest Neighbors).
  • Regression:
    • Random forest (Random Forest).
    • Gradient Busting (Gradient Boosting).
    • Linear Regression (Linear Regression).
    • Comb Ridge Regression.
    • LASSO Regression (Lasso Regression).
    • Method of support vectors (Support Vector Machine).

The table below shows the main hyperparameters of each algorithm and the ranges of their optimal values.

Label RS * The table means that it is impossible to specify the optimal values \u200b\u200band the RANDOM SEARCH) must be executed.

Let us remind you again, do not forget to keep all the applied converters:

And do not forget to apply them to the validation set:

The approach we considered and the framework-based framework demonstrated good results on most of the data sets with which I had to work. Of course, when solving very complex tasks, this technique does not always give a good result. Nothing perfectly, but, studying, we are improving. Just as this happens in machine learning.