The data processing-CodePudding

Data preprocessing method of machine learning algorithm and Python

Why is it important to have data processing?
Familiar with data mining and machine learning friends all know that the data processing of related work time occupies more than 70% of the whole project, the quality of the data, determine the stand or fall of the prediction and generalization ability of the model, it involves many factors, including: the accuracy, completeness, consistency, timeliness, credibility and explanatory, and in the real data, we get the data may contain a large number of missing value, may contain a lot of noise, may also have abnormal points, because the manual input error exists, is not conducive to training algorithm of the model, data cleaning is the result of the corresponding methods of all kinds of dirty data processing, get the standard, clean, continuous data, provide data statistics, data mining is used,
What are the data preprocessing method?
Main steps of the data preprocessing can be divided into: data cleaning, data integration, data code and data transformation, this paper will detail from the four aspects of specific methods, if in a project, what do you do in the aspects of data processing is very good, for after modeling is of great help, and can quickly achieve a good result,
The data cleansing
Data cleaning, data cleaning) is the main idea by filling missing values, smoothing noise data, smooth or remove outliers, and solve the inconsistency of data to "clean up" the data, if the user think dirty data, they will not believe based on the data mining results, namely the output result is not reliable,
1, the processing of missing value
Due to the real world, the process of access to information and data, there are various reasons lead to data loss and vacancy, according to these methods of dealing with the missing value, mainly based on the distribution properties of the variables and the importance of variable (the amount of information and the ability to predict) using different methods, mainly divided into the following kinds:
* delete variables: if the lack of variable rate (> 80%), high coverage rate is low, and the importance is low, remove the variables can be directly,
* constant value fill: engineering in common use - 9999 to replace
* statistics populate: if the missing rate is low (less than 95%) and lower importance, depending on the data distribution of filling, to conform to the uniform distribution data, using the variable mean fill is missing, the condition of the slope distribution for data, using the median to fill,
* interpolation filling: including random interpolation, multiple difference method, thermal platform interpolation, Lagrange interpolation, Newton interpolation, etc.
Filling: * model using regression, bayesian, random forests, the decision tree model to forecast the missing data, such as
* filled dummy variable: if the variable is discrete, and different value is less, can be converted into a dummy variable, such as SEX SEX variable, male, fameal, NA three different values, the column can be converted into IS_SEX_MALE, IS_SEX_FEMALE, IS_SEX_NA, if a variable exists a dozen different values, according to the frequency of each value, the frequency of the smaller value as a kind of 'other', reducing dimension, this approach can maximize reserve variable information,
Summary of view, the building of commonly used way is: with pandas. The first isnull. The sum () to detect the lack of variable ratio, consider deleting or fill, if need to fill the variable is continuous, generally USES the mean method and random differential populated, if the variable is discrete, usually using the median or dummy variables to fill,
Note: if the variables discretization points, generally will be missing value as a box alone (a value of discrete variable)
2, deal with outliers
Outliers is the normal data distribution, in a certain distribution area or beyond the range of the data is usually defined as abnormal or noise, anomaly can be divided into two kinds: "pseudo anomaly", as a result of the specific business operations to create, is a response to a normal state of the business, rather than the data itself exception;" True anomaly ", not as a result of the specific business operations to create, but the abnormal distribution of the data itself, namely the outliers, the main method to detect outliers are the following:
* simple statistical analysis: according to the boxplot, each quantile judge whether there is abnormal, such as pandas describe function can quickly find outliers,
* 3 principles: if the data is normal distribution, the mean deviation of 3. Usually defined within the scope of the point for outliers,
* based on the median absolute deviation (MAD) : this is a kind of robust against outliers in data method of distance calculation by using the observed value and the sum of the average distance method, enlarge the influence of outliers,
* based on distance: by defining the object between the approach of the measure, whether exception object according to the distance away from the other objects, the disadvantage is that the computing complexity is higher, is not suitable for large data sets and the existing data sets of different density area
* based on density: the local density of outliers is significantly lower than most of the neighbor points, applicable to heterogeneous data sets
* based on clustering: using clustering algorithms, cast away from other clusters of small clusters,
Summary, as outliers in the data processing phase will affect the data quality of abnormal points to consider, rather than as commonly referred to as the anomaly detection of target, and therefore the original poster is relatively simple and intuitive method commonly, combined with the boxplot and statistical method of MAD judge variable outliers,
Specific treatment methods:
* according to the number of abnormal point and impact, considering whether to delete actual orderdate, information loss more
* if to do the log data - scale after the logarithmic transformation to eliminate the outliers, this method is effective, and no loss of information
* average or median replace abnormal point, simple and efficient, less loss of information
* when training tree model, tree model for higher robustness of outliers, without information loss, do not affect model training effect
3, noise processing
Random error and the variance of noise is a variable that is observation point and the error between the real point, that is, the usual treatment method: the data points, such as frequency or width box, and then use each box of average, the median or boundary value (different data distribution, processing method different) instead of all the number in the box, have the effect of smoothing data, another approach is to establish the variable and predictor variable regression model, according to the regression coefficient and prediction variables, the approximation solution from the variable,
Data integration
Most data analysis tasks involved in data integration, data integration to combine data from multiple sources, stored in a consistent data storage, such as data warehouse, these sources may include multiple databases, data or file commonly,
1. Entity recognition problem: for example, data analysis or the computer how to be confident, according to a number in the library customer_id and another cust_number refers to the same entity in the database? Usually, the database and data warehouse with metadata, data about data, this metadata can help avoid errors in the model integration,
2. The problem of redundancy, a property is redundant, if it can be seen by another table "export"; Such as salary, property, or d named inconsistency could lead to a data set of redundancy, with the correlation detection redundant: numeric variables can be calculated correlation coefficient matrix, the nominal variables can be calculated chi-square,
3. The data values of conflict and processing: different data sources, in unity to merge, standardization, to heavy,
Data code
Data reduction techniques can be used to get data set reduction, says it is much less, but still close to maintain the integrity of the original data, in this way, after the reduction of data sets on mining would be more efficient, and the analysis of the same (or nearly the same) as a result, the general has the following strategies:
1, dimension code
Used for data analysis, data may contain hundreds of attributes, most of these attributes do not related with mining tasks, is redundant, dimension reduction by removing the irrelevant attributes, to reduce the amount of data, and to ensure that the minimum loss of information,
Attribute subset selection: the goal is to find the minimum set of properties, make the probability distribution of data classes as much as possible close to using all of the properties of the original distribution of the compressed into the set of properties and other advantages, it reduces the appeared on the found model of the number of attributes, makes the model more easy to understand,
*
Step by step forward selection: the process by the empty set of properties, select the original concentration of the best attribute and add it to the collection
, after each iteration, the original property set the rest of the best attribute is added to the collection,
* gradually remove back: the process from the whole set of properties, at each step, delete is still in the concentration of the worst attribute,
* choose forward and backward deleted: the combination of forward and backward can delete methods together, every step to choose one of the most good properties, and in the remaining attributes to remove a worst attributes,
Python scikit - learn Recursive feature elimination algorithm of Recursive feature elimination (RFE), is the use of such ideas for feature subset selection, generally considering or SVM regression model,
Single variable importance: analysis of the single variables and target correlation, delete the ability to predict low variable, this method is different from the attribute subset selection, usually from the view of statistics and information to analyze,
* Pearson correlation coefficient and chi-square test, analysis of the target variable and the correlation of single variable,
* regression coefficient: training linear regression and logistic regression, extract the vote coefficient of each variable, importance,
* tree model Gini index of the decision tree model, extract the important degree of each variable, namely the Gini index, sorting,
* Lasso regularization, the regression model, to join the L1 regularization parameter, the feature vector is sparse,
* IV indicators: risk control model, the solving IV value of each variable, usually to define variables of importance, the thresholds are set in 0.02 above,
Explain the specific the method mentioned above, there is no theory and realization method, requires students to master, the original poster is usually based on the business requirements to decide, if based on the business user or commodity characteristics, need more explanatory, consider adopting some statistical methods, such as the distribution curve of variables, such as histogram, and then calculating the correlation index, and finally to consider some model method, if the modeling needs, often model method is used to filter characteristics, if use some of the more complicated GBDT, within DNN models, such as suggest not do feature selection, and the characteristics of the cross,
2, the dimension transformation:
Dimension transformation is to reduce the existing data into smaller dimensions, as far as possible to ensure the integrity of the data and information, the original poster will introduce several loss of dimension transformation method, will greatly improve the efficiency of the practice of modeling
* principal component analysis (PCA) and factor analysis (FA) : PCA, by means of space mapping to map the current dimension to lower dimensions and, makes the variance of each variable in the new space is the largest, FA is find common factor of the current eigenvector dimension (smaller), using a linear combination of the common factor to describe the characteristics of the current vector,
* singular value decomposition (SVD) : the SVD of low dimension reduction can be interpreted, and the amount of calculation is bigger than PCA, general use on the sparse matrix dimension reduction, such as image compression, recommendation system,
* clustering: the one kind of have the feature of similarity to a single variable, thereby significantly reducing dimension,
* linear combination: to make multiple variables linear regression, based on the vote coefficient of each variable, gives the variable weights, can synthesize the class variables according to the right of restructuring a variable,
* learning popular: some popular study the complex nonlinear method, reference skearn: LLE Example
Data transformation is
Including the data standardization, data transformation discretization, sparse, achieve the goal of suitable for mining,
1. Standardized treatment: the data in the different characteristics of the dimension may be inconsistent, the differences between values may be large, not for processing may affect the results of data analysis, therefore, need the data according to certain proportion to zoom in, falls in a particular area, easy to make a comprehensive analysis, especially mining method based on distance, clustering, KNN and SVM must do standardized treatment,
* Max - min standardization: map the data to [0, 1] interval,

* Z - Score standardization: after processing the data mean value is 0, the variance is 1,

* Log transformation: in the time series data, the data scale of large difference of variable, usually do the Log function transformation,.

2, the discretization process: data discretization is piecewise continuous data, make it into a series of discrete interval, segmentation is based on the principle of equidistance, such as frequency or optimization methods, data discretization reasons mainly include the following:
* model needs: such as decision tree, such as naive bayes algorithm, are based on the discrete data, if you want to use this kind of algorithm, the discrete data, must be effective algorithm of discretization can reduce cost of time and space, improve the system for the classification of the sample clustering ability and the ability to resist noise,
* features relatively new type of discretization is easier to understand,
* can effectively overcome the hidden defect of the data, make the model results are more stable,
Such as frequency method: make equal the number of samples in each case, the total sample n=100, for example, is divided into k=5 cases, the principle of points is to ensure that fall into each box size=20,
Width equal to width method: made the properties box, age variables (0-100), for example, can be divided into [0], [40] 20, 40, 60 [], [60], [80100] five width of box,
Clustering method: according to the cluster of clusters, each cluster the data in a box, the number of cluster model is given,
nullnullnull