When it comes to machine learning (ML) algorithms, it’s a classic case of garbage-in-garbage-out: If the data being fed to the algorithm does not meet the requirements of the algorithm, then the accuracy of the results will suffer.
Data preprocessing is a requirement for any ML project and must be customized to meet the requirements of each project. It is an iterative task, not a one-time activity, based on the accuracy of the feedback from the model. Detailed below are a few critical data preprocessing techniques that can be used to prepare the data to ensure superior model accuracy.
Missing values can reduce model accuracy due to a reduction in the number of data points used for model creation. Therefore, it is very important to identify the missing values during the data preprocessing process.
Causes of missing values in data
- Issues in data collection: In many cases, data is pushed periodically by sensors. However, the data may not be sent due to sensor malfunction or loss of connection. Also, a manual error by the data entry operator may result in missing values in the data.
- Issues in data extraction: Many times, data is extracted from various input sources with different formats such as websites or documents. Issues in the data extraction process may result in missing values. To address this, the data needs to be validated once it is extracted and checked for missing values so the extraction process can be modified to correct the issue.
How to handle missing values
- Deletion: If there are only a few observations with missing values then the observations can be deleted without impacting model accuracy. This is the simplest way to handle the problem of missing values.
- Imputation: Imputation is the process of filling in missing values. New values can be estimated using a few different methods such as mean, median or mode of the data available for that attribute. Selection of the method or methods depends on the semantics of the attribute being filled.
- Prediction model: This approach used models such as logistic regression to estimate missing values. The model is trained on datasets without missing values and uses them to predict missing values. However, if the attribute with missing values is not well correlated with other data attributes, then estimated values will not be accurate.
- K-nearest neighbors: KNN finds the k nearest neighbors of the observation with the missing value and uses the mean or median of the attribute value of these nearest neighbors to impute the missing value. The advantage of KNN is that it provides a better-estimated value for imputation than the other methods described above.
Outliers in the data can diverge widely from the common pattern of the attribute. For example, if the attribute age has a value of 200 then it is considered an outlier as the typical age pattern will be from 1 to 100; a value of 200 is way above the range. Dealing with outliers is very important as they impact the model drastically, especially models using regression.
Outliers can be detected using various visualization techniques such as a scatter plot or box plot.
How to handle outliers
- Deletion: In cases where an outlier seems to be an incorrect attribute value and the number of such observations are very low then they can be deleted.
- Data transformation: Various transformations can be performed on the data to handle outliers such as taking the natural log of the values. Another form of transformation is “binning” the data, which involved defining the ranges of the data and putting each observation in a bin. For example, age attribute ranges can be defined as <18, 19-35, 36-70 and >70. So, any outlier value such as 200 will be considered in the >70 range and will not adversely impact the model.
- Imputation: The methods described above for imputing missing values can be used for replacing outlier values also.
- Normalization: Algorithms that use the distance between data points—such as KNN—can be biased in cases where all the attributes are not in a similar range. Attributes that have a higher range value will bias the model thus reducing model accuracy. To overcome such issues, data points are normalized or scaled, so the range for each attribute becomes 0 to 1. This range assigns equal weight to each attribute, which provides better accuracy.
- Standardization: Sometimes data distribution is such that it is skewed to one side; either right or left. Such data is standardized around a mean of 0 with a standard deviation from the mean of 1 using various methods such as the log, square, or cube-root of the values.
- Discretization: Binning is used to converting continuous variables into categorical data. This may be helpful in cases where instead of a continuous variable, categorical data is good enough. For example, test grades can be binned into A, B, C, D, E, and Discretization can be very helpful in classifying algorithms like decision trees.
New feature extraction
Feature extraction is the process of creating an additional feature using information from existing features. For example, for a retail store to understand sales patterns, it is more beneficial to look at the day of the week instead of a specific date. So creating another feature called “day of the week” from the date will improve the model.
Existing feature transformation
Existing feature transformation includes converting categorical data to numerical data necessitated by the limitations of some algorithms to work only on numerical data. The following two methods use only numerical data:
- Label encoding: Converting categorical data to numerical data is called label encoding. For example, gender data for males and females will be converted to 1 and 0 values without adding any new features.
- One hot encoding: For some categorical data such as colors, there may not be a natural way to order the values. In this case, the one-hot encoding technique is used, which creates a new binary equivalent for each categorical
There are a variety of data preprocessing methods that can be used to improve ML model accuracy. While not an exhaustive list, the methods discussed above detail a variety of techniques for improving the quality of the data, which is essential during data preparation to increase the accuracy of the results produced by the modeling algorithm.