The data, whether it is big data or very amount small data may be incomplete, unstructured or it has missing attributes. Sometimes the data may be in a different format, we have to process the data in a specific format so that any machine learning or deep learning algorithms can be implemented.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In machine learning, data preprocessing is the vital and fundamental step to structure the data in a way, so that the model fits perfectly to get accurate results.
Data preprocessing is divided into four stages:
- Data cleaning
- Data integration
- Data reduction
- Data transformation
Data cleaning refers to techniques to clean data by removing outliers, smoothing noisy data, replacing missing values, and correcting inconsistent data.
1) Missing Data: Mostly datasets will have missing values, it happens during data collection or data validation.
Some of the common reasons can be
- While transferring the database, some data will be lost.
- Some fields were not filled by the user.
- There could be errors in processing the dataset.
Some approaches to deal with the missing data are
- Eliminating rows with missing data.
- Filling approximate values in the missing places.
- Using a standard value to replace the missing value.
- Using algorithms like regression and decision tree the missing values can be predicted and replaced.
2) Duplicate Data: These are data points that are repetitive in the dataset that do not contribute to any new information.
Duplicate data mostly arise during data collection in scenarios when
- The user combines data set from multiple sources.
- The user scrapes data from the web.
- The user receives data from other clients.
3)Inconsistent data: Sometimes the data may be incorrectly placed in the dataset. It is therefore always advised to perform data assessment like knowing what is the type of the features of data and whether it is the same for all the data objects.
4) Outliers in the Data: These are values that are critically off from other observations and can result in poor performance of various models. Outliers can also occur when comparing relationships between two sets of data.
Outliers can come up in the data due to reasons such as
- Data corruption
- Input error when data is entered manually
- Faulty measurements
To deal with these anomalous values, data smoothing techniques like binning, regression, Outlier analysis, Specifying absolute bounds on data are done.
Since data is being collected from multiple sources, data integration has become a vital part of the process. This may lead to redundant and inconsistent data, which could result in poor accuracy.
To deal with these issues and maintain the data integrity, approaches such as tuple duplication detection and data conflict detection can be used.
Mostly datasets have a large number of features. As dimensionality increases, the number plane occupied by the data increases which is difficult to model and visualize.
A few major benefits of dimensionality reduction are :
- Data Analysis algorithms work better if the dimensionality of the dataset is lower.
- The models which are built on top of lower-dimensional data are more understandable and explainable.
- The data may now also get easier to visualize.
A few methods to reduce the volume of data are:
- Principal component analysis (PCA) a statistical method that reduces the numbers of attributes by lumping highly correlated attributes together.
- Single value decomposition (SVD) a factorization of a real or complex matrix.
The final step of data preprocessing is transforming the data into an appropriate form for Data Modeling. The data you have available may not be in the right format or may require transformations to make it more useful. For achieving better results from the applied model in Machine Learning projects, the format of the data has to be in a proper manner.
Data Transformation activities and techniques include:
- Categorical encoding.
Implementing Data Preprocessing in Machine Learning
- 1. Getting the dataset
- 2. Import libraries
- 3. Import the dataset
- 4. Treating missing values
- 5. Feature scaling
- 6.Dimensionality Reduction
- 7. Splitting the dataset.
- Getting the datasets- Datasets are the major component for machine learning. Datasets can be collected manually or we can get huge datasets from some websites like Kaggle.
- Import libraries- Python is the most preferred language for machine learning because of its wide range of libraries. We can import the required libraries required for the data processing like pandas, NumPy, sci-kit learn,etc.,
- Import the dataset – The dataset can be imported into the code you can import the dataset using the “read_csv()” function of the Pandas library.
- Treating missing values– To overcome the missing values we can perform operations like deleting a row containing missing values, replacing with some values like the median of other values.
- Feature scaling- It is a method to standardize the independent variables of a dataset within a specific range.
- Splitting the dataset- The dataset for the Machine Learning model must be split into two separate sets – training set and test set. The training set denotes the subset of a dataset that is used for training the machine learning model. A test set is the subset of the dataset that is used for testing the machine learning model.