Data Preprocessing and Feature Engineering are highly underrated steps in a machine learning pipeline. The importance of these steps is often overshadowed by the choice of the machine learning model used for training. However, the secret ingredient to enhancing the performance of a model lies in the quality of the data fed into it.
While often used interchangeably, Data Preprocessing and Feature Engineering serve distinct purposes. Data Preprocessing ensures data quality, while Feature Engineering enhances a model’s predictive value.
Data Preprocessing
Data Preprocessing is tasked with tidying up the raw, messy dataset to make it suitable for feature engineering.
Key Tasks
- Handling Missing Data — There are multiple ways to handle missing data. Please refer to my previous blogs to learn how to handle missing data in detail.
- Removing duplicates and outliers — Duplicate records can occur in a dataset for a myriad of reasons, such as data entry errors, system glitches, and so on. This can lead to biased results. While unintentional or redundant entries need to be removed, it is essential to ensure that the repetitions are not valid, as they are intensional additions. This might require some domain knowledge. Outliers are data points that deviate significantly from the rest of the dataset. They can occur due to measurement errors, variability in data, or rare events. Outliers can skew statistical analyses and negatively impact machine learning models. Again, it is vital to analyse the cause of outliers before deciding to remove them. Some outliers might carry valid information about the dataset. If the outliers are irrelevant, they need to be removed. Also, avoid removing too many outliers if it significantly reduces the size of the dataset or leads to biased results.
- Scaling and Normalisation — These are techniques used to standardise the range of numerical features so that they are on a similar scale, which ensures that machine learning models can process the data effectively and without bias toward features with larger magnitudes. They play a crucial role in enabling models to learn effectively from data without being biased by differences in scale or units. Scaling adjusts the range of features to ensure they are comparable in magnitude. Normalisation rescales features to fit within a specific range, typically between 0 and 1. This type of normalisation is called min-max scaling. Standardisation is another method that scales data with a mean of 0 and a standard deviation of 1.
- Encoding categorical variables — This is a crucial preprocessing step in machine learning. Since most machine learning algorithms work with numerical data, categorical variables (non-numeric data) must be transformed into numerical representations to make them usable for modelling. Properly encoded data improves model performance while minimising biases and computational inefficiencies. Categorical variables can be classified as nominal and ordinal variables. Categories with no inherent order can be categorised as nominal variables. For example, colours with values like “red”, “green”, and “blue” or car brands like “Volkswagen”, “Benz”, and “BMW”. Categories with a meaningful order or ranking can be categorised as ordinal variables. For example, education level with values like “High School”, “Bachelor’s”, and “Master’s” or ratings with values like “Poor”, “Average”, “Good”, and “Excellent”. Multiple encoding techniques can be used. The choice of encoding technique would depend on the category of the variable. For nominal categories, dummy encoding and one-hot encoding are recommended. Label encoding and ordinal encoding are recommended techniques for ordinal categories. Please note that there are many more techniques that have not been discussed in this article.
Feature Engineering
Feature Engineering is the step that follows Data Preprocessing. This step creates new features or transforms existing ones to enhance the model’s ability to detect patterns in the dataset and make predictions.
Key Tasks
- Feature Creation — As the name suggests, this process involves the creation of new features from existing ones in the dataset. The new features are created to capture additional relationships or underlying patterns in the dataset. For example, extracting components like a year, month, date, day of the week, etc. from the date column.
- Feature Transformation — This process modifies existing features to make them more suitable for modelling. This step is often used to handle skewed data, outliers, or non-linear relationships. Transformations help normalise data distributions and make relationships between variables linear, which many machine learning algorithms prefer. Different types of transformations are possible, such as Logarithmic Transformation, Square Root Transformation, Box-Cox Transformation, etc.
- Feature Extraction — Feature extraction involves creating new features by transforming or combining existing ones into a lower-dimensional space while retaining most of the relevant information. Feature extraction is used when you have a high-dimensional dataset that is computationally expensive. In such cases, reducing the dimensionality of the dataset while retaining as much information as possible is important. Some common feature extraction techniques are Principal Component Analysis (PCA), Word Embeddings in a text dataset, and Fourier transform.
- Feature Selection — involves identifying and retaining only the most relevant features while removing irrelevant or redundant ones. This technique reduces noise in the dataset and improves model efficiency. Feature selection reduces overfitting, speeds up training time, and simplifies the model without sacrificing accuracy. Feature selection is used when your dataset is noisy and you believe only a subset of the dataset is helpful for the task. Standard techniques used for feature selection can further be classified as Wrapper, Filter, and Embedded Methods.
Conclusion
Data Preprocessing improves the quality of your dataset. As the saying goes, “Garbage in, garbage out.” Feature engineering focuses on improving model performance by providing more meaningful inputs. As Andrew Ng famously said, “Applied machine learning is feature engineering.” Both data preprocessing and feature engineering are indispensable steps in building robust machine learning models. While preprocessing ensures clean, consistent data, feature engineering transforms that data into robust inputs for your model. Understanding their differences and roles allows you to optimise your workflow and build better-performing models.
References
- Detect and Remove Outliers using Python | Aman Kharwal. https://thecleverprogrammer.com/2023/07/26/detect-and-remove-outliers-using-python/
- Navigating the Future: AI in Data Mining Explained — promptpanda.io. https://www.promptpanda.io/blog/ai-in-data-mining/
- FINV Function: Definition, Formula Examples and Usage. https://sheetsland.com/finv-function/
- Data Cleaning in Data Science. https://www.almabetter.com/bytes/tutorials/data-science/data-cleaning-in-data-science
- Understanding Model Training: A Comprehensive Guide. https://botpenguin.com/blogs/understanding-model-training
- Mudau, T. (2019). ANN-MIND : Dropout for neural network training with missing data. https://core.ac.uk/download/245881059.pdf