In the previous article, we explored the different types of missing data, and how they can be identified. In this article we look at the different techniques for handling the missing data, and their use cases.

There are 3 broad techniques for handling the missing data:

Dropping rows or columns with missing records
Imputation by statistical methods like mean, mode, or median
Creating a separate model to predict the missing values

In this article we will be discussing each of these techniques and their use cases in detail.

Dropping rows or columns

Removing missing values is not the best approach, especially when preserving data is a priority. However, in certain scenarios, the best approach would be to remove the missing values.

Dropping Columns

Think about whether the particular column or feature is critical for the task at hand. When a feature or column has more than 50% of data missing, it may not be a good candidate for data analysis. Hence, one choice would be to drop the column or feature. However, removing variables with high missingness (>40–70%) preserves dataset quality but sacrifices potentially critical features. For example, eliminating a “customer feedback” column missing 60% of responses simplifies analysis but may obscure key insights.

Applicability:

MCAR/MAR/MNAR: Use sparingly, only when a variable’s missing-ness renders it unusable

Dropping Rows

Sometimes, the number of records with missing data may be very small compared to the size of the entire dataset. In such cases, when only a small percentage of data is missing, removing such records may not affect the task at hand. This is known as listwise deletion or complete case analysis. While computationally simple, it risks substantial data loss and biased estimates unless data is MCAR. For example, deleting 25% of rows in a survey dataset reduced statistical power and skewed demographic distributions in practice. This technique of deleting records will only be effective if the remaining data still represents the original dataset’s distribution.

Applicability:

MCAR: Safe if missingness is minimal (<5%)
MAR/MNAR: Avoid due to bias risks

Statistical Imputation

Mean/Median/Mode Imputation

Replacing missing numerical values with the column’s mean/median or categorical values with the mode is computationally efficient. This method is also known as Univariate Imputation. Here, estimation of missing values is done using only the same column. However, it distorts variance and correlations. For skewed income data, median imputation preserves central tendency better than mean.

Mean: average of the column

Median: after sorting the column in ascending order, the mid-value is used to replace the missing values.

Mode: the most frequent value of the column is used to replace missing values.

Applicability:

MCAR: Optimal for small datasets with low missingness.
MAR: Limited utility unless combined with multivariate methods.
MNAR: Not recommended due to systemic bias.

Creating a Separate Model to Predict Missing Values

Choose column with missing value as the target. The remaining columns are given as input feature. The input feature is fed into the model. This method is computationally expensive, and is also time hungry.

Regression Imputation

Regression models predict missing values using its relation to other variables in the dataset. For example, predicting missing house prices using square footage and location. While effective for MAR, it underestimates uncertainty by treating imputed values as exact.

Advantages:

Preserves connection between the variables
Reduces bias compared to ignoring missing values
Integrates seamlessly with multiple imputation frameworks

Drawbacks:

Assumes linear relationships. If the real relationship is curved or complex, the predictions may be wrong
When using regression alone, imputed values often fit too neatly into the pattern, reducing natural variability. For example, all predicted incomes might cluster tightly around the trend line, ignoring real-world randomness

Applicability:

MAR: Highly effective with well-specified models.
MCAR: Overkill compared to simpler methods. Works if the reason for missingness is explained by other variables (e.g., income missing for older people, but age is recorded).
MNAR: Fails if the missingness relates to the missing value itself (e.g., people with high incomes skip reporting it). For MNAR, more advanced methods are needed

K-Nearest Neighbors (KNN) Imputation

KNN imputes missing values by averaging the nearest neighbors’ observed values. For instance, a patient’s missing cholesterol level could be inferred from similar patients’ profiles. KNN preserves local structures but scales poorly with large datasets.

Advantages:

Can handle both numbers and categories
Unlike methods that assume simple relationships (like straight-line trends), KNN can handle messy or curved patterns in data, making it useful for real-world scenarios

Drawbacks:

Computationally intensive for large datasets. KNN requires checking every data point to find neighbors. For large datasets (like millions of rows), this becomes very slow and uses a lot of computer memory.
If some features (like “eye color” in a health study) don’t relate to the missing value, KNN’s guesses can be wrong. It also struggles if data isn’t scaled properly (e.g., mixing income in dollars and age in years).
Choosing the right number of neighbors (k) is tricky. Too few neighbors (e.g., k=3) picks up noise or outliers. Too many neighbors (e.g., k=100) oversimplifies and ignores local patterns.

Applicability:

MAR: Robust for datasets with strong feature correlations
MCAR: Viable but computationally intensive
MNAR: Limited without external data

Multiple Imputation

Multiple imputation (MI) generates several plausible datasets by varying imputed values, then pools results to account for uncertainty. MI is ideal for MAR data, as seen in clinical trials where patient dropout correlates with observed covariates.

Steps:

Imputation: Create m datasets using stochastic regression
Analysis: Perform analyses on each dataset
Pooling: Combine estimates using Rubin’s rules

Advantages:

By combining results from different imputed datasets, you get a more robust answer in studies or models. This method often produces better estimates, especially when the data is missing at random.
The procedure uses patterns in the observed data to guess the missing values. This way, any relationship between variables is taken into account.

Drawbacks:

This process is more complex to perform
Because it works with multiple versions of the dataset, it takes longer to run and needs more computing power, especially for large datasets.
The accuracy of the imputed values depends on the models used. If the assumptions made by the models are not correct, the results might be less reliable.
Since many datasets are created, you need to carefully combine the results. This extra step requires a good understanding of the process to avoid mistakes.

Applicability:

MAR: Gold standard for valid inferences
MNAR: Requires specialized models incorporating missingness mechanisms

Machine Learning Models

Tree-based algorithms (e.g., Random Forests) handle missing data internally by surrogate splitting, while neural networks require pre-imputation. For example, XGBoost’s built-in handling of missing values simplifies preprocessing for MAR data.

Advantages:

Machine learning models (like decision trees or neural networks) can find complicated relationships in the data, even if the patterns are not obvious.
Machine learning models perform better as the amount of data increases. They can learn from lots of examples to make more accurate predictions.
These models can work with numbers, categories (like “yes” or “no”), or even mixed types of data.
You can choose or tweak a machine learning model to suit your dataset and problem. For instance, you might use Random Forests for structured data or neural networks for more complex datasets.

Drawbacks:

Machine learning models need enough data to learn patterns accurately. If your dataset is small, the predictions might not be reliable.
Training machine learning models takes time and computing power, especially for large datasets or complex models. This might be challenging if resources are limited.
Sometimes, the model learns the data too well and picks up noise or random details instead of general patterns. This can lead to overfitting.
Setting up and training machine learning models requires knowledge about algorithms, tuning parameters, and evaluating results. It’s not as straightforward as simpler methods like mean imputation.
Many machine learning models act like “black boxes,” meaning they give predictions without explaining how they arrived at them. This can make it harder to trust or interpret the results.

Applicability:

MAR/MCAR: Effective with robust algorithms
MNAR: Limited without explicit missingness modeling
Use them when you have a large dataset with complex patterns that simpler methods can’t handle. Avoid them if your dataset is small, you lack computing resources, or you need easy-to-understand results.

anaghamulloth.com

Mastering Missing Data in Your Datasets – Part 2

Regression Imputation

K-Nearest Neighbors (KNN) Imputation

Multiple Imputation

Machine Learning Models

References