Mastering Missing Data in Your Datasets — Part 2

In the previous article, we explored the different types of missing data, and how they can be identified. In this article we look at the different techniques for handling the missing data, and their use cases.

There are 3 broad techniques for handling the missing data:

1. Dropping rows or columns with missing records
2. Imputation by statistical methods like mean, mode, or median
3. Creating a separate model to predict the missing values

Dropping Rows or Columns

Removing missing values is not the best approach, especially when preserving data is a priority. However, in certain scenarios, the best approach would be to remove the missing values.

Dropping Columns

Think about whether the particular column or feature is critical for the task at hand. When a feature or column has more than 50% of data missing, it may not be a good candidate for data analysis. Hence, one choice would be to drop the column or feature. However, removing variables with high missingness (>40–70%) preserves dataset quality but sacrifices potentially critical features. For example, eliminating a "customer feedback" column missing 60% of responses simplifies analysis but may obscure key insights.

Applicability: MCAR/MAR/MNAR: Use sparingly, only when a variable's missingness renders it unusable.

Dropping Rows

Sometimes, the number of records with missing data may be very small compared to the size of the entire dataset. In such cases, when only a small percentage of data is missing, removing such records may not affect the task at hand. This is known as listwise deletion or complete case analysis. While computationally simple, it risks substantial data loss and biased estimates unless data is MCAR. For example, deleting 25% of rows in a survey dataset reduced statistical power and skewed demographic distributions in practice. This technique of deleting records will only be effective if the remaining data still represents the original dataset's distribution.

Applicability: MCAR: Safe if missingness is minimal (<5%). MAR/MNAR: Avoid due to bias risks.

Statistical Imputation

Mean/Median/Mode Imputation

Replacing missing numerical values with the column's mean/median or categorical values with the mode is computationally efficient. This method is also known as Univariate Imputation. Here, estimation of missing values is done using only the same column. However, it distorts variance and correlations. For skewed income data, median imputation preserves central tendency better than mean.

Mean: average of the column. Median: after sorting the column in ascending order, the mid-value is used to replace the missing values. Mode: the most frequent value of the column is used to replace missing values.

Applicability: MCAR: Optimal for small datasets with low missingness. MAR: Limited utility unless combined with multivariate methods. MNAR: Not recommended due to systemic bias.

Creating a Separate Model to Predict Missing Values

Choose column with missing value as the target. The remaining columns are given as input feature. The input feature is fed into the model. This method is computationally expensive, and is also time hungry.

Regression Imputation

Regression models predict missing values using its relation to other variables in the dataset. For example, predicting missing house prices using square footage and location. While effective for MAR, it underestimates uncertainty by treating imputed values as exact.

Advantages: Preserves connection between the variables, reduces bias compared to ignoring missing values, and integrates seamlessly with multiple imputation frameworks.

Drawbacks: Assumes linear relationships — if the real relationship is curved or complex, the predictions may be wrong. Imputed values often fit too neatly into the pattern, reducing natural variability.

Applicability: MAR: Highly effective with well-specified models. MCAR: Overkill compared to simpler methods. MNAR: Fails if the missingness relates to the missing value itself.

K-Nearest Neighbors (KNN) Imputation

KNN imputes missing values by averaging the nearest neighbors' observed values. For instance, a patient's missing cholesterol level could be inferred from similar patients' profiles. KNN preserves local structures but scales poorly with large datasets.

Advantages: Can handle both numbers and categories. Unlike methods that assume simple relationships, KNN can handle messy or curved patterns in data, making it useful for real-world scenarios.

Drawbacks: Computationally intensive for large datasets. Sensitive to irrelevant features and requires proper feature scaling. Choosing the right number of neighbors (k) is tricky — too few picks up noise, too many oversimplifies.

Applicability: MAR: Robust for datasets with strong feature correlations. MCAR: Viable but computationally intensive. MNAR: Limited without external data.

Multiple Imputation

Multiple imputation (MI) generates several plausible datasets by varying imputed values, then pools results to account for uncertainty. MI is ideal for MAR data, as seen in clinical trials where patient dropout correlates with observed covariates.

Steps: (1) Imputation: Create m datasets using stochastic regression. (2) Analysis: Perform analyses on each dataset. (3) Pooling: Combine estimates using Rubin's rules.

Advantages: Produces more robust answers by combining results from different imputed datasets. Uses patterns in observed data to guess missing values, taking relationships between variables into account.

Drawbacks: More complex to perform, requires more computing power, and accuracy depends on model assumptions. Combining results requires careful methodology.

Applicability: MAR: Gold standard for valid inferences. MNAR: Requires specialized models incorporating missingness mechanisms.

Machine Learning Models

Tree-based algorithms (e.g., Random Forests) handle missing data internally by surrogate splitting, while neural networks require pre-imputation. For example, XGBoost's built-in handling of missing values simplifies preprocessing for MAR data.

Advantages: Can find complicated relationships in the data, scale well with large amounts of data, and handle mixed data types. Models can be customized to suit specific datasets and problems.

Drawbacks: Require enough data to learn patterns accurately. Training is time and compute intensive. Prone to overfitting and often act as "black boxes" without easy interpretability.

Applicability: MAR/MCAR: Effective with robust algorithms. MNAR: Limited without explicit missingness modeling. Use when you have a large dataset with complex patterns that simpler methods can't handle.