The quality of a dataset plays a fundamental role in the success of any machine learning project. Poor data quality can significantly hinder the model's ability to learn meaningful patterns, leading to unreliable predictions, biased outcomes, and reduced generalisability.
Missing data poses as one of the most common data quality issues. Missing data can lead to biased models if not handled properly. This article contains various subsections as given below:
- Reasons for Missing Values in Datasets
- Understanding the types of Missing Data (MCAR, MNAR, and MAR)
- How to Identify the Type of Missing Data
Reason for Missing Values in Datasets
1. Data Collection Issues:
- Surveys and questionnaires may have skipped questions (affecting categorical and ordinal data).
- Sensors and devices may fail to record measurements (affecting numerical data).
2. Human Factors:
- Sensitive questions like income or demographics are often left unanswered (affecting ratio and nominal data).
- Subjective questions like satisfaction ratings may be skipped (affecting ordinal data).
3. Technical Problems:
- Equipment malfunctions can result in missing continuous and interval data (e.g., weather sensors failing to record temperature).
4. Systematic Exclusions:
- Certain groups may systematically omit responses due to cultural or social factors (affecting all types of data).
Understanding the Types of Missing Data
The missing data can be classified into three categories:
- MCAR (Missing Completely at Random)
- MAR (Missing at Random)
- MNAR (Missing Not at Random)
Missing Completely at Random (MCAR)
MCAR occurs when the data missing does not depend on any values in the dataset (observed or missing) and is completely random. Here, the probability of missing is unrelated to the data itself.
Example: A computer glitch randomly deleting some entries in a weather dataset or failing to record some values. Here the reasons for the failure are unknown.
Missing at Random (MAR)
MAR occurs when the data missing depends on other measured variables but not to any missing data. Here, the cause of missing data is related to other measured variables. MAR is a broader version of MCAR.
Example: A weighing scale producing more missing values when placed on a soft surface compared to a hard surface. Here we know the surface type that causes the missingness.
Missing Not at Random (MNAR)
MNAR occurs when the data missing is related to other unobserved data or unmeasured data. Here the data may be missing for reasons unknown to us or reasons that cannot be measured. This type of data can cause significant bias in the data distribution. MNAR is the most complex case.
Example: In a study, it was observed that people with higher incomes are less likely to record their incomes. Hence most of the missing income data came from people with higher income. This would cause a significant bias in the average income of the people as per the data observed.
How to Identify the Type of Missing Data
There are multiple ways to determine whether missing data in your dataset belongs to MCAR, MAR, or MNAR category.
1. Analyze Correlations Between Missingness
In machine learning and data analysis, correlation refers to a statistical measure that quantifies the degree to which two variables are related. It indicates whether and how strongly pairs of variables are associated with each other.
A heatmap of missingness correlations can help identify the type of missing data:
- MCAR — Low or Zero correlations in the heatmap suggest MCAR. If the missingness in one variable is not correlated with missingness or observed values in other variables, the data may be MCAR.
- MAR — Significant correlations in the heatmap suggest MAR. If the missingness in one variable is correlated with observed values of other variables, the data is likely to be MAR.
- MNAR — If missingness is related to the value of the variable itself (e.g., people with high incomes are less likely to report income), it is MNAR. This cannot be directly detected from correlations and requires domain knowledge.
2. Use Statistical Tests
For MCAR: Little's MCAR test can be used to assess if the data is MCAR. A non-significant p-value indicates whether the data may be MCAR:
- If p-value > 0.05 — The data is likely MCAR
- If p-value ≤ 0.05 — The data is not MCAR (it could be MAR or MNAR)
3. Visualize Patterns
Create scatter plots with data points color-coded by missingness, or box plots and histograms comparing distributions of observed vs. missing data for other variables:
- MCAR data should show even distribution of missingness
- MAR may show some patterns related to observed variables.
- MNAR will likely show clear patterns related to the missing variable itself.
4. Use Domain Knowledge
Domain knowledge is critical for MNAR. It arises when missingness depends on unobserved values of the variable itself. This cannot be detected statistically but requires understanding of the dataset's context.
By combining statistical tests, visualisations, and domain expertise, you can classify the type of missing data and make informed decisions for handling it effectively!
The next article will be a detailed review of the different techniques that can be employed for handling the different types of missing data.
References
- https://stefvanbuuren.name/fimd/sec-MCAR.html
- https://www.theanalysisfactor.com/missing-data-mechanism/
- https://blog.dailydoseofds.com/p/enrich-your-missing-data-analysis
- https://spotintelligence.com/2024/10/18/handling-missing-data-in-machine-learning/
- https://stackoverflow.com/a/76247351