The quality of a dataset plays a fundamental role in the success of any machine learning project. Poor data quality can significantly hinder the model's ability to learn meaningful patterns, leading to unreliable predictions, biased outcomes, and reduced generalisability.

Missing data poses as one of the most common data quality issues. Missing data can lead to biased models if not handled properly. This article contains various subsections as given below:

  1. Reasons for Missing Values in Datasets
  2. Understanding the types of Missing Data (MCAR, MNAR, and MAR)
  3. How to Identify the Type of Missing Data

Reason for Missing Values in Datasets

1. Data Collection Issues:

2. Human Factors:

3. Technical Problems:

4. Systematic Exclusions:

Understanding the Types of Missing Data

The missing data can be classified into three categories:

  1. MCAR (Missing Completely at Random)
  2. MAR (Missing at Random)
  3. MNAR (Missing Not at Random)

Missing Completely at Random (MCAR)

MCAR occurs when the data missing does not depend on any values in the dataset (observed or missing) and is completely random. Here, the probability of missing is unrelated to the data itself.

Example: A computer glitch randomly deleting some entries in a weather dataset or failing to record some values. Here the reasons for the failure are unknown.

Missing at Random (MAR)

MAR occurs when the data missing depends on other measured variables but not to any missing data. Here, the cause of missing data is related to other measured variables. MAR is a broader version of MCAR.

Example: A weighing scale producing more missing values when placed on a soft surface compared to a hard surface. Here we know the surface type that causes the missingness.

Missing Not at Random (MNAR)

MNAR occurs when the data missing is related to other unobserved data or unmeasured data. Here the data may be missing for reasons unknown to us or reasons that cannot be measured. This type of data can cause significant bias in the data distribution. MNAR is the most complex case.

Example: In a study, it was observed that people with higher incomes are less likely to record their incomes. Hence most of the missing income data came from people with higher income. This would cause a significant bias in the average income of the people as per the data observed.

How to Identify the Type of Missing Data

There are multiple ways to determine whether missing data in your dataset belongs to MCAR, MAR, or MNAR category.

1. Analyze Correlations Between Missingness

In machine learning and data analysis, correlation refers to a statistical measure that quantifies the degree to which two variables are related. It indicates whether and how strongly pairs of variables are associated with each other.

A heatmap of missingness correlations can help identify the type of missing data:

2. Use Statistical Tests

For MCAR: Little's MCAR test can be used to assess if the data is MCAR. A non-significant p-value indicates whether the data may be MCAR:

3. Visualize Patterns

Create scatter plots with data points color-coded by missingness, or box plots and histograms comparing distributions of observed vs. missing data for other variables:

4. Use Domain Knowledge

Domain knowledge is critical for MNAR. It arises when missingness depends on unobserved values of the variable itself. This cannot be detected statistically but requires understanding of the dataset's context.

By combining statistical tests, visualisations, and domain expertise, you can classify the type of missing data and make informed decisions for handling it effectively!

The next article will be a detailed review of the different techniques that can be employed for handling the different types of missing data.

References

  1. https://stefvanbuuren.name/fimd/sec-MCAR.html
  2. https://www.theanalysisfactor.com/missing-data-mechanism/
  3. https://blog.dailydoseofds.com/p/enrich-your-missing-data-analysis
  4. https://spotintelligence.com/2024/10/18/handling-missing-data-in-machine-learning/
  5. https://stackoverflow.com/a/76247351