Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

List down different way to Handling missing values by imputing them from the dataset

Using mean imputation
Using median imputation
Using mode imputation
Using constant imputation
Using K-Nearest Neighbors (KNN) imputation
Using regression imputation:
Using backward/forward fill imputation:
Using custom imputation based on domain knowledge like unknown and zero

Using mean imputation:

import pandas as pd
from sklearn.impute import SimpleImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})


# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1. 6. ]
[ 2. 7.5]
[ 3. 8. ]
[ 4. 9. ]
[ 5. 10. ]]
Using median imputation:
median is the middle value in a sorted dataset.

import pandas as pd
from sklearn.impute import SimpleImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Impute missing values with median
imputer = SimpleImputer(strategy='median')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1.   6. ]
 [ 2.   8. ]
 [ 3.   8. ]
 [ 4.   9. ]
 [ 5.  10. ]]
Enter fullscreen mode Exit fullscreen mode

Using mode imputation:
the mode is the value that appears most frequently in a dataset.

import pandas as pd
from sklearn.impute import SimpleImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})


# Impute missing values with mode
imputer = SimpleImputer(strategy='most_frequent')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1.  6.]
 [ 2.  6.]
 [ 1.  8.]
 [ 4.  9.]
 [ 5. 10.]]
Enter fullscreen mode Exit fullscreen mode

Using constant imputation:

import pandas as pd
from sklearn.impute import SimpleImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Impute missing values with a constant value
imputer = SimpleImputer(strategy='constant', fill_value=-999)
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[   1.    6.]
 [   2. -999.]
 [-999.    8.]
 [   4.    9.]
 [   5.   10.]]
Enter fullscreen mode Exit fullscreen mode

Using K-Nearest Neighbors (KNN) imputation:
K-Using constant imputation*nearest neighbors (KNN) imputation is a technique that estimates missing values by considering the values of the closest neighboring data points*

import pandas as pd
from sklearn.impute import KNNImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Impute missing values using KNN imputation
imputer== KNNImputer(n_neighbors=3)
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1. 6. ]
[ 2. 8. ]
[ 1. 8. ]
[ 4. 9. ]
[ 5. 10. ]]
Enter fullscreen mode Exit fullscreen mode
  1. Using regression imputation:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Impute missing values using regression imputation
imputer = IterativeImputer(estimator=LinearRegression())
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]
 [ 5. 10.]]
Enter fullscreen mode Exit fullscreen mode

Using interpolation:

import pandas as pd

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Interpolate missing values
data_imputed = data.interpolate()
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

   col1  col2
0   1.0   6.0
1   2.0   7.0
2   3.0    8.0
3   4.0   9.0
4   5.0  10.0
Enter fullscreen mode Exit fullscreen mode

Using backfill/forward fill:

import pandas as pd

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Forward fill missing values
data_imputed = data.ffill()
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

   col1  col2
0   1.0   6.0
1   2.0   6.0
2   2.0   8.0
3   4.0   9.0
4   5.0  10.0
Enter fullscreen mode Exit fullscreen mode

Using backward fill:

import pandas as pd

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Backward fill missing values
data_imputed = data.bfill()
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

   col1  col2
0   1.0   6.0
1   2.0   8.0
2   4.0   8.0
3   4.0   9.0
4   5.0  10.0
Enter fullscreen mode Exit fullscreen mode

These are just some of the common ways to handle missing values by imputing them from the dataset. The choice of method depends on the specific problem and the characteristics of the data.

Using hot-deck imputation:

import pandas as pd
from sklearn.impute import KNNImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, None, 5],
    'col2': [6, None, 8, 9, 10]
})

# Perform hot-deck imputation using KNNImputer
imputer = KNNImputer(n_neighbors=1, weights='uniform')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1.  6.]
 [ 2.  8.]
 [ 2.  8.]
 [ 2.  9.]
 [ 5. 10.]]
Enter fullscreen mode Exit fullscreen mode

Using multiple imputation:

import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, None, 10]
})

# Perform multiple imputation using IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 1.          6.        ]
 [ 2.          6.32653061]
 [ 3.          8.        ]
 [ 4.          8.6462585 ]
 [ 5.         10.        ]]
Enter fullscreen mode Exit fullscreen mode

Using custom imputation based on domain knowledge

:

import pandas as pd

# Example dataset
data = pd.DataFrame({
    'col1': [1, 2, None, 4, 5],
    'col2': [6, None, 8, 9, 10]
})

# Impute missing values with custom values based on domain knowledge
data['col1'] = data['col1'].fillna(0)
data['col2'] = data['col2'].fillna('Unknown')
print(data)
Enter fullscreen mode Exit fullscreen mode

Output:

   col1     col2
0   1.0        6
1   2.0  Unknown
2   0.0        8
3   4.0        9
4   5.0       10
Enter fullscreen mode Exit fullscreen mode

Top comments (0)