Using mean imputation
Using median imputation
Using mode imputation
Using constant imputation
Using K-Nearest Neighbors (KNN) imputation
Using regression imputation:
Using backward/forward fill imputation:
Using custom imputation based on domain knowledge like unknown and zero
Using mean imputation:
import pandas as pd
from sklearn.impute import SimpleImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6. ]
[ 2. 7.5]
[ 3. 8. ]
[ 4. 9. ]
[ 5. 10. ]]
Using median imputation:
median is the middle value in a sorted dataset.
import pandas as pd
from sklearn.impute import SimpleImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values with median
imputer = SimpleImputer(strategy='median')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6. ]
[ 2. 8. ]
[ 3. 8. ]
[ 4. 9. ]
[ 5. 10. ]]
Using mode imputation:
the mode is the value that appears most frequently in a dataset.
import pandas as pd
from sklearn.impute import SimpleImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values with mode
imputer = SimpleImputer(strategy='most_frequent')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6.]
[ 2. 6.]
[ 1. 8.]
[ 4. 9.]
[ 5. 10.]]
Using constant imputation:
import pandas as pd
from sklearn.impute import SimpleImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values with a constant value
imputer = SimpleImputer(strategy='constant', fill_value=-999)
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6.]
[ 2. -999.]
[-999. 8.]
[ 4. 9.]
[ 5. 10.]]
Using K-Nearest Neighbors (KNN) imputation:
K-Using constant imputation*nearest neighbors (KNN) imputation is a technique that estimates missing values by considering the values of the closest neighboring data points*
import pandas as pd
from sklearn.impute import KNNImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values using KNN imputation
imputer== KNNImputer(n_neighbors=3)
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6. ]
[ 2. 8. ]
[ 1. 8. ]
[ 4. 9. ]
[ 5. 10. ]]
- Using regression imputation:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values using regression imputation
imputer = IterativeImputer(estimator=LinearRegression())
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6.]
[ 2. 7.]
[ 3. 8.]
[ 4. 9.]
[ 5. 10.]]
Using interpolation:
import pandas as pd
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Interpolate missing values
data_imputed = data.interpolate()
print(data_imputed)
Output:
col1 col2
0 1.0 6.0
1 2.0 7.0
2 3.0 8.0
3 4.0 9.0
4 5.0 10.0
Using backfill/forward fill:
import pandas as pd
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Forward fill missing values
data_imputed = data.ffill()
print(data_imputed)
Output:
col1 col2
0 1.0 6.0
1 2.0 6.0
2 2.0 8.0
3 4.0 9.0
4 5.0 10.0
Using backward fill:
import pandas as pd
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Backward fill missing values
data_imputed = data.bfill()
print(data_imputed)
Output:
col1 col2
0 1.0 6.0
1 2.0 8.0
2 4.0 8.0
3 4.0 9.0
4 5.0 10.0
These are just some of the common ways to handle missing values by imputing them from the dataset. The choice of method depends on the specific problem and the characteristics of the data.
Using hot-deck imputation:
import pandas as pd
from sklearn.impute import KNNImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, None, 5],
'col2': [6, None, 8, 9, 10]
})
# Perform hot-deck imputation using KNNImputer
imputer = KNNImputer(n_neighbors=1, weights='uniform')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6.]
[ 2. 8.]
[ 2. 8.]
[ 2. 9.]
[ 5. 10.]]
Using multiple imputation:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, None, 10]
})
# Perform multiple imputation using IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
data_imputed = imputer.fit_transform(data)
print(data_imputed)
Output:
[[ 1. 6. ]
[ 2. 6.32653061]
[ 3. 8. ]
[ 4. 8.6462585 ]
[ 5. 10. ]]
Using custom imputation based on domain knowledge
:
import pandas as pd
# Example dataset
data = pd.DataFrame({
'col1': [1, 2, None, 4, 5],
'col2': [6, None, 8, 9, 10]
})
# Impute missing values with custom values based on domain knowledge
data['col1'] = data['col1'].fillna(0)
data['col2'] = data['col2'].fillna('Unknown')
print(data)
Output:
col1 col2
0 1.0 6
1 2.0 Unknown
2 0.0 8
3 4.0 9
4 5.0 10
Top comments (0)