Imputing missing values using KNN imputation in Pandas
Pandas: Machine Learning Integration Exercise-11 with Solution
Write a Pandas program that imputes missing values using K-Nearest neighbours.
The following exercise demonstrates how to impute missing values using the K-Nearest Neighbors (KNN) algorithm.
Sample Solution :
Code :
import pandas as pd
from sklearn.impute import KNNImputer
# Load the dataset
df = pd.read_csv('data.csv')
# Separate the numeric columns (Age and Salary) from non-numeric ones (Name, Gender)
numeric_cols = ['Age', 'Salary']
non_numeric_cols = ['ID', 'Name', 'Gender', 'Target']
# Apply KNN imputation only to the numeric columns
imputer = KNNImputer(n_neighbors=3)
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)
# Combine the non-numeric columns with the imputed numeric data
df_imputed = pd.concat([df[non_numeric_cols].reset_index(drop=True), df_numeric_imputed], axis=1)
# Output the dataset with imputed values
print(df_imputed)
Output:
ID Name Gender Target Age Salary 0 1 Sara Female 0 25.000000 50000.000000 1 2 Ophrah Male 1 30.000000 60000.000000 2 3 Torben Male 0 22.000000 70000.000000 3 4 Masaharu Male 1 35.000000 80000.000000 4 5 Kaya Female 0 25.666667 55000.000000 5 6 Abaddon Male 1 29.000000 63333.333333
Explanation:
- Import Libraries:
- pandas is imported for handling data in DataFrame format.
- KNNImputer from sklearn is imported for imputing missing values using K-Nearest Neighbors (KNN).
- Load Dataset:
- The data.csv file is read using pd.read_csv() and stored in a DataFrame df.
- Separate Numeric and Non-Numeric Columns:
- Two lists are created: numeric_cols containing the numeric columns ('Age', 'Salary') and non_numeric_cols containing non-numeric columns ('ID', 'Name', 'Gender', 'Target').
- Initialize and Apply KNN Imputer:
- KNNImputer is initialized with n_neighbors=3, meaning that the algorithm will use the 3 nearest neighbors to impute missing values.
- The fit_transform() method is applied to the numeric_cols ('Age' and 'Salary') to fill in the missing values, creating a DataFrame df_numeric_imputed with the imputed data.
- Combine Imputed Data with Non-Numeric Columns:
- The imputed numeric data (df_numeric_imputed) is combined with the original non-numeric columns (df[non_numeric_cols]) using pd.concat().
- The reset_index(drop=True) ensures that the indexes align properly after concatenation.
- Output the Final Dataset:
- The fully imputed dataset (df_imputed) is printed, containing both the non-numeric and imputed numeric data.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://www.w3resource.com/python-exercises/pandas/pandas-impute-missing-values-using-knn-imputation.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics