Removing duplicate rows in Pandas DataFrame

Last update on December 21 2024 07:43:18 (UTC/GMT +8 hours)

Remove duplicate rows from a Pandas DataFrame.

Sample Solution:

Python Code:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {'Name': ['Ross', 'Bob', 'Ross', 'Geoffrey', 'Bob'],
        'Age': [25, 30, 25, 22, 30],
        'Salary': [50000, 60000, 50000, 45000, 60000]}

df = pd.DataFrame(data)

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

# Display the DataFrame without duplicates
print(df_no_duplicates)

Output:

       Name  Age  Salary
0      Ross   25   50000
1       Bob   30   60000
3  Geoffrey   22   45000

Explanation:

In the exerciser above,

We create a sample DataFrame (df) with columns 'Name', 'Age', and 'Salary'.
The df.drop_duplicates() method removes duplicate rows from the DataFrame.
The resulting DataFrame (df_no_duplicates) contains only unique rows.

You can also specify a subset of columns to consider when identifying duplicates using the subset parameter. For example, to remove duplicates based on the 'Name' column:

df_no_duplicates = df.drop_duplicates(subset='Name')

Based on the structure of the DataFrame, adjust the column names and data.

Flowchart:

Python Code Editor:

Previous: Normalizing numerical column in Pandas DataFrame with Min-Max scaling.
Next: Performing element-wise addition in NumPy arrays.