Pandas Drop Duplicate Rows in DataFrame

Current Location：Home > Learning > PROGRAM > Python >

Python PHP Java Go TypeScript C++ Vba Node.js C语言 MATLAB

Pandas Drop Duplicate Rows in DataFrame

Author：JIYIK Last Updated：2025/04/12 Views：

This tutorial explains how to DataFrame.drop_duplicates()remove all duplicate rows from a Pandas DataFrame using the remove_by method.

`DataFrame.drop_duplicates()`grammar

DataFrame.drop_duplicates(subset=None, keep="first", inplace=False, ignore_index=False)

It returns a DataFrame with all the duplicate rows removed.

Use `DataFrame.drop_duplicates()`the method to remove duplicate rows

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates()

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300

It removes rows where all values for all columns are the same. By default, rows in a DataFrame are considered duplicates only if they have the same values for each column. In df_with_duplicatesthe DataFrame, the first and fifth rows have the same values for all columns, so the fifth row is removed.

Set `subset`the parameters to remove duplicates based on specific columns only

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"])

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100

Here, we pass Nameas subsetthe parameter to drop_duplicates()the method. The fourth and fifth rows are deleted because their Namecolumns have the same value as the first column.

`drop_duplicates()`Set in method`keep='last'`

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep="last")

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
5  302   Watch  300

It deletes all rows except the last row Namewhich has the same value as the column.

We set keep=Falseto remove all rows with the same value in any column.

import pandas as pd

df_with_duplicates = pd.DataFrame(
    {
        "Id": [302, 504, 708, 103, 303, 302],
        "Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
        "Cost": ["300", "400", "350", "100", "300", "300"],
    }
)

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep=False)

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

Output:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100

It deletes the first, fifth, and sixth rows because their Namecolumns all have the same values.

Previous：Pandas DataFrame Delete a row

Next：Get the first row of Dataframe Pandas

For reprinting, please send an email to 1244347461@qq.com for approval. After obtaining the author's consent, kindly include the source as a link.

Article URL：

JIYIK CN >