Home > Enterprise >  How can I remove the duplicates from the DataFrame
How can I remove the duplicates from the DataFrame

Time:04-11

Please check the image for detailsThe problem is that there are some some titles where the only difference between them is the color, the rest of the details are matching in every column.

Example:

  1. Apple iPhone 12 Pro Max (256GB) - Gold
  2. Apple iPhone 12 Pro Max (256GB) - Pacific Blue

The only difference between the two is Gold and Pacific Blue all the other values are same.

Now I want to remove all such Titles or duplicates where the difference is in the Title only but the rest of the details are same.

import pandas as pd 
import numpy as np 

df = pd.read_csv('../amazon_listing_scraper/amazon_listing.csv')
df.head(5)

df[np.isin(df, ['Apple iPhone 12 Pro Max (256GB) - Gold','Apple iPhone 12 Pro Max (256GB) - Pacific Blue']).any(axis=1)]

CodePudding user response:

I worked on cleaning a dataset with duplicate Names and this is how I dropped those:

df = df.drop_duplicates(subset = "Name")

You can try doing:

df = df.drop_duplicates(subset = "Title")

Read the documentation to learn more.

CodePudding user response:

This is an extantion to @Ishan Shishodiya post.

If the color is added to the name by - , than you can split first, extend the DataFrame by one column named "color", drop duplicates on "title" and then drop "color" again.

df[['title', 'color']] = df['title'].str.split(' - ', 1, expand=True)
df = df.drop_duplicates(subset='title')
df = df.drop('color')

Minimal Example

I show this on only two columns, to simplify it even more.

import pandas as pd

df = pd.DataFrame({
    'title': ["Apple iPhone 12 Pro Max (256GB) - Gold", "Apple iPhone 12 Pro Max (256GB) - Pacific Blue"],
    'stars':[4.7]*2
})
>>> df
                                            title    stars
0          Apple iPhone 12 Pro Max (256GB) - Gold      4.7
1  Apple iPhone 12 Pro Max (256GB) - Pacific Blue      4.7

df[['title', 'color']] = df['title'].str.split(' - ', 1, expand=True)
df = df.drop_duplicates(subset='title')
df = df.drop("color", axis=1)

>>> df
                             title    stars
0  Apple iPhone 12 Pro Max (256GB)      4.7
  • Related