Pandas drop_duplicates in any one columns and also in other csv-CodePudding

P.S. NOT JUST "Fruit" repeat I want "ref" repeated also drop it!!! Thanks ALL!!

Q1: Can pandas drop duplicates if any one cell is duplicated (all data from csv)

df1 read csv:

Fruit ref

Banana 123

Banana 456

Apple 789

Apple 456

apricot 123

Orange 999

===============

I want:

Fruit ref

Banana 123

Apple 789

Orange 999

i.e If duplicated on Fruit OR duplicated on ref then remove whole row, not matter which columns repeated, just keep the first

Q2: Another question is similar but in anther csv

df2 read csv:

Fruit ref

Banana 8778

Apple 7899

Mango 999

Grapes 8778

Pear 5465

Cherry 7445

Mango 5465

===========

I want:

Fruit ref

Pear 5465

Cherry 894

Similar, I want drop duplicated not matter which columns repeated, just keep the first

Banana, Apple is repeated in df1 so drop it

999 also repeated in df1 so drop it

5465 is repeated in df2(same df) so drop it

Thus, just keep Pear 5465 and Cherry 894

Q3: I have many csv data is similar, how can I remove all if repeated in COL(Fruit) or COL(ref)? or it must concat or merge to one csv then use drop_duplicates?

Sorry, my English not good and difficult to explain what I want, how you know what I want to say.

CodePudding user response：

For first chain conditions for both columns:

df = df1[~df1['Fruit'].duplicated() & ~df1['ref'].duplicated()]
print (df)
    Fruit  ref
0  Banana  123
2   Apple  789
5  Orange  999

For second first join both DataFrames and remove duplicates same way, for values from df2 only use inner join by df2:

df12 = pd.concat([df1, df2])
df = df12[~df12['Fruit'].duplicated() & ~df12['ref'].duplicated()].merge(df2)
print (df)
    Fruit   ref
0    Pear  5465
1  Cherry  7445

CodePudding user response：

df = pd.DataFrame({
    'Fruit' : ['Banana', 'Banana', 'Apple', 'Apple', 'Apricot', 'Orange'],
    'ref' : [123, 456, 789, 456, 123, 999]
})

    Fruit   ref
0   Banana  123
1   Banana  456
2   Apple   789
3   Apple   456
4   Apricot 123
5   Orange  999

df1 = df.groupby(["Fruit"]).filter(lambda df:df.shape[0] == 1)
df2 = df.groupby(["ref"]).filter(lambda df:df.shape[0] == 1)
df = pd.concat([df1, df2]).drop_duplicates()

Output:

    Fruit   ref
4   Apricot 123
5   Orange  999
2   Apple   789

Also check this post to know more. here

CodePudding user response：

just a hack,

>>> import pandas as pd
>>> data=[['banana',123],['banana',456],['apple',789],['apricot',123],['orange','999']]
>>> df = pd.DataFrame(data,columns=['fruit','ref'])
>>> df.drop_duplicates(subset=['fruit'])
     fruit  ref
0   banana  123
2    apple  789
3  apricot  123
4   orange  999
>>> (df.drop_duplicates(subset=['fruit'])).drop_duplicates(subset=['ref'])
    fruit  ref
0  banana  123
2   apple  789
4  orange  999

click here for snippet