P.S. NOT JUST "Fruit" repeat I want "ref" repeated also drop it!!! Thanks ALL!!
Q1: Can pandas drop duplicates if any one cell is duplicated (all data from csv)
df1 read csv:
Fruit ref
Banana 123
Banana 456
Apple 789
Apple 456
apricot 123
Orange 999
===============
I want:
Fruit ref
Banana 123
Apple 789
Orange 999
i.e If duplicated on Fruit OR duplicated on ref then remove whole row, not matter which columns repeated, just keep the first
Q2: Another question is similar but in anther csv
df2 read csv:
Fruit ref
Banana 8778
Apple 7899
Mango 999
Grapes 8778
Pear 5465
Cherry 7445
Mango 5465
===========
I want:
Fruit ref
Pear 5465
Cherry 894
Similar, I want drop duplicated not matter which columns repeated, just keep the first
Banana, Apple is repeated in df1 so drop it
999 also repeated in df1 so drop it
5465 is repeated in df2(same df) so drop it
Thus, just keep Pear 5465 and Cherry 894
Q3: I have many csv data is similar, how can I remove all if repeated in COL(Fruit) or COL(ref)? or it must concat or merge to one csv then use drop_duplicates?
Sorry, my English not good and difficult to explain what I want, how you know what I want to say.
CodePudding user response:
For first chain conditions for both columns:
df = df1[~df1['Fruit'].duplicated() & ~df1['ref'].duplicated()]
print (df)
Fruit ref
0 Banana 123
2 Apple 789
5 Orange 999
For second first join both DataFrames and remove duplicates same way, for values from df2
only use inner join by df2
:
df12 = pd.concat([df1, df2])
df = df12[~df12['Fruit'].duplicated() & ~df12['ref'].duplicated()].merge(df2)
print (df)
Fruit ref
0 Pear 5465
1 Cherry 7445
CodePudding user response:
df = pd.DataFrame({
'Fruit' : ['Banana', 'Banana', 'Apple', 'Apple', 'Apricot', 'Orange'],
'ref' : [123, 456, 789, 456, 123, 999]
})
Fruit ref
0 Banana 123
1 Banana 456
2 Apple 789
3 Apple 456
4 Apricot 123
5 Orange 999
df1 = df.groupby(["Fruit"]).filter(lambda df:df.shape[0] == 1)
df2 = df.groupby(["ref"]).filter(lambda df:df.shape[0] == 1)
df = pd.concat([df1, df2]).drop_duplicates()
Output:
Fruit ref
4 Apricot 123
5 Orange 999
2 Apple 789
Also check this post to know more. here
CodePudding user response:
just a hack,
>>> import pandas as pd
>>> data=[['banana',123],['banana',456],['apple',789],['apricot',123],['orange','999']]
>>> df = pd.DataFrame(data,columns=['fruit','ref'])
>>> df.drop_duplicates(subset=['fruit'])
fruit ref
0 banana 123
2 apple 789
3 apricot 123
4 orange 999
>>> (df.drop_duplicates(subset=['fruit'])).drop_duplicates(subset=['ref'])
fruit ref
0 banana 123
2 apple 789
4 orange 999