Home > Software design >  Get rid of duplicate rows on conditions for a pandas dataframe
Get rid of duplicate rows on conditions for a pandas dataframe

Time:01-02

I have a dataframe with lots of duplicated rows on index, like this:

olddf = pd.DataFrame(index=['MJ','MJ','MJ','BJ','KJ','KJ'],data={'name':['masdjsdf','machael jordon','mskkkadke','boris johnson', 'kim jongun', 'kkasdfl'],'age':[23,40,31,35,25,30]})

I need to get rid of the duplicate index(rows) which also don't match the dictionary dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}. So after the operation, the dataframe should become

newdf = pd.DataFrame(index=['MJ','BJ','KJ'],data={'name':['machael jordon','boris johnson', 'kim jongun',],'age':[40,35,25]})

Thank you...

CodePudding user response:

Use map to set the values from the dictionary keys, then eq to compare with the column's data. If equal this yields True, you can use the resulting Series of booleans as a mask to slice the original dataframe:

mask = olddf['name'].eq(olddf.index.map(dic))

newdf = olddf[mask]

Output:

              name  age
MJ  machael jordon   40
BJ   boris johnson   35
KJ      kim jongun   25

also keeping the non duplicated rows

Simple, add a second mask:

mask2 = ~olddf.index.duplicated(keep=False)

newdf = olddf[mask|mask2]

CodePudding user response:

Take it easy, make it easy. Make the dic a pd.Series and check inclusion using .isin(). Code below

lst={'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}

olddf[olddf.isin(pd.Series(lst, name='name')).any(1)]

Outcome

            name  age
MJ  machael jordon   40
BJ   boris johnson   35
KJ      kim jongun   25
  • Related