I have a dataframe with lots of duplicated rows on index, like this:
olddf = pd.DataFrame(index=['MJ','MJ','MJ','BJ','KJ','KJ'],data={'name':['masdjsdf','machael jordon','mskkkadke','boris johnson', 'kim jongun', 'kkasdfl'],'age':[23,40,31,35,25,30]})
I need to get rid of the duplicate index(rows) which also don't match the dictionary
dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}.
So after the operation, the dataframe should become
newdf = pd.DataFrame(index=['MJ','BJ','KJ'],data={'name':['machael jordon','boris johnson', 'kim jongun',],'age':[40,35,25]})
Thank you...
CodePudding user response:
Use map
to set the values from the dictionary keys, then eq
to compare with the column's data. If equal this yields True, you can use the resulting Series of booleans as a mask to slice the original dataframe:
mask = olddf['name'].eq(olddf.index.map(dic))
newdf = olddf[mask]
Output:
name age
MJ machael jordon 40
BJ boris johnson 35
KJ kim jongun 25
also keeping the non duplicated rows
Simple, add a second mask:
mask2 = ~olddf.index.duplicated(keep=False)
newdf = olddf[mask|mask2]
CodePudding user response:
Take it easy, make it easy. Make the dic a pd.Series and check inclusion using .isin(). Code below
lst={'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf[olddf.isin(pd.Series(lst, name='name')).any(1)]
Outcome
name age
MJ machael jordon 40
BJ boris johnson 35
KJ kim jongun 25