I want to fetch 1 record from duplicate rows in Df except one column. example :
index | col1 | col2 | col3 | col4 |
---|---|---|---|---|
0 | abc | efg | hij | op |
1 | abc | efg | hij | ki |
2 | abc | efg | hij | tf |
3 | abc | efg | hij | ge |
4 | xyz | mmm | qt | aa |
5 | xyz | mmm | qt | bb |
6 | xyz | mmm | qt | cc |
(order by col4 asc) Thus, desired result could be like
index | col1 | col2 | col3 | col4 | rank |
---|---|---|---|---|---|
0 | abc | efg | hij | op | 1 |
1 | abc | efg | hij | ki | 2 |
2 | abc | efg | hij | tf | 3 |
3 | abc | efg | hij | ge | 4 |
4 | xyz | mmm | qt | aa | 1 |
5 | xyz | mmm | qt | bb | 2 |
6 | xyz | mmm | qt | cc | 3 |
The goal is to obtain a rank for each similar result to fetch data as df = df[df['rank'] == 1]
CodePudding user response:
Use:
df['rank'] = df.groupby(df.columns.difference(['col4']).tolist()).cumcount().add(1)
CodePudding user response:
ansev's answers does the job but for those of you who like the rank method two things might be worth trying. Unfortunately, pandas.groupby.rank does only work for one column but this case that might be enough. I confess, the case for multiple columns is a little bit busy...
df = pd.DataFrame({
'col1':['abc','abc','abc','abc','xyz','xyz','xyz'],
'col2':['efg','efg','efg','efg','mmm','mmm','mmm'],
'col3':['hij','hij','hij','hij','qt','qt','qt'],
'col4':['op','ki','tf','ge','aa','bb','cc']
})
df['rank_single_column'] = df.groupby('col3')['col4'].rank(method='first').astype(int)
df['rank_multiple_columns'] = df.groupby(df[df.columns.difference(['col4']).tolist()].apply(tuple,axis=1))['col4'].rank(method='first').astype(int)
An advandage of rank (not asked in the question) would be the possibility to sort for the 'col4'. This could also be achieved by adding sort_values() into ansev's answer.
df['rank'] = df.sort_values(['col4'],ascending=True).groupby(df.columns.difference(['col4']).tolist()).cumcount().add(1)