I have the following dataframe:
id location method
1 456 Phone
1 456 OS
6 456 OS
6 943 Specialist
What I'm trying to do, is to implement the following logic:
- If there's only one record (consider the combination of location method), I'll just do nothing. That's the scenario for the first and last row.
- If there's more than one record (location method), I want to keep only those where the ID == 1.
So, the resulting dataframe would be:
id location method
1 456 Phone
1 456 OS
6 943 Specialist
If I'm trying to only filter by the id column, I have this solution: df.loc[df['id'].eq(1).groupby(df['location'], sort=False).idxmax()]
(Reference: Filter and apply condition between multiple rows)
But I can not figure out how to perform this filter combined with the "method" column. Any ideas?
CodePudding user response:
A possible solution:
(df.sort_values(by='id')
.groupby(['location', 'method']).first()
.reset_index().sort_index(axis=1))
Output:
id location method
0 1 456 OS
1 1 456 Phone
2 6 943 Specialist
CodePudding user response:
Firstly, you can use groupby
on multiple columns (location and method). Then on the "grouped" dataframes, you can select one of the rows as you need (here I sort by "id", and select the first one).
# generate the dataframe
df = pd.DataFrame(
[
[1, 456, "Phone"],
[1, 456, "OS"],
[6, 456, "OS"],
[6 , 943, "Specialist"]
],
columns=["id", "location", "method"]
)
# groupby, sort, and select the first row
n_df = df.groupby(["location", "method"]).apply(lambda x:x.sort_values("id").iloc[0])
# remove the index that groupby generates
n_df.reset_index(drop=True, inplace=True)
n_df
id location method
0 1 456 OS
1 1 456 Phone
2 6 943 Specialist