Home > Enterprise >  How to perform groupby and remove duplicate based on first occurrence of a column condition?
How to perform groupby and remove duplicate based on first occurrence of a column condition?

Time:10-05

This problem is a bit hard for me to wrap my head around so I hope I can explain it properly below. I have a data frame with a lot of rows but only 3 columns like below:

data = {'line_group': [1,1,8,8,4,4,5,5], 
'route_order': [1,2,1,2,1,2,1,2],
'StartEnd':['20888->20850','20888->20850','20888->20850','20888->20850',
'20961->20960','20961->20960','20961->20960','20961->20960']}
df = pd.DataFrame(data)

enter image description here

In the end, I want to use this data to plot routes between points for instance 20888 to 20850. But the problem is that there are a lot of trips/line_group that also goes through these two points so when I do plot things, it will be overlapping and very slow which is not what I want.

So I only want the first line_group which has the unique StartEnd like in the data frame below:

enter image description here

I believe it could have something to do with groupby like in the following code below that I have tried but it doesn't produce the results I want. And in the full dataset route orders aren't usually just from 1 point to another and can be up to much more (E.g 1,2,3,4,...).

drop_duplicates(subset='StartEnd', keep="first")

CodePudding user response:

Group by StartEnd and keep only the first line_group value

Then filter to rows which contain the unique line groups

unique_groups = df.groupby('StartEnd')['line_group'].agg(lambda x: list(x)[0]).reset_index()

StartEnd    line_group
20888->20850    1
20961->20960    4

unique_line_groups = unique_groups['line_group']
filtered_df = df[df['line_group'].isin(unique_line_groups)]

Final Output

line_group  route_order StartEnd
1          1    20888->20850
1          2    20888->20850
4          1    20961->20960
4          2    20961->20960

CodePudding user response:

You can add in route_order to the argument subset to get output you want.

In [8]: df.drop_duplicates(subset=['StartEnd', 'route_order'], keep='first')
Out[8]: 
   line_group  route_order      StartEnd
0           1            1  20888->20850
1           1            2  20888->20850
4           4            1  20961->20960
5           4            2  20961->20960

CodePudding user response:

You can use groupby.first():

df.groupby(["route_order", "StartEnd"], as_index=False).first()

output:

   route_order      StartEnd  line_group
0            1  20888->20850           1
1            1  20961->20960           4
2            2  20888->20850           1
3            2  20961->20960           4
  • Related