I have a dataframe (df) that contains datapoints along rivers, where each river basin (Basin ID) contains multiple rivers (River ID), and each river has points along its length (Length). It looks something like this (simplified):
Basin_ID | River_ID | Length |
---|---|---|
1 | 1 | 5 |
1 | 1 | 10 |
1 | 2 | 5 |
1 | 2 | 7 |
1 | 2 | 12 |
1 | 3 | 5 |
2 | 1 | 5 |
2 | 1 | 10 |
2 | 1 | 12 |
2 | 1 | 14 |
2 | 2 | 5 |
In this example Basin 1 has three rivers, and Basin 2 has two rivers. The actual table has 600K rows and 12 other columns with other river data.
Objective: I want to extract the single longest river subset for each unique basin in df, so that I end up with something like this:
Basin_ID | River_ID | Length |
---|---|---|
1 | 2 | 5 |
1 | 2 | 7 |
1 | 2 | 12 |
2 | 1 | 5 |
2 | 1 | 10 |
2 | 1 | 12 |
2 | 1 | 14 |
I assume that I would need to do something like
res = df.groupby('Basin_ID').apply(lambda x: ...
But because I need to group both by Basin_ID and River_ID, I am finding it hard to come up with an expression that works as it should. Any advice welcome, thanks!
CodePudding user response:
df2=df.assign(maxRiverLength=df.groupby('Basin_ID').transform(lambda x: x.max())['Length']).set_index(['Basin_ID','River_ID'])
df.set_index(['Basin_ID','River_ID']).loc[df2[df2['Length']==df2['maxRiverLength']].index].reset_index()
There might be a smarter way, but I reproduced your output in 2 steps:
I first create a new column containing the maxRiverLength per Basin_ID and assign this to a df2 where I set the index to Basin_ID and River_ID:
I then take the original df, set the index to Basin_ID and River_ID as well and filter by the indexes of df2 having 'Length' = 'maxRiverLength'