I have a Pandas dataframe that looks like this:
import pandas as pd
df = pd.DataFrame({
'city': ['New York','New York','New York','Los Angeles','Los Angeles','Houston','Houston','Houston'],
'airport': ['LGA', 'EWR', 'JFK', 'LAX', 'BUR', 'IAH', 'HOU', 'EFD'],
'distance': [38, 50, 32, 8, 50, 90, 78, 120]
}
df
city airport distance
0 New York LGA 38
1 New York EWR 50
2 New York JFK 32
3 Los Angeles LAX 8
4 Los Angeles BUR 50
5 Houston IAH 90
6 Houston HOU 78
7 Houston EFD 120
I would like to output a separate dataframe based on the following logic:
- if the value in the
distance
column is 40 or less between a given city and associated airport, than keep the row - if, within a given city, there is no distance below 40, then show only the shortest (lowest) distance
The desired dataframe would look like this:
city airport distance
0 New York LGA 38
1 New York JFK 32
3 Los Angeles LAX 8
4 Houston HOU 78 <-- this is returned, even though it's more than 40
How would I do this?
Thanks!
CodePudding user response:
So in your case do with drop_duplicates
then combine_first
out = df.sort_values('distance').drop_duplicates('city').combine_first(df.loc[df['distance']<40])
Out[228]:
city airport distance
0 NewYork LGA 38
2 NewYork JFK 32
3 LosAngeles LAX 8
6 Houston HOU 78
CodePudding user response:
Another possible solution, which is based on the following ideas:
Create a dataframe that only contains rows where
distance
is lower or equal to 40.Create another dataframe whose rows correspond to the minimum of
distance
per group ofcities
.Concatenate the above two dataframes.
Remove the duplicates.
(pd.concat([tdf.loc[tdf.distance.le(40)],
tdf.iloc[tdf.groupby('city')['distance'].idxmin()]])
.drop_duplicates()
)
Output:
city airport distance
0 New York LGA 38
2 New York JFK 32
3 Los Angeles LAX 8
6 Houston HOU 78