Home > Software engineering >  How to conditionally filter a Pandas dataframe
How to conditionally filter a Pandas dataframe

Time:10-29

I have a Pandas dataframe that looks like this:

import pandas as pd

df = pd.DataFrame({
'city': ['New York','New York','New York','Los Angeles','Los Angeles','Houston','Houston','Houston'],
'airport': ['LGA', 'EWR', 'JFK', 'LAX', 'BUR', 'IAH', 'HOU', 'EFD'],
'distance': [38, 50, 32, 8, 50, 90, 78, 120]
}
df

    city           airport    distance
0   New York       LGA        38
1   New York       EWR        50
2   New York       JFK        32
3   Los Angeles    LAX        8
4   Los Angeles    BUR        50
5   Houston        IAH        90
6   Houston        HOU        78
7   Houston        EFD        120

I would like to output a separate dataframe based on the following logic:

  1. if the value in the distance column is 40 or less between a given city and associated airport, than keep the row
  2. if, within a given city, there is no distance below 40, then show only the shortest (lowest) distance

The desired dataframe would look like this:

    city           airport    distance
0   New York       LGA        38
1   New York       JFK        32
3   Los Angeles    LAX        8
4   Houston        HOU        78      <-- this is returned, even though it's more than 40

How would I do this?

Thanks!

CodePudding user response:

So in your case do with drop_duplicates then combine_first

out = df.sort_values('distance').drop_duplicates('city').combine_first(df.loc[df['distance']<40])
Out[228]: 
         city airport  distance
0     NewYork     LGA        38
2     NewYork     JFK        32
3  LosAngeles     LAX         8
6     Houston     HOU        78

CodePudding user response:

Another possible solution, which is based on the following ideas:

  1. Create a dataframe that only contains rows where distance is lower or equal to 40.

  2. Create another dataframe whose rows correspond to the minimum of distance per group of cities.

  3. Concatenate the above two dataframes.

  4. Remove the duplicates.

(pd.concat([tdf.loc[tdf.distance.le(40)], 
            tdf.iloc[tdf.groupby('city')['distance'].idxmin()]])
 .drop_duplicates()
)

Output:

          city airport  distance
0     New York     LGA        38
2     New York     JFK        32
3  Los Angeles     LAX         8
6      Houston     HOU        78
  • Related