Home > other >  In pandas, what would be the idiomatic way to select multiple columns based on different patterns?
In pandas, what would be the idiomatic way to select multiple columns based on different patterns?

Time:09-09

I'm trying to replicate some of R4DS's dplyr exercises using Python's pandas, with the nycflights13.flights dataset. What I want to do is select, from that dataset:

  • Columns through year to day (inclusive);
  • All columns that end with "delay";
  • The distance and air_time columns

In the book, Hadley uses the following syntax:

library("tidyverse")
library("nycflights13")

flights_sml <- select(flights,
   year:day,
   ends_with("delay"),
   distance,
   air_time
)

In pandas, I came up with the following "solution":

import pandas as pd
from nycflights13 import flights

flights_sml = pd.concat([
    flights.loc[:, 'year':'day'],
    flights.loc[:, flights.columns.str.endswith("delay")],
    flights.distance,
    flights.air_time,
], axis=1)

Another possible implementation:

flights_sml = flights.filter(regex='year|day|month|delay$|^distance$|^air_time$', axis=1)

But I'm sure this is not the idiomatic way to write such DF-operation. I digged around, but haven't found something that fits in this situation from pandas API.

CodePudding user response:

You are correct. This will create multiple dataframes/series and then concatenate them together, resulting in a lot of extra work. Instead, you can create a list of the columns you want to use and then simply select those.

For example (keeping the same column order):

cols = ['year', 'month', 'day']   [col for col in flights.columns if col.endswith('delay')]   ['distance', 'air_time']
flights_sml = flights[cols]

CodePudding user response:

According to the dataset's columns info, we can utilize str.contains

df.loc[:, df.columns.str.contains('year|month|day|delay|distance|air_time')]
   year  month  day  dep_delay  arr_delay  air_time  distance
0  2013      1    1        2.0       11.0     227.0      1400
1  2013      1    1        4.0       20.0     227.0      1416
2  2013      1    1        2.0       33.0     160.0      1089
3  2013      1    1       -1.0      -18.0     183.0      1576
4  2013      1    1       -6.0      -25.0     116.0       762

CodePudding user response:

One option is with select_columns from pyjanitor:

# pip install pyjanitor
import pandas as pd 
import janitor
from nycflights13 import flights

flights_sml = flights.select_columns(
                          slice('year', 'day'), 
                          '*delay', 
                          'distance', 
                          'air_time'
                       )

flights_sml.head()
   year  month  day  dep_delay  arr_delay  distance  air_time
0  2013      1    1        2.0       11.0      1400     227.0
1  2013      1    1        4.0       20.0      1416     227.0
2  2013      1    1        2.0       33.0      1089     160.0
3  2013      1    1       -1.0      -18.0      1576     183.0
4  2013      1    1       -6.0      -25.0       762     116.0
  • Related