I'm trying to replicate some of R4DS's dplyr
exercises using Python's pandas
, with the nycflights13.flights
dataset. What I want to do is select, from that dataset:
- Columns through
year
today
(inclusive); - All columns that end with "delay";
- The
distance
andair_time
columns
In the book, Hadley uses the following syntax:
library("tidyverse")
library("nycflights13")
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
In pandas, I came up with the following "solution":
import pandas as pd
from nycflights13 import flights
flights_sml = pd.concat([
flights.loc[:, 'year':'day'],
flights.loc[:, flights.columns.str.endswith("delay")],
flights.distance,
flights.air_time,
], axis=1)
Another possible implementation:
flights_sml = flights.filter(regex='year|day|month|delay$|^distance$|^air_time$', axis=1)
But I'm sure this is not the idiomatic way to write such DF-operation. I digged around, but haven't found something that fits in this situation from pandas
API.
CodePudding user response:
You are correct. This will create multiple dataframes/series and then concatenate them together, resulting in a lot of extra work. Instead, you can create a list of the columns you want to use and then simply select those.
For example (keeping the same column order):
cols = ['year', 'month', 'day'] [col for col in flights.columns if col.endswith('delay')] ['distance', 'air_time']
flights_sml = flights[cols]
CodePudding user response:
According to the dataset's columns info, we can utilize str.contains
df.loc[:, df.columns.str.contains('year|month|day|delay|distance|air_time')]
year month day dep_delay arr_delay air_time distance
0 2013 1 1 2.0 11.0 227.0 1400
1 2013 1 1 4.0 20.0 227.0 1416
2 2013 1 1 2.0 33.0 160.0 1089
3 2013 1 1 -1.0 -18.0 183.0 1576
4 2013 1 1 -6.0 -25.0 116.0 762
CodePudding user response:
One option is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
from nycflights13 import flights
flights_sml = flights.select_columns(
slice('year', 'day'),
'*delay',
'distance',
'air_time'
)
flights_sml.head()
year month day dep_delay arr_delay distance air_time
0 2013 1 1 2.0 11.0 1400 227.0
1 2013 1 1 4.0 20.0 1416 227.0
2 2013 1 1 2.0 33.0 1089 160.0
3 2013 1 1 -1.0 -18.0 1576 183.0
4 2013 1 1 -6.0 -25.0 762 116.0