I have two columns ("basecamp_date" and "highpoint_date") in my "expeditions" dataframe, they have a start date (basecamp_date) and an end date ("highpoint_date") and I would like to create a new column that expresses the duration between these two dates but I have no idea how to do it.
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
CodePudding user response:
In read_csv
convert columns to datetimes and then subtrat columns with Series.dt.days
for days:
file = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv"
expeditions = pd.read_csv(file, parse_dates=['basecamp_date','highpoint_date'])
expeditions['diff'] = expeditions['highpoint_date'].sub(expeditions['basecamp_date']).dt.days
CodePudding user response:
You can convert those columns to datetime and then subtract them to get the duration:
tstart = pd.to_datetime(expeditions['basecamp_date'])
tend = pd.to_datetime(expeditions['highpoint_date'])
expeditions['duration'])= pd.Timedelta(tend - tstart)