In a csv file, there is a column with a date (month/day/year) and TWO columns with time (hour:minute). One time column is the start time and the other column is the end time. All columns are objects that are not converted into datetime. In the time columns, there are some time values that are over 23:59 and if they are over, the format is hour:minute:second (what I've seen so far). Ex: 24:50:00, 25:35:00, etc. How would I parse the time columns? I'm getting an error message and I think it's because the time is over the usual limit. Also, for the date column I'm told that if the start time column exceeds 23:59, the date would increase based on how much the time is over the limit. Ex: date of 1/1/2000 with start time column of 24:50:00 (hour:hour:minute) is 1/2/2000 with time 0:50 (hour:minute). Do I create a new column and merge the two, and if so, how? And what should I do for the end time column?
When reading the csv file, I tried to parse the time series with parse_dates:
time_parser = lambda x: pd.datetime.strptime(x, '%H:%M')
df = pd.read_csv('data.csv', parse_dates = ['StartTime'], date_parser = time_parser)
But I get a error message that tells me something like: "25:39 does not match format %H:%M". I'm not sure if the parser just ignores the extra :00 (second) as mentioned above, but I think the problem is that the time exceeds 23:59. How should I go about approaching this?
CodePudding user response:
Parse the date to datetime
, parse the time to timedelta
and add the two together. Note that to_timedelta expects a certain input format (HH:MM:SS), which in your case could be enforced by prepending :00
. Ex:
import pandas as pd
df = pd.DataFrame({"date": ["1/1/2000", "1/1/2000", "1/1/2000"],
"time": ["23:59", "24:50", "25:30"]})
df["datetime"] = (
pd.to_datetime(df["date"], format="%m/%d/%Y")
pd.to_timedelta(df["time"] ":00")
)
df
date time datetime
0 1/1/2000 23:59 2000-01-01 23:59:00
1 1/1/2000 24:50 2000-01-02 00:50:00
2 1/1/2000 25:30 2000-01-02 01:30:00