I have a a dataframe where each lines contains text, and the start of this text is a time that is in the format of 00:00:00-00:00:01. I would like to extract the starting time, the ending time but also the difference between those time into 3 columns of my df.
Here is a sample of the dataframe :
text |
---|
10:30:00-12:30:00 some random text(some special caracters [-'(@ ) some numbers 456231 386 |
15:35:10-15:36:12 some other text some numbers 9875321651132 |
Here is the expected result
text | start time | end time | delta |
---|---|---|---|
10:30:00-12:30:00 some random text(some special caracters [-'(@ ) some numbers 456231 386 | 10:30:00 | 12:30:00 | 02:00:00 |
15:35:10-15:36:12 some other text some numbers 9875321651132 | 15:35:10 | 15:36:12 | 00:01:02 |
I think using str.extract like this would do the trick but can't get the matching as I would like, where the pattern match a 00:00:00 format
df['start time'] = df['text'].str.extract(r'( ^Pattern)') # ^ for the begining of the string where the start time is.
df['end time'] = df['text'].str.extract(r'( -Patern)')
#the - because the end time is between a "-" so the pattern for the end time would have to start with "-" to get it right
And for the delta, I was basically thinking about converting the start and end time strings to datetime and doing :
df['delta'] = df['end time']- df['start time']
CodePudding user response:
For start_time
and end_time
, you can do:
df[['start_time','end_time']] = df['text'].str.extract('(\d{2}:\d{2}:\d{2})-(\d{2}:\d{2}:\d{2})')
Then for delta, you can convert the strings into Timedelta
, get the difference, and convert to string:
df['delta'] = (pd.to_timedelta(df['end_time'])
.sub(pd.to_timedelta(df['start_time']))
)
CodePudding user response:
similar to Quang Hoang solution, but if you need to keep the difference in hours, minute and seconds then you extract only time part
# assuming- is inbetween to times and there is no space in b/w them,
# extracting two times, one prior to hyphen, another prior to whitespace
df[['start','end']]=df['text'].str.extract(r'(.*?)-(.*?)\s')
#takea difference and extract a part after the days
#timedelta always return days component
df['difference']=(pd.to_timedelta(df['end']) .sub(pd.to_timedelta(df['start']))).astype(str).str.extract(r'days (.*)')
df
text start end difference
0 10:30:00-12:30:00 some random text(some specia... 10:30:00 12:30:00 02:00:00
1 15:35:10-15:36:12 some other text some numbers... 15:35:10 15:36:12 00:01:02
CodePudding user response:
Here my main logic is to 3d list and convert that list to dataframe.
to get the list, I am passing first loop to read each row strings and gives me the date only, so after loop I get first row list something like
[[Timedelta('0 days 10:30:00'), imedelta('0 days 12:30:00')]
.
Then running loop where I am getting difference between end and start date.
Lastly, converting to the dataframe.
Code:
df1 = pd.DataFrame([i [i[1]-i[0]] for i in [[pd.to_timedelta(x) for x in re.sub('[^0123456789\:]',r' ', s ).split(' ')[:2]] for s in df['text']]])
df1.columns = ['strt','end','delta']
Output:
strt end delta
0 0 days 10:30:00 0 days 12:30:00 0 days 02:00:00
1 0 days 15:35:10 0 days 15:36:12 0 days 00:01:02