Extract time from string in dataframe's rows into new columns-CodePudding

I have a a dataframe where each lines contains text, and the start of this text is a time that is in the format of 00:00:00-00:00:01. I would like to extract the starting time, the ending time but also the difference between those time into 3 columns of my df.

Here is a sample of the dataframe :

text
10:30:00-12:30:00 some random text(some special caracters [-'(@ ) some numbers 456231 386
15:35:10-15:36:12 some other text some numbers 9875321651132

Here is the expected result

text	start time	end time	delta
10:30:00-12:30:00 some random text(some special caracters [-'(@ ) some numbers 456231 386	10:30:00	12:30:00	02:00:00
15:35:10-15:36:12 some other text some numbers 9875321651132	15:35:10	15:36:12	00:01:02

I think using str.extract like this would do the trick but can't get the matching as I would like, where the pattern match a 00:00:00 format

df['start time'] = df['text'].str.extract(r'( ^Pattern)')  # ^ for the begining of the string where the start time is.
df['end time'] = df['text'].str.extract(r'( -Patern)')     
#the - because the end time is between a "-" so the pattern for the end time would have to start with "-" to get it right

And for the delta, I was basically thinking about converting the start and end time strings to datetime and doing :

df['delta'] = df['end time']- df['start time']

CodePudding user response：

For start_time and end_time, you can do:

df[['start_time','end_time']] = df['text'].str.extract('(\d{2}:\d{2}:\d{2})-(\d{2}:\d{2}:\d{2})')

Then for delta, you can convert the strings into Timedelta, get the difference, and convert to string:

df['delta'] = (pd.to_timedelta(df['end_time'])
    .sub(pd.to_timedelta(df['start_time']))
)

CodePudding user response：

similar to Quang Hoang solution, but if you need to keep the difference in hours, minute and seconds then you extract only time part

# assuming- is inbetween to times and there is no space in b/w them,
# extracting two times, one prior to hyphen, another prior to whitespace

df[['start','end']]=df['text'].str.extract(r'(.*?)-(.*?)\s')

#takea difference and extract a part after the days
#timedelta always return days component
df['difference']=(pd.to_timedelta(df['end'])    .sub(pd.to_timedelta(df['start']))).astype(str).str.extract(r'days (.*)')
df

    text                                                start       end         difference
0   10:30:00-12:30:00 some random text(some specia...   10:30:00    12:30:00    02:00:00
1   15:35:10-15:36:12 some other text some numbers...   15:35:10    15:36:12    00:01:02

CodePudding user response：

Here my main logic is to 3d list and convert that list to dataframe.

to get the list, I am passing first loop to read each row strings and gives me the date only, so after loop I get first row list something like [[Timedelta('0 days 10:30:00'), imedelta('0 days 12:30:00')].

Then running loop where I am getting difference between end and start date.

Lastly, converting to the dataframe.

Code:

df1 = pd.DataFrame([i [i[1]-i[0]] for i in [[pd.to_timedelta(x) for x in re.sub('[^0123456789\:]',r' ', s ).split(' ')[:2]] for s in df['text']]])
df1.columns = ['strt','end','delta']

Output:

              strt              end           delta
0   0 days 10:30:00 0 days 12:30:00 0 days 02:00:00
1   0 days 15:35:10 0 days 15:36:12 0 days 00:01:02