Home > Mobile >  Extract date and time using Regex and then covert it to datetime
Extract date and time using Regex and then covert it to datetime

Time:10-17

I have a dataframe which has only one column named ImagePath and this is how one input looks like:

Images/LINE 58/AGSD/LUC/L58_AFND_K3_2022-10-12_23-53-18_Color_ID_511.jpg

This is what I want to extract:

2022-10-12_23-53-18

This is what I have:

dfDate = dfColor['ImagePath'].str.extract(r"^(?:.*?_){3}([^\nColor] )").to_string()
dfDate = pd.to_datetime(dfDate, format='%Y-%m-%d_%H-%M-%S_')

I would not want to create a dependency on that world "Color" as I am doing right now because in the future there might be any other word. I would like to count on the "_" and take everything from the 3rd and 5th underscore.

CodePudding user response:

Another option, with str.split:

dfDate = dfColor['ImagePath'].str.split('_').str[3:5].str.join('_')
dfDate = pd.to_datetime(dfDate, format='%Y-%m-%d_%H-%M-%S')

CodePudding user response:

Change the regex to

^(?:[^_]*_){3}([^_]*_[^_]*)

See the regex demo.

Details:

  • ^ - start of string
  • (?:[^_]*_){3} - substring till (and including) the third _
  • ([^_]*_[^_]*) - Group 1 (the actual output of .extract()): zero or more non-underscore chars, _ and again zero or more non-underscore chars.

A variation of the solution: you may require the group pattern to match a datetime-like string:

^(?:[^_]*_){3}(\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2})

Or, allowing single digit hours/minutes/seconds/months/days:

^(?:[^_]*_){3}(\d{4}-\d{1,2}-\d{1,2}_\d{1,2}-\d{1,2}-\d{1,2})

See this regex demo.

CodePudding user response:

here is one way to do it

# assuming there is a 4 digit following underscore _
# continuing matching until there is underscore and non-digit

_(\d{4}-.*?(?=_\D))


# _ : matches underscore
# (\d{4}- : 4 digits followed by -
# .*? : non-greedy, match all characters
# (?=_\D)) : positive lookahead, an underscore followed by non-digit,
to terminate the match
df['date']=df['url'].str.extract('_(\d{4}-.*?(?=_\D))')
df['date']=pd.to_datetime(df['date'],  format='%Y-%m-%d_%H-%M-%S')

df


    url     date
0   Images/LINE 58/AGSD/LUC/L58_AFND_K3_2022-10-12...   2022-10-12 23:53:18

  • Related