I have a dataframe which has only one column named ImagePath and this is how one input looks like:
Images/LINE 58/AGSD/LUC/L58_AFND_K3_2022-10-12_23-53-18_Color_ID_511.jpg
This is what I want to extract:
2022-10-12_23-53-18
This is what I have:
dfDate = dfColor['ImagePath'].str.extract(r"^(?:.*?_){3}([^\nColor] )").to_string()
dfDate = pd.to_datetime(dfDate, format='%Y-%m-%d_%H-%M-%S_')
I would not want to create a dependency on that world "Color" as I am doing right now because in the future there might be any other word. I would like to count on the "_" and take everything from the 3rd and 5th underscore.
CodePudding user response:
Another option, with str.split
:
dfDate = dfColor['ImagePath'].str.split('_').str[3:5].str.join('_')
dfDate = pd.to_datetime(dfDate, format='%Y-%m-%d_%H-%M-%S')
CodePudding user response:
Change the regex to
^(?:[^_]*_){3}([^_]*_[^_]*)
See the regex demo.
Details:
^
- start of string(?:[^_]*_){3}
- substring till (and including) the third_
([^_]*_[^_]*)
- Group 1 (the actual output of.extract()
): zero or more non-underscore chars,_
and again zero or more non-underscore chars.
A variation of the solution: you may require the group pattern to match a datetime-like string:
^(?:[^_]*_){3}(\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2})
Or, allowing single digit hours/minutes/seconds/months/days:
^(?:[^_]*_){3}(\d{4}-\d{1,2}-\d{1,2}_\d{1,2}-\d{1,2}-\d{1,2})
See this regex demo.
CodePudding user response:
here is one way to do it
# assuming there is a 4 digit following underscore _
# continuing matching until there is underscore and non-digit
_(\d{4}-.*?(?=_\D))
# _ : matches underscore
# (\d{4}- : 4 digits followed by -
# .*? : non-greedy, match all characters
# (?=_\D)) : positive lookahead, an underscore followed by non-digit,
to terminate the match
df['date']=df['url'].str.extract('_(\d{4}-.*?(?=_\D))')
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d_%H-%M-%S')
df
url date
0 Images/LINE 58/AGSD/LUC/L58_AFND_K3_2022-10-12... 2022-10-12 23:53:18