I'm trying to capture and replace by regex in a DataFrame column which contains a date, i want capture the following date format in the text "YYYY-MM-DD" it seems my syntax for capturing and replace is correct but some how it doesn't work.
lst_date_version = ["2021-10-10 rev. 002", "2021-11-28 rev. 003", "2021-09-27 rev. 008","2021-11-29 rev. 008", "2021-10-16 rev. 003", "2021-10-25 rev. 008","2021-11-03 rev. 003", "2021-04-12 rev. 008", "2021-03-19 rev. 004"]
df_test_date = pd.DataFrame({"Version":lst_date_version})
df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\d{4}-\d{2})-(\d{2})", r"\1", regex=True)
print(df_test_date["Version"])
the result seems the day in the date format is remove (2021-10-10 rev. 002 ==> 2021-10 rev. 002):
0 2021-10 rev. 002
1 2021-11 rev. 003
2 2021-09 rev. 008
3 2021-11 rev. 008
4 2021-10 rev. 003
5 2021-10 rev. 008
6 2021-11 rev. 003
7 2021-04 rev. 008
8 2021-03 rev. 004
but when i do the following :
df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\d{4}-\d{2})-(\d{2})", r"\0", regex=True)
print(df_test_date["Version"])
the result is :
0 rev. 002
1 rev. 003
2 rev. 008
3 rev. 008
4 rev. 003
5 rev. 008
6 rev. 003
7 rev. 008
8 rev. 004
in the meantime i find different way (invert the capture) to what i wanted to achieve by this:
df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\srev. )", r"\0", regex=True))
a big thanks for your help in advance :)
PS: adapted the questions based on remarks :)
CodePudding user response:
Your code fails as (\d{4}-\d{2})-(\d{2})
is never matching.
You could use str.split
with n=1
:
df_test_date['Version'] = df_test_date['Version'].str.split(n=1).str[1]
else, change your regex to \d{4}-\d{2}-\d{2}\s*
:
df_test_date['Version'] = df_test_date["Version"].str.replace(r"\d{4}-\d{2}-\d{2}\s*",
'', regex=True)
CodePudding user response:
I would use str.extract
here:
df_test_date["Version"] = df_test_date["Version"].str.extract(r'^(\d{4}-\d{2}-\d{2})')
If you want to use str.replace
, then use a pattern which matches the entire input:
df_test_date["Version"] = df_test_date["Version"].str.replace(r'^(\d{4}-\d{2}-\d{2}) rev\. \d $', r'\1')