Home > Net >  how to capture and replace with regex in data frame column in python pandas
how to capture and replace with regex in data frame column in python pandas

Time:03-08

I'm trying to capture and replace by regex in a DataFrame column which contains a date, i want capture the following date format in the text "YYYY-MM-DD" it seems my syntax for capturing and replace is correct but some how it doesn't work.

lst_date_version = ["2021-10-10 rev. 002", "2021-11-28 rev. 003", "2021-09-27 rev. 008","2021-11-29 rev. 008", "2021-10-16 rev. 003", "2021-10-25 rev. 008","2021-11-03 rev. 003", "2021-04-12 rev. 008", "2021-03-19 rev. 004"]
df_test_date = pd.DataFrame({"Version":lst_date_version})
df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\d{4}-\d{2})-(\d{2})", r"\1", regex=True)
print(df_test_date["Version"])

the result seems the day in the date format is remove (2021-10-10 rev. 002 ==> 2021-10 rev. 002):

0    2021-10 rev. 002
1    2021-11 rev. 003
2    2021-09 rev. 008
3    2021-11 rev. 008
4    2021-10 rev. 003
5    2021-10 rev. 008
6    2021-11 rev. 003
7    2021-04 rev. 008
8    2021-03 rev. 004

but when i do the following :

 df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\d{4}-\d{2})-(\d{2})", r"\0", regex=True)
    print(df_test_date["Version"])

the result is :

0     rev. 002
1     rev. 003
2     rev. 008
3     rev. 008
4     rev. 003
5     rev. 008
6     rev. 003
7     rev. 008
8     rev. 004

in the meantime i find different way (invert the capture) to what i wanted to achieve by this:

 df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\srev. )", r"\0", regex=True))

a big thanks for your help in advance :)

PS: adapted the questions based on remarks :)

CodePudding user response:

Your code fails as (\d{4}-\d{2})-(\d{2}) is never matching.

You could use str.split with n=1:

df_test_date['Version'] = df_test_date['Version'].str.split(n=1).str[1]

else, change your regex to \d{4}-\d{2}-\d{2}\s*:

df_test_date['Version'] = df_test_date["Version"].str.replace(r"\d{4}-\d{2}-\d{2}\s*",
                                    '', regex=True)

CodePudding user response:

I would use str.extract here:

df_test_date["Version"] = df_test_date["Version"].str.extract(r'^(\d{4}-\d{2}-\d{2})')

If you want to use str.replace, then use a pattern which matches the entire input:

df_test_date["Version"] = df_test_date["Version"].str.replace(r'^(\d{4}-\d{2}-\d{2}) rev\. \d $', r'\1')
  • Related