Home > Back-end >  Pandas df.replace with regex group
Pandas df.replace with regex group

Time:12-24

I have a column with strings such as: Posted: 1 day ago, Posted: 2 days ago. I want to convert this column to a column of dates, i.e.: datetime.date(2021, 12, 22), datetime.date(2021, 12, 21).

I tried using regex groups combined with df.replace() to achieve it in a compact operation:

df2 = df.replace({r"Posted: (\d ) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)

but this results in ValueError: invalid literal for int() with base 10: '\\1' error since int() evaluates its input not as a reference to the earlier regex group but as a literal string. Merely just obtaining the matched pattern works fine though, either of the following two would work if I only wanted to preserve the numerical value in the column, instead of translating it to datetime objects:

df2 = df.replace({r"Posted: (\d ) days? ago": "\g<1>"}, regex=True)

df2 = df.replace({r"Posted: (\d ) days? ago": r"\1"}, regex=True)

How can I obtain the referenced regex value to pass it on to timedelta()?

Full code:

import pandas as pd
from datetime import date, timedelta

df = pd.DataFrame(
    [['Posted: 1 day ago', 'xa01332cs', 101],
     ['Posted: 2 days ago', 'd11as99101', 630],
     ['Posted: 11 days ago', '12011rww1a', 301]
    ],
    columns = ['Date', 'Code', 'Value']
)

def preprocess(df):
    
    #df2 = df.replace({r"Posted: (\d ) days? ago": "\g<1>"}, regex=True)     # this works
    #df2 = df.replace({r"Posted: (\d ) days? ago": r"\1"}, regex=True)       # this works identically to previous row
    df2 = df.replace({r"Posted: (\d ) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)
    return df2

preprocess(df)

CodePudding user response:

You can't use date - timedelta, but you can use datetime - timedelta:

from datetime import datetime, timedelta

df['Date'] = datetime.datetime.today() - df.Date.str.extract('Posted: (\d ) days? ago')[0].astype(int).apply(timedelta)

Output:

>>> df
                        Date        Code  Value
0 2021-12-22 08:33:03.396630   xa01332cs    101
1 2021-12-21 08:33:03.396630  d11as99101    630
2 2021-12-12 08:33:03.396630  12011rww1a    301

CodePudding user response:

You can extract the numbers, convert it to timedelta, then subtract:

df['New Date'] = datetime.datetime.today() - df['Date'].str.extract(r"Posted: (\d ) days? ago").astype(int) * pd.Timedelta('1D')

Output:

                  Date        Code  Value                   New Date
0    Posted: 1 day ago   xa01332cs    101 2021-12-22 10:36:13.361973
1   Posted: 2 days ago  d11as99101    630 2021-12-21 10:36:13.361973
2  Posted: 11 days ago  12011rww1a    301 2021-12-12 10:36:13.361973
  • Related