I have a column with strings such as: Posted: 1 day ago
, Posted: 2 days ago
. I want to convert this column to a column of dates, i.e.: datetime.date(2021, 12, 22)
, datetime.date(2021, 12, 21)
.
I tried using regex groups combined with df.replace()
to achieve it in a compact operation:
df2 = df.replace({r"Posted: (\d ) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)
but this results in ValueError: invalid literal for int() with base 10: '\\1'
error since int()
evaluates its input not as a reference to the earlier regex group but as a literal string. Merely just obtaining the matched pattern works fine though, either of the following two would work if I only wanted to preserve the numerical value in the column, instead of translating it to datetime objects:
df2 = df.replace({r"Posted: (\d ) days? ago": "\g<1>"}, regex=True)
df2 = df.replace({r"Posted: (\d ) days? ago": r"\1"}, regex=True)
How can I obtain the referenced regex value to pass it on to timedelta()
?
Full code:
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
[['Posted: 1 day ago', 'xa01332cs', 101],
['Posted: 2 days ago', 'd11as99101', 630],
['Posted: 11 days ago', '12011rww1a', 301]
],
columns = ['Date', 'Code', 'Value']
)
def preprocess(df):
#df2 = df.replace({r"Posted: (\d ) days? ago": "\g<1>"}, regex=True) # this works
#df2 = df.replace({r"Posted: (\d ) days? ago": r"\1"}, regex=True) # this works identically to previous row
df2 = df.replace({r"Posted: (\d ) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)
return df2
preprocess(df)
CodePudding user response:
You can't use date - timedelta
, but you can use datetime - timedelta
:
from datetime import datetime, timedelta
df['Date'] = datetime.datetime.today() - df.Date.str.extract('Posted: (\d ) days? ago')[0].astype(int).apply(timedelta)
Output:
>>> df
Date Code Value
0 2021-12-22 08:33:03.396630 xa01332cs 101
1 2021-12-21 08:33:03.396630 d11as99101 630
2 2021-12-12 08:33:03.396630 12011rww1a 301
CodePudding user response:
You can extract the numbers, convert it to timedelta, then subtract:
df['New Date'] = datetime.datetime.today() - df['Date'].str.extract(r"Posted: (\d ) days? ago").astype(int) * pd.Timedelta('1D')
Output:
Date Code Value New Date
0 Posted: 1 day ago xa01332cs 101 2021-12-22 10:36:13.361973
1 Posted: 2 days ago d11as99101 630 2021-12-21 10:36:13.361973
2 Posted: 11 days ago 12011rww1a 301 2021-12-12 10:36:13.361973