Within my dataframe I have two columns: 'release_date' and 'release_year'
I am trying to replace the year value in each 'release_date' instance with the corresponding value in 'release_year'
I have tried the following
df.loc[:, 'release_date'] = df['release_date'].apply(lambda x: x.replace(x.year == df['release_year']))
however I am getting the error: 'value must be an integer, received <class 'pandas.core.series.Series'> for year'
Having checked the dtype, the release_date column is stored as datetime64[ns]
CodePudding user response:
You need to use pandas.DataFrame.apply
here rather than pandas.Series.apply
as you need data from other column, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)],'release_year':[2001,2002,2003]})
df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
print(df)
output
release_date release_year changed_date
0 1901-01-01 2001 2001-01-01
1 1902-01-01 2002 2002-01-01
2 1903-01-01 2003 2003-01-01
Note axis=1
which mean function is applied to each row and you got row (pandas.Series
) as argument for that function
CodePudding user response:
casting to string then parsing to datetime is more efficient here; and also more readable if you ask me. Ex:
import datetime
import pandas as pd
N = 100000
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)]*N,
'release_year':[2001,2002,2003]*N})
df['changed_date'] = pd.to_datetime(
df['release_year'].astype(str) df['release_date'].astype(str).str[5:],
format="%Y%m-%d"
)
df['changed_date']
Out[176]:
0 2001-01-01
1 2002-01-01
2 2003-01-01
3 2001-01-01
4 2002-01-01
299995 2002-01-01
299996 2003-01-01
299997 2001-01-01
299998 2002-01-01
299999 2003-01-01
Name: changed_date, Length: 300000, dtype: datetime64[ns]
>>> %timeit df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
6.73 s ± 542 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df['changed_date'] = pd.to_datetime(df['release_year'].astype(str) df['release_date'].astype(str).str[5:], format="%Y%m-%d")
651 ms ± 78.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)