noob here.
I have a dataframe that looks like this:
start | end | start_year |
---|---|---|
NaT | NaT | 2020 |
NaT | NaT | 2021 |
and I want to fill in the NaT's with the first and last day of the year listed in the start_year column. So it would look like this:
start | end | start_year |
---|---|---|
2020-01-01 | 2020-12-31 | 2020 |
2021-01-01 | 2021-12-31 | 2021 |
I tried to fill in the NaTs in the 'end' column like this:
df2.loc[df2['start'].isnull()
& df2['end'].isnull()
& df2['start_year'].notnull()
, "end"] = dt.date(df2["start_year"], 12, 31)
but I get this error:
TypeError: cannot convert the series to <class 'int'>
When I look at just the start year column it says this:
Name: start_year, Length: 4213, dtype: int64
I also tried using
df2["start_year"].values
but that didn't help.
Apologies if I'm just being an idiot. I searched around on here and google but couldn't find an answer.
CodePudding user response:
For both columns start and end, keep the value if filled else fill it with first day (resp last day) of year
df['start'] = df.apply(lambda x: x['start'] if pd.notna(x['start']) else pd.to_datetime(datetime(x['start_year'], 1, 1), format='%y-%m-%d'), axis = 1)
df['end'] = df.apply(lambda x: x['end'] if pd.notna(x['end']) else pd.to_datetime(datetime(x['start_year'], 12, 31), format='%y-%m-%d'), axis = 1)
CodePudding user response:
Use:
#if necessary
#df['start'] = pd.to_datetime(df['start'])
#df['end'] = pd.to_datetime(df['end'])
#replace missing values by Year - first day and last day
df['start'] = df['start'].fillna(pd.to_datetime(df['start_year'],format='%Y'))
df['end'] = (df['end'].fillna(pd.to_datetime(df['start_year'].add(1), format='%Y')
- pd.Timedelta('1 day')))
print (df)
start end start_year
0 2020-01-01 2020-12-31 2020
1 2021-01-01 2021-12-31 2021
CodePudding user response:
df['start_year'].apply(pd.Period).dt.to_timestamp('A')
output:
0 2020-12-31
1 2021-12-31
Name: start_year, dtype: datetime64[ns]