It's been a while and I'm not sure how to do this in pandas, say I have a timeseries like this:
# creating time series
import pandas as pd
df = pd.DataFrame({'a': [0,1,2], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2020-04-01'),
pd.to_datetime('2022-04-01')]})
s = df.set_index('date')['a']
s
date
2017-04-01 0
2020-04-01 1
2022-04-01 2
Name: a, dtype: int64
I would like to expand the dataset so fill in daily values from the start to 2050 so I can do interpolates on it.
I know I can do s.interpolate(method='linear') but I believe I need the rows with nan's in it to do so.
How can I extend the dataframe?
CodePudding user response:
It's probably easier to create the full date range and join your values rather than trying to impute rows. We can use the pd.date_range()
function to create a datetime index with the range you want and convert it to a dataframe, so we can just the join function, which combines dataframes by index.
import pandas as pd
df = pd.DataFrame({'a': [0,1,2], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2020-04-01'),
pd.to_datetime('2022-04-01')]})
s = df.set_index('date')['a']
full_date_df = pd.date_range(start='2017-04-01', end='2050-12-31').to_frame()
df_w_nulls = full_date_df.join(s)
df_w_nulls
CodePudding user response:
You can try
date = pd.date_range(start='2017-04-01', end='2050-04-01', freq='AS-APR')
s = s.reindex(date).fillna(0)
print(s)
2017-04-01 0.0
2018-04-01 0.0
2019-04-01 0.0
2020-04-01 1.0
2021-04-01 0.0
2022-04-01 2.0
2023-04-01 0.0
2024-04-01 0.0
2025-04-01 0.0
2026-04-01 0.0
2027-04-01 0.0
2028-04-01 0.0
2029-04-01 0.0
2030-04-01 0.0
2031-04-01 0.0
2032-04-01 0.0
2033-04-01 0.0
2034-04-01 0.0
2035-04-01 0.0
2036-04-01 0.0
2037-04-01 0.0
2038-04-01 0.0
2039-04-01 0.0
2040-04-01 0.0
2041-04-01 0.0
2042-04-01 0.0
2043-04-01 0.0
2044-04-01 0.0
2045-04-01 0.0
2046-04-01 0.0
2047-04-01 0.0
2048-04-01 0.0
2049-04-01 0.0
2050-04-01 0.0
Freq: AS-APR, Name: a, dtype: float64
CodePudding user response:
I recommend you to create a new dataframe that has the dates that you are looking for. One way to do it:
from pandas import Timestamp
start = Timestamp('2017-04-01')
end = Timestamp('2050-01-01')
all_dates = pd.date_range(start=start, end=end, freq='D')
# You can fill the "Value" column with any value you want. Here I filled with 0.
df = pd.DataFrame({'Date':all_dates , 'Value':[0]*len(all_dates)})