Home > Back-end >  How to populate date in a dataframe using pandas in python
How to populate date in a dataframe using pandas in python

Time:04-28

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.

original dataframe:

   Case       Date
0     1 2010-01-01
1     2 2011-04-01
2     3 2012-08-01

after populating dates:

   Case        Date
0     1  2010-02-01
1     1  2010-03-01
2     1  2010-04-01
3     2  2011-05-01
4     2  2011-06-01
5     2  2011-07-01
6     3  2012-09-01
7     3  2012-10-01
8     3  2012-11-01

I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.

import pandas as pd

data = [[1, '2010-01-01'],  [2, '2011-04-01'], [3, '2012-08-01']]
 
df = pd.DataFrame(data, columns = ['Case', 'Date'])

df.Date = pd.to_datetime(df.Date)

df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])

month_num = 3

for c in df.Case:
    for m in range(1, month_num 1):
        temp = df.loc[df['Case']==c]
        temp['Date'] = temp['Date']   pd.DateOffset(months=m)
        df_new = pd.concat([df_new, temp]) 

df_new.reset_index(inplace=True, drop=True)

My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!

CodePudding user response:

Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.

As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.

CodePudding user response:

Given your input data this is what worked on my notebook:

df2=pd.DataFrame()

df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()

df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']  pd.DateOffset(days=1)
df3[['Case','Date']]
  • We create a df2 to which we populate 'Date' with the needed dates coming from the original df
  • Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
  • Finally , we offset the resulting column off 1 day
  • Related