Home > Net >  Overcome Performance warning in Pandas
Overcome Performance warning in Pandas

Time:11-10

mydata = [{'ID' : '10', 'StartDate': '10/10/2016', 'EndDate': '15/10/2016'},
          {'ID' : '20', 'StartDate': '10/10/2016', 'EndDate': '18/10/2016'}]

df = pd.DataFrame(mydata)

df['StartDate'] = pd.to_datetime(df['StartDate']).dt.date
df['EndDate'] = pd.to_datetime(df['EndDate']).dt.date
df = df.loc[df.index.repeat((df['EndDate'] - df['StartDate']).dt.days   1)]
df['Date'] = df['StartDate']   pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')

I am getting Performance warning :

PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.

What can I do to make it vectorized?

CodePudding user response:

For vectorized solution need datetimes instead dates:

df['StartDate'] = pd.to_datetime(df['StartDate'], dayfirst=True)
df['EndDate'] = pd.to_datetime(df['EndDate'], dayfirst=True)

If there are times and need remove them use Series.dt.normalize instead dt.date:

df['StartDate'] = pd.to_datetime(df['StartDate'], dayfirst=True).dt.normalize()
df['EndDate'] = pd.to_datetime(df['EndDate'], dayfirst=True).dt.normalize()

df = df.loc[df.index.repeat((df['EndDate'] - df['StartDate']).dt.days   1)]
df['Date'] = df['StartDate']   pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
    ID  StartDate    EndDate       Date
0   10 2016-10-10 2016-10-15 2016-10-10
1   10 2016-10-10 2016-10-15 2016-10-11
2   10 2016-10-10 2016-10-15 2016-10-12
3   10 2016-10-10 2016-10-15 2016-10-13
4   10 2016-10-10 2016-10-15 2016-10-14
5   10 2016-10-10 2016-10-15 2016-10-15
6   20 2016-10-10 2016-10-18 2016-10-10
7   20 2016-10-10 2016-10-18 2016-10-11
8   20 2016-10-10 2016-10-18 2016-10-12
9   20 2016-10-10 2016-10-18 2016-10-13
10  20 2016-10-10 2016-10-18 2016-10-14
11  20 2016-10-10 2016-10-18 2016-10-15
12  20 2016-10-10 2016-10-18 2016-10-16
13  20 2016-10-10 2016-10-18 2016-10-17
14  20 2016-10-10 2016-10-18 2016-10-18

CodePudding user response:

This should fix the issue.

pandas complained about dayfirst not set for dates like "%d/%m/%Y". Removed .dt.date that prevent pandas to set dtype to datetime64 and caused the warning.

mydata = [{'ID' : '10', 'StartDate': '10/10/2016', 'EndDate': '15/10/2016'}, {'ID' : '20', 'StartDate': '10/10/2016', 'EndDate': '18/10/2016'}]

df = pd.DataFrame(mydata)

df['StartDate'] = pd.to_datetime(df['StartDate'], dayfirst=True)
df['EndDate'] = pd.to_datetime(df['EndDate'], dayfirst=True)
df = df.loc[df.index.repeat((df['EndDate'] - df['StartDate']).dt.days   1)]
df['Date'] = df['StartDate']   pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')

Before

StartDate    object
EndDate      object

After

StartDate    datetime64[ns]
EndDate      datetime64[ns]
  • Related