Home > Software engineering >  generate date feature column using pandas
generate date feature column using pandas

Time:12-20

I have a timeseries data frame that has columns like these:

Date          temp_data        holiday           day   

01.01.2000    10000              0                1
02.01.2000    0                  1                2
03.01.2000    2000               0                3
..
..
..
26.01.2000    200                0                26
27.01.2000     0                 1                27
28.01.2000    500                0                28
29.01.2000     0                 1                29
30.01.2000    200                0                30
31.01.2000     0                 1                31
01.02.2000     0                 1                 1
02.02.2000    2500               0                 2

Here, holiday = 0 when there is data present - indicates a working day

holiday = 1 when there is no data present - indicated a non-working day

I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday

the output data frame should look like this

Date          temp_data   holiday     day     secondlast_wd   thirdlast_wd    fouthlast_wd

01.01.2000    10000         0          1             1                0              0
02.01.2000    0             1          2             0                0              0
03.01.2000    2000          0          3             0                0              0
..
..
25.01.2000    345           0          25            0                0              1
26.01.2000    200           0          26            0                1              0
27.01.2000     0            1          27            0                0              0
28.01.2000    500           0          28            1                0              0
29.01.2000     0            1          29            0                0              0
30.01.2000    200           0          30            0                0              0
31.01.2000     0            1          31            0                0              0
01.02.2000     0            1          1             0                0              0
02.02.2000    2500          0          2             0                0              0

Can anyone help me with this?

CodePudding user response:

Example

data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28], 
        ['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31], 
        ['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])

df

    Date        temp_data   holiday day
0   26.01.2000  200         0       26
1   27.01.2000  0           1       27
2   28.01.2000  500         0       28
3   29.01.2000  0           1       29
4   30.01.2000  200         0       30
5   31.01.2000  0           1       31
6   26.02.2000  200         0       26
7   27.02.2000  0           0       27
8   28.02.2000  500         0       28
9   29.02.2000  0           1       29

Code

for example make secondlast_wd column (n=2)

n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)

result

0    False
1    False
2     True
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: holiday, dtype: bool

make result to secondlast_wd column

df.assign(secondlast_wd=result.astype('int'))

output:

    Date        temp_data   holiday day secondlast_wd
0   26.01.2000  200         0       26  0
1   27.01.2000  0           1       27  0
2   28.01.2000  500         0       28  1
3   29.01.2000  0           1       29  0
4   30.01.2000  200         0       30  0
5   31.01.2000  0           1       31  0
6   26.02.2000  200         0       26  0
7   27.02.2000  0           0       27  1
8   28.02.2000  500         0       28  0
9   29.02.2000  0           1       29  0

you can change n and can get third, forth and so on..


Update for comment

chk workday(reverse index)

df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]

9    False
8     True
7     True
6     True
5    False
4     True
3    False
2     True
1    False
0     True
Name: holiday, dtype: bool

reverse cumsum by group(month). then when workday is 1 above value and when holyday is still same value with above.(of course in reverse index)

df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()


9    0
8    1
7    2
6    3
5    0
4    1
3    1
2    2
1    2
0    3
Name: holiday, dtype: int64

find holiday == 0 and result == 2, that is secondlast_wd

df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)

0    False
1    False
2     True
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: holiday, dtype: bool

This operation returns index as it was.(not reverse)


Other Way

A more understandable code would be:

s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')

same result

  • Related