Home > Software engineering >  how to create various dummies based on consecutive values from another column
how to create various dummies based on consecutive values from another column

Time:05-03

I have the following panel dataset. "winner" =1 if in period (date), someone is a winner, zero if loser.

ID   date  winner 
A   2017Q4  NaN
A   2018Q4   1         
A   2019Q4   0   
A   2020Q4   0    
A   2021Q4   1   
B   2017Q4  NaN 
B   2018Q4   1   
B   2019Q4   1   
B   2020Q4   0    
B   2021Q4   0   
C   2017Q4  NaN
C   2018Q4   0      
C   2019Q4   0    
C   2020Q4   0  
C   2021Q4   0 
D   2017Q4  NaN   
D   2018Q4   0                 
D   2019Q4   1   
D   2020Q4   1 
D   2021Q4   1 

I want to create four dummy variables, WW =1 if someone is winner in two consecutive periods. LL=1 if loser in two consecutive periods. WL if winner in period 1 and loser the next period, and LW vice versa.

UPDATE

when i apply the answers below i get the following

ID   date  winner  WW  LL  WL  LW
A   2017Q4  NaN
A   2018Q4   1     0   0   0   0  
A   2019Q4   0     0   0   1   0    
A   2020Q4   0     0   1   0   0  
A   2021Q4   1     0   0   0   1
B   2017Q4  NaN
B   2018Q4   1     0   0   0   0   
B   2019Q4   1     1   0   0   0
B   2020Q4   0     0   0   1   0
B   2021Q4   0     0   1   0   0
C   2017Q4  NaN
C   2018Q4   0     0   0   0   0      
C   2019Q4   0     0   1   0   0
C   2020Q4   0     0   1   0   0
C   2021Q4   0     0   1   0   0
D   2017Q4  NaN
D   2018Q4   0     0   0   0   0                
D   2019Q4   1     0   0   0   1
D   2020Q4   1     1   0   0   0
D   2021Q4   1     1   0   0   0

How do i make sure I get the NaN when the previous value is NaN? desired output

ID   date  winner  WW  LL  WL  LW
A   2017Q4  NaN
A   2018Q4   1    NaN NaN NaN NaN   
A   2019Q4   0     0   0   1   0    
A   2020Q4   0     0   1   0   0  
A   2021Q4   1     0   0   0   1
B   2017Q4  NaN
B   2018Q4   1    NaN NaN NaN NaN  
B   2019Q4   1     1   0   0   0
B   2020Q4   0     0   0   1   0
B   2021Q4   0     0   1   0   0
C   2017Q4  NaN
C   2018Q4   0    NaN NaN NaN NaN    
C   2019Q4   0     0   1   0   0
C   2020Q4   0     0   1   0   0
C   2021Q4   0     0   1   0   0
D   2017Q4  NaN
D   2018Q4   0    NaN NaN NaN NaN               
D   2019Q4   1     0   0   0   1
D   2020Q4   1     1   0   0   0
D   2021Q4   1     1   0   0   0

How to do this in the most simple way?

CodePudding user response:

Here's one way: Use groupby.shift to get the previous record; then use numpy.select to assign values, which you use get_dummies to convert to dummy variables:

import numpy as np
df['previous'] = df.groupby('ID')['winner'].shift()
tmp = df[['previous','winner']]
dummy_vars = ['WW','LL','WL', 'LW']
out = (df.join(pd.get_dummies(np.select([tmp.eq(1).all(1), 
                                         tmp.eq(0).all(1),
                                         tmp.eq([1,0]).all(1), 
                                         tmp.eq([0,1]).all(1)], 
                                        dummy_vars, ''))[dummy_vars ['']]
               .mask(df['previous'].isna(), ''))
       .drop(columns=['previous','']))

Output:

   ID    date  winner WW LL WL LW
0   A  2018Q4       1            
1   A  2019Q4       0  0  0  1  0
2   A  2020Q4       0  0  1  0  0
3   A  2021Q4       1  0  0  0  1
4   B  2018Q4       1            
5   B  2019Q4       1  1  0  0  0
6   B  2020Q4       0  0  0  1  0
7   B  2021Q4       0  0  1  0  0
8   C  2018Q4       0            
9   C  2019Q4       0  0  1  0  0
10  C  2020Q4       0  0  1  0  0
11  C  2021Q4       0  0  1  0  0
12  D  2018Q4       0            
13  D  2019Q4       1  0  0  0  1
14  D  2020Q4       1  1  0  0  0
15  D  2021Q4       1  1  0  0  0

CodePudding user response:

  1. map 1 and 0 to "W" and "L"
  2. get the 2-period streak
  3. get_dummies for the "streak"
  4. join to original DataFrame ignoring the first row of each ID
wins = df["winner"].fillna(0).map({1:"W",0:"L"})
streaks = wins.shift()   wins
other = pd.get_dummies(streaks.where(df["ID"].eq(df["ID"].shift())))
output = df.join(other.where(df["ID"].duplicated()&df["winner"].shift().notna()))

>>> output

   ID    date  winner   LL   LW   WL   WW
0   A  2017Q4     NaN  NaN  NaN  NaN  NaN
1   A  2018Q4     1.0  NaN  NaN  NaN  NaN
2   A  2019Q4     0.0  0.0  0.0  1.0  0.0
3   A  2020Q4     0.0  1.0  0.0  0.0  0.0
4   A  2021Q4     1.0  0.0  1.0  0.0  0.0
5   B  2017Q4     NaN  NaN  NaN  NaN  NaN
6   B  2018Q4     1.0  NaN  NaN  NaN  NaN
7   B  2019Q4     1.0  0.0  0.0  0.0  1.0
8   B  2020Q4     0.0  0.0  0.0  1.0  0.0
9   B  2021Q4     0.0  1.0  0.0  0.0  0.0
10  C  2017Q4     NaN  NaN  NaN  NaN  NaN
11  C  2018Q4     0.0  NaN  NaN  NaN  NaN
12  C  2019Q4     0.0  1.0  0.0  0.0  0.0
13  C  2020Q4     0.0  1.0  0.0  0.0  0.0
14  C  2021Q4     0.0  1.0  0.0  0.0  0.0
15  D  2017Q4     NaN  NaN  NaN  NaN  NaN
16  D  2018Q4     0.0  NaN  NaN  NaN  NaN
17  D  2019Q4     1.0  0.0  1.0  0.0  0.0
18  D  2020Q4     1.0  0.0  0.0  0.0  1.0
19  D  2021Q4     1.0  0.0  0.0  0.0  1.0
  • Related