looping over pandas dataframe to create dummies-CodePudding

I have a dataset which looks like following-

|      year     |     state   | election year 1-yes|
|---------------|-------------|--------------------|
|          2010 |     haryana |          1         | 
|          2010 |     haryana |          1         | 
|          2010 |     up      |          0         | 
|          2011 |     goa     |          1         |
|          2012 |     haryana |          0         | 
|          2012 |     up      |          1         | 
|          2013 |     up      |          0         |
|          2013 |     up      |          0         |
|          2013 |     haryana |          1         |
|          2015 |     haryana |          0         |
|          2015 |     up      |          0         |

The third column tells if there is an election in that particular state in that particular year. Haryana has election in 2010, so 1, but goa and up do not have any election in 2010, so the election year is 0 for them.

I want a loop which can add another column which tells the number of years until next election. So, when there is a election in haryana in 2010, it will be 0 (as the election is scheduled in this year only) and in 2012, it will be 3 as next election is in 2015. So, the required column has a range from 0 to 4 for all the states.

The table below summarizes my desired output.

| year |  state | election year 1-yes| years until next election|
|------|--------|--------------------|--------------------------|
| 2010 |haryana |          1         |            0             |
| 2010 |haryana |          1         |            0             |
| 2010 |  up    |          0         |            2             | 
| 2011 | goa    |          1         |            0             | 
| 2012 |haryana |          0         |            3             | 
| 2012 |  up    |          1         |            0             |
| 2013 |  up    |          0         |            4             | 
| 2013 |  up    |          0         |            4             |
| 2013 |haryana |          0         |            2             |
| 2015 |haryana |          1         |            0             |
| 2015 |  up    |          0         |            2             |

I am looking for a simple code loop on year and state as my data has over 30 states and 12 years.

CodePudding user response：

Example

data = [[2010, 'haryana', 1], [2010, 'haryana', 1], [2010, 'up', 0],
        [2011, 'goa', 1], [2012, 'haryana', 0], [2012, 'up', 1],
        [2013, 'up', 0], [2013, 'up', 0], [2013, 'haryana', 0],
        [2015, 'haryana', 1], [2015, 'up', 0]]
df = pd.DataFrame(data, columns=['year', 'state', 'election'])

df

    year    state   election
0   2010    haryana 1
1   2010    haryana 1
2   2010    up      0
3   2011    goa     1
4   2012    haryana 0
5   2012    up      1
6   2013    up      0
7   2013    up      0
8   2013    haryana 0
9   2015    haryana 1
10  2015    up      0

Code

df1 = df.drop_duplicates().copy()
cond1 = df1['election'].eq(0)
s = (df1['election'].mask(cond1).mask(~cond1, df['year'])
     .groupby(df1['state']).bfill())
df1['next'] = s.mask(~cond1, 0).mask(cond1, s.sub(df1['year']))

df1

    year    state   election    next
0   2010    haryana 1           0.0
2   2010    up      0           2.0
3   2011    goa     1           0.0
4   2012    haryana 0           3.0
5   2012    up      1           0.0
6   2013    up      0           NaN
8   2013    haryana 0           2.0
9   2015    haryana 1           0.0
10  2015    up      0           NaN

merge df and df1

df.merge(df1, how='left')

result:

    year    state   election    next
0   2010    haryana 1           0.0
1   2010    haryana 1           0.0
2   2010    up      0           2.0
3   2011    goa     1           0.0
4   2012    haryana 0           3.0
5   2012    up      1           0.0
6   2013    up      0           NaN
7   2013    up      0           NaN
8   2013    haryana 0           2.0
9   2015    haryana 1           0.0
10  2015    up      0           NaN