I am trying to sort a pandas dataframe. The data looks like-
year | state | district | Party | rank | share in votes |
---|---|---|---|---|---|
2010 | haryana | kaithal | Winner | 1 | 40.12 |
2010 | haryana | kaithal | bjp | 2 | 30.52 |
2010 | haryana | kaithal | NOTA | 3 | 29 |
2010 | goa | panji | Winner | 3 | 10 |
2010 | goa | panji | INC | 2 | 40 |
2010 | goa | panji | BJP | 1 | 50 |
2013 | up | meerut | Winner | 2 | 40 |
2013 | up | meerut | SP | 1 | 60 |
2015 | haryana | kaithal | Winner | 2 | 15 |
2015 | haryana | kaithal | BJP | 3 | 35 |
2015 | haryana | kaithal | INC | 1 | 50 |
This data is for multiple states for multiple years. In this dataset, there are multiple values for each district. I want to calculate the margin of share for each district in this manner. I have tried this, but not able to write fully. I am not able to write code for defining the margin of share and get a dataframe with only one (margin of share) value corresponding to each district instead of party wise shares.
for year in df['YEAR']:
for state in df['STATE']:
for district in df['DISTRICT']:
for rank in df['RANK']:
for party in df['PARTY']:
if rank==1 and party=='WINNER':
then margin of share =Share of Winner-Share of party at rank 2. If share WINNER does not have rank 1 then Margin of Share= share of winner - share of party at rank 1.
I am basically trying to get this output-
| year | state |district| margin of share|
|---------------|-------------|--------|----------------|
| 2010 | haryana |kaithal | 9.6 |
| 2010 | goa |panji | -40 |
| 2010 | up |kaithal | -20 |
| 2015 | haryana |kaithal | -35 |
I wish to have create a different data frame with columns Year, State, District and margin of SHARE.
CodePudding user response:
Create MultiIndex by first 3 columns by DataFrame.set_index
, create masks, filter with DataFrame.loc
and subtract values, last use Series.fillna
for replace not matched values by condition m3
:
df1 = df.set_index(['year', 'state', 'district'])
m1 = df1.Party=='Winner'
m2 = df1['rank']==2
m3 = df1['rank']==1
s1 = (df1.loc[m1 & m3,'share in votes']
.sub(df1.loc[m2,'share in votes']))
print (s1)
year state district
2010 goa panji NaN
haryana kaithal 9.6
2013 up meerut NaN
2015 haryana kaithal NaN
Name: share in votes, dtype: float64
s2 = (df1.loc[m1,'share in votes']
.sub(df1.loc[m3,'share in votes']))
print (s2)
year state district
2010 haryana kaithal 0.0
goa panji -40.0
2013 up meerut -20.0
2015 haryana kaithal -35.0
Name: share in votes, dtype: float64
df = s1.fillna(s2).reset_index()
print (df)
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
CodePudding user response:
use groupby
and where
with conditions
g = df.groupby(['year', 'state', 'district'])
cond1 = df['Party'].eq('Winner')
cond2 = df['rank'].eq(1)
cond3 = df['rank'].eq(2)
df1 = g['share in votes'].agg(lambda x: (x.where(cond1).sum() - x.where(cond3).sum()) if x.where(cond1 & cond2).sum() != 0 else (x.where(cond1).sum() - x.where(cond2).sum())).reset_index()
result(df1
):
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
if you want sort like df
use following code:
df.iloc[:, :3].drop_duplicates().merge(df1)
result:
year state district share in votes
0 2010 haryana kaithal 9.6
1 2010 goa panji -40.0
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0