getting new dataframe from existing in dataframe with conditions on multiple columns-CodePudding

I am trying to sort a pandas dataframe. The data looks like-

year	state	district	Party	rank	share in votes
2010	haryana	kaithal	Winner	1	40.12
2010	haryana	kaithal	bjp	2	30.52
2010	haryana	kaithal	NOTA	3	29
2010	goa	panji	Winner	3	10
2010	goa	panji	INC	2	40
2010	goa	panji	BJP	1	50
2013	up	meerut	Winner	2	40
2013	up	meerut	SP	1	60
2015	haryana	kaithal	Winner	2	15
2015	haryana	kaithal	BJP	3	35
2015	haryana	kaithal	INC	1	50

This data is for multiple states for multiple years. In this dataset, there are multiple values for each district. I want to calculate the margin of share for each district in this manner. I have tried this, but not able to write fully. I am not able to write code for defining the margin of share and get a dataframe with only one (margin of share) value corresponding to each district instead of party wise shares.

for year in df['YEAR']:
 for state in df['STATE']:
    for district in df['DISTRICT']:
        for rank in df['RANK']:
            for party in df['PARTY']:
                if rank==1 and party=='WINNER':

then margin of share =Share of Winner-Share of party at rank 2. If share WINNER does not have rank 1 then Margin of Share= share of winner - share of party at rank 1.

I am basically trying to get this output-

|      year     |     state   |district| margin of share|
|---------------|-------------|--------|----------------|
|          2010    | haryana  |kaithal | 9.6            |
|          2010    | goa      |panji   | -40            |
|          2010    | up       |kaithal | -20            |
|          2015    | haryana  |kaithal | -35            |

I wish to have create a different data frame with columns Year, State, District and margin of SHARE.

CodePudding user response：

Create MultiIndex by first 3 columns by DataFrame.set_index, create masks, filter with DataFrame.loc and subtract values, last use Series.fillna for replace not matched values by condition m3:

df1 = df.set_index(['year', 'state', 'district'])
m1 = df1.Party=='Winner'
m2 = df1['rank']==2
m3 = df1['rank']==1

s1 = (df1.loc[m1 & m3,'share in votes']
        .sub(df1.loc[m2,'share in votes']))
print (s1)
year  state    district
2010  goa      panji       NaN
      haryana  kaithal     9.6
2013  up       meerut      NaN
2015  haryana  kaithal     NaN
Name: share in votes, dtype: float64

s2 = (df1.loc[m1,'share in votes']
        .sub(df1.loc[m3,'share in votes']))
print (s2)
year  state    district
2010  haryana  kaithal      0.0
      goa      panji      -40.0
2013  up       meerut     -20.0
2015  haryana  kaithal    -35.0
Name: share in votes, dtype: float64

df = s1.fillna(s2).reset_index()
print (df)
   year    state district  share in votes
0  2010      goa    panji           -40.0
1  2010  haryana  kaithal             9.6
2  2013       up   meerut           -20.0
3  2015  haryana  kaithal           -35.0

CodePudding user response：

use groupby and where with conditions

g = df.groupby(['year', 'state', 'district'])
cond1 = df['Party'].eq('Winner')
cond2 = df['rank'].eq(1)
cond3 = df['rank'].eq(2)
df1 = g['share in votes'].agg(lambda x: (x.where(cond1).sum() - x.where(cond3).sum()) if x.where(cond1 & cond2).sum() != 0 else (x.where(cond1).sum() - x.where(cond2).sum())).reset_index()

result(df1):

    year    state   district    share in votes
0   2010    goa     panji       -40.0
1   2010    haryana kaithal     9.6
2   2013    up      meerut      -20.0
3   2015    haryana kaithal     -35.0

if you want sort like df use following code:

df.iloc[:, :3].drop_duplicates().merge(df1)

result:

    year    state   district    share in votes
0   2010    haryana kaithal     9.6
1   2010    goa     panji       -40.0
2   2013    up      meerut      -20.0
3   2015    haryana kaithal     -35.0