Create new column with values in long format pandas dataframe-CodePudding

I am looking to calculate the vote share for a candidate in a particular district for a particular election.

I've got a dataset that gives me the party of the candidate, the district that they ran in, and the year of the election.

However, I have their competitors as well and want to calculate the vote share for one of the candidates, say the Democratic candidate. How would I then place that vote share value in the rows for all of those in that particular race?

Here's an example of a pandas dataframe that I have:

   year    state  district                   candidate       party  candidatevotes  totalvotes
0  1976  alabama         1              BILL DAVENPORT    DEMOCRAT           58906      157170 
1  1976  alabama         1                JACK EDWARDS  REPUBLICAN           98257      157170 
2  1976  alabama         1                     WRITEIN         NaN               7      157170
3  1976  alabama         2             J CAROLE KEAHEY    DEMOCRAT           66288      156362
4  1976  alabama         2  WILLIAM L "BILL" DICKINSON  REPUBLICAN           90069      156362

What I want to be able to do is have a column that would take the vote share of the democratic candidate for that particular district and year. But that vote share value be there for all of the candidates in that particular race:

   year    state  district                   candidate       party  candidatevotes  totalvotes    Dem_vote_share
0  1976  alabama         1              BILL DAVENPORT    DEMOCRAT           58906      157170    0.374
1  1976  alabama         1                JACK EDWARDS  REPUBLICAN           98257      157170    0.374
2  1976  alabama         1                     WRITEIN         NaN               7      157170    0.374
3  1976  alabama         2             J CAROLE KEAHEY    DEMOCRAT           66288      156362    0.424
4  1976  alabama         2  WILLIAM L "BILL" DICKINSON  REPUBLICAN           90069      156362    0.424

If helpful: a full dataset can be found here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2

CodePudding user response：

You should use transform() with a custom function. You can try with:

df['Dem_votes_share'] = df.groupby('party')['candidatevotes'].transform(lambda x: sum(x)/df['candidatevotes'].sum())

Following an mock-up example:

df = pd.DataFrame()
df['party'] = ['D','R','NULL','D','R']
df['candidatevotes'] = [58906,98257,7,66288,90069]
df['candidate'] = ['Bill','Jack','Writein','J Carole','William L']

We get:

  party  candidatevotes  candidate  Dem_votes_share
0     D           58906       Bill         0.399309
1     R           98257       Jack         0.600669
2  NULL               7    Writein         0.000022
3     D           66288   J Carole         0.399309
4     R           90069  William L         0.600669

CodePudding user response：

Filter your df for the kind of party you want. Then get the Dem_votes_share and map its results to the df.

New Input:

df = pd.DataFrame({
    'year': [1976, 1976, 1980, 1980, 1976, 1976, 1980, 1980], 
    'state': ['alabama',  'alabama',  'alaska',  'alaska',  'alabama',  'alabama',  'alaska',  'alaska'], 
    'district': [1, 1, 1, 1, 2, 2, 2, 2], 
    'candidate': ['BILL DAVENPORT',  'JACK EDWARDS',  'WRITEIN',  'foo',  'J CAROLE KEAHEY',  'WILLIAM L "BILL" DICKINSON',  'foo2',  'foo3'], 
    'party': ['DEMOCRAT',  'REPUBLICAN',  'DEMOCRAT',  'REPUBLICAN',  'DEMOCRAT',  'REPUBLICAN',  'DEMOCRAT',  'REPUBLICAN'], 
    'candidatevotes': [58906, 98257, 50000, 20000, 66288, 90069, 10000, 20000], 
    'totalvotes': [157163, 157163, 70000, 70000, 156362, 156362, 30000, 30000]
})

   year    state  district                   candidate       party  candidatevotes  totalvotes
0  1976  alabama         1              BILL DAVENPORT    DEMOCRAT           58906      157163
1  1976  alabama         1                JACK EDWARDS  REPUBLICAN           98257      157163
2  1980   alaska         1                     WRITEIN    DEMOCRAT           50000       70000
3  1980   alaska         1                         foo  REPUBLICAN           20000       70000
4  1976  alabama         2             J CAROLE KEAHEY    DEMOCRAT           66288      156362
5  1976  alabama         2  WILLIAM L "BILL" DICKINSON  REPUBLICAN           90069      156362
6  1980   alaska         2                        foo2    DEMOCRAT           10000       30000
7  1980   alaska         2                        foo3  REPUBLICAN           20000       30000

mapping_vals = (
    df[df['party'].eq('DEMOCRAT')]
    .set_index(['year', 'district'])
    .apply(lambda x: x['candidatevotes']/x['totalvotes'],axis=1)
)
print(mapping_vals)

year  district
1976  1           0.374808
1980  1           0.714286
1976  2           0.423939
1980  2           0.333333
dtype: float64


df['Dem_vote_share'] = df.set_index(['year', 'district']).index.map(mapping_vals)
print(df)

Output:

   year    state  district                   candidate       party  candidatevotes  totalvotes  Dem_vote_share
0  1976  alabama         1              BILL DAVENPORT    DEMOCRAT           58906      157163        0.374808
1  1976  alabama         1                JACK EDWARDS  REPUBLICAN           98257      157163        0.374808
2  1980   alaska         1                     WRITEIN    DEMOCRAT           50000       70000        0.714286
3  1980   alaska         1                         foo  REPUBLICAN           20000       70000        0.714286
4  1976  alabama         2             J CAROLE KEAHEY    DEMOCRAT           66288      156362        0.423939
5  1976  alabama         2  WILLIAM L "BILL" DICKINSON  REPUBLICAN           90069      156362        0.423939
6  1980   alaska         2                        foo2    DEMOCRAT           10000       30000        0.333333
7  1980   alaska         2                        foo3  REPUBLICAN           20000       30000        0.333333