I'm not even sure how to describe this, but I am looking for a method to sum the values of others and replace the NaN with a particular value.
My data consists of Organization, Population and Square Miles. I have population and square miles data for each county in the state, however some Organizations span across multiple counties. I merged the two data sets (organization info with pop/square miles data) but am obviously left with NaNs for the organizations that span across multiple counties.
If I create a dictionary like the following:
counties = {'Job a': ['county a', 'county b', 'county c'],
'Job b': ['county d', 'county e', 'county f'],
'Job c': ['county g', 'county h']} etc..
If I have table 1 like this:
county sq_mile population
0 County a 2500 15000
1 County b 750 400
2 County c 4000 3500
3 County d 4300 4500
4 County e 2000 1500
5 County f 1000 1500
6 County g 4300 3500
7 County h 4400 3200
8 County i 4000 3500
How can I take table 2 (which has just organizations) and fill missing data using the dictionary and associated column values?
organization sq_mile population
0 job a 7250 18900
1 job b 7300 7500
2 job c 8700 6700
3 county i 4000 3500
CodePudding user response:
Try this:
counties_r = {i: k for k in counties.keys() for i in counties[k]}
df.groupby(df['County'].str.lower().map(counties_r).fillna(df['County']))\
.sum().reset_index()
Output:
County Sq. Mile Population
0 County i 4000 3500
1 Job a 7250 18900
2 Job b 7300 7500
3 Job c 8700 6700
Swap the keys and values in your dictionary the map that swapped dictionary to the county column making everything lowercase then use groupby with sum.