Pandas dataframe merge row by addition-CodePudding

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.

For now, I wrote this

census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()

num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
                            'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']

for i, column_name in zip(range(1, 7), num_of_returns):
    sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']

I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.

Here is the dataframe that I currently get:

zipcode agi_stub    N02650  A02650  ELDERLY A07180  Number_of_returns_1_25000   Number_of_returns_25000_50000   Number_of_returns_50000_75000   Number_of_returns_75000_100000  Number_of_returns_100000_200000 Number_of_returns_200000_more   Amount_1_25000  Amount_25000_50000  Amount_50000_75000  Amount_75000_100000 Amount_100000_200000    Amount_200000_more
0   0   1   778140.0    10311099.0  144610.0    2076.0  778140.0    NaN NaN NaN NaN NaN 10311099.0  NaN NaN NaN NaN NaN
1   0   2   525940.0    19145621.0  113810.0    17784.0 NaN 525940.0    NaN NaN NaN NaN NaN 19145621.0  NaN NaN NaN NaN
2   0   3   285700.0    17690402.0  82410.0 9521.0  NaN NaN 285700.0    NaN NaN NaN NaN NaN 17690402.0  NaN NaN NaN
3   0   4   179070.0    15670456.0  57970.0 8072.0  NaN NaN NaN 179070.0    NaN NaN NaN NaN NaN 15670456.0  NaN NaN
4   0   5   257010.0    35286228.0  85030.0 14872.0 NaN NaN NaN NaN 257010.0    NaN NaN NaN NaN NaN 35286228.0  NaN

And here is what I want to get:

zipcode Number_of_returns_1_25000   Number_of_returns_25000_50000   Number_of_returns_50000_75000   Number_of_returns_75000_100000  Number_of_returns_100000_200000 Number_of_returns_200000_more   
0   0   778140.0                              525940.0                         285700.0     179070.0                 257010.0   850.0

CodePudding user response：

here is one way to do it using groupby and sum the desired columns

num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
                            'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']

df.groupby('zipcode', as_index=False)[num_of_returns].sum()

    zipcode     Number_of_returns_1_25000   Number_of_returns_25000_50000   Number_of_returns_50000_75000   Number_of_returns_75000_100000  Number_of_returns_100000_200000     Number_of_returns_200000_more
0   0   778140.0    525940.0    285700.0    179070.0    257010.0    0.0

CodePudding user response：

This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:

- `N1: Number of returns`
  
- `agi_stub: Size of adjusted gross income`
         According to IRS this has the following levels.
     Size of adjusted gross income  "0 = No AGI Stub
     1 = ‘Under $1’
     2 = '$1 under $10,000' 
     3 = '$10,000 under $25,000' 
     4 = '$25,000 under $50,000' 
     5 = '$50,000 under $75,000' 
     6 = '$75,000 under $100,000' 
     7 = '$100,000 under $200,000'
     8 = ‘$200,000 under $500,000’
     9 = ‘$500,000 under $1,000,000’
     10 = ‘$1,000,000 or more’"

I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc

With this information, I think what you want to find is the number of people who filed a tax return for each of the income levels of agi_stub. If that is what you mean then, this can be achieved by:

import pandas as pd

data = pd.read_csv("./data/19zpallagi.csv")

## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]

## solution to your problem?
df = data.pivot_table(
    index='zipcode',
    values='N1',
    columns='agi_stub',
    aggfunc=['sum']
)

## bit of cleaning up. 
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX   level for level in df.columns.get_level_values(1).astype(str)]

Here's the output.

In [77]: df
Out[77]: 
         agi_stub_level_1  agi_stub_level_2  ...  agi_stub_level_5  agi_stub_level_6
zipcode                                      ...                                    
0              50061850.0        37566510.0  ...        21938920.0         8859370.0
1001               2550.0            2230.0  ...            1420.0             230.0
1002               2850.0            1830.0  ...            1840.0             990.0
1005                650.0             570.0  ...             450.0              60.0
1007               1980.0            1530.0  ...            1830.0             460.0
...                   ...               ...  ...               ...               ...
99827               470.0             360.0  ...             170.0              40.0
99833               550.0             380.0  ...             290.0              80.0
99835              1250.0            1130.0  ...             730.0             190.0
99901              1960.0            1520.0  ...            1030.0             290.0
99999            868450.0          644160.0  ...          319880.0          142960.0

[27595 rows x 6 columns]