Home > Blockchain >  Pandas: copy value from one dataframe to another where other values match
Pandas: copy value from one dataframe to another where other values match

Time:10-24

First time asking a question so please forgive me. Dealing with two different dataframes, one containing state level data and another containing individual level data (within states)

Surveyframe:

Location Year Age Smokes
1 NY 2000 15 False
2 NY 2000 17 True
3 NY 2001 13 True
4 NY 2001 16 False
5 SD 2000 15 False
6 SD 2000 17 True
7 SD 2001 13 True
8 SD 2001 16 False

etc...

taxframe:

Location Year SubMeasure Value
1 NY 2000 Total Tax/Pack 0.50
2 NY 2000 Avg Cost/Pack 5.50
3 NY 2001 Total Tax/Pack 0.75
4 NY 2001 Avg Cost/Pack 5.75
5 SD 2000 Total Tax/Pack 0.10
6 SD 2000 Avg Cost/Pack 3.25
7 SD 2001 Total Tax/Pack 0.10
8 SD 2001 Avg Cost/Pack 3.25

etc...

Desire:

Location Year Age Smokes Total Tax/Pack Avg Cost/Pack
1 NY 2000 15 False 0.50 5.50
2 NY 2000 17 True 0.50 5.50
3 NY 2001 13 True 0.75 5.75
4 NY 2001 16 False 0.75 5.75
5 SD 2000 15 False 0.10 3.25
6 SD 2000 17 True 0.10 3.25
7 SD 2001 13 True 0.10 3.25
8 SD 2001 16 False 0.10 3.25

Using data for around 10 states with multiple sub-measures and over 200k individuals.

My first idea was to loop through each column appending to the surveyrfame, filling in value from the taxframe where the location and year match the location and year of the current indes, but that seems inefficient. Is there a better way to get this done with pandas?

Thanks

CodePudding user response:

You can use pandas to merge the dataframes.

import pandas as pd    

df1 = Surveyframe
df2 = taxframe
       
new_df = pd.merge(df1, df2,  how='left', on=['Location','Year'])

This should do what you want. This code will do a left join. There are other join types:

how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’ Type of merge to be performed.

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

cross: creates the cartesian product from both frames, preserves the order of the left keys.

Also - make sure that the headers you are using as the key are the same in both dataframes (e.g. location, year). Otherwise you can use left_on and right_on if they have different names.

CodePudding user response:

Made as follows. There is a grouping by columns 'Location', 'Year' in the function my_func. x.values[0][0] is 'Location' and x.values[0][1] is 'Year'. A selection is made based on these values from the "taxframe" data frame.

Next, from the filtered data, we get 'Value' where 'Total Tax/Pack' and 'Avg Cost/Pack'.

We substitute values through loc (loc is an explicit indexing on the left of the row indices, on the right is the name of the column). Here you can read about loc.

import pandas as pd
import numpy as np

Surveyframe['Total Tax/Pack'] = np.nan
Surveyframe['Avg Cost/Pack'] = np.nan


def my_func(x):
    filt = taxframe.loc[(taxframe['Location'] == x.values[0][0]) & (taxframe['Year'] == x.values[0][1])]
    ttp = filt[filt['SubMeasure'] == 'Total Tax/Pack']['Value']
    acp = filt[filt['SubMeasure'] == 'Avg Cost/Pack']['Value']

    Surveyframe.loc[x.index, 'Total Tax/Pack'] = ttp.values[0]
    Surveyframe.loc[x.index, 'Avg Cost/Pack'] = acp.values[0]


Surveyframe.groupby(['Location', 'Year']).apply(my_func)

print(Surveyframe)

Output

  Location  Year  Age  Smokes  Total Tax/Pack  Avg Cost/Pack
0       NY  2000   15   False            0.50           5.50
1       NY  2000   17    True            0.50           5.50
2       NY  2001   13    True            0.75           5.75
3       NY  2001   16   False            0.75           5.75
4       SD  2000   15   False            0.10           3.25
5       SD  2000   17    True            0.10           3.25
6       SD  2001   13    True            0.10           3.25
7       SD  2001   16   False            0.10           3.25
  • Related