Pandas: copy value from one dataframe to another where other values match-CodePudding

First time asking a question so please forgive me. Dealing with two different dataframes, one containing state level data and another containing individual level data (within states)

Surveyframe:

	Location	Year	Age	Smokes
1	NY	2000	15	False
2	NY	2000	17	True
3	NY	2001	13	True
4	NY	2001	16	False
5	SD	2000	15	False
6	SD	2000	17	True
7	SD	2001	13	True
8	SD	2001	16	False

etc...

taxframe:

	Location	Year	SubMeasure	Value
1	NY	2000	Total Tax/Pack	0.50
2	NY	2000	Avg Cost/Pack	5.50
3	NY	2001	Total Tax/Pack	0.75
4	NY	2001	Avg Cost/Pack	5.75
5	SD	2000	Total Tax/Pack	0.10
6	SD	2000	Avg Cost/Pack	3.25
7	SD	2001	Total Tax/Pack	0.10
8	SD	2001	Avg Cost/Pack	3.25

etc...

Desire:

	Location	Year	Age	Smokes	Total Tax/Pack	Avg Cost/Pack
1	NY	2000	15	False	0.50	5.50
2	NY	2000	17	True	0.50	5.50
3	NY	2001	13	True	0.75	5.75
4	NY	2001	16	False	0.75	5.75
5	SD	2000	15	False	0.10	3.25
6	SD	2000	17	True	0.10	3.25
7	SD	2001	13	True	0.10	3.25
8	SD	2001	16	False	0.10	3.25

Using data for around 10 states with multiple sub-measures and over 200k individuals.

My first idea was to loop through each column appending to the surveyrfame, filling in value from the taxframe where the location and year match the location and year of the current indes, but that seems inefficient. Is there a better way to get this done with pandas?

Thanks

CodePudding user response：

You can use pandas to merge the dataframes.

import pandas as pd    

df1 = Surveyframe
df2 = taxframe
       
new_df = pd.merge(df1, df2,  how='left', on=['Location','Year'])

This should do what you want. This code will do a left join. There are other join types:

how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’ Type of merge to be performed.

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

cross: creates the cartesian product from both frames, preserves the order of the left keys.

Also - make sure that the headers you are using as the key are the same in both dataframes (e.g. location, year). Otherwise you can use left_on and right_on if they have different names.

CodePudding user response：

Made as follows. There is a grouping by columns 'Location', 'Year' in the function my_func. x.values[0][0] is 'Location' and x.values[0][1] is 'Year'. A selection is made based on these values from the "taxframe" data frame.

Next, from the filtered data, we get 'Value' where 'Total Tax/Pack' and 'Avg Cost/Pack'.

We substitute values through loc (loc is an explicit indexing on the left of the row indices, on the right is the name of the column). Here you can read about loc.

import pandas as pd
import numpy as np

Surveyframe['Total Tax/Pack'] = np.nan
Surveyframe['Avg Cost/Pack'] = np.nan


def my_func(x):
    filt = taxframe.loc[(taxframe['Location'] == x.values[0][0]) & (taxframe['Year'] == x.values[0][1])]
    ttp = filt[filt['SubMeasure'] == 'Total Tax/Pack']['Value']
    acp = filt[filt['SubMeasure'] == 'Avg Cost/Pack']['Value']

    Surveyframe.loc[x.index, 'Total Tax/Pack'] = ttp.values[0]
    Surveyframe.loc[x.index, 'Avg Cost/Pack'] = acp.values[0]


Surveyframe.groupby(['Location', 'Year']).apply(my_func)

print(Surveyframe)

Output

  Location  Year  Age  Smokes  Total Tax/Pack  Avg Cost/Pack
0       NY  2000   15   False            0.50           5.50
1       NY  2000   17    True            0.50           5.50
2       NY  2001   13    True            0.75           5.75
3       NY  2001   16   False            0.75           5.75
4       SD  2000   15   False            0.10           3.25
5       SD  2000   17    True            0.10           3.25
6       SD  2001   13    True            0.10           3.25
7       SD  2001   16   False            0.10           3.25