First time asking a question so please forgive me. Dealing with two different dataframes, one containing state level data and another containing individual level data (within states)
Surveyframe:
Location | Year | Age | Smokes | |
---|---|---|---|---|
1 | NY | 2000 | 15 | False |
2 | NY | 2000 | 17 | True |
3 | NY | 2001 | 13 | True |
4 | NY | 2001 | 16 | False |
5 | SD | 2000 | 15 | False |
6 | SD | 2000 | 17 | True |
7 | SD | 2001 | 13 | True |
8 | SD | 2001 | 16 | False |
etc...
taxframe:
Location | Year | SubMeasure | Value | |
---|---|---|---|---|
1 | NY | 2000 | Total Tax/Pack | 0.50 |
2 | NY | 2000 | Avg Cost/Pack | 5.50 |
3 | NY | 2001 | Total Tax/Pack | 0.75 |
4 | NY | 2001 | Avg Cost/Pack | 5.75 |
5 | SD | 2000 | Total Tax/Pack | 0.10 |
6 | SD | 2000 | Avg Cost/Pack | 3.25 |
7 | SD | 2001 | Total Tax/Pack | 0.10 |
8 | SD | 2001 | Avg Cost/Pack | 3.25 |
etc...
Desire:
Location | Year | Age | Smokes | Total Tax/Pack | Avg Cost/Pack | |
---|---|---|---|---|---|---|
1 | NY | 2000 | 15 | False | 0.50 | 5.50 |
2 | NY | 2000 | 17 | True | 0.50 | 5.50 |
3 | NY | 2001 | 13 | True | 0.75 | 5.75 |
4 | NY | 2001 | 16 | False | 0.75 | 5.75 |
5 | SD | 2000 | 15 | False | 0.10 | 3.25 |
6 | SD | 2000 | 17 | True | 0.10 | 3.25 |
7 | SD | 2001 | 13 | True | 0.10 | 3.25 |
8 | SD | 2001 | 16 | False | 0.10 | 3.25 |
Using data for around 10 states with multiple sub-measures and over 200k individuals.
My first idea was to loop through each column appending to the surveyrfame, filling in value from the taxframe where the location and year match the location and year of the current indes, but that seems inefficient. Is there a better way to get this done with pandas?
Thanks
CodePudding user response:
You can use pandas to merge the dataframes.
import pandas as pd
df1 = Surveyframe
df2 = taxframe
new_df = pd.merge(df1, df2, how='left', on=['Location','Year'])
This should do what you want. This code will do a left join. There are other join types:
how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’ Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
Also - make sure that the headers you are using as the key are the same in both dataframes (e.g. location, year). Otherwise you can use left_on and right_on if they have different names.
CodePudding user response:
Made as follows. There is a grouping by columns 'Location', 'Year' in the function my_func.
x.values[0][0]
is 'Location' and x.values[0][1]
is 'Year'.
A selection is made based on these values from the "taxframe" data frame.
Next, from the filtered data, we get 'Value' where 'Total Tax/Pack' and 'Avg Cost/Pack'.
We substitute values through loc (loc is an explicit indexing on the left of the row indices, on the right is the name of the column). Here you can read about loc.
import pandas as pd
import numpy as np
Surveyframe['Total Tax/Pack'] = np.nan
Surveyframe['Avg Cost/Pack'] = np.nan
def my_func(x):
filt = taxframe.loc[(taxframe['Location'] == x.values[0][0]) & (taxframe['Year'] == x.values[0][1])]
ttp = filt[filt['SubMeasure'] == 'Total Tax/Pack']['Value']
acp = filt[filt['SubMeasure'] == 'Avg Cost/Pack']['Value']
Surveyframe.loc[x.index, 'Total Tax/Pack'] = ttp.values[0]
Surveyframe.loc[x.index, 'Avg Cost/Pack'] = acp.values[0]
Surveyframe.groupby(['Location', 'Year']).apply(my_func)
print(Surveyframe)
Output
Location Year Age Smokes Total Tax/Pack Avg Cost/Pack
0 NY 2000 15 False 0.50 5.50
1 NY 2000 17 True 0.50 5.50
2 NY 2001 13 True 0.75 5.75
3 NY 2001 16 False 0.75 5.75
4 SD 2000 15 False 0.10 3.25
5 SD 2000 17 True 0.10 3.25
6 SD 2001 13 True 0.10 3.25
7 SD 2001 16 False 0.10 3.25