We have 2 datasets and the goal is to add columns from one dataset to another. Below is example of code:
import pandas as pd
d1 = {'city_id': [116,1,1,1,116,1,1,116,1], 'key': [14,14,22,21,22,14,13,80,99]}
d2={'key':[14,22,80],'population':[2000,7500,11000],'median_income':[30000,50000,44000]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
print(df1)
print()
print(df2)
city_id key
0 116 14
1 1 14
2 1 22
3 1 21
4 116 22
5 1 14
6 1 13
7 116 80
8 1 99
key population median_income
0 14 2000 30000
1 22 7500 50000
2 80 11000 44000
On next step I do:
print(df1.loc[df1['city_id']==116].assign(
population=lambda x: x['key'].map(dict(zip(df2['key'],df2['population'])))
,median_income=lambda x:x['key'].map(dict(zip(df2['key'],df2['median_income'])))
))
city_id key population median_income
0 116 14 2000 30000
4 116 22 7500 50000
7 116 80 11000 44000
Than, when I try to assign it into original dataframe, I get an error:
df1.loc[df1['city_id']==116]=df1.loc[df1['city_id']==116].assign(
population=lambda x: x['key'].map(dict(zip(df2['key'],df2['population'])))
,median_income=lambda x:x['key'].map(dict(zip(df2['key'],df2['median_income'])))
)
ValueError: shape mismatch: value array of shape (3,4) could not be broadcast to indexing result of shape (3,2)
While the expected result is:
city_id key population median_income
0 116 14 2000 30000
1 1 14 NaN NaN
2 1 22 NaN NaN
3 1 21 NaN NaN
4 116 22 7500 50000
5 1 14 NaN NaN
6 1 13 NaN NaN
7 116 80 11000 44000
8 1 99 NaN NaN
What can possibly solve the issue?
Note! We cannot use merge, because in reality there are 20 different 'city_id' and it will create numerous suffixes like'population_x','population_y','population_z'...'median_income_x', 'median_income_y','median_income_z', which seems to be way less handy. The idea is to create function and using assing for each city_id.
CodePudding user response:
I think it is very unlikely this is the correct thing to do in this setting. You should almost surely use merge and melt as necessary. That being said, instead you can simply add two new, empty columns to the original dataset.
import pandas as pd
d1 = {'city_id': [116,1,1,1,116,1,1,116,1], 'key': [14,14,22,21,22,14,13,80,99]}
d2={'key':[14,22,80],'population':[2000,7500,11000],'median_income':[30000,50000,44000]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
import numpy as np
df1[["population", "median_income"]] = np.nan
df1.loc[df1['city_id']==116]=df1.loc[df1['city_id']==116].assign(
population=lambda x: x['key'].map(dict(zip(df2['key'],df2['population'])))
,median_income=lambda x:x['key'].map(dict(zip(df2['key'],df2['median_income'])))
)
This works as you want it to.
CodePudding user response:
It looks like you're better off merging the two then blanking out the columns where the city_id isn't 116.
df3 = df1.merge(df2, on='key', how='left')
df3.loc[df3['city_id'] != 116, ['population', 'median_income']] = np.nan