Pandas assigning with loc-CodePudding

We have 2 datasets and the goal is to add columns from one dataset to another. Below is example of code:

import pandas as pd
d1 = {'city_id': [116,1,1,1,116,1,1,116,1], 'key': [14,14,22,21,22,14,13,80,99]}
d2={'key':[14,22,80],'population':[2000,7500,11000],'median_income':[30000,50000,44000]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
print(df1)
print()
print(df2)

   city_id  key
0      116   14
1        1   14
2        1   22
3        1   21
4      116   22
5        1   14
6        1   13
7      116   80
8        1   99


   key  population  median_income
0   14        2000          30000
1   22        7500          50000
2   80       11000          44000

On next step I do:

print(df1.loc[df1['city_id']==116].assign(
population=lambda x: x['key'].map(dict(zip(df2['key'],df2['population'])))
,median_income=lambda x:x['key'].map(dict(zip(df2['key'],df2['median_income'])))
))

   city_id  key  population  median_income
0      116   14        2000          30000
4      116   22        7500          50000
7      116   80       11000          44000

Than, when I try to assign it into original dataframe, I get an error:

df1.loc[df1['city_id']==116]=df1.loc[df1['city_id']==116].assign(
population=lambda x: x['key'].map(dict(zip(df2['key'],df2['population'])))
,median_income=lambda x:x['key'].map(dict(zip(df2['key'],df2['median_income'])))
)
ValueError: shape mismatch: value array of shape (3,4)  could not be broadcast to indexing result of shape (3,2)

While the expected result is:

   city_id  key  population  median_income
0      116   14        2000         30000
1      1     14         NaN          NaN
2      1     22         NaN          NaN
3      1     21         NaN          NaN
4      116   22        7500         50000
5      1     14         NaN          NaN
6      1     13         NaN          NaN
7      116   80       11000        44000
8      1     99         NaN          NaN

What can possibly solve the issue?

Note! We cannot use merge, because in reality there are 20 different 'city_id' and it will create numerous suffixes like'population_x','population_y','population_z'...'median_income_x', 'median_income_y','median_income_z', which seems to be way less handy. The idea is to create function and using assing for each city_id.

CodePudding user response：

I think it is very unlikely this is the correct thing to do in this setting. You should almost surely use merge and melt as necessary. That being said, instead you can simply add two new, empty columns to the original dataset.

import pandas as pd
d1 = {'city_id': [116,1,1,1,116,1,1,116,1], 'key': [14,14,22,21,22,14,13,80,99]}
d2={'key':[14,22,80],'population':[2000,7500,11000],'median_income':[30000,50000,44000]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)

import numpy as np 
df1[["population", "median_income"]] = np.nan

df1.loc[df1['city_id']==116]=df1.loc[df1['city_id']==116].assign(
population=lambda x: x['key'].map(dict(zip(df2['key'],df2['population'])))
,median_income=lambda x:x['key'].map(dict(zip(df2['key'],df2['median_income'])))
)

This works as you want it to.

CodePudding user response：

It looks like you're better off merging the two then blanking out the columns where the city_id isn't 116.

df3 = df1.merge(df2, on='key', how='left')
df3.loc[df3['city_id'] != 116, ['population', 'median_income']] = np.nan