I have 2 datasets:
df_1.head(4)
region postal_code
Adrar 1000
Broko 5633
Conan 4288
Cymus 7435
df_2.head(4)
Name Charges region postal_code Revenue
Lia HG Pintol 4522 345
Joss PX Inend 7455 142
Amph CT 5633 148
Andrew UY Liven 9033 147
The second dataset has many missing values in 'region' column... But we can get those missing values using first dataset by matching values of postal_code... For example, in the third row of df_2, 'region' column is missing but by matching it's respective postal_code with df_1, we can find it's region as 'Broko'... Can someone please suggest on how to code it
CodePudding user response:
You can use boolean indexing and a map
:
m = df2['region'].isna()
df2.loc[m, 'region'] = (df2.loc[m, 'postal_code']
.map(df1.set_index('postal_code')['region'])
)
Another less efficient approach could be:
df2['region'] = (df2['region']
.fillna(df2['postal_code']
.map(df1.set_index('postal_code')['region']))
)
# or in place
df2['region'].update(df2['postal_code']
.map(df1.set_index('postal_code')['region']))
Output:
Name Charges region postal_code Revenue
0 Lia HG Pintol 4522 345
1 Joss PX Inend 7455 142
2 Amph CT Broko 5633 148
3 Andrew UY Liven 9033 147
CodePudding user response:
Example
data1 = {'region': {0: 'Adrar', 1: 'Broko', 2: 'Conan', 3: 'Cymus'},
'postal_code': {0: 1000, 1: 5633, 2: 4288, 3: 7435}}
data2 = {'Name': {0: 'Lia', 1: 'Joss', 2: 'Amph', 3: 'Andrew'},
'Charges': {0: 'HG', 1: 'PX', 2: 'CT', 3: 'UY'},
'region': {0: 'Pintol', 1: 'Inend', 2: None, 3: 'Liven'},
'postal_code': {0: 4522, 1: 7455, 2: 5633, 3: 9033},
'Revenue': {0: 345, 1: 142, 2: 148, 3: 147}}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Code
use map
and fillna
mapper = df1.set_index('postal_code')['region']
df2.assign(region=df2['region'].fillna(df2['postal_code'].map(mapper)))
result:
Name Charges region postal_code Revenue
0 Lia HG Pintol 4522 345
1 Joss PX Inend 7455 142
2 Amph CT Broko 5633 148
3 Andrew UY Liven 9033 147
CodePudding user response:
try this:
mapper = dict(df1.values[:, ::-1])
df2['region'] = df2['region'].combine_first(df2['postal_code'].replace(mapper))