I have two dataframes say df1 and df2. df1 has column info which has matching values from df2 column code_dig.
what i want is to check whether the info column from df1 has a matching value in code_dig column of df2, if there is matching value then it should bring corresponding code_loc value into a new column. If there were no multiple values within same row separated by semicolons, it was pretty straight forward
Here is bit subset of the data
import pandas as pd
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'info': ['301', '521;519', '412;905', '409;206', '301']})
df2 = pd.DataFrame({'code_dig': [206, 301, 409, 412, 519, 521, 905],
'code_loc': [202, 302, 406, 404.3, 'inf', 'inf', 'inf' ]})
Desired output
id info added_from_df2
0 1 301 302
1 2 521;519 inf;inf
2 3 412;905 404.3;inf
3 4 409;206 406;202
4 5 301 302
What i tried so far: Since i was not able to find solution for above issue i tried a workaround, thought of splitting it all and adding the corresponding codes and then zipping it again which works but the issue is that order of the codes matter. lets say 521;519 != 519;521. and when zipping back it messes up the original order.
# split all ; seperated codes to assign localised codes
df1['info'] = df1['info'].str.split(';')
df1 = df1.explode('info',ignore_index=True) # count: 2227
df_merged = df1.merge(df2, left_on = 'info', right_on = 'code_dig')
grouped = df_merged.groupby('id').agg({'code_dig':'; '.join, 'code_loc':'; '.join,}).reset_index()
# adding localised codes and re-group the splitted signs
digitised_lv = df_merged.merge(grouped, on = 'id', how='inner')
digitised_lv.drop_duplicates(subset = 'id', inplace = True)
# to file
digitised_lv.to_file('result.shp')
some unfinished attempts includes making the df2 as dictionary and trying isin but it doesnt work, or trying to figure out if two loops can work but since no programming background i am stuck.
df1['info']=df1['info'].apply(lambda row: row.split(';'))
new_dict = dict(df2.values)
lst = [new_dict.keys]
df1['info'].isin(lst)
Trying two loops:
df1['added_from_df2'] = 0
for i, row in df1.iterrows():
for i_,row_ in df2.iterrows():
if row['info'] == row_['code_dig']:
? HOW to assign :( ?
CodePudding user response:
Here is one option with a complex transform:
df1.join(
df1['info']
.str.split(';')
.explode()
.astype(int)
.map(df2.set_index('code_dig')['code_loc'])
.astype(str)
.groupby(level=0).agg(';'.join)
.rename('added_from_df2')
)
Or a regex alternative:
dic = df2.astype(str).set_index('code_dig')['code_loc'].to_dict()
# {'206': '202', '301': '302', '409': '406', '412': '404.3',
# '519': 'inf', '521': 'inf', '905': 'inf'}
df1['added_from_df2'] = (df1['info']
.str.replace('\d ', lambda m: dic.get(m.group(), 'nan'),
regex=True)
)
output:
id info added_from_df2
0 1 301 302
1 2 521;519 inf;inf
2 3 412;905 404.3;inf
3 4 409;206 406;202
4 5 301 302