Home > Back-end >  Merge Pandas with Filter
Merge Pandas with Filter

Time:03-16

I have 2 dataframes and I need to do a left join between them to bring the sk_provider, the problem is that the description column of the first dataframe has the values ​​separated by a slash, so the join doesn't work as expected, how can I resolve this?

First dataframe:

sk_provider provider
1 gmail
2 hotmail/outlook
3 ig
4 terra
5 uol/bol/zipmail
6 yahoo
7 outros

Second dataframe:

porcentagem valor provider type
100% 664 gmail enviados
100% 407 hotmail enviados
100% 1 ig enviados
100% 15 terra enviados
100% 43 uol enviados
100% 74 yahoo enviados

Expected output:

porcentagem valor provider type sk_provider
100% 664 gmail enviados 1
100% 407 hotmail enviados 2
100% 1 ig enviados 3
100% 15 terra enviados 4
100% 43 uol enviados 5
100% 74 yahoo enviados 6

CodePudding user response:

Split the strings in column provider of df1 around the delimiter / then explode on provider and merge the exploded dataframe with df2

df1['provider'] = df1['provider'].str.split('/')
df2.merge(df1.explode('provider'), on='provider', how='left')

  porcentagem  valor provider      type  sk_provider
0        100%    664    gmail  enviados            1
1        100%    407  hotmail  enviados            2
2        100%      1       ig  enviados            3
3        100%     15    terra  enviados            4
4        100%     43      uol  enviados            5
5        100%     74    yahoo  enviados            6

CodePudding user response:

I was able to use the .explode function from pandas to try and re-set your first df. This depends on your provider column being separated by a / consistantly

data = {
        'Column1' : [1, 2],
        'Column2' : ['gmail', 'hotmail/outlook']}
df = pd.DataFrame(data)
df['emails'] = df['Column2'].apply(lambda x : x.split('/')[0:])
df1 = df.explode('emails')[['Column1', 'emails']]

This will also require you to make a new df to join (df1) so be careful of that when you are performing your join.

CodePudding user response:

2 steps:

step 1: create a row per provider, split by /

df1['providers'] = df1['provider'].str.split('/')
df1 = df.set_index(['sk_provider'])['providers'].apply(pd.Series).stack().reset_index().drop(columns=['level_1']).rename(columns={0:'provider'})

result:

    sk_provider provider
0   1           gmail
1   2           hotmail
2   2           outlook
3   3           ig
4   4           terra
5   5           uol
6   5           bol
7   5           zipmail
8   6           yahoo
9   7           outros

step 2: merge

pd.merge(df2, df1, left_on =['provider'], right_on=['provider'])

result:

    porcentagem valor   provider    type    sk_provider
0   100%         664    gmail       enviados    1
1   100%         407    hotmail     enviados    2
2   100%           1    ig          enviados    3
3   100%          15    terra       enviados    4
4   100%          43    uol         enviados    5
5   100%          74    yahoo       enviados    6
  • Related