I have 2 dataframes and I need to do a left join between them to bring the sk_provider
, the problem is that the description
column of the first dataframe has the values separated by a slash, so the join doesn't work as expected, how can I resolve this?
First dataframe:
sk_provider | provider |
---|---|
1 | gmail |
2 | hotmail/outlook |
3 | ig |
4 | terra |
5 | uol/bol/zipmail |
6 | yahoo |
7 | outros |
Second dataframe:
porcentagem | valor | provider | type |
---|---|---|---|
100% | 664 | gmail | enviados |
100% | 407 | hotmail | enviados |
100% | 1 | ig | enviados |
100% | 15 | terra | enviados |
100% | 43 | uol | enviados |
100% | 74 | yahoo | enviados |
Expected output:
porcentagem | valor | provider | type | sk_provider |
---|---|---|---|---|
100% | 664 | gmail | enviados | 1 |
100% | 407 | hotmail | enviados | 2 |
100% | 1 | ig | enviados | 3 |
100% | 15 | terra | enviados | 4 |
100% | 43 | uol | enviados | 5 |
100% | 74 | yahoo | enviados | 6 |
CodePudding user response:
Split the strings in column provider
of df1
around the delimiter /
then explode
on provider
and merge
the exploded dataframe with df2
df1['provider'] = df1['provider'].str.split('/')
df2.merge(df1.explode('provider'), on='provider', how='left')
porcentagem valor provider type sk_provider
0 100% 664 gmail enviados 1
1 100% 407 hotmail enviados 2
2 100% 1 ig enviados 3
3 100% 15 terra enviados 4
4 100% 43 uol enviados 5
5 100% 74 yahoo enviados 6
CodePudding user response:
I was able to use the .explode function from pandas to try and re-set your first df. This depends on your provider column being separated by a / consistantly
data = {
'Column1' : [1, 2],
'Column2' : ['gmail', 'hotmail/outlook']}
df = pd.DataFrame(data)
df['emails'] = df['Column2'].apply(lambda x : x.split('/')[0:])
df1 = df.explode('emails')[['Column1', 'emails']]
This will also require you to make a new df to join (df1) so be careful of that when you are performing your join.
CodePudding user response:
2 steps:
step 1: create a row per provider, split by /
df1['providers'] = df1['provider'].str.split('/')
df1 = df.set_index(['sk_provider'])['providers'].apply(pd.Series).stack().reset_index().drop(columns=['level_1']).rename(columns={0:'provider'})
result:
sk_provider provider
0 1 gmail
1 2 hotmail
2 2 outlook
3 3 ig
4 4 terra
5 5 uol
6 5 bol
7 5 zipmail
8 6 yahoo
9 7 outros
step 2: merge
pd.merge(df2, df1, left_on =['provider'], right_on=['provider'])
result:
porcentagem valor provider type sk_provider
0 100% 664 gmail enviados 1
1 100% 407 hotmail enviados 2
2 100% 1 ig enviados 3
3 100% 15 terra enviados 4
4 100% 43 uol enviados 5
5 100% 74 yahoo enviados 6