I have the following two dataframes:
df1
col1 col2
A 5, 3, 4
B 5, 1
C 3, 4
df2
c1 c2
1 XXX
2 CCC
3 AAA
4 DDD
5 FFF
Script to create df1
and df2
:
import pandas as pd
data1 = [['A', '5, 3, 4'], ['B', '5, 1'], ['C', '3, 4']]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
data2 = [['1', 'XXX'], ['2', 'CCC'], ['3', 'AAA'], ['4', 'DDD'], ['5', 'FFF']]
df2 = pd.DataFrame(data2, columns = ['c1', 'c2'])
I need to expand the col2
in df1
and merge the expanded columns with c2
of df2
by c1
.
I know how to expand col2
, but the challenge is that the number of expanded values is different for different rows. Therefore I don't know how to merge df_expanded
with df2
.
df_expanded = df1['col2'].str.split(', ', expand=True)
The expected result:
col1 col2_1 col2_2 col2_3 col2_1_title col2_2_title col2_3_title
A 5 3 4 FFF AAA DDD
B 5 1 FFF XXX
C 3 4 AAA DDD
How can I get such result?
CodePudding user response:
Let us do it step by step
s = df1.col2.str.split(',').apply(pd.Series).add_prefix('col_')
t = s.astype(float).replace(df2.set_index(['c1'])['c2'])
out = df1.join(s).join(t.add_suffix('_title'))
Out[42]:
col1 col2 col_0 col_1 col_2 col_0_title col_1_title col_2_title
0 A 5, 3, 4 5 3 4 FFF AAA DDD
1 B 5, 1 5 1 NaN FFF XXX NaN
2 C 3, 4 3 4 NaN AAA DDD NaN
CodePudding user response:
Here is one approach using the stacked data for a merge
and a reshape to a MultiIndex. Note that it has a slightly different output:
df3 = (df1['col2']
.str.split(', ', expand=True)
.stack()
#.astype(int) # only if df2 has integers in c1
.to_frame('col2')
.merge(df2.set_index('c1').rename(columns={'c2': 'col2_title'}),
left_on='col2', right_index=True)
.unstack(1)
)
df3.columns = df3.columns.map(lambda x: f'{x[0]}_{x[1]}')
df_expand = df1[['col1']].join(df3)
output:
col1 col2_0 col2_1 col2_2 col2_title_0 col2_title_1 col2_title_2
0 A 5 3 4 FFF AAA DDD
1 B 5 1 NaN FFF XXX NaN
2 C 3 4 NaN AAA DDD NaN
CodePudding user response:
I really wanted a prettier way to do this but I don't think this is any better than the other answers.
get = dict(zip(*map(df2.get, df2))).get
d = df1.col2.str.split(', ', expand=True).add_prefix('col2_')
df1[['col1']].join(d).join(d.applymap(get).add_suffix('_title'))
col1 col2_0 col2_1 col2_2 col2_0_title col2_1_title col2_2_title
0 A 5 3 4 FFF AAA DDD
1 B 5 1 None FFF XXX None
2 C 3 4 None AAA DDD None
Maybe this isn't exactly what OP wanted, output wise, but this seems better is some subjective way.
s = df1.set_index('col1').col2
m = df2.set_index('c1').c2
pd.concat({
'number': s, 'title': s.replace(m, regex=True)
}).str.split(', ', expand=True).unstack(0)
0 1 2
number title number title number title
col1
A 5 FFF 3 AAA 4 DDD
B 5 FFF 1 XXX None None
C 3 AAA 4 DDD None None